WO2018205986A1

WO2018205986A1 - Method and system for parallelizing sequential graph computation

Info

Publication number: WO2018205986A1
Application number: PCT/CN2018/086454
Authority: WO
Inventors: Wenfei Fan; Jingbo XU; Wenyuan Yu
Original assignee: Shanghai Putu Technology Partnership (General Partnership)
Priority date: 2017-05-12
Filing date: 2018-05-11
Publication date: 2018-11-15
Also published as: WO2018205246A1

Abstract

A method and system for parallelizing sequential graph computations are provided. The method comprises, partitioning a graph G into fragments; distributing the fragments and parallelized algorithms across n workers respectively; receiving a query Q at a coordinator and posting Q to all workers; executing partial evaluation by each worker against its local fragment; and generating messages; exchange messages between each worker; executing incremental evaluation by worker upon receiving a message against local fragment updated by the message; iterating the incremental evaluation until no further update message can be made to any fragment; computing a complete result via assemble partial results from workers by the coordinator; and returning the result as the answer to the query Q. The system is also disclosed via illustrating the architecture according to a figure.

Description

METHOD AND SYSTEM FOR PARALLELIZING SEQUENTIAL GRAPH COMPUTATION

RELATED APPLICATION

The present application claims the benefit of PCT/CN2017/084083 filed May 12, 2017, which is incorporated herein by reference in its entirety.

FIELD OF THE INVENTION

The present invention relates to graph computations. More particularly, this invention describes solutions on parallel graph computations.

BACKGROUND OF THE INVENTION

A variety of models and systems for paralleling graph computation are available in the prior art. Such models include: PRAM (Parallel Random Access Machine) , BSP (Bulk Synchronous Parallel) and MapReduce (Dean, J. and Ghemawat, S. 2008. MapReduce: Simplified data processing on large clusters. Commun. ACM. 51, 1 (2008) ) . PRAM abstracts parallel RAM access over shared memory. BSP models parallel computations in supersteps (including local computation, communication and a synchronization barrier) to synchronize communication among workers. Pregel (Giraph) (Malewicz, G., Austern, M.H., Bik, A.J.C., Dehnert, J.C., Horn, I., Leiser, N. and Czajkowski, G. 2010. Pregel: A system for large-scale graph processing. SIGMOD (2010) ) implements BSP with vertex-centric programming, where a superstep executes a user-defined function at each vertex in parallel. GraphLab (Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C. and Hellerstein, J.M. 2012. Distributed GraphLab: A framework for machine learning in the cloud. PVLDB. 5, 8 (2012) ) revises BSP to pass messages asynchronously. Block-centric models extend vertex-centric programming to blocks, to exchange messages among blocks. Popular graph systems also include: GraphX (Gonzalez, J.E., Xin, R.S., Dave, A., Crankshaw, D., Franklin, M.J. and Stoica, I. 2014. GraphX: Graph processing in a distributed dataflow framework. OSDI (2014) ) , GRACE (G. Wang, W. Xie, A.J. Demers, and J. Gehrke. Asynchronous large-scale graph processing is made easy. In CIDR, 2013) , GPS (Salihoglu, S. and Widom, J. 2013. GPS: a graph processing system. SSDBM (2013) ) , etc. GraphX recasts graph computation in its distributed dataflow processors a sequence of join and group-by stages punctuated by map operations over Spark platform. GRACE provides an operator-level, iterative programming model to enhance synchronous BSP with asynchronous execution. GPS implements Pregel with extended APIs and partition strategies. All these systems require recasting of sequential algorithms into parallel algorithms before taking advantage of the parallel model and system.

A number of graph algorithms have been developed in MapReduce, vertex-centric models and others. Prior work on automated parallelization, however, has focused on the instruction or operator level by breaking dependencies via symbolic and automata analyses. There has also been work at data partition level, to perform multi-level partition ( “parallel abstraction” ) and enable locality-optimized access to adapt to different parallel abstraction. It is not easy to do the adaption hence they targeted experienced developers of parallel algorithms. Thus, an easier method has been eagerly sought for parallelizing a sequential graph computation.

OBJECTS AND SUMMARY OF THE INVENTION

Therefore, it is an object of the claimed invention to provide a significantly improved method and system for parallelizing sequential graph computations.

In an embodiment, a method for parallelizing a sequential graph computation is provided. The method comprises:

partitioning a graph G by a preprocessor into fragments (F ₁, …, F _n) ;

distributing the fragments by the preprocessor across n workers (P ₁, …, P _n) , respectively;

receiving by the preprocessor three sequential algorithms and the message preambles;

processing by the preprocessor the sequential algorithms into parallel version;

processing the message preambles and determining the communication protocol by the preprocessor;

receiving a query Q at a coordinator and posting the query Q to all n workers;

executing partial evaluation by each worker P _i against its local fragment F _i, i∈ [1, n] ;

computing partial result Q (F _i) by each worker P _i, and generating message Mi;

exchange messages between each worker P _i;

executing incremental evaluation by worker P _i upon receiving a message M _i against local F _i updated by M _i;

computing updated partial result

by worker Pi; where operator

means applying changes Mi to Fi. (the same definition through the whole file. )

iterating the incremental evaluation until no further updates M _i can be made to any F _i;

pulling updated partial results by the coordinator from all workers;

computing a complete result Q (G) by assemble by the coordinator; and

returning from the coordinator the result Q (G) as the answer to the query Q.

Preferably, the method further comprises: selecting a partition strategy

by the preprocessor before partitioning the graph G.

In embodiments of the method, the workers are distributed processors, or processors in a single machine, or threads on a processor.

In an embodiment of the method, wherein the coordinator and the preprocessor may be implemented completely or in part as centralized, decentralized or virtual components.

Preferably, the method further comprises: identifying and initializing by each worker P _i while executing the partial evaluation a set of update parameters for each F _i that records the status of its border nodes.

Preferably, the method further comprises: routing messages by the coordinator to workers during the message passing.

Preferably, the method further comprises: routing messages to other workers by each worker P _i during the message passing.

Preferably, the method further comprises: treating by worker P _i,

and

as F _i and Q (F _i) , respectively, for a next computation each time after executing incremental evaluation.

In an embodiment, a method for parallelizing a sequential graph computation is provided, the method comprises:

loading fragments (F ₁, …, F _n) by n workers (P ₁, …, P _n) respectively, wherein, the fragments are pre-partitioned from a graph G;

receiving by a preprocessor three sequential algorithms and the message preambles;

processing by the preprocessor the sequential algorithms into parallel version;

receiving a query Q at a coordinator and posting the query Q to all n workers;

exchange messages between each worker P _i;

computing updated partial result

by worker P _i;

pulling updated partial results by the coordinator from all workers;

computing a complete result Q (G) by assemble by the coordinator; and

returning from the coordinator the result Q (G) as the answer to the query Q.

In an embodiment, provided is a non-transitory computer readable medium with instructions stored thereon for parallelizing a sequential graph computation, that when executed by one or more processors, preforming the steps comprising:

partitioning a graph G by a preprocessor into fragments (F ₁, …, F _n) ;

distributing the fragments by the preprocessor across n workers (P ₁, …, P _n) respectively;

distributing the graph computation algorithm by the preprocessor to all n workers;

receiving a query Q at the coordinator and posting the query Q to all n workers;

computing partial result Q (F _i) by each worker P _i;

exchange partial results between each worker P _i via message passing;

computing updated partial result

by worker P _i;

pulling updated partial results by the coordinator from all workers;

computing a complete result Q (G) by assemble by the coordinator; and

returning from the coordinator the result Q (G) as the answer to the query Q.

In an embodiment, a system for parallelizing a sequential graph computation is provided.

A system for parallelizing a sequential graph computation, comprising a preprocessor, a coordinator and a plurality of workers, wherein:

the preprocessor comprises:

an automating parallelization module for receiving, parsing and automating parallelizing inputted algorithms and for distributing algorithms to workers;

a partition manager module for partitioning graph and managing fragments;

a communication control module for processing the message preambles and determining the communication protocol; and

a storage module for managing graph data in distributed file system;

the coordinator comprises:

a query parser module for receiving, parsing and distributing a query to workers;

an assemble module for assembling result for a query;

a communication control module for controlling message passing; and

a storage module for managing graph data in distributed file system; and

each worker comprises:

a partial evaluation module for executing partial evaluation;

an incremental evaluation module for executing incremental evaluation;

a communication control module for controlling message passing; and

a storage module for managing graph data in distributed file system.

Preferably, in the system, the coordinator comprises a load balancing module for assigning work according to load balance.

Preferably, in the system, the coordinator comprises an indexing module for supporting indexing structure.

Preferably, in the system, the coordinator comprises a compression module for adopting query preserving compression; and

each worker comprises a compression module for adopting query preserving compression.

Preferably, in the system, the coordinator comprises a fault tolerance module for recovering from worker failures or coordinator failures; and

each worker comprises a fault tolerance module for recovering from worker failures or coordinator failures.

Preferably, in the system, the workers are distributed processors, or processors in a single machine, or threads on a processor.

Preferably, in the system, the coordinator and the preprocessor may be implemented completely or in part as centralized, decentralized or virtual components.

Various other objects, advantages and features of the present invention will become readily apparent from the ensuing detailed description, and the novel features will be particularly pointed out in the appended claims.

BRIEF DESCRIPTION OF FIGURES

The following detailed descriptions, given by way of example, and not intended to limit the present invention solely thereto, will be best be understood in conjunction with the accompanying figures:

FIG. 1 is a diagram of GRAPE architecture;

FIG. 2 is a diagram of workflow of GRAPE;

FIG. 3 is a diagram of PEval for SSSP; and

FIG. 4 is a diagram of IncEval for SSSP.

DETAILED DESCRIPTION OF THE EMBODIMENTS

Definitions and terms used in the description.

Graphs. Consider graphs G= (V, E, L) , directed or undirected, where (1) V is a finite set of nodes; (2)

is a set of edges; (3) each node v in V (resp. edge e∈E) carries L (v) (resp. L (e) ) , indicating its content, as found in social networks, knowledge bases and property graphs.

Graph G'= (V', E', L') is called a subgraph of G if

and for each node v∈V' (resp. each edge e∈E') , L' (v) =L (v) (resp. L' (e) =L (e) ) .

Subgraph G'is said to be induced by V'if E'consists of all the edges in G whose endpoints are both in V'.

Partition strategy. Given a number m, a strategy

partitions graph G into fragments

such that each F _i= (V _i, E _i, L _i) is a subgraph of G, E=∪ _{i∈ [1, m]} E _i, V=∪ _{i∈ [1, m]} V _i, and F _i resides at processor P _i. Denote by

F _i. I the set of nodes v∈V _i such that there is an edge (v', v) incoming from a node v'in F _j (i≠j) ;

F _i. O the set of nodes v'such that there exists an edge (v, v') in E, v∈V _i and v'is in some F _j (i≠j) ; and

In vertex-cut partition,

and

correspond to entry vertices and exit vertices, respectively. Refer to nodes in F _i. I∪F _i. O as the border nodes of F _i w.r.t.

The fragmentation graph

of G via

is an index such that given each node v in

(or

) ,

retrieves a set of

if v∈F _i. O and v∈F _j. I with i≠j. As will be seen shortly,

helps us deduce the directions of messages.

The notations of this paper are summarized in the table below.

To Solve the technical problem in the prior art, a novel system is provided such that a sequential graph algorithms may be “plugged” into it as a whole (subject to minor changes) , and it parallelizes the computation across multiple processors, without drastic degradation in performance or functionality of existing systems. The system is called GRAPE, i.e. a parallel GRAPh Engine for graph computations such as traversal, pattern matching, connectivity and collaborative filtering.

A Parallel Model GRAPE is provided in an embodiment.

According to the parallel model of GRAPE, a method of programming with GRAPE is also provided. The GRAPE system comprises a preprocessor, a coordinator P ₀ and a set of m workers P ₁, …, P _m. The workers are distributed processors or threads on a processor or processors in a single machine. The preprocessor and the coordinator may be implemented completely or in part as centralized, decentralized or virtual components.

Given a partition strategy

and sequential PEval, IncEval and Assemble for a class

of graph queries, GRAPE parallelizes the computations as follows. It first partitions G into (F ₁, …, F _m) with

and distributes F _i’s across m shared-nothing virtual workers (P ₁, …, P _m) . It maps m virtual workers to n physical workers. When n＜m, multiple virtual workers mapped to the same worker share memory. It also constructs fragmentation graph

Note that G is partitioned once for all queries

posed on G. In an embodiment, the partitioning step can be implemented by a preprocessor. In another embodiment, the fragments are pre-partitioned from a graph G, and loaded by n workers (P ₁, …, P _n) respectively.

Parallel model in an embodiment is disclosed below. Given

GRAPE computes Q (G) in the partitioned G as shown in Fig. 2. Upon receiving Q at coordinator P ₀, GRAPE posts the same Q to all the workers. It adopts synchronous message passing following BSP. Its parallel computation comprises three steps.

(1) Partial evaluation (PEval) . In the first superstep, upon receiving Q, each worker P _i computes partial results Q (F _i) locally at F _i using PEval, in parallel (i∈ [1, m] ) . It also identifies and initializes a set of update parameters for each F _i that records the status of its border nodes. At the end of the process, it generates a message from the update parameters at each P _i and sends it to coordinator P ₀. In another embodiment, the messages are sent to peer workers directly.

(2) Incremental computation (IncEval) . GRAPE iterates the following supersteps until it terminates. Each superstep has two steps, one at P ₀ and the other at the workers.

(2. a) Coordinator. Coordinator P ₀ checks whether for all i∈ [1, m] , P _i is inactive, i.e., P _i is done with its local computation and there is no pending message designated for P _i. If so, GRAPE invokes Assemble and terminates (see below) . Otherwise, P ₀ routes messages from the last superstep to workers and triggers the next superstep.

(2. b) Workers. Upon receiving message M _i, worker P _i incrementally computes

with IncEval, by treating M _i as updates, in parallel for all i∈ [1, m] . It automatically finds the changes to the update parameters in each F _i, and sends the changes as a message to P ₀. In another embodiment, the messages are sent to peer workers directly.

GRAPE supports data-partitioned parallelism bypartial evaluation on local fragments, in parallel by all workers. Its incremental step (2. b) speeds up iterative graph computations by reusing the partial results from the last superstep.

(3) Termination (Assemble) . The coordinator P ₀ decides to terminate ifthere is no change to any update parameters (see (2. a) above) . If so, P ₀ pulls partial results from all workers, and computes Q (G) by Assemble. It returns Q (G) .

The embodiment introduces the programming model of GRAPE. For a class

of graph queries, one only needs to provide three core functions PEval, IncEval and Ass emble, referred to as a PIE program. These are conventional sequential algorithms from textbooks, or can be picked from Library API of GRAPE.

An example of PIE program is disclosed below.

Example 1: Consider Single Source Shortest Path (SSSP) . Given a graph G with edges labeled with weights, and a source node s in G (as a query Q) , it is to find Q (G) including the shortest distance dist (s, v) from s to all nodes v in G.

Using GRAPE, one can pick our familiar Dijkstra’s algorithm as PEval, and a bounded sequential incremental algorithm as IncEval. The only addition is that for each fragment F _i, an integer variable dist (s, v) is declared for each node v, initially ∞ (except dist (s, s) =0) . As shown in Fig. 2, PEval first computes Q (F _i) ; it then repeats IncEval to compute

where messages M _iinclude updated (smaller) dist (s, u) (due to new “shortcut” from s) for border nodes u, i.e., nodes with edges across different fragments. GRAPE guarantees the termination of the fixpoint computation, when no more dist (s, v) can be changed to a smaller value. At this point, Assemble takes a union of Q (F _i) as Q (G) , which is provably correct.

That is, sequential algorithms can be transformed to PEval, IncEval and Assemble, and specify variables dist (s, v) for updating border nodes. GRAPE takes care of details such as message passing, load balancing and fault tolerance. There is no need to recast the entire algorithms into a new model.

PEval: Partial Evaluation

PEval takes a query

and a fragment F _iof G as input, and computes partial answers Q (F _i) at worker P _iin parallel for all i∈ [1, m] . It may be any existing sequential algorithm

for

extended with the following:

partial result kept in a designated variable; and

message specification as its interface to IncEval.

Communication between workers is conducted via messages, defined in terms of update parameters as follows.

(1) Message preamble. PEval (a) declares status variables

and (b) specifies a set C _iof nodes and edges relative to F _i. I or F _i. O. The status variables associated with C _iare denoted by

referred to as the updateparameters of F _i.

Intuitively, variables in

are the candidates to be updated by incremental steps. In other words, messages M _ito worker P _iare updates to the values of variables in

More specifically, C _iis specified by an integer d and S, where S is either F _i. I or F _i. O. That is, C _iis the set of nodes and edges within d-hops of nodes in S. If d= 0, C _i is F _i. I or F _i. O. Otherwise, C _imay include nodes and edges from other fragments F _jof G.

The variables are declared and initialized in PEval. At the end of PEval, it sends the values of

to coordinator P ₀. In another embodiment, the values are sent to peer workers directly.

(2) Message segment. PEval may specify function aggregateMsg, to resolve conflicts when multiple messages from different workers attempt to assign different values to the same update parameter (variable) . When such a strategy is not provided, GRAPE picks a default handler.

(3) Message grouping. GRAPE deduces updates to

for i∈ [1, m] , and treats them as messages exchanged among workers. More specifically, at coordinator P ₀, GRAPE identifies and maintains

for each worker P _i. Upon receiving messages from P _i’s, GRAPE works as follows.

(a) Identifying C _i. It deduces C _i for i∈ [1, m] by referencing fragmentation graph

and C _i remains unchanged in the entire process. It maintains update parameters

for F _i.

(b) Composing M _i. For messages from each P _i, GRAPE (i) identifies variables in

with changedvalues; (ii) deduces their designations P _j by referencing

if

is edge-cut, the variable tagged with a node v in F _i. O will be sent to worker P _j if v is in F _j. I (i.e., if

is in

) ; similarly for v in F _i. I; if

is vertex-cut, it identifies nodes shared by F _i and F _j (i≠j) ; and (iii) it combines all changed variables values designated to P _j into a single message M _j, and sends M _j to worker P _j in the next superstep for all j∈ [1, m] .

These are automatically conducted by GRAPE, which minimizes communication costs by passing only updated variable values. To reduce the workload at the coordinator, alternatively each worker may maintain a copy of

and deduce the designation of its messages in parallel.

Example 2: How GRAPE parallelizes SSSP is disclosed. Consider a directed graph G= (V, E, L) in which for each edge e, L (e) is a positive number. The length of a path (v ₀, …, v _k) in G is the sum of L (v _i-1, v _i) for i∈ [1, k] . For a pair (s, v) of nodes, denote by dist (s, v) the shortest distance from s to v, i.e., the length of a shortest path from s to v. Given graph G and a node s in V, GRAPE computes dist (s, v) for all v∈V. It adopts edge-cut partition. It deduces F _i. O by referencing

and stores F _i. O at each fragment F _i.

As shown in Fig. 3, PEval (lines 1-14) is verbally identical to Dijsktra’s sequential algorithm. The only changes are message preamble and segment (underlined) . It declares an integer variable dist (s, v) for each node v, initially ∞ (except dist (s, s) =0) . It specifies min as aggregateMsg to resolve conflicts: if there are multiple values for the same dist (s, v) , the smallest value is taken by the linear order on integers. The update parameters are

At the end of its process, PEval sends

to coordinator P ₀. At P ₀, GRAPE maintains dist (s, v) for all

Upon receiving messages from all workers, it takes the smallest value for each dist (s, v) . It finds those variables with smaller values, deduces their destinations by referencing

groups them into message M _j, and sends M _j to P _j.

IncEval: Incremental Evaluation

Given query Q, fragment F _i, partial results Q (F _i) and message M _i (updates to

) , IncEval computes

incrementally, making maximum reuse of the computation of Q (F _i) in the last round. Each time after IncEval is executed, GRAPE treats

and

as F _iand Q (F _i) , respectively, for the next round of incremental computation.

IncEval can take any existing sequential incremental algorithm

for

It shares the message preamble of PEval. At the end of the process, it identifies changed values to

at each F _i, and sends the changes as messages to P ₀. In another embodiment, the changes are sent to peer workers directly. At the destination worker or P ₀, GRAPE composes messages as described in 3 (b) above.

Boundedness. Graph computations are typically iterative. GRAPE reduces the costs of iterative computations by promoting bounded incremental algorithms for IncEval.

Consider an incremental algorithm

for

Given G,

Q (G) and updates M to G, it computes ΔO such that

where ΔO denotes changes to the old output O (G) . It is said to be bounded if its cost can be expressed as a function in the size of |CHANGED|=|ΔM|+|ΔO|, i.e., the size of changes in the input and output [14, 32] . Intuitively, |CHANGED|represents the updating costs inherent to the incremental problem for

itself. For a bounded IncEval, its cost is determined by |CHANGED|, not by the size |F _i| of entire F _i, no matter how big |F _i| is.

Example 3: Continuing with Example 2, IncEval in Fig. 4 is provided. It is the sequential incremental algorithm for SSSP, in response to changed dist (s, v) for v in F _i. I (here M _i includes changes to dist (s, v) for v∈F _i. I deduced from

) . Using a queue Que, it starts with M _i, propagates the changes to affected area, and updates the distances. The partial result is now the revised distances (line 11) .

At the end of the process, IncEval sends to coordinator P ₀ updated values of those status variables in

as in PEval. It applies aggregateMsg min to resolve conflicts.

The only changes to the algorithm are underlined in Fig. 4. It’s obvious that IncEval is bounded: its cost is determined by the sizes of “updates” |M _i|and the changes to the output. This reduces the cost of iterative computation of SSSP (the while and for loops) .

Assemble Partial Results

Function Assemble takes partial results

and fragmentation graph

as input and combines

to get Q (G) . It is triggered when no more changes can be made to update parameters

for any i∈ [1, m] .

Example 4: Continuing with Example 3, Assemble (not shown) for SSSP takes Q (G) =∪ _{i∈ [1, n]} Q (F _i) , the union of the shortest distance for each node in each F _i.

The GRAPE process terminates with correct Q (G) . The updates to

are “monotonic” : the value of dist (s, v) for each node v decreases or remains unchanged. There are finitely many such variables. Furthermore, dist (s, v) is the shortest distance from s to v, as warranted by the correctness of the sequential algorithms (PEval and IncEval) .

Putting these together, it is clear that a PIE program parallelizes a graph query class

provided with a sequential algorithm

(PEval) and a sequential incremental algorithm

(IncEval) for

Assemble is typically a straightforward sequential algorithm. A large number of sequential (incremental) algorithms are already in place for various

Moreover, there have been methods for incrementalizing graph algorithms, to get incremental algorithms from their batch counterparts. Thus, GRAPE makes parallel graph computations accessible to a large group of end users.

In contrast to existing graph systems, GRAPE plugs in

andΔT as a whole, and confines communication specification to the message segment of PEval. Users do not have to think “like a vertex” when programming. As opposed to vertex-centric and block-centric systems, GRAPE runs sequential algorithms on entire fragments. Moreover, IncEval employs incremental evaluation to reduce cost, which is a unique feature of GRAPE. Note that IncEval speeds up iterative computations by minimizing unnecessary recomputation of Q (F _i) , no matter whether it is bounded or not.

In an embodiment, the partitioning step, fragments distributing step, algorithms and preambles receiving step, processing sequential algorithms into parallel step, preambles processing step, and communication protocol determining step can be implemented by a preprocessor.

An implementation of GRAPE system is discussed below.

Architecture overview. GRAPE adopts a four-tier architecture depicted in Fig. 1, described as follows.

(1) The top layer is a user interface. GRAPE supports interactions with (a) developers who specify and register sequential PEval, IncEval and Assemble as a PIE program for a class

of graph queries (the plug panel) ; and (b) end users who plug-in PIE programs from API library, pick a graph G, enter queries

and “play” (the play panel) . GRAPE parallelizes the PIE program, computes Q (G) and displays Q (G) in result and analytics consoles.

(2) At the core of the system is a parallel query engine. It manages sequential algorithms registered in GRAPE API, makes parallel evaluation plans for PIE programs, and executes the plans for query answering. It also enforces consistency control and fault tolerance (see below) .

(3) Underlying the query engine are (a) a Communication Controller (message passing interface) for communications between coordinator and workers, (b) an Index Manager for loading indices, (c) a Partition Manager to partition graphs, and (d) a Load Balancer to balance workload (see below) .

(4) The storage layer manages graph data in DFS (distributed file system) . It is accessible to the query engine, Index Manager, Partition Manager and Load Balancer.

Message passing. The Communication Controller of GRAPE makes use of a standard message passing interface for parallel and distributed programs. It currently adopts MPICH, which is also the basis of other systems such as GraphLab and Blogel. It generates messages and coordinates messages in synchronization steps using standard MPI primitives. Moreover, GRAPE also supports alternative communication mechanisms such as Sockets, RPC and RDMA which support data communication between processors over an interconnected network. It supports both designated messages and key-value pairs.

Graph partition. The Graph Partitioner supports a variety of partition algorithms. Users may pick: (a) an edge-cut algorithm that split each edge into one or more partitions such as METIS/ParMETIS/PuLP/XtraPuLP; (b) a vertex-cut algorithm that split each vertex into one or more partitions such as 1D/2D partitions; or (c) a hybrid vertex-and edge-cut partition algorithm that split both edges and/or vertexes. New data partition algorithms either for online (stream) or offline (batch) processing can also be plugged into GRAPE.

Graph-level optimization. In contrast to prior graph systems, GRAPE supports data-partitioned parallelism by parallelizing the runs of sequential algorithms. Hence all optimization strategies developed for sequential (batch and incremental) algorithms can be readily plugged into GRAPE, to speed up PEval and IncEval over graph fragments. As examples, some optimization strategies are disclosed below.

(1) Indexing. Any indexing structure effective for sequential algorithm can be computed offline and directly used to optimize PEval, IncEval and Assemble, without recasting. GRAPE supports indices including (1) 2-hop index for reachability queries; and (2) neighborhood-index for candidate filtering in graph pattern matching. Moreover, new indices can be “plugged” into GRAPE API library.

(2) Compression. GRAPE adopts query preserving compression at the fragment level. Given a query class

and a fragment F _i, each worker P _icomputes a smaller

offline via a compression algorithm, such that for any query Q in

Q (F _i) can be computed from

without decompressing

regardless of what sequential PEval and IncEval are used. This compression scheme is effective for graph pattern matching and graph traversal, among others.

(3) Dynamic grouping. GRAPE dynamically group a set of border nodes by adding a “dummy” node, and sends messages from the dummy nodes in batches, instead of one by one. This effectively reduces the amount of message communication in each synchronization step.

Load balancing. GRAPE groups computation tasks into work units, and estimates the cost at each virtual worker P _iin terms of the fragment size |F _i| at P _i, the number of border nodes in F _i, and the complexity of computation

Its Load Balancer computes an assignment of the work units to physical workers, to minimize both computational cost and communication cost (GRAPE employs m virtual workers and n physical workers, and m＞n) . The bi-criteria objective makes it easy to deal with skewed graphs, when a small fraction of nodes are adjacent to a large fraction of the edges in G, as found in social graphs.

To the knowledge of the person skilled in the art, these optimization strategies are not supported by the state-of-the-art vertex-centric and block-centric systems. Indexing and query-preserving compression for sequential algorithms do not carry over to vertex programs, and block-centric programming essentially treats blocks as vertices rather than graphs. Moreover, dynamic grouping does not help vertex-level synchronization.

Fault tolerance. GRAPE employs an arbitrator mechanism to recover from both worker failures and coordinator failures (a.k.a. single-point failures) . It reserves a worker P _aas arbitrator, and a worker S _c'as a standby coordinator. It keeps sending heart-beat signals to all workers and the coordinator. In case of failure, (a) if a worker fails to respond, the arbitrator transfers its computation tasks to another worker; and (b) if the coordinator fails, it activates the standby coordinator S _c'to continue computation.

Consistency. Multiple workers may update copies of the same status variable. To cope with this, (a) GRAPE allows users to specify a conflict resolution policy as function aggregateMsg in PEval, e.g. min for SSSP and CC, based on a partial order on the domain of status variables, e.g. linear order on integers. Based on the policy, inconsistencies are resolved in each synchronization step of PEval and IncEval processes. Moreover, Theorem 1 guarantees the consistency when the policy satisfies the monotonic condition. (b) GRAPE also supports default exception handlers when users opt not to specify aggregateMsg. In addition, GRAPE allows users to specify generic consistency control strategies and register them in GRAPE API library.

A lightweight transaction controller is also employed, to support not only queries but also updates such as insertions and deletions of nodes and edges. When the load is light, it adopts non-destructive updates of functional databases. Otherwise, it switches to multi-version concurrency control that keeps track of timestamps and versions, as also adopted by existing distributed systems.

In addition to the SSSP in the afore-mentioned Examples 1-4, our parallelization system in the present invention is applicable in such contexts as pagerank, pattern matching (via simulation and isomorphism) , connectivity, keyword search and collaborative filtering, as well as other graph computations.

Our invention provides the following advantages.

I. Our invention makes it easier to implement distributive parallel programming. A relatively inexperienced user, by inserting existing sequential algorithms in our parallelization system, is enabled to handle large amount of graph data in a distributive fashion without recasting her own algorithms for parallel computation.

II. Automated parallelization, as well as deployment of Partial Evaluation and Incremental Evaluation, avoids unnecessary conversion of existing systems into new ones. With simple adaptation, well established sequential algorithms for graph computation are kept functional in distributed contexts.

III. Our invention guarantees the correctness of parallelized graph computation. When the three sequential algorithms inserted by the user are correct and the message is monotonous, a parallel computation under automated parallelization stops and correct result is assured.

IV. Our invention significantly boosts efficiency of the three applications in graph computation. Experiments indicate that our system are 400 times faster than existing systems for graph computation. Commutation costs becomes one-ten millionth of what it takes for existing systems.

V. Aided by our invention, computation performance is enhanced and cost drops for such algorithms as graph traversal (DFS, BFS, single sourced shortest path) , connectivity (strong connected component, minimum spanning tree) , graph pattern matching (via simulation and subgraph isomorphism) , keyword search and machine learning (collaborate filtering, triangle counting, PageRank) .

VI. GRAPE is experimentally evaluated, compared with (a) Giraph, an open-source version of Pregel, (b) GraphLab, an asynchronous vertex-centric system, and (c) Blogel, might be the fastest block-centric system. Over real-life graphs, it is clear that in addition to the ease of programming, GRAPE achieves comparable performance to the state-of-the-art systems. For instance, (a) GRAPE is larger than 480, 36 and 15 times faster than Giraph, GraphLab and Blogel for SSSP, up to 5 magnitudes than their for CC, larger than 150, 6 and 16 times for Sim, 70, 14 and 3 times for SubIso, and 5, 3 and 12 times for CF on average, respectively. (b) In the same setting, GRAPE ships on average 0.07%, 0.12%and 1.7%of the data shipped by Giraph, GraphLab and Blogel for SSSP, 0.9%, 0.14%and 4.9%for Sim, 0.18%, 0.23%and 0.11%for SubIso, and 5.6%, 43.3%and 3.2%for CF, respectively. (c) Incremental steps effectively reduce the cost and improve the performance by 9.6 times. (d) Optimization strategies for sequential algorithms remain effective for GRAPE and improve by 2 times on average.

Having described at least one of the embodiments of the claimed invention with reference to the accompanying drawings, it will be apparent to those skills that the invention is not limited to those precise embodiments, and that various modifications and variations can be made in the presently disclosed system without departing from the scope or spirit of the invention. Thus, it is intended that the present disclosure cover modifications and variations of this disclosure provided they come within the scope of the appended claims and their equivalents. Specifically, one or more limitations recited throughout the specification can be combined in any level of details to the extent they are described to improve the method or the system.

Claims

A method for parallelizing a sequential graph computation, comprising:

partitioning a graph G by a preprocessor into fragments (F ₁, …, F _n) ;

distributing the fragments by the preprocessor across n workers (P ₁, …, P _n) respectively;

receiving by the preprocessor three sequential algorithms and the message preambles;

processing by the preprocessor the sequential algorithms into parallel version;

processing the message preambles and determining the communication protocol by the preprocessor;

receiving a query Q at a coordinator and posting the query Q to all n workers;

executing partial evaluation by each worker P _i against its local fragment F _i, i∈ [1, n] ;

computing partial result Q (F _i) by each worker P _i, and generating message Mi;

exchange messages between each worker P _i;

executing incremental evaluation by worker P _i upon receiving a message M _i against local F _i updated by M _i;

computing updated partial result
by worker P _i; where operator
means applies changes M _i to F _i;

iterating the incremental evaluation until no further updates M _i can be made to any F _i;

pulling updated partial results by the coordinator from all workers;

computing a complete result Q (G) by the assemble function by the coordinator or one of the workers; and

returning from the assemble function, the result Q (G) as the answer to the query Q.
The method in claim 1, further comprising: selecting a partition strategy
by the preprocessor before partitioning the graph G.
The method in claim 1, wherein the workers are distributed processors, or processors in a single machine, or threads on a processor.
The method in claim 1, wherein the coordinator and the preprocessor may be implemented completely or in part as centralized, decentralized or virtual components.
The method in claim 1, further comprising identifying and initializing by each worker P _i while executing the partial evaluation a set of update parameters for each F _i that records the status of its border nodes.
The method in claim 1, further comprising routing messages by the coordinator to workers during the message passing.
The method in claim 1, further comprising routing messages to other workers by each worker P _i during the message passing.
The method in claim 1, further comprising treating by worker P _i
and
as F _i and Q (F _i) , respectively, for a next computation each time after executing incremental evaluation.
A method for parallelizing a sequential graph computation, comprising:

loading fragments (F ₁, …, F _n) by n workers (P ₁, …, P _n) respectively, wherein, the fragments are pre-partitioned from a graph G;

receiving by a preprocessor three sequential algorithms and the message preambles;

processing by the preprocessor the sequential algorithms into parallel version;

processing the message preambles and determining the communication protocol by the preprocessor;

receiving a query Q at a coordinator and posting the query Q to all n workers;

executing partial evaluation by each worker P _i against its local fragment F _i, i∈ [1, n] ;

computing partial result Q (F _i) by each worker P _i, and generating message Mi;

exchange messages between each worker P _i;

executing incremental evaluation by worker P _i upon receiving a message M _i against local F _i updated by M _i;

computing updated partial result
by worker P _i; where operator
means applies changes M _i to F _i;

iterating the incremental evaluation until no further updates M _i can be made to any F _i;

pulling updated partial results by the coordinator from all workers;

computing a complete result Q (G) by assemble by the coordinator; and

returning from the coordinator the result Q (G) as the answer to the query Q.
A non-transitory computer readable medium with instructions stored thereon for parallelizing a sequential graph computation, that when executed by a processor, preforming the steps comprising:

partitioning a graph G by a preprocessor into fragments (F ₁, …, F _n) ;

distributing the fragments by the preprocessor across n workers (P ₁, …, P _n) respectively;

receiving by the preprocessor three sequential algorithms and the message preambles;

processing by the preprocessor the sequential algorithms into parallel version;

processing the message preambles and determining the communication protocol by the preprocessor;

receiving a query Q at a coordinator and posting the query Q to all n workers;

executing partial evaluation by each worker P _i against its local fragment F _i, i∈ [1, n] ;

computing partial result Q (F _i) by each worker P _i, and generating message Mi;

exchange messages between each worker P _i;

executing incremental evaluation by worker P _i upon receiving a message M _i against local F _i updated by M _i;

computing updated partial result
by worker P _i; where operator
means applies changes M _i to F _i;

iterating the incremental evaluation until no further updates M _i can be made to any F _i;

pulling updated partial results by the coordinator from all workers;

computing a complete result Q (G) by the assemble function by the coordinator or one of the workers; and

returning from the assemble function, the result Q (G) as the answer to the query Q.
A system for parallelizing a sequential graph computation, comprising a preprocessor, a coordinator and a plurality of workers, wherein:

the preprocessor comprises:

an automating parallelization module for receiving, parsing and automating parallelizing inputted algorithms and for distributing algorithms to workers;

a partition manager module for partitioning graph and managing fragments;

a communication control module for processing the message preambles and determining the communication protocol; and

a storage module for managing graph data in distributed file system; the coordinator comprises:

a query parser module for receiving, parsing and distributing a query to workers;

an assemble module for assembling result for a query;

a storage module for managing graph data in distributed file system; and

each worker comprises:

a partial evaluation module for executing partial evaluation;

an incremental evaluation module for executing incremental evaluation;

a communication control module for controlling message passing; and

a storage module for managing graph data in distributed file system.
The system in claim 10, wherein the coordinator comprises a load balancing module for assigning work according to load balance.
The system in claim 10, wherein the coordinator comprises an indexing module for supporting indexing structure.
The system in claim 10, wherein:

the coordinator comprises a compression module for adopting query preserving compression; and

each worker comprises a compression module for adopting query preserving compression.
The system in claim 10, wherein:

the coordinator comprises a fault tolerance module for recovering from worker failures or coordinator failures; and

each worker comprises a fault tolerance module for recovering from worker failures or coordinator failures.
The system in claim 10, wherein the workers are distributed processors or processors in a single machine or threads on a processor.
The system in claim 10, wherein the coordinator and the preprocessor may be implemented completely or in part as centralized, decentralized or virtual components.