WO2018205986A1 - Method and system for parallelizing sequential graph computation - Google Patents

Method and system for parallelizing sequential graph computation Download PDF

Info

Publication number
WO2018205986A1
WO2018205986A1 PCT/CN2018/086454 CN2018086454W WO2018205986A1 WO 2018205986 A1 WO2018205986 A1 WO 2018205986A1 CN 2018086454 W CN2018086454 W CN 2018086454W WO 2018205986 A1 WO2018205986 A1 WO 2018205986A1
Authority
WO
WIPO (PCT)
Prior art keywords
worker
coordinator
workers
preprocessor
query
Prior art date
Application number
PCT/CN2018/086454
Other languages
French (fr)
Inventor
Wenfei Fan
Jingbo XU
Wenyuan Yu
Original Assignee
Shanghai Putu Technology Partnership (General Partnership)
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Putu Technology Partnership (General Partnership) filed Critical Shanghai Putu Technology Partnership (General Partnership)
Publication of WO2018205986A1 publication Critical patent/WO2018205986A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5066Algorithms for mapping a plurality of inter-dependent sub-tasks onto a plurality of physical CPUs

Definitions

  • the present invention relates to graph computations. More particularly, this invention describes solutions on parallel graph computations.
  • PRAM Parallel Random Access Machine
  • BSP Bit Synchronous Parallel
  • MapReduce MapReduce: Simplified data processing on large clusters. Commun. ACM. 51, 1 (2008) ) .
  • PRAM abstracts parallel RAM access over shared memory.
  • BSP models parallel computations in supersteps (including local computation, communication and a synchronization barrier) to synchronize communication among workers.
  • Pregel (Giraph) (Malewicz, G., Austern, M.H., Bik, A.J.C., Dehnert, J.C., Horn, I., Leiser, N. and Czajkowski, G. 2010.
  • Pregel A system for large-scale graph processing. SIGMOD (2010) ) implements BSP with vertex-centric programming, where a superstep executes a user-defined function at each vertex in parallel.
  • GraphLab (Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C. and Hellerstein, J.M. 2012. Distributed GraphLab: A framework for machine learning in the cloud. PVLDB.
  • Block-centric models extend vertex-centric programming to blocks, to exchange messages among blocks.
  • Popular graph systems also include: GraphX (Gonzalez, J.E., Xin, R.S., Dave, A., Crankshaw, D., Franklin, M.J. and Stoica, I. 2014.
  • GraphX Graph processing in a distributed dataflow framework.
  • OSDI (2014) GraphI (2014)
  • GRACE G. Wang, W. Xie, A.J. Demers, and J. Gehrke.
  • Asynchronous large-scale graph processing is made easy.
  • GPS Salihoglu, S. and Widom, J. 2013. GPS: a graph processing system.
  • GraphX recasts graph computation in its distributed dataflow processors a sequence of join and group-by stages punctuated by map operations over Spark platform.
  • GRACE provides an operator-level, iterative programming model to enhance synchronous BSP with asynchronous execution.
  • GPS implements Pregel with extended APIs and partition strategies. All these systems require recasting of sequential algorithms into parallel algorithms before taking advantage of the parallel model and system.
  • a method for parallelizing a sequential graph computation comprises:
  • the method further comprises: selecting a partition strategy by the preprocessor before partitioning the graph G.
  • the workers are distributed processors, or processors in a single machine, or threads on a processor.
  • coordinator and the preprocessor may be implemented completely or in part as centralized, decentralized or virtual components.
  • the method further comprises: identifying and initializing by each worker P i while executing the partial evaluation a set of update parameters for each F i that records the status of its border nodes.
  • the method further comprises: routing messages by the coordinator to workers during the message passing.
  • the method further comprises: routing messages to other workers by each worker P i during the message passing.
  • the method further comprises: treating by worker P i , and as F i and Q (F i ) , respectively, for a next computation each time after executing incremental evaluation.
  • a method for parallelizing a sequential graph computation comprises:
  • fragments F 1 , ..., F n
  • P 1 , ..., P n n workers
  • a non-transitory computer readable medium with instructions stored thereon for parallelizing a sequential graph computation, that when executed by one or more processors, preforming the steps comprising:
  • a system for parallelizing a sequential graph computation is provided.
  • a system for parallelizing a sequential graph computation comprising a preprocessor, a coordinator and a plurality of workers, wherein:
  • the preprocessor comprises:
  • an automating parallelization module for receiving, parsing and automating parallelizing inputted algorithms and for distributing algorithms to workers;
  • partition manager module for partitioning graph and managing fragments
  • a communication control module for processing the message preambles and determining the communication protocol
  • the coordinator comprises:
  • a query parser module for receiving, parsing and distributing a query to workers
  • a communication control module for controlling message passing
  • each worker comprises:
  • a partial evaluation module for executing partial evaluation
  • a communication control module for controlling message passing
  • a storage module for managing graph data in distributed file system.
  • the coordinator comprises a load balancing module for assigning work according to load balance.
  • the coordinator comprises an indexing module for supporting indexing structure.
  • the coordinator comprises a compression module for adopting query preserving compression
  • each worker comprises a compression module for adopting query preserving compression.
  • the coordinator comprises a fault tolerance module for recovering from worker failures or coordinator failures;
  • each worker comprises a fault tolerance module for recovering from worker failures or coordinator failures.
  • the workers are distributed processors, or processors in a single machine, or threads on a processor.
  • the coordinator and the preprocessor may be implemented completely or in part as centralized, decentralized or virtual components.
  • FIG. 1 is a diagram of GRAPE architecture
  • FIG. 2 is a diagram of workflow of GRAPE
  • FIG. 3 is a diagram of PEval for SSSP.
  • FIG. 4 is a diagram of IncEval for SSSP.
  • Subgraph G' is said to be induced by V'if E'consists of all the edges in G whose endpoints are both in V'.
  • the fragmentation graph of G via is an index such that given each node v in (or ) , retrieves a set of if v ⁇ F i . O and v ⁇ F j . I with i ⁇ j. As will be seen shortly, helps us deduce the directions of messages.
  • a novel system such that a sequential graph algorithms may be “plugged” into it as a whole (subject to minor changes) , and it parallelizes the computation across multiple processors, without drastic degradation in performance or functionality of existing systems.
  • the system is called GRAPE, i.e. a parallel GRAP h E ngine for graph computations such as traversal, pattern matching, connectivity and collaborative filtering.
  • a Parallel Model GRAPE is provided in an embodiment.
  • the GRAPE system comprises a preprocessor, a coordinator P 0 and a set of m workers P 1 , ..., P m .
  • the workers are distributed processors or threads on a processor or processors in a single machine.
  • the preprocessor and the coordinator may be implemented completely or in part as centralized, decentralized or virtual components.
  • GRAPE parallelizes the computations as follows. It first partitions G into (F 1 , ..., F m ) with and distributes F i ’s across m shared-nothing virtual workers (P 1 , ..., P m ) . It maps m virtual workers to n physical workers. When n ⁇ m, multiple virtual workers mapped to the same worker share memory. It also constructs fragmentation graph Note that G is partitioned once for all queries posed on G. In an embodiment, the partitioning step can be implemented by a preprocessor. In another embodiment, the fragments are pre-partitioned from a graph G, and loaded by n workers (P 1 , ..., P n ) respectively.
  • Partial evaluation (PEval) .
  • each worker P i computes partial results Q (F i ) locally at F i using PEval, in parallel (i ⁇ [1, m] ) . It also identifies and initializes a set of update parameters for each F i that records the status of its border nodes. At the end of the process, it generates a message from the update parameters at each P i and sends it to coordinator P 0 . In another embodiment, the messages are sent to peer workers directly.
  • Coordinator P 0 checks whether for all i ⁇ [1, m] , P i is inactive, i.e., P i is done with its local computation and there is no pending message designated for P i . If so, GRAPE invokes Assemble and terminates (see below) . Otherwise, P 0 routes messages from the last superstep to workers and triggers the next superstep.
  • GRAPE supports data-partitioned parallelism bypartial evaluation on local fragments, in parallel by all workers. Its incremental step (2. b) speeds up iterative graph computations by reusing the partial results from the last superstep.
  • Termination (Assemble) .
  • the coordinator P 0 decides to terminate ifthere is no change to any update parameters (see (2. a) above) . If so, P 0 pulls partial results from all workers, and computes Q (G) by Assemble. It returns Q (G) .
  • the embodiment introduces the programming model of GRAPE.
  • P Eval For a class of graph queries, one only needs to provide three core functions P Eval, I ncEval and Ass e mble, referred to as a PIE program. These are conventional sequential algorithms from textbooks, or can be picked from Library API of GRAPE.
  • Example 1 Consider Single Source Shortest Path (SSSP) . Given a graph G with edges labeled with weights, and a source node s in G (as a query Q) , it is to find Q (G) including the shortest distance dist (s, v) from s to all nodes v in G.
  • SSSP Single Source Shortest Path
  • GRAPE guarantees the termination of the fixpoint computation, when no more dist (s, v) can be changed to a smaller value. At this point, Assemble takes a union of Q (F i ) as Q (G) , which is provably correct.
  • PEval takes a query and a fragment F i of G as input, and computes partial answers Q (F i ) at worker P i in parallel for all i ⁇ [1, m] . It may be any existing sequential algorithm for extended with the following:
  • PEval (a) declares status variables and (b) specifies a set C i of nodes and edges relative to F i . I or F i . O.
  • the status variables associated with C i are denoted by referred to as the updateparameters of F i .
  • variables in are the candidates to be updated by incremental steps.
  • messages M i to worker P i are updates to the values of variables in
  • the variables are declared and initialized in PEval. At the end of PEval, it sends the values of to coordinator P 0 . In another embodiment, the values are sent to peer workers directly.
  • PEval may specify function aggregateMsg, to resolve conflicts when multiple messages from different workers attempt to assign different values to the same update parameter (variable) .
  • GRAPE picks a default handler.
  • GRAPE deduces updates to for i ⁇ [1, m] , and treats them as messages exchanged among workers. More specifically, at coordinator P 0 , GRAPE identifies and maintains for each worker P i . Upon receiving messages from P i ’s, GRAPE works as follows.
  • each worker may maintain a copy of and deduce the designation of its messages in parallel.
  • Example 2 How GRAPE parallelizes SSSP is disclosed.
  • G (V, E, L) in which for each edge e, L (e) is a positive number.
  • the length of a path (v 0 , ..., v k ) in G is the sum of L (v i-1 , v i ) for i ⁇ [1, k] .
  • dist (s, v) the shortest distance from s to v, i.e., the length of a shortest path from s to v.
  • GRAPE Given graph G and a node s in V, GRAPE computes dist (s, v) for all v ⁇ V. It adopts edge-cut partition. It deduces F i . O by referencing and stores F i . O at each fragment F i .
  • the update parameters are
  • PEval sends to coordinator P 0 .
  • GRAPE maintains dist (s, v) for all Upon receiving messages from all workers, it takes the smallest value for each dist (s, v) . It finds those variables with smaller values, deduces their destinations by referencing groups them into message M j , and sends M j to P j .
  • IncEval Given query Q, fragment F i , partial results Q (F i ) and message M i (updates to ) , IncEval computes incrementally, making maximum reuse of the computation of Q (F i ) in the last round. Each time after IncEval is executed, GRAPE treats and as F i and Q (F i ) , respectively, for the next round of incremental computation.
  • IncEval can take any existing sequential incremental algorithm for It shares the message preamble of PEval. At the end of the process, it identifies changed values to at each F i , and sends the changes as messages to P 0 . In another embodiment, the changes are sent to peer workers directly. At the destination worker or P 0 , GRAPE composes messages as described in 3 (b) above.
  • Example 3 Continuing with Example 2, IncEval in Fig. 4 is provided. It is the sequential incremental algorithm for SSSP, in response to changed dist (s, v) for v in F i . I (here M i includes changes to dist (s, v) for v ⁇ F i . I deduced from ) . Using a queue Que, it starts with M i , propagates the changes to affected area, and updates the distances. The partial result is now the revised distances (line 11) .
  • IncEval sends to coordinator P 0 updated values of those status variables in as in PEval. It applies aggregateMsg min to resolve conflicts.
  • IncEval is bounded: its cost is determined by the sizes of “updates”
  • Function Assemble takes partial results and fragmentation graph as input and combines to get Q (G) . It is triggered when no more changes can be made to update parameters for any i ⁇ [1, m] .
  • the GRAPE process terminates with correct Q (G) .
  • the updates to are “monotonic” : the value of dist (s, v) for each node v decreases or remains unchanged. There are finitely many such variables. Furthermore, dist (s, v) is the shortest distance from s to v, as warranted by the correctness of the sequential algorithms (PEval and IncEval) .
  • GRAPE plugs in and ⁇ T as a whole, and confines communication specification to the message segment of PEval. Users do not have to think “like a vertex” when programming. As opposed to vertex-centric and block-centric systems, GRAPE runs sequential algorithms on entire fragments. Moreover, IncEval employs incremental evaluation to reduce cost, which is a unique feature of GRAPE. Note that IncEval speeds up iterative computations by minimizing unnecessary recomputation of Q (F i ) , no matter whether it is bounded or not.
  • the partitioning step, fragments distributing step, algorithms and preambles receiving step, processing sequential algorithms into parallel step, preambles processing step, and communication protocol determining step can be implemented by a preprocessor.
  • GRAPE adopts a four-tier architecture depicted in Fig. 1, described as follows.
  • the top layer is a user interface.
  • GRAPE supports interactions with (a) developers who specify and register sequential PEval, IncEval and Assemble as a PIE program for a class of graph queries (the plug panel) ; and (b) end users who plug-in PIE programs from API library, pick a graph G, enter queries and “play” (the play panel) .
  • GRAPE parallelizes the PIE program, computes Q (G) and displays Q (G) in result and analytics consoles.
  • a parallel query engine At the core of the system is a parallel query engine. It manages sequential algorithms registered in GRAPE API, makes parallel evaluation plans for PIE programs, and executes the plans for query answering. It also enforces consistency control and fault tolerance (see below) .
  • Underlying the query engine are (a) a Communication Controller (message passing interface) for communications between coordinator and workers, (b) an Index Manager for loading indices, (c) a Partition Manager to partition graphs, and (d) a Load Balancer to balance workload (see below) .
  • a Communication Controller messages passing interface
  • an Index Manager for loading indices
  • a Partition Manager to partition graphs
  • a Load Balancer to balance workload (see below) .
  • the storage layer manages graph data in DFS (distributed file system) . It is accessible to the query engine, Index Manager, Partition Manager and Load Balancer.
  • DFS distributed file system
  • the Communication Controller of GRAPE makes use of a standard message passing interface for parallel and distributed programs. It currently adopts MPICH, which is also the basis of other systems such as GraphLab and Blogel. It generates messages and coordinates messages in synchronization steps using standard MPI primitives. Moreover, GRAPE also supports alternative communication mechanisms such as Sockets, RPC and RDMA which support data communication between processors over an interconnected network. It supports both designated messages and key-value pairs.
  • the Graph Partitioner supports a variety of partition algorithms. Users may pick: (a) an edge-cut algorithm that split each edge into one or more partitions such as METIS/ParMETIS/PuLP/XtraPuLP; (b) a vertex-cut algorithm that split each vertex into one or more partitions such as 1D/2D partitions; or (c) a hybrid vertex-and edge-cut partition algorithm that split both edges and/or vertexes. New data partition algorithms either for online (stream) or offline (batch) processing can also be plugged into GRAPE.
  • GRAPE supports data-partitioned parallelism by parallelizing the runs of sequential algorithms.
  • all optimization strategies developed for sequential (batch and incremental) algorithms can be readily plugged into GRAPE, to speed up PEval and IncEval over graph fragments. As examples, some optimization strategies are disclosed below.
  • Indexing Any indexing structure effective for sequential algorithm can be computed offline and directly used to optimize PEval, IncEval and Assemble, without recasting.
  • GRAPE supports indices including (1) 2-hop index for reachability queries; and (2) neighborhood-index for candidate filtering in graph pattern matching. Moreover, new indices can be “plugged” into GRAPE API library.
  • GRAPE adopts query preserving compression at the fragment level. Given a query class and a fragment F i , each worker P i computes a smaller offline via a compression algorithm, such that for any query Q in Q (F i ) can be computed from without decompressing regardless of what sequential PEval and IncEval are used. This compression scheme is effective for graph pattern matching and graph traversal, among others.
  • GRAPE dynamically group a set of border nodes by adding a “dummy” node, and sends messages from the dummy nodes in batches, instead of one by one. This effectively reduces the amount of message communication in each synchronization step.
  • GRAPE groups computation tasks into work units, and estimates the cost at each virtual worker P i in terms of the fragment size
  • Its Load Balancer computes an assignment of the work units to physical workers, to minimize both computational cost and communication cost (GRAPE employs m virtual workers and n physical workers, and m>n) .
  • the bi-criteria objective makes it easy to deal with skewed graphs, when a small fraction of nodes are adjacent to a large fraction of the edges in G, as found in social graphs.
  • GRAPE employs an arbitrator mechanism to recover from both worker failures and coordinator failures (a.k.a. single-point failures) . It reserves a worker P a as arbitrator, and a worker S c 'as a standby coordinator. It keeps sending heart-beat signals to all workers and the coordinator. In case of failure, (a) if a worker fails to respond, the arbitrator transfers its computation tasks to another worker; and (b) if the coordinator fails, it activates the standby coordinator S c 'to continue computation.
  • GRAPE allows users to specify a conflict resolution policy as function aggregateMsg in PEval, e.g. min for SSSP and CC, based on a partial order on the domain of status variables, e.g. linear order on integers. Based on the policy, inconsistencies are resolved in each synchronization step of PEval and IncEval processes. Moreover, Theorem 1 guarantees the consistency when the policy satisfies the monotonic condition.
  • GRAPE also supports default exception handlers when users opt not to specify aggregateMsg.
  • GRAPE allows users to specify generic consistency control strategies and register them in GRAPE API library.
  • a lightweight transaction controller is also employed, to support not only queries but also updates such as insertions and deletions of nodes and edges.
  • the load When the load is light, it adopts non-destructive updates of functional databases. Otherwise, it switches to multi-version concurrency control that keeps track of timestamps and versions, as also adopted by existing distributed systems.
  • our parallelization system in the present invention is applicable in such contexts as pagerank, pattern matching (via simulation and isomorphism) , connectivity, keyword search and collaborative filtering, as well as other graph computations.
  • GRAPE is experimentally evaluated, compared with (a) Giraph, an open-source version of Pregel, (b) GraphLab, an asynchronous vertex-centric system, and (c) Blogel, might be the fastest block-centric system. Over real-life graphs, it is clear that in addition to the ease of programming, GRAPE achieves comparable performance to the state-of-the-art systems. For instance, (a) GRAPE is larger than 480, 36 and 15 times faster than Giraph, GraphLab and Blogel for SSSP, up to 5 magnitudes than their for CC, larger than 150, 6 and 16 times for Sim, 70, 14 and 3 times for SubIso, and 5, 3 and 12 times for CF on average, respectively.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A method and system for parallelizing sequential graph computations are provided. The method comprises, partitioning a graph G into fragments; distributing the fragments and parallelized algorithms across n workers respectively; receiving a query Q at a coordinator and posting Q to all workers; executing partial evaluation by each worker against its local fragment; and generating messages; exchange messages between each worker; executing incremental evaluation by worker upon receiving a message against local fragment updated by the message; iterating the incremental evaluation until no further update message can be made to any fragment; computing a complete result via assemble partial results from workers by the coordinator; and returning the result as the answer to the query Q. The system is also disclosed via illustrating the architecture according to a figure.

Description

METHOD AND SYSTEM FOR PARALLELIZING SEQUENTIAL GRAPH COMPUTATION
RELATED APPLICATION
The present application claims the benefit of PCT/CN2017/084083 filed May 12, 2017, which is incorporated herein by reference in its entirety.
FIELD OF THE INVENTION
The present invention relates to graph computations. More particularly, this invention describes solutions on parallel graph computations.
BACKGROUND OF THE INVENTION
A variety of models and systems for paralleling graph computation are available in the prior art. Such models include: PRAM (Parallel Random Access Machine) , BSP (Bulk Synchronous Parallel) and MapReduce (Dean, J. and Ghemawat, S. 2008. MapReduce: Simplified data processing on large clusters. Commun. ACM. 51, 1 (2008) ) . PRAM abstracts parallel RAM access over shared memory. BSP models parallel computations in supersteps (including local computation, communication and a synchronization barrier) to synchronize communication among workers. Pregel (Giraph) (Malewicz, G., Austern, M.H., Bik, A.J.C., Dehnert, J.C., Horn, I., Leiser, N. and Czajkowski, G. 2010. Pregel: A system for large-scale graph processing. SIGMOD (2010) ) implements BSP with vertex-centric programming, where a superstep executes a user-defined function at each vertex in parallel. GraphLab (Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C. and Hellerstein, J.M. 2012. Distributed GraphLab: A framework for machine learning in the cloud. PVLDB. 5, 8 (2012) ) revises BSP to pass messages asynchronously. Block-centric models extend vertex-centric programming to blocks, to exchange messages among blocks. Popular graph systems also include: GraphX (Gonzalez, J.E., Xin, R.S., Dave, A., Crankshaw, D., Franklin, M.J. and Stoica, I. 2014. GraphX: Graph processing in a distributed dataflow framework. OSDI (2014) ) , GRACE (G. Wang, W. Xie, A.J. Demers, and J. Gehrke. Asynchronous large-scale graph processing is made easy. In CIDR, 2013) , GPS (Salihoglu, S. and Widom, J. 2013. GPS: a graph processing system.  SSDBM (2013) ) , etc. GraphX recasts graph computation in its distributed dataflow processors a sequence of join and group-by stages punctuated by map operations over Spark platform. GRACE provides an operator-level, iterative programming model to enhance synchronous BSP with asynchronous execution. GPS implements Pregel with extended APIs and partition strategies. All these systems require recasting of sequential algorithms into parallel algorithms before taking advantage of the parallel model and system.
A number of graph algorithms have been developed in MapReduce, vertex-centric models and others. Prior work on automated parallelization, however, has focused on the instruction or operator level by breaking dependencies via symbolic and automata analyses. There has also been work at data partition level, to perform multi-level partition ( “parallel abstraction” ) and enable locality-optimized access to adapt to different parallel abstraction. It is not easy to do the adaption hence they targeted experienced developers of parallel algorithms. Thus, an easier method has been eagerly sought for parallelizing a sequential graph computation.
OBJECTS AND SUMMARY OF THE INVENTION
Therefore, it is an object of the claimed invention to provide a significantly improved method and system for parallelizing sequential graph computations.
In an embodiment, a method for parallelizing a sequential graph computation is provided. The method comprises:
partitioning a graph G by a preprocessor into fragments (F 1, …, F n) ;
distributing the fragments by the preprocessor across n workers (P 1, …, P n) , respectively;
receiving by the preprocessor three sequential algorithms and the message preambles;
processing by the preprocessor the sequential algorithms into parallel version;
processing the message preambles and determining the communication protocol by the preprocessor;
receiving a query Q at a coordinator and posting the query Q to all n workers;
executing partial evaluation by each worker P i against its local fragment F i, i∈ [1, n] ;
computing partial result Q (F i) by each worker P i, and generating message Mi;
exchange messages between each worker P i;
executing incremental evaluation by worker P i upon receiving a message M i against local F i updated by M i;
computing updated partial result
Figure PCTCN2018086454-appb-000001
by worker Pi; where operator
Figure PCTCN2018086454-appb-000002
means applying changes Mi to Fi. (the same definition through the whole file. )
iterating the incremental evaluation until no further updates M i can be made to any F i;
pulling updated partial results by the coordinator from all workers;
computing a complete result Q (G) by assemble by the coordinator; and
returning from the coordinator the result Q (G) as the answer to the query Q.
Preferably, the method further comprises: selecting a partition strategy
Figure PCTCN2018086454-appb-000003
by the preprocessor before partitioning the graph G.
In embodiments of the method, the workers are distributed processors, or processors in a single machine, or threads on a processor.
In an embodiment of the method, wherein the coordinator and the preprocessor may be implemented completely or in part as centralized, decentralized or virtual components.
Preferably, the method further comprises: identifying and initializing by each worker P i while executing the partial evaluation a set of update parameters for each F i that records the status of its border nodes.
Preferably, the method further comprises: routing messages by the coordinator to workers during the message passing.
Preferably, the method further comprises: routing messages to other workers by each worker P i during the message passing.
Preferably, the method further comprises: treating by worker P i
Figure PCTCN2018086454-appb-000004
and 
Figure PCTCN2018086454-appb-000005
as F i and Q (F i) , respectively, for a next computation each time after executing incremental evaluation.
In an embodiment, a method for parallelizing a sequential graph computation is provided, the method comprises:
loading fragments (F 1, …, F n) by n workers (P 1, …, P n) respectively, wherein, the fragments are pre-partitioned from a graph G;
receiving by a preprocessor three sequential algorithms and the message preambles;
processing by the preprocessor the sequential algorithms into parallel version;
processing the message preambles and determining the communication protocol by the preprocessor;
receiving a query Q at a coordinator and posting the query Q to all n workers;
executing partial evaluation by each worker P i against its local fragment F i, i∈ [1, n] ;
computing partial result Q (F i) by each worker P i, and generating message Mi;
exchange messages between each worker P i;
executing incremental evaluation by worker P i upon receiving a message M i against local F i updated by M i;
computing updated partial result
Figure PCTCN2018086454-appb-000006
by worker P i;
iterating the incremental evaluation until no further updates M i can be made to any F i;
pulling updated partial results by the coordinator from all workers;
computing a complete result Q (G) by assemble by the coordinator; and
returning from the coordinator the result Q (G) as the answer to the query Q.
In an embodiment, provided is a non-transitory computer readable medium with instructions stored thereon for parallelizing a sequential graph computation, that when executed by one or more processors, preforming the steps comprising:
partitioning a graph G by a preprocessor into fragments (F 1, …, F n) ;
distributing the fragments by the preprocessor across n workers (P 1, …, P n) respectively;
distributing the graph computation algorithm by the preprocessor to all n workers;
receiving a query Q at the coordinator and posting the query Q to all n workers;
executing partial evaluation by each worker P i against its local fragment F i, i∈ [1, n] ;
computing partial result Q (F i) by each worker P i;
exchange partial results between each worker P i via message passing;
executing incremental evaluation by worker P i upon receiving a message M i against local F i updated by M i;
computing updated partial result
Figure PCTCN2018086454-appb-000007
by worker P i;
iterating the incremental evaluation until no further updates M i can be made to any F i;
pulling updated partial results by the coordinator from all workers;
computing a complete result Q (G) by assemble by the coordinator; and
returning from the coordinator the result Q (G) as the answer to the query Q.
In an embodiment, a system for parallelizing a sequential graph computation is provided.
A system for parallelizing a sequential graph computation, comprising a preprocessor, a coordinator and a plurality of workers, wherein:
the preprocessor comprises:
an automating parallelization module for receiving, parsing and automating parallelizing inputted algorithms and for distributing algorithms to workers;
a partition manager module for partitioning graph and managing fragments;
a communication control module for processing the message preambles and determining the communication protocol; and
a storage module for managing graph data in distributed file system;
the coordinator comprises:
a query parser module for receiving, parsing and distributing a query to workers;
an assemble module for assembling result for a query;
a communication control module for controlling message passing; and
a storage module for managing graph data in distributed file system; and
each worker comprises:
a partial evaluation module for executing partial evaluation;
an incremental evaluation module for executing incremental evaluation;
a communication control module for controlling message passing; and
a storage module for managing graph data in distributed file system.
Preferably, in the system, the coordinator comprises a load balancing module for assigning work according to load balance.
Preferably, in the system, the coordinator comprises an indexing module for supporting indexing structure.
Preferably, in the system, the coordinator comprises a compression module for adopting query preserving compression; and
each worker comprises a compression module for adopting query preserving compression.
Preferably, in the system, the coordinator comprises a fault tolerance module for recovering from worker failures or coordinator failures; and
each worker comprises a fault tolerance module for recovering from worker failures or coordinator failures.
Preferably, in the system, the workers are distributed processors, or processors in a single machine, or threads on a processor.
Preferably, in the system, the coordinator and the preprocessor may be implemented completely or in part as centralized, decentralized or virtual components.
Various other objects, advantages and features of the present invention will become readily apparent from the ensuing detailed description, and the novel features will be particularly pointed out in the appended claims.
BRIEF DESCRIPTION OF FIGURES
The following detailed descriptions, given by way of example, and not intended to limit the present invention solely thereto, will be best be understood in conjunction with the accompanying figures:
FIG. 1 is a diagram of GRAPE architecture;
FIG. 2 is a diagram of workflow of GRAPE;
FIG. 3 is a diagram of PEval for SSSP; and
FIG. 4 is a diagram of IncEval for SSSP.
DETAILED DESCRIPTION OF THE EMBODIMENTS
Definitions and terms used in the description.
Graphs. Consider graphs G= (V, E, L) , directed or undirected, where (1) V is a finite set of nodes; (2) 
Figure PCTCN2018086454-appb-000008
is a set of edges; (3) each node v in V (resp. edge e∈E) carries L (v) (resp. L (e) ) , indicating its content, as found in social networks, knowledge bases and property graphs.
Graph G'= (V', E', L') is called a subgraph of G if
Figure PCTCN2018086454-appb-000009
and for each node v∈V' (resp. each edge e∈E') , L' (v) =L (v) (resp. L' (e) =L (e) ) .
Subgraph G'is said to be induced by V'if E'consists of all the edges in G whose endpoints are both in V'.
Partition strategy. Given a number m, a strategy
Figure PCTCN2018086454-appb-000010
partitions graph G into fragments
Figure PCTCN2018086454-appb-000011
such that each F i= (V i, E i, L i) is a subgraph of G, E=∪ i∈ [1, m] E i, V=∪ i∈ [1, m] V i, and F i resides at processor P i. Denote by
F i. I the set of nodes v∈V i such that there is an edge (v', v) incoming from a node v'in F j (i≠j) ;
F i. O the set of nodes v'such that there exists an edge (v, v') in E, v∈V i and v'is in some F j (i≠j) ; and
Figure PCTCN2018086454-appb-000012
In vertex-cut partition, 
Figure PCTCN2018086454-appb-000013
and
Figure PCTCN2018086454-appb-000014
correspond to entry vertices and exit vertices, respectively. Refer to nodes in F i. I∪F i. O as the border nodes of F i w.r.t. 
Figure PCTCN2018086454-appb-000015
The fragmentation graph
Figure PCTCN2018086454-appb-000016
of G via
Figure PCTCN2018086454-appb-000017
is an index such that given each node v in
Figure PCTCN2018086454-appb-000018
 (or
Figure PCTCN2018086454-appb-000019
) , 
Figure PCTCN2018086454-appb-000020
retrieves a set of
Figure PCTCN2018086454-appb-000021
if v∈F i. O and v∈F j. I with i≠j. As will be seen shortly, 
Figure PCTCN2018086454-appb-000022
helps us deduce the directions of messages.
The notations of this paper are summarized in the table below.
Figure PCTCN2018086454-appb-000023
To Solve the technical problem in the prior art, a novel system is provided such that a sequential graph algorithms may be “plugged” into it as a whole (subject to minor changes) , and it parallelizes the computation across multiple processors, without drastic degradation in performance or functionality of existing systems. The system is called GRAPE, i.e. a parallel  GRAPEngine for graph computations such as traversal, pattern matching, connectivity and collaborative filtering.
A Parallel Model GRAPE is provided in an embodiment.
According to the parallel model of GRAPE, a method of programming with GRAPE is also provided. The GRAPE system comprises a preprocessor, a coordinator P 0 and a set of m workers P 1, …, P m. The workers are distributed processors or threads on a processor or processors in a single machine. The preprocessor and the coordinator may be implemented completely or in part as centralized, decentralized or virtual components.
Given a partition strategy
Figure PCTCN2018086454-appb-000024
and sequential PEval, IncEval and Assemble for a class
Figure PCTCN2018086454-appb-000025
of graph queries, GRAPE parallelizes the computations as follows. It first partitions G into (F 1, …, F m) with
Figure PCTCN2018086454-appb-000026
and distributes F i’s across m shared-nothing virtual workers (P 1, …, P m) . It maps m virtual workers to n physical workers. When n<m, multiple virtual workers mapped to the same worker share memory. It also constructs fragmentation graph
Figure PCTCN2018086454-appb-000027
Note that G is partitioned once for all queries
Figure PCTCN2018086454-appb-000028
posed on G. In an embodiment, the partitioning step can be implemented by a preprocessor. In another embodiment, the fragments are pre-partitioned from a graph G, and loaded by n workers (P 1, …, P n) respectively.
Parallel model in an embodiment is disclosed below. Given
Figure PCTCN2018086454-appb-000029
GRAPE computes Q (G) in the partitioned G as shown in Fig. 2. Upon receiving Q at coordinator P 0, GRAPE posts the same Q to all the workers. It adopts synchronous message passing following BSP. Its parallel computation comprises three steps.
(1) Partial evaluation (PEval) . In the first superstep, upon receiving Q, each worker P i computes partial results Q (F i) locally at F i using PEval, in parallel (i∈ [1, m] ) . It also identifies and initializes a set of update parameters for each F i that records the status of its border nodes. At the end of the process, it generates a message from the  update parameters at each P i and sends it to coordinator P 0. In another embodiment, the messages are sent to peer workers directly.
(2) Incremental computation (IncEval) . GRAPE iterates the following supersteps until it terminates. Each superstep has two steps, one at P 0 and the other at the workers.
(2. a) Coordinator. Coordinator P 0 checks whether for all i∈ [1, m] , P i is inactive, i.e., P i is done with its local computation and there is no pending message designated for P i. If so, GRAPE invokes Assemble and terminates (see below) . Otherwise, P 0 routes messages from the last superstep to workers and triggers the next superstep.
(2. b) Workers. Upon receiving message M i, worker P i incrementally computes
Figure PCTCN2018086454-appb-000030
with IncEval, by treating M i as updates, in parallel for all i∈ [1, m] . It automatically finds the changes to the update parameters in each F i, and sends the changes as a message to P 0. In another embodiment, the messages are sent to peer workers directly.
GRAPE supports data-partitioned parallelism bypartial evaluation on local fragments, in parallel by all workers. Its incremental step (2. b) speeds up iterative graph computations by reusing the partial results from the last superstep.
(3) Termination (Assemble) . The coordinator P 0 decides to terminate ifthere is no change to any update parameters (see (2. a) above) . If so, P 0 pulls partial results from all workers, and computes Q (G) by Assemble. It returns Q (G) .
The embodiment introduces the programming model of GRAPE. For a class
Figure PCTCN2018086454-appb-000031
of graph queries, one only needs to provide three core functions  PEval,  IncEval and Ass emble, referred to as a PIE program. These are conventional sequential algorithms from textbooks, or can be picked from Library API of GRAPE.
An example of PIE program is disclosed below.
Example 1: Consider Single Source Shortest Path (SSSP) . Given a graph G with edges labeled with weights, and a source node s in G (as a query Q) , it is to find Q (G) including the shortest distance dist (s, v) from s to all nodes v in G.
Using GRAPE, one can pick our familiar Dijkstra’s algorithm as PEval, and a bounded sequential incremental algorithm as IncEval. The only addition is that for each fragment F i, an integer variable dist (s, v) is declared for each node v, initially ∞ (except dist (s, s) =0) . As shown in Fig. 2, PEval first computes Q (F i) ; it then repeats  IncEval to compute
Figure PCTCN2018086454-appb-000032
where messages M include updated (smaller) dist (s, u) (due to new “shortcut” from s) for border nodes u, i.e., nodes with edges across different fragments. GRAPE guarantees the termination of the fixpoint computation, when no more dist (s, v) can be changed to a smaller value. At this point, Assemble takes a union of Q (F i) as Q (G) , which is provably correct.
That is, sequential algorithms can be transformed to PEval, IncEval and Assemble, and specify variables dist (s, v) for updating border nodes. GRAPE takes care of details such as message passing, load balancing and fault tolerance. There is no need to recast the entire algorithms into a new model.
PEval: Partial Evaluation
PEval takes a query
Figure PCTCN2018086454-appb-000033
and a fragment F of G as input, and computes partial answers Q (F i) at worker P in parallel for all i∈ [1, m] . It may be any existing sequential algorithm
Figure PCTCN2018086454-appb-000034
for
Figure PCTCN2018086454-appb-000035
extended with the following:
partial result kept in a designated variable; and
message specification as its interface to IncEval.
Communication between workers is conducted via messages, defined in terms of update parameters as follows.
(1) Message preamble. PEval (a) declares status variables
Figure PCTCN2018086454-appb-000036
and (b) specifies a set C of nodes and edges relative to F i. I or F i. O. The status variables associated with C are denoted by
Figure PCTCN2018086454-appb-000037
referred to as the updateparameters of F i.
Intuitively, variables in
Figure PCTCN2018086454-appb-000038
are the candidates to be updated by incremental steps. In other words, messages M to worker P are updates to the values of variables in
Figure PCTCN2018086454-appb-000039
More specifically, C is specified by an integer d and S, where S is either F i. I or F i. O. That is, C is the set of nodes and edges within d-hops of nodes in S. If d= 0, C i is F i. I or F i. O. Otherwise, C may include nodes and edges from other fragments F of G.
The variables are declared and initialized in PEval. At the end of PEval, it sends the values of
Figure PCTCN2018086454-appb-000040
to coordinator P 0. In another embodiment, the values are sent to peer workers directly.
(2) Message segment. PEval may specify function aggregateMsg, to resolve conflicts when multiple messages from different workers attempt to assign different  values to the same update parameter (variable) . When such a strategy is not provided, GRAPE picks a default handler.
(3) Message grouping. GRAPE deduces updates to
Figure PCTCN2018086454-appb-000041
for i∈ [1, m] , and treats them as messages exchanged among workers. More specifically, at coordinator P 0, GRAPE identifies and maintains
Figure PCTCN2018086454-appb-000042
for each worker P i. Upon receiving messages from P i’s, GRAPE works as follows.
(a) Identifying C i.  It deduces C i for i∈ [1, m] by referencing fragmentation graph
Figure PCTCN2018086454-appb-000043
and C i remains unchanged in the entire process. It maintains update parameters
Figure PCTCN2018086454-appb-000044
for F i.
(b) Composing M i.  For messages from each P i, GRAPE (i) identifies variables in
Figure PCTCN2018086454-appb-000045
with changedvalues; (ii) deduces their designations P j by referencing
Figure PCTCN2018086454-appb-000046
if 
Figure PCTCN2018086454-appb-000047
is edge-cut, the variable tagged with a node v in F i. O will be sent to worker P j if v is in F j. I (i.e., if
Figure PCTCN2018086454-appb-000048
is in
Figure PCTCN2018086454-appb-000049
) ; similarly for v in F i. I; if
Figure PCTCN2018086454-appb-000050
is vertex-cut, it identifies nodes shared by F i and F j (i≠j) ; and (iii) it combines all changed variables values designated to P j into a single message M j, and sends M j to worker P j in the next superstep for all j∈ [1, m] .
These are automatically conducted by GRAPE, which minimizes communication costs by passing only updated variable values. To reduce the workload at the coordinator, alternatively each worker may maintain a copy of
Figure PCTCN2018086454-appb-000051
and deduce the designation of its messages in parallel.
Example 2: How GRAPE parallelizes SSSP is disclosed. Consider a directed graph G= (V, E, L) in which for each edge e, L (e) is a positive number. The length of a path (v 0, …, v k) in G is the sum of L (v i-1, v i) for i∈ [1, k] . For a pair (s, v) of nodes, denote by dist (s, v) the shortest distance from s to v, i.e., the length of a shortest path from s to v. Given graph G and a node s in V, GRAPE computes dist (s, v) for all v∈V. It adopts edge-cut partition. It deduces F i. O by referencing
Figure PCTCN2018086454-appb-000052
and stores F i. O at each fragment F i.
As shown in Fig. 3, PEval (lines 1-14) is verbally identical to Dijsktra’s sequential algorithm. The only changes are message preamble and segment (underlined) . It declares an integer variable dist (s, v) for each node v, initially ∞ (except dist (s, s) =0) . It specifies min as aggregateMsg to resolve conflicts: if there are multiple  values for the same dist (s, v) , the smallest value is taken by the linear order on integers. The update parameters are
Figure PCTCN2018086454-appb-000053
At the end of its process, PEval sends
Figure PCTCN2018086454-appb-000054
to coordinator P 0. At P 0, GRAPE maintains dist (s, v) for all
Figure PCTCN2018086454-appb-000055
Upon receiving messages from all workers, it takes the smallest value for each dist (s, v) . It finds those variables with smaller values, deduces their destinations by referencing
Figure PCTCN2018086454-appb-000056
groups them into message M j, and sends M j to P j.
IncEval: Incremental Evaluation
Given query Q, fragment F i, partial results Q (F i) and message M i (updates to
Figure PCTCN2018086454-appb-000057
) , IncEval computes
Figure PCTCN2018086454-appb-000058
incrementally, making maximum reuse of the computation of Q (F i) in the last round. Each time after IncEval is executed, GRAPE treats
Figure PCTCN2018086454-appb-000059
and
Figure PCTCN2018086454-appb-000060
as F and Q (F i) , respectively, for the next round of incremental computation.
IncEval can take any existing sequential incremental algorithm
Figure PCTCN2018086454-appb-000061
for 
Figure PCTCN2018086454-appb-000062
It shares the message preamble of PEval. At the end of the process, it identifies changed values to
Figure PCTCN2018086454-appb-000063
at each F i, and sends the changes as messages to P 0. In another embodiment, the changes are sent to peer workers directly. At the destination worker or P 0, GRAPE composes messages as described in 3 (b) above.
Boundedness. Graph computations are typically iterative. GRAPE reduces the costs of iterative computations by promoting bounded incremental algorithms for IncEval.
Consider an incremental algorithm
Figure PCTCN2018086454-appb-000064
for
Figure PCTCN2018086454-appb-000065
Given G, 
Figure PCTCN2018086454-appb-000066
Q (G) and updates M to G, it computes ΔO such that
Figure PCTCN2018086454-appb-000067
where ΔO denotes changes to the old output O (G) . It is said to be bounded if its cost can be expressed as a function in the size of |CHANGED|=|ΔM|+|ΔO|, i.e., the size of changes in the input and output [14, 32] . Intuitively, |CHANGED|represents the updating costs inherent to the incremental problem for
Figure PCTCN2018086454-appb-000068
itself. For a bounded IncEval, its cost is determined by |CHANGED|, not by the size |F i| of entire F i, no matter how big |F i| is.
Example 3: Continuing with Example 2, IncEval in Fig. 4 is provided. It is the sequential incremental algorithm for SSSP, in response to changed dist (s, v) for v in F i. I (here M i includes changes to dist (s, v) for v∈F i. I deduced from
Figure PCTCN2018086454-appb-000069
) . Using a queue  Que, it starts with M i, propagates the changes to affected area, and updates the distances. The partial result is now the revised distances (line 11) .
At the end of the process, IncEval sends to coordinator P 0 updated values of those status variables in
Figure PCTCN2018086454-appb-000070
as in PEval. It applies aggregateMsg min to resolve conflicts.
The only changes to the algorithm are underlined in Fig. 4. It’s obvious that IncEval is bounded: its cost is determined by the sizes of “updates” |M i|and the changes to the output. This reduces the cost of iterative computation of SSSP (the while and for loops) .
Assemble Partial Results
Function Assemble takes partial results
Figure PCTCN2018086454-appb-000071
and fragmentation graph
Figure PCTCN2018086454-appb-000072
as input and combines
Figure PCTCN2018086454-appb-000073
to get Q (G) . It is triggered when no more changes can be made to update parameters
Figure PCTCN2018086454-appb-000074
for any i∈ [1, m] .
Example 4: Continuing with Example 3, Assemble (not shown) for SSSP takes Q (G) =∪ i∈ [1, n] Q (F i) , the union of the shortest distance for each node in each F i.
The GRAPE process terminates with correct Q (G) . The updates to 
Figure PCTCN2018086454-appb-000075
are “monotonic” : the value of dist (s, v) for each node v decreases or remains unchanged. There are finitely many such variables. Furthermore, dist (s, v) is the shortest distance from s to v, as warranted by the correctness of the sequential algorithms (PEval and IncEval) .
Putting these together, it is clear that a PIE program parallelizes a graph query class
Figure PCTCN2018086454-appb-000076
provided with a sequential algorithm
Figure PCTCN2018086454-appb-000077
 (PEval) and a sequential incremental algorithm
Figure PCTCN2018086454-appb-000078
 (IncEval) for
Figure PCTCN2018086454-appb-000079
Assemble is typically a straightforward sequential algorithm. A large number of sequential (incremental) algorithms are already in place for various
Figure PCTCN2018086454-appb-000080
Moreover, there have been methods for incrementalizing graph algorithms, to get incremental algorithms from their batch counterparts. Thus, GRAPE makes parallel graph computations accessible to a large group of end users.
In contrast to existing graph systems, GRAPE plugs in
Figure PCTCN2018086454-appb-000081
andΔT as a whole, and confines communication specification to the message segment of PEval. Users do not have to think “like a vertex” when programming. As opposed to vertex-centric and block-centric systems, GRAPE runs sequential algorithms on entire fragments. Moreover, IncEval employs incremental evaluation to reduce cost, which is a unique feature of  GRAPE. Note that IncEval speeds up iterative computations by minimizing unnecessary recomputation of Q (F i) , no matter whether it is bounded or not.
In an embodiment, the partitioning step, fragments distributing step, algorithms and preambles receiving step, processing sequential algorithms into parallel step, preambles processing step, and communication protocol determining step can be implemented by a preprocessor.
An implementation of GRAPE system is discussed below.
Architecture overview. GRAPE adopts a four-tier architecture depicted in Fig. 1, described as follows.
(1) The top layer is a user interface. GRAPE supports interactions with (a) developers who specify and register sequential PEval, IncEval and Assemble as a PIE program for a class
Figure PCTCN2018086454-appb-000082
of graph queries (the plug panel) ; and (b) end users who plug-in PIE programs from API library, pick a graph G, enter queries
Figure PCTCN2018086454-appb-000083
and “play” (the play panel) . GRAPE parallelizes the PIE program, computes Q (G) and displays Q (G) in result and analytics consoles.
(2) At the core of the system is a parallel query engine. It manages sequential algorithms registered in GRAPE API, makes parallel evaluation plans for PIE programs, and executes the plans for query answering. It also enforces consistency control and fault tolerance (see below) .
(3) Underlying the query engine are (a) a Communication Controller (message passing interface) for communications between coordinator and workers, (b) an Index Manager for loading indices, (c) a Partition Manager to partition graphs, and (d) a Load Balancer to balance workload (see below) .
(4) The storage layer manages graph data in DFS (distributed file system) . It is accessible to the query engine, Index Manager, Partition Manager and Load Balancer.
Message passing. The Communication Controller of GRAPE makes use of a standard message passing interface for parallel and distributed programs. It currently adopts MPICH, which is also the basis of other systems such as GraphLab and Blogel. It generates messages and coordinates messages in synchronization steps using standard MPI primitives. Moreover, GRAPE also supports alternative communication mechanisms such as  Sockets, RPC and RDMA which support data communication between processors over an interconnected network. It supports both designated messages and key-value pairs.
Graph partition. The Graph Partitioner supports a variety of partition algorithms. Users may pick: (a) an edge-cut algorithm that split each edge into one or more partitions such as METIS/ParMETIS/PuLP/XtraPuLP; (b) a vertex-cut algorithm that split each vertex into one or more partitions such as 1D/2D partitions; or (c) a hybrid vertex-and edge-cut partition algorithm that split both edges and/or vertexes. New data partition algorithms either for online (stream) or offline (batch) processing can also be plugged into GRAPE.
Graph-level optimization. In contrast to prior graph systems, GRAPE supports data-partitioned parallelism by parallelizing the runs of sequential algorithms. Hence all optimization strategies developed for sequential (batch and incremental) algorithms can be readily plugged into GRAPE, to speed up PEval and IncEval over graph fragments. As examples, some optimization strategies are disclosed below.
(1) Indexing. Any indexing structure effective for sequential algorithm can be computed offline and directly used to optimize PEval, IncEval and Assemble, without recasting. GRAPE supports indices including (1) 2-hop index for reachability queries; and (2) neighborhood-index for candidate filtering in graph pattern matching. Moreover, new indices can be “plugged” into GRAPE API library.
(2) Compression. GRAPE adopts query preserving compression at the fragment level. Given a query class
Figure PCTCN2018086454-appb-000084
and a fragment F i, each worker P computes a smaller 
Figure PCTCN2018086454-appb-000085
offline via a compression algorithm, such that for any query Q in
Figure PCTCN2018086454-appb-000086
Q (F i) can be computed from
Figure PCTCN2018086454-appb-000087
without decompressing
Figure PCTCN2018086454-appb-000088
regardless of what sequential PEval and IncEval are used. This compression scheme is effective for graph pattern matching and graph traversal, among others.
(3) Dynamic grouping. GRAPE dynamically group a set of border nodes by adding a “dummy” node, and sends messages from the dummy nodes in batches, instead of one by one. This effectively reduces the amount of message communication in each synchronization step.
Load balancing. GRAPE groups computation tasks into work units, and estimates the cost at each virtual worker P in terms of the fragment size |F i| at P i, the  number of border nodes in F i, and the complexity of computation
Figure PCTCN2018086454-appb-000089
Its Load Balancer computes an assignment of the work units to physical workers, to minimize both computational cost and communication cost (GRAPE employs m virtual workers and n physical workers, and m>n) . The bi-criteria objective makes it easy to deal with skewed graphs, when a small fraction of nodes are adjacent to a large fraction of the edges in G, as found in social graphs.
To the knowledge of the person skilled in the art, these optimization strategies are not supported by the state-of-the-art vertex-centric and block-centric systems. Indexing and query-preserving compression for sequential algorithms do not carry over to vertex programs, and block-centric programming essentially treats blocks as vertices rather than graphs. Moreover, dynamic grouping does not help vertex-level synchronization.
Fault tolerance. GRAPE employs an arbitrator mechanism to recover from both worker failures and coordinator failures (a.k.a. single-point failures) . It reserves a worker P as arbitrator, and a worker S c'as a standby coordinator. It keeps sending heart-beat signals to all workers and the coordinator. In case of failure, (a) if a worker fails to respond, the arbitrator transfers its computation tasks to another worker; and (b) if the coordinator fails, it activates the standby coordinator S c'to continue computation.
Consistency. Multiple workers may update copies of the same status variable. To cope with this, (a) GRAPE allows users to specify a conflict resolution policy as function aggregateMsg in PEval, e.g. min for SSSP and CC, based on a partial order on the domain of status variables, e.g. linear order on integers. Based on the policy, inconsistencies are resolved in each synchronization step of PEval and IncEval processes. Moreover, Theorem 1 guarantees the consistency when the policy satisfies the monotonic condition. (b) GRAPE also supports default exception handlers when users opt not to specify aggregateMsg. In addition, GRAPE allows users to specify generic consistency control strategies and register them in GRAPE API library.
A lightweight transaction controller is also employed, to support not only queries but also updates such as insertions and deletions of nodes and edges. When the load is light, it adopts non-destructive updates of functional databases. Otherwise, it switches to multi-version concurrency control that keeps track of timestamps and versions, as also adopted by existing distributed systems.
In addition to the SSSP in the afore-mentioned Examples 1-4, our parallelization system in the present invention is applicable in such contexts as pagerank, pattern matching (via simulation and isomorphism) , connectivity, keyword search and collaborative filtering, as well as other graph computations.
Our invention provides the following advantages.
I. Our invention makes it easier to implement distributive parallel programming. A relatively inexperienced user, by inserting existing sequential algorithms in our parallelization system, is enabled to handle large amount of graph data in a distributive fashion without recasting her own algorithms for parallel computation.
II. Automated parallelization, as well as deployment of Partial Evaluation and Incremental Evaluation, avoids unnecessary conversion of existing systems into new ones. With simple adaptation, well established sequential algorithms for graph computation are kept functional in distributed contexts.
III. Our invention guarantees the correctness of parallelized graph computation. When the three sequential algorithms inserted by the user are correct and the message is monotonous, a parallel computation under automated parallelization stops and correct result is assured.
IV. Our invention significantly boosts efficiency of the three applications in graph computation. Experiments indicate that our system are 400 times faster than existing systems for graph computation. Commutation costs becomes one-ten millionth of what it takes for existing systems.
V. Aided by our invention, computation performance is enhanced and cost drops for such algorithms as graph traversal (DFS, BFS, single sourced shortest path) , connectivity (strong connected component, minimum spanning tree) , graph pattern matching (via simulation and subgraph isomorphism) , keyword search and machine learning (collaborate filtering, triangle counting, PageRank) .
VI. GRAPE is experimentally evaluated, compared with (a) Giraph, an open-source version of Pregel, (b) GraphLab, an asynchronous vertex-centric system, and (c) Blogel, might be the fastest block-centric system. Over real-life graphs, it is clear that in addition to the ease of programming, GRAPE achieves comparable performance to the state-of-the-art systems. For instance, (a) GRAPE is larger than 480, 36 and 15 times faster than  Giraph, GraphLab and Blogel for SSSP, up to 5 magnitudes than their for CC, larger than 150, 6 and 16 times for Sim, 70, 14 and 3 times for SubIso, and 5, 3 and 12 times for CF on average, respectively. (b) In the same setting, GRAPE ships on average 0.07%, 0.12%and 1.7%of the data shipped by Giraph, GraphLab and Blogel for SSSP, 0.9%, 0.14%and 4.9%for Sim, 0.18%, 0.23%and 0.11%for SubIso, and 5.6%, 43.3%and 3.2%for CF, respectively. (c) Incremental steps effectively reduce the cost and improve the performance by 9.6 times. (d) Optimization strategies for sequential algorithms remain effective for GRAPE and improve by 2 times on average.
Having described at least one of the embodiments of the claimed invention with reference to the accompanying drawings, it will be apparent to those skills that the invention is not limited to those precise embodiments, and that various modifications and variations can be made in the presently disclosed system without departing from the scope or spirit of the invention. Thus, it is intended that the present disclosure cover modifications and variations of this disclosure provided they come within the scope of the appended claims and their equivalents. Specifically, one or more limitations recited throughout the specification can be combined in any level of details to the extent they are described to improve the method or the system.

Claims (17)

  1. A method for parallelizing a sequential graph computation, comprising:
    partitioning a graph G by a preprocessor into fragments (F 1, …, F n) ;
    distributing the fragments by the preprocessor across n workers (P 1, …, P n) respectively;
    receiving by the preprocessor three sequential algorithms and the message preambles;
    processing by the preprocessor the sequential algorithms into parallel version;
    processing the message preambles and determining the communication protocol by the preprocessor;
    receiving a query Q at a coordinator and posting the query Q to all n workers;
    executing partial evaluation by each worker P i against its local fragment F i, i∈ [1, n] ;
    computing partial result Q (F i) by each worker P i, and generating message Mi;
    exchange messages between each worker P i;
    executing incremental evaluation by worker P i upon receiving a message M i against local F i updated by M i;
    computing updated partial result
    Figure PCTCN2018086454-appb-100001
    by worker P i; where operator
    Figure PCTCN2018086454-appb-100002
    means applies changes M i to F i;
    iterating the incremental evaluation until no further updates M i can be made to any F i;
    pulling updated partial results by the coordinator from all workers;
    computing a complete result Q (G) by the assemble function by the coordinator or one of the workers; and
    returning from the assemble function, the result Q (G) as the answer to the query Q.
  2. The method in claim 1, further comprising: selecting a partition strategy
    Figure PCTCN2018086454-appb-100003
    by the preprocessor before partitioning the graph G.
  3. The method in claim 1, wherein the workers are distributed processors, or processors in a single machine, or threads on a processor.
  4. The method in claim 1, wherein the coordinator and the preprocessor may be implemented completely or in part as centralized, decentralized or virtual components.
  5. The method in claim 1, further comprising identifying and initializing by each worker P i while executing the partial evaluation a set of update parameters for each F i that records the status of its border nodes.
  6. The method in claim 1, further comprising routing messages by the coordinator to workers during the message passing.
  7. The method in claim 1, further comprising routing messages to other workers by each worker P i during the message passing.
  8. The method in claim 1, further comprising treating by worker P i
    Figure PCTCN2018086454-appb-100004
    and
    Figure PCTCN2018086454-appb-100005
    as F i and Q (F i) , respectively, for a next computation each time after executing incremental evaluation.
  9. A method for parallelizing a sequential graph computation, comprising:
    loading fragments (F 1, …, F n) by n workers (P 1, …, P n) respectively, wherein, the fragments are pre-partitioned from a graph G;
    receiving by a preprocessor three sequential algorithms and the message preambles;
    processing by the preprocessor the sequential algorithms into parallel version;
    processing the message preambles and determining the communication protocol by the preprocessor;
    receiving a query Q at a coordinator and posting the query Q to all n workers;
    executing partial evaluation by each worker P i against its local fragment F i, i∈ [1, n] ;
    computing partial result Q (F i) by each worker P i, and generating message Mi;
    exchange messages between each worker P i;
    executing incremental evaluation by worker P i upon receiving a message M i against local F i updated by M i;
    computing updated partial result
    Figure PCTCN2018086454-appb-100006
    by worker P i; where operator
    Figure PCTCN2018086454-appb-100007
    means applies changes M i to F i;
    iterating the incremental evaluation until no further updates M i can be made to any F i;
    pulling updated partial results by the coordinator from all workers;
    computing a complete result Q (G) by assemble by the coordinator; and
    returning from the coordinator the result Q (G) as the answer to the query Q.
  10. A non-transitory computer readable medium with instructions stored thereon for parallelizing a sequential graph computation, that when executed by a processor, preforming the steps comprising:
    partitioning a graph G by a preprocessor into fragments (F 1, …, F n) ;
    distributing the fragments by the preprocessor across n workers (P 1, …, P n) respectively;
    receiving by the preprocessor three sequential algorithms and the message preambles;
    processing by the preprocessor the sequential algorithms into parallel version;
    processing the message preambles and determining the communication protocol by the preprocessor;
    receiving a query Q at a coordinator and posting the query Q to all n workers;
    executing partial evaluation by each worker P i against its local fragment F i, i∈ [1, n] ;
    computing partial result Q (F i) by each worker P i, and generating message Mi;
    exchange messages between each worker P i;
    executing incremental evaluation by worker P i upon receiving a message M i against local F i updated by M i;
    computing updated partial result
    Figure PCTCN2018086454-appb-100008
    by worker P i; where operator
    Figure PCTCN2018086454-appb-100009
    means applies changes M i to F i;
    iterating the incremental evaluation until no further updates M i can be made to any F i;
    pulling updated partial results by the coordinator from all workers;
    computing a complete result Q (G) by the assemble function by the coordinator or one of the workers; and
    returning from the assemble function, the result Q (G) as the answer to the query Q.
  11. A system for parallelizing a sequential graph computation, comprising a preprocessor, a coordinator and a plurality of workers, wherein:
    the preprocessor comprises:
    an automating parallelization module for receiving, parsing and automating parallelizing inputted algorithms and for distributing algorithms to workers;
    a partition manager module for partitioning graph and managing fragments;
    a communication control module for processing the message preambles and determining the communication protocol; and
    a storage module for managing graph data in distributed file system; the coordinator comprises:
    a query parser module for receiving, parsing and distributing a query to workers;
    an assemble module for assembling result for a query;
    a storage module for managing graph data in distributed file system; and
    each worker comprises:
    a partial evaluation module for executing partial evaluation;
    an incremental evaluation module for executing incremental evaluation;
    a communication control module for controlling message passing; and
    a storage module for managing graph data in distributed file system.
  12. The system in claim 10, wherein the coordinator comprises a load balancing module for assigning work according to load balance.
  13. The system in claim 10, wherein the coordinator comprises an indexing module for supporting indexing structure.
  14. The system in claim 10, wherein:
    the coordinator comprises a compression module for adopting query preserving compression; and
    each worker comprises a compression module for adopting query preserving compression.
  15. The system in claim 10, wherein:
    the coordinator comprises a fault tolerance module for recovering from worker failures or coordinator failures; and
    each worker comprises a fault tolerance module for recovering from worker failures or coordinator failures.
  16. The system in claim 10, wherein the workers are distributed processors or processors in a single machine or threads on a processor.
  17. The system in claim 10, wherein the coordinator and the preprocessor may be implemented completely or in part as centralized, decentralized or virtual components.
PCT/CN2018/086454 2017-05-12 2018-05-11 Method and system for parallelizing sequential graph computation WO2018205986A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CNPCT/CN2017/084083 2017-05-12
PCT/CN2017/084083 WO2018205246A1 (en) 2017-05-12 2017-05-12 Parallel computation engine for graph data

Publications (1)

Publication Number Publication Date
WO2018205986A1 true WO2018205986A1 (en) 2018-11-15

Family

ID=64104033

Family Applications (2)

Application Number Title Priority Date Filing Date
PCT/CN2017/084083 WO2018205246A1 (en) 2017-05-12 2017-05-12 Parallel computation engine for graph data
PCT/CN2018/086454 WO2018205986A1 (en) 2017-05-12 2018-05-11 Method and system for parallelizing sequential graph computation

Family Applications Before (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/084083 WO2018205246A1 (en) 2017-05-12 2017-05-12 Parallel computation engine for graph data

Country Status (1)

Country Link
WO (2) WO2018205246A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11681838B2 (en) 2020-05-26 2023-06-20 Landmark Graphics Corporation Distributed Sequential Gaussian Simulation

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104158840A (en) * 2014-07-09 2014-11-19 东北大学 Method for calculating node similarity of chart in distributing manner
CN105447156A (en) * 2015-11-30 2016-03-30 北京航空航天大学 Resource description framework distributed engine and incremental updating method
CN106033476A (en) * 2016-05-19 2016-10-19 西安交通大学 Incremental graphic computing method in distributed computing mode under cloud computing environment
US20170097853A1 (en) * 2013-01-31 2017-04-06 International Business Machines Corporation Realizing graph processing based on the mapreduce architecture

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170097853A1 (en) * 2013-01-31 2017-04-06 International Business Machines Corporation Realizing graph processing based on the mapreduce architecture
CN104158840A (en) * 2014-07-09 2014-11-19 东北大学 Method for calculating node similarity of chart in distributing manner
CN105447156A (en) * 2015-11-30 2016-03-30 北京航空航天大学 Resource description framework distributed engine and incremental updating method
CN106033476A (en) * 2016-05-19 2016-10-19 西安交通大学 Incremental graphic computing method in distributed computing mode under cloud computing environment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
FAN, WENFEI ET AL.: "Distributed Graph Simulation: Impossibility and Possibility", PROCEEDINGS OF THE VLDB ENDOWMENT, vol. 7, no. 12, 5 September 2014 (2014-09-05), pages 1084 - 1090, XP055550374 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11681838B2 (en) 2020-05-26 2023-06-20 Landmark Graphics Corporation Distributed Sequential Gaussian Simulation

Also Published As

Publication number Publication date
WO2018205246A1 (en) 2018-11-15

Similar Documents

Publication Publication Date Title
Fan et al. Parallelizing sequential graph computations
CN107239335B (en) Job scheduling system and method for distributed system
US20220067025A1 (en) Ordering transaction requests in a distributed database according to an independently assigned sequence
US20210027000A1 (en) Simulation Systems and Methods
Fan et al. The Case Against Specialized Graph Analytics Engines.
Malewicz et al. Pregel: a system for large-scale graph processing
Sarwat et al. Horton: Online query execution engine for large distributed graphs
US9400767B2 (en) Subgraph-based distributed graph processing
US11416305B2 (en) Commands for simulation systems and methods
CN109740765A (en) A kind of machine learning system building method based on Amazon server
CN114691658A (en) Data backtracking method and device, electronic equipment and storage medium
CN108153859A (en) A kind of effectiveness order based on Hadoop and Spark determines method parallel
WO2018205986A1 (en) Method and system for parallelizing sequential graph computation
Chen et al. GraphHP: A hybrid platform for iterative graph processing
US20150150011A1 (en) Self-splitting of workload in parallel computation
Kumar et al. Graphsteal: Dynamic re-partitioning for efficient graph processing in heterogeneous clusters
CN117076563A (en) Pruning method and device applied to blockchain
CN110196879B (en) Data processing method, device, computing equipment and storage medium
WO2022247869A1 (en) Method for searching for data, apparatus, and device
Kemme et al. Dagstuhl seminar review: Consistency in distributed systems
Fan et al. Think sequential, run parallel
Higashino et al. Attributed graph rewriting for complex event processing self-management
Qadah et al. Highly Available Queue-oriented Speculative Transaction Processing
RU2813571C1 (en) Method of parallelizing programs in multiprocessor computer
Achar et al. Got: Git, but for Objects

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18799186

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18799186

Country of ref document: EP

Kind code of ref document: A1