CN113326125A

CN113326125A - Large-scale distributed graph calculation end-to-end acceleration method and device

Info

Publication number: CN113326125A
Application number: CN202110552903.0A
Authority: CN
Inventors: 李丹; 刘天峰
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2021-05-20
Filing date: 2021-05-20
Publication date: 2021-08-31
Anticipated expiration: 2041-05-20
Also published as: CN113326125B

Abstract

The invention discloses a large-scale distributed graph computation end-to-end acceleration method and a device, wherein the method comprises the steps of carrying out task division on distributed graph computation to obtain a model selection task, a vertex distribution task and an adjacent linked list construction task; selecting a corresponding information flow mode to calculate a model selection task; dividing the vertexes into different graph divisions according to the end-to-end division indexes, and then distributing the vertexes through a streaming block division algorithm with an optimal threshold value; and expanding the load balancing radix sequencing algorithm to obtain a NUMA-aware load balancing radix sequencing algorithm, and converting the edge array into an adjacent linked list by using a distributed sequencing algorithm on the data format of the underlying graph through the NUMA-aware load balancing radix sequencing algorithm. The acceleration scheme taking the end-to-end time as an optimization target can greatly accelerate the end-to-end graph calculation processing performance.

Description

Large-scale distributed graph calculation end-to-end acceleration method and device

Technical Field

The invention relates to the technical field of distributed computing, in particular to a method and a device for accelerating computing of a large-scale distributed graph from end to end.

Background

In the big data era, applications such as social networks, internet of things and e-commerce generate a great deal of data, which is generally organized into a graph format and continuously increases, and has grown to the TB level. In order to efficiently process such large-scale graph data, a large number of distributed graph computing systems are proposed.

The process flow of a distributed graph computing system generally includes two phases. The first stage is a pre-treatment stage: naturally occurring maps are large and irregular and require pre-processing to perform a particular map algorithm. In the preprocessing stage, the format of the input graph needs to be converted and the graph needs to be divided into different machines. The second phase is the algorithm execution phase: a specific graph algorithm is executed on the preprocessed graph. Most graph computing systems primarily optimize the efficiency of the algorithm execution phase, without concern for pre-processing phase performance, resulting in very long end-to-end processing times.

Disclosure of Invention

The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.

Therefore, an object of the present invention is to provide an end-to-end acceleration method for large-scale distributed graph computation, which uses an end-to-end time as an acceleration scheme of an optimization target, and can greatly accelerate the end-to-end graph computation processing performance.

Another objective of the present invention is to provide an end-to-end acceleration apparatus for large-scale distributed graph computation.

In order to achieve the above object, an embodiment of an aspect of the present invention provides an end-to-end acceleration method for large-scale distributed graph computation, including:

performing task division on the distributed graph calculation to obtain a model selection task, a vertex distribution task and an adjacency linked list construction task;

selecting a corresponding information flow mode to calculate a model selection task;

dividing the vertexes into different graph divisions according to the end-to-end division indexes, and then distributing the vertexes through a streaming block division algorithm with an optimal threshold value;

and expanding the load balancing radix sequencing algorithm to obtain a NUMA-aware load balancing radix sequencing algorithm, and converting the edge array into an adjacent linked list by using a distributed sequencing algorithm on the data format of the underlying graph through the NUMA-aware load balancing radix sequencing algorithm.

According to the large-scale distributed graph calculation end-to-end acceleration method, the large-scale distributed graph calculation end-to-end acceleration device takes end-to-end time as an optimization target, and a static mode is proposed based on theoretical analysis to reduce a data preprocessing flow; providing a more balanced end-to-end partition index and a streaming block partition algorithm; a faster and more efficient distributed sorting algorithm accelerated sorting process is provided.

In addition, the large-scale distributed graph computation end-to-end acceleration method according to the above embodiment of the present invention may further have the following additional technical features:

further, the information flow mode comprises a push mode and a pull mode, wherein the push mode is that each vertex pushes the updated information to a target vertex through an outgoing edge; the pull mode is that each vertex pulls the updated information from the source vertex to itself through an incoming edge.

Further, the end-to-end division index is:

(1+η+θ(K-1))*E(P_i)+η(K-1)*V(P_i)

where η is a variable parameter to balance the weight of preprocessing and algorithm execution, θ is the communication ratio in the distributed ranking algorithm, K is the division of the entire graph into K divisions, E (P)_i) To divide into P_iThe number of edges of all vertices, V (P)_i) To divide into P_iThe number of all vertices above.

Further, the optimal threshold value of the algorithm is searched through the dichotomy.

Further, the load balancing radix ranking algorithm obtained by expanding the load balancing radix ranking algorithm to obtain NUMA perception includes:

shared memory communication is used and allocated in a particular NUMA memory for different threads.

In order to achieve the above object, another embodiment of the present invention provides an end-to-end acceleration apparatus for large-scale distributed graph computation, including:

the division module is used for carrying out task division on the distributed graph calculation to obtain a model selection task, a vertex distribution task and an adjacency linked list construction task;

the selection module is used for selecting the corresponding information flow mode to calculate the model selection task;

the distribution module is used for dividing the vertexes into different graph divisions according to the end-to-end division indexes and distributing the vertexes through a streaming block division algorithm of an optimal threshold value;

the building module is used for expanding the load balancing radix sorting algorithm to obtain the NUMA-aware load balancing radix sorting algorithm, and the NUMA-aware load balancing radix sorting algorithm is used for converting the edge array into the adjacent linked list by using the distributed sorting algorithm in the data format of the underlying graph.

The large-scale distributed graph calculation end-to-end accelerating device takes end-to-end time as an optimization target, and reduces data preprocessing flow by proposing a static mode based on theoretical analysis; providing a more balanced end-to-end partition index and a streaming block partition algorithm; a faster and more efficient distributed sorting algorithm accelerated sorting process is provided.

In addition, the large-scale distributed graph computation end-to-end acceleration apparatus according to the above embodiment of the present invention may further have the following additional technical features:

Further, the end-to-end division index is:

(1+η+θ(K-1))*E(P_i)+η(K-1)*V(P_i)

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a flow diagram of a large-scale distributed graph computation end-to-end acceleration method according to one embodiment of the invention;

FIG. 2 is a block partitioning algorithm for optimal threshold according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a NUMA aware load balancing cardinality ordering algorithm according to one embodiment of the invention;

FIG. 4 is a schematic diagram of a distributed NUMA aware load balancing cardinality ordering algorithm, according to one embodiment of the invention;

FIG. 5 is a block diagram of a large-scale distributed graph computation end-to-end acceleration apparatus according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

The following describes a method and an apparatus for end-to-end acceleration of large-scale distributed graph computation according to an embodiment of the present invention with reference to the accompanying drawings.

First, a proposed large-scale distributed graph computation end-to-end acceleration method according to an embodiment of the present invention will be described with reference to the accompanying drawings.

FIG. 1 is a flow diagram of a large-scale distributed graph computation end-to-end acceleration method according to one embodiment of the invention.

As shown in fig. 1, the large-scale distributed graph computation end-to-end acceleration method includes the following steps:

and step S1, performing task division on the distributed graph calculation to obtain a model selection task, a vertex distribution task and an adjacency linked list construction task.

Distributed graph computation typically involves two phases of preprocessing and algorithm execution, and the lack of attention to the preprocessing phase in existing schemes results in very long end-to-end processing times. The pre-processing stage can be divided into three tasks: the method comprises a model selection task, a vertex distribution task and an adjacency linked list construction task.

In step S2, the corresponding information flow pattern is selected to calculate the model selection task.

In the task of model selection, the characteristics of an algorithm and an information flow mode are fully considered, and a static mode is provided to reduce the workload of preprocessing.

The first task of the pre-processing is to select the information flow pattern used by the algorithm. Two information flow modes are used, one is a push (push) mode, namely each vertex pushes updated information to a target vertex through an outgoing edge; the other is the pull mode, where each vertex pulls updated information from the source vertex to itself through an incoming edge. The existing graph algorithms can be divided into two types, one is an Always-Active-Style (Always-Active-Style) graph algorithm, and the other is a Traversal-Style (Traversal-Style) graph algorithm. It can be demonstrated that for the always-active type graph algorithm, the time of the push mode is strictly less than that of the pull mode; for a traversal-type graph algorithm, the time of the pull mode is strictly less than the time of the push mode. Based on the above theorem, we can use a specific pattern for a specific algorithm.

And step S3, dividing the vertexes into different graph divisions according to the end-to-end division indexes, and then distributing the vertexes through a streaming block division algorithm of an optimal threshold value.

In the vertex distribution task, a more representative division index is provided according to the characteristics of an end-to-end task; and then a streaming block partitioning algorithm with theoretical guarantee is provided to enable end-to-end task partitioning to be more balanced.

The second task is to divide the vertices onto different graph partitions, making the workload on each slice as equal as possible. First, a more balanced partitioning formula is proposed as follows:

(1+η+θ(K-1))*E(P_i)+η(K-1)*V(P_i)

This formula takes into account the communication load and the computation load of the preprocessing phase and the algorithm execution phase. And then a streaming block partitioning algorithm of an optimal threshold value is proposed. The chunking partitioning (chunking partitioning) algorithm is a partitioning algorithm with the lowest preprocessing cost known at present, but the existing chunking partitioning algorithm has the problem of load imbalance. As shown in fig. 2, the optimal partitioning strategy is found by searching for the optimal threshold. It can be shown that the function of this search algorithm is a non-decreasing function and that the optimal threshold value is exactly the point of change of the function value. Based on the two properties, the optimal threshold value can be found by using a binary search algorithm efficiently.

And step S4, the load balancing radix sorting algorithm is expanded to obtain a NUMA-aware load balancing radix sorting algorithm, and the edge array is converted into an adjacent linked list by the NUMA-aware load balancing radix sorting algorithm by using a distributed sorting algorithm in the data format of the underlying graph.

In the task of constructing the adjacency linked list, a load balancing radix sorting algorithm is used. And the algorithm is expanded to a distributed scene by utilizing the characteristics of the graph calculation and the overhead of distributed communication is greatly reduced.

The third task is to convert the edge array into the adjacency linked list by using a distributed sorting algorithm in the data format of the underlying graph, which is the most time-consuming task in the preprocessing stage. For the sorting in the machine, the existing load balancing radix sorting algorithm is expanded to obtain a load balancing radix sorting algorithm perceived by NUMA, as shown in fig. 3, so that the load balancing radix sorting algorithm is more suitable for the NUMA architecture of the current server. Extensions include two aspects, the first is to use shared memory communication, and the second is to allocate in a particular NUMA memory for different threads. Then, for inter-machine ordering, the NUMA-aware load balancing cardinality ordering algorithm is extended to a distributed scenario, as shown in fig. 4. The characteristics of graph calculation, namely the characteristics that partial sequencing results are known before sequencing, are fully utilized. The data can be partially sorted before being transmitted so as to reduce the transmission amount of the data.

According to the large-scale distributed graph calculation end-to-end acceleration method provided by the embodiment of the invention, end-to-end time is taken as an optimization target, and a static mode is provided based on theoretical analysis to reduce a data preprocessing flow; providing a more balanced end-to-end partition index and a streaming block partition algorithm; a faster and more efficient distributed sorting algorithm accelerated sorting process is provided.

Next, a description is given, with reference to the drawings, of a large-scale distributed graph computation end-to-end acceleration apparatus proposed according to an embodiment of the present invention.

As shown in fig. 5, the large-scale distributed graph computation end-to-end acceleration apparatus includes: a partitioning module 501, a selection module 502, an assignment module 503, and a construction module 504.

The partitioning module 501 is configured to perform task partitioning on the distributed graph computation, so as to obtain a model selection task, a vertex allocation task, and an adjacency linked list construction task.

And the selection module 502 is used for selecting the corresponding information flow mode to calculate the model selection task.

The allocating module 503 is configured to divide the vertex into different graph partitions according to the end-to-end division index, and allocate the vertex through a streaming block division algorithm with an optimal threshold.

The building module 504 is configured to expand the load balancing radix ranking algorithm to obtain a NUMA-aware load balancing radix ranking algorithm, and convert the edge array into the adjacency linked list using a distributed ranking algorithm for the underlying graph data format through the NUMA-aware load balancing radix ranking algorithm.

Furthermore, the information flow mode comprises a push mode and a pull mode, wherein the push mode is that each vertex pushes the updated information to the target vertex through an outgoing edge; the pull mode is that each vertex pulls the updated information from the source vertex to itself through an incoming edge.

Further, the end-to-end division index is:

(1+η+θ(K-1))*E(P_i)+η(K-1)*V(P_i)

It should be noted that the foregoing explanation of the method embodiment is also applicable to the apparatus of this embodiment, and is not repeated herein.

According to the large-scale distributed graph calculation end-to-end accelerating device provided by the embodiment of the invention, end-to-end time is taken as an optimization target, and a static mode is provided based on theoretical analysis to reduce a data preprocessing flow; providing a more balanced end-to-end partition index and a streaming block partition algorithm; a faster and more efficient distributed sorting algorithm accelerated sorting process is provided.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. An end-to-end acceleration method for large-scale distributed graph computation is characterized by comprising the following steps:

2. The method of claim 1, wherein the information flow patterns comprise a push pattern and a pull pattern, wherein the push pattern is used for pushing each vertex to the destination vertex through an outgoing edge after the updated information is pushed; the pull mode is that each vertex pulls the updated information from the source vertex to itself through an incoming edge.

3. The method of claim 1, wherein the end-to-end partition index is:

(1+η+θ(K-1))*E(P_i)+η(K-1)*V(P_i)

4. The method of claim 1, wherein the optimal threshold value of the algorithm is found by a dichotomy.

5. The method of claim 1, wherein expanding the load balancing radix ranking algorithm to a NUMA aware load balancing radix ranking algorithm comprises:

6. An end-to-end acceleration apparatus for large-scale distributed graph computation, comprising:

7. The apparatus of claim 6, wherein the information flow patterns comprise a push pattern and a pull pattern, wherein the push pattern is to push each vertex to the destination vertex through an outgoing edge; the pull mode is that each vertex pulls the updated information from the source vertex to itself through an incoming edge.

8. The apparatus of claim 6, wherein the end-to-end partition index is:

(1+η+θ(K-1))*E(P_i)+η(K-1)*V(P_i)

9. The apparatus of claim 6, wherein the optimal threshold value of the algorithm is found by a dichotomy.

10. The apparatus of claim 6, wherein the load balancing radix ranking algorithm that is extended to NUMA aware load balancing radix ranking algorithm comprises: