CN113419862A

CN113419862A - GPU card group-oriented graph data division optimization method

Info

Publication number: CN113419862A
Application number: CN202110750006.0A
Authority: CN
Inventors: 罗鑫; 王达; 吴冬冬
Original assignee: Beijing Ruixin High Throughput Technology Co ltd
Current assignee: Beijing Zhongke Flux Technology Co ltd
Priority date: 2021-07-02
Filing date: 2021-07-02
Publication date: 2021-09-21
Anticipated expiration: 2041-07-02
Also published as: CN113419862B

Abstract

A graph data partitioning optimization method facing a GPU card group is used for partitioning graph data G (V, E), the vertex set of the graph data G (V, E) is V, the edge set is E, and the set of edges from a source vertex V to a destination vertex w is (V, w) (V belongs to V, w belongs to V). Compared with coarse-grained load division, load division with edges as the granularity enables loads among threads to be more balanced, and the utilization rate of BFS on GPU computing resources is improved.

Description

GPU card group-oriented graph data division optimization method

Technical Field

The invention relates to a GPU card group-oriented graph data partitioning optimization method, in particular to graph data partitioning optimization capable of improving load balance of a breadth-first search algorithm on a GPU card group.

Background

The graph is a mathematical object representing the relationship between objects, and many application scenes in real life need to be represented by a graph data structure, such as protein structure prediction, shortest time path, scientific literature reference relationship, social network analysis and the like. The Breadth First Search (BFS) is a classical graph traversal algorithm.

The BFS algorithm has the following features: dense access, irregularity, data dependency, poor locality. The conventional BFS algorithm is executed in a Top-down (Top-down) manner, and is also a BFS algorithm implementation commonly used in serial programs. Top-down involves the use of a current queue and a lower queue, vertices stored in the current queue are called boundary vertices, the vertices are the vertices to be searched in the current layer, and the lower queue is a lower layer vertex to be processed found by searching the neighbors of the vertices in the current queue, i.e. a vertex expanded in the current layer. For graph G ═ V, E, where V is the set of vertices of the graph and E is the set of edges of the graph, when the source vertex s of the BFS algorithm is given, the Top-down algorithm iteratively searches all reachable points through layers starting from the source vertex s and finally forms the BFS search tree. The BFS search tree is finally recorded as a path array and a hierarchy array, wherein the path array records a parent vertex of an extended vertex in the search process, and the hierarchy array records a hierarchy of the extended vertex.

Due to the inherent characteristics of the algorithm, the parallelization of the BFS algorithm is difficult to obtain better performance, and the research of the BFS algorithm on a multi-core processor platform is always a hot topic. In recent years, domestic and foreign researchers develop a series of researches for BFS algorithms, and provide a plurality of effective optimization schemes, including a layer synchronization BFS algorithm, a direction optimization BFS algorithm, a BFS algorithm on a NUMA (non uniform memory access) architecture, data preprocessing and the like.

Graphics Processing (GPU) is generated to solve the problem of Graphics rendering efficiency, but since GPU hardware is developed at a speed greatly exceeding moore's law, the GPU is gradually receiving attention in the field of high-performance general-purpose computing. Meanwhile, the GPU has abundant computing resources and high energy efficiency, has become an attractive computing platform, is generally used as a coprocessor of a CPU for high-performance computing, and plays an important role in the fields of physics, biology, medical treatment, finance and the like at present.

In the GPU platform, because the degrees of graph data follow power law distribution, the degrees of vertexes are different, and the problems of unbalanced load, access branches, redundant overhead and the like are serious. At present, some special load balancing and memory access optimization technologies are also provided, and aiming at the performance bottleneck problem on a GPU platform, the algorithm performance is improved to a certain extent, but a better speed-up ratio is not obtained, and a lot of space for improving the GPU in the parallelization performance on the graph algorithm is still provided.

Disclosure of Invention

The invention provides a GPU card group-oriented graph data partitioning optimization method, which combines two parallelization strategies of dynamic load partitioning and static load partitioning with edges as granularity in a breadth-first search algorithm and aims to improve load balance in the execution process of a Top-down algorithm.

In order to achieve the above object, the present invention provides a graph data partitioning optimization method facing a GPU card group, the method is used for partitioning a graph data G (V, E), a vertex set of the graph data G (V, E) is V, an edge set is E, and a set of edges from a source vertex V to a destination vertex w is (V, w) (V ∈ V, w ∈ V), the method includes a dynamic load partitioning method and a static load partitioning method, and a switching mechanism of the dynamic load partitioning method and the static load partitioning method is: initially, a dynamic load partitioning method is adopted, when

And | V_fWhen | ≧ beta is simultaneously established, switching to the static load dividing method, when |, the method

|V_fWhen one of the values is less than beta, switching to a dynamic load dividing method, wherein m_fThe total number of edges to be traversed for the top point in the lower layer queue, alpha and beta are empirical parameters, and m_f＝|{(V，W)∈E|V∈NF，W∈A(V)}|、V_f＝{V|V∈V，V∈NF}、V_fIs the set of vertices for the lower queue, NF is the lower queue, A (v) is the set of neighbor vertices for vertex v,

the dynamic load division method comprises the following steps:

s11: storing graph data G (V, E) as a CSR structure, wherein the CSR structure comprises a row array and an edge array, the row array stores a pointer pointing to an offset index of the edge array, the pointer points to a first neighbor of a current vertex, the length of the pointer is the total number of vertices of the graph data, the difference value between a next position value and the current position value is the degree of the vertex, the edge array stores a neighbor vertex list of the edge at each vertex, and the length is the total number of the edges of the graph data;

s12: creating a main kernel function, wherein the dimensionality of the main kernel function is determined by the number of vertexes in the current queue, and each thread processes a boundary vertex;

s13: calculating a vertex degree deg and the number of sub-kernel functions kernel _ num which are created currently, and setting three threshold parameters deg _ min, deg _ max and kernel _ th, wherein the deg _ min, the deg _ max and the kernel _ th are respectively the minimum value of the degree, the maximum value of the degree and the threshold value of the number of the sub-kernel functions;

s14: judging whether (deg > deg _ min & & kernel _ num < kernel _ th) | deg > deg _ max holds, if yes, creating a sub-kernel function, wherein when the vertex frequency is greater than deg _ min but not greater than deg _ max, if the kernel _ num is not greater than kernel _ th, creating the sub-kernel function, and if the vertex frequency is greater than deg _ max, creating the sub-kernel function independent of the value of kernel _ th;

s15: creating a sub-kernel function with grid _ dim of grid dimension, wherein all threads in the grid are responsible for searching the neighbors of the vertex, and the execution load is divided into all threads on average, wherein the grid dimension grid _ dim is calculated according to the following formula:

vertex _ degree is a vertex degree, block _ dim is a thread block dimension which is a fixed value, and k is an experience parameter;

s16: if the condition for creating the sub-kernel function is not met, the thread of the main kernel function executes a searching process on the neighbor of the vertex;

s17: accessing the state of the neighbor vertex, screening the neighbor vertex with the state of not being accessed, expanding and outputting to a lower layer queue;

the static load division method comprises the following steps:

s21: each thread processes a vertex in the current queue, obtains the degree of the vertex and stores the degree into an array degree _ prefix _ scan, and the number of effective elements in the array degree _ prefix _ scan is the same as that of boundary vertices;

s22: obtaining a local prefix and an array of the thread block by using a parallel prefix and algorithm for the array degree _ prefix _ scan, communicating with other thread blocks to obtain a global prefix and an array, storing the prefix and the array into the array degree _ prefix _ scan, and storing the sum of the degrees of all boundary vertexes into a total degree, namely the total load searched by the layer;

s23: the edges of all boundary vertices needing to be processed in the layer are averagely distributed to each thread block, and the number of the edges edge _ per _ block needing to be searched in each thread block is as follows: edge _ per _ block is total _ depth/block _ num, total _ depth is the total number of edges needing to be searched in the current queue, block _ num is the total number of thread blocks in the kernel function, after the load is evenly divided into the thread blocks, the edge range which is handled by the thread block numbered as block _ id is marked as [ block _ process _ start, block _ process _ end ];

s24: processing a thread with thread ID being thread _ ID in a thread block from edge _ ID, wherein edge _ ID is initially (block _ process _ start + thread _ ID), locating a vertex to which the edge belongs in a binary search mode, and locating the storage position of the edge in an edge array through the value and the degree prefix sum in the vertex row array;

s25: searching the edge through the edge array to obtain the number of a neighbor vertex, judging whether the vertex needs to be expanded or not through the vertex access state, and outputting the vertex to a lower-layer queue if the vertex needs to be expanded;

s26: and the thread updates the edge _ id being processed by taking the total thread number thread _ num in the thread block as the step length, if the edge _ id < block _ process _ end, the step S24 is skipped to continue execution, otherwise, the edge which is responsible for processing by the thread is completely processed, and the searching process of the thread is ended.

In one embodiment of the present invention, k has a value of 32.

In an embodiment of the present invention, α is a judgment coefficient when Top-down is switched to Bottom-up, and β is a judgment coefficient when Top-down is switched to Bottom-up.

The GPU card group-oriented graph data partitioning optimization method combines two parallelization strategies of dynamic load partitioning and static load partitioning, wherein the parallelization strategies are based on edges as the granularity, and compared with coarse-granularity load partitioning, load partitioning based on edges as the granularity enables loads among threads to be more balanced, and the utilization rate of BFS on GPU computing resources is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a diagram of switching load sharing modes;

FIG. 2 is a CSR structure of graph data;

FIG. 3 is a schematic Top-down flow chart of dynamic load partitioning;

FIG. 4 is a schematic diagram of a dynamic load partitioning process;

FIG. 5 is a process flow diagram of static load partitioning;

FIG. 6 is a diagram illustrating the load distribution of the vertices of the boundary.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive effort based on the embodiments of the present invention, are within the scope of the present invention.

Breadth-first search algorithms typically store graph counts using a compressed Sparse matrix (CSR) structure. CSR stores the adjacency matrix in row-compressed form, the neighbors of the vertices are stored sequentially, and all the spaces are valid values. The structure comprises the idea of an adjacency list structure, but has smaller storage space, higher access efficiency and larger information quantity.

First, a dynamic load division method and a static load division method applied in the present invention are briefly described:

dynamic load partitioning in the present invention handles load balancing by using CUDA dynamic parallelism. The dynamic parallel mechanism allows nested kernel functions to be created, with secondary kernel functions being created dynamically in the threads of the primary kernel functions. The dynamic load division dynamically adjusts the dimensionality of the primary and secondary kernel functions in the execution process to adapt to the scale-free characteristic of real graph data, so that the purpose of load balancing is achieved. The dimensionality of the primary kernel function is determined according to the number of boundary vertices, and each thread processes the search task of one vertex. The dimensionality of the secondary kernel function is determined according to the degree of the vertex, and each thread is responsible for traversing a certain number of adjacent edges.

After a dynamic parallel mechanism is used, even under the condition that the degree difference of the boundary vertex is large, the load is divided according to the edges, and the load among threads is balanced. The dynamic load division method ensures that the load of the vertex with large scale can be effectively divided, relieves the influence caused by the non-standard property of the graph data, but is not suitable for all conditions in the Top-down algorithm. The dynamic load division dynamically changes the number of the sub-kernel functions and the number of the threads in the execution process, when the number of vertexes in the current queue is large and the task amount is large, a large number of secondary kernel functions can be dynamically created in the search process, and too many secondary kernel functions and threads bring a large amount of scheduling overhead to influence the execution of the program. At this point, the Top-down algorithm will be executed in a static load partitioning manner.

Static load partitioning ensures that the load of each thread block is equivalent, the load among threads is balanced and each thread is in an active state by distributing the load of all boundary vertex edge searches of each layer to each thread block and then executing the search of each edge by the thread in each thread block. In order to equally divide the load into each thread block, the degree of the boundary vertex needs to be prefixed and operated to obtain a degree prefix and an array, and the prefixing and the algorithm are common algorithms on the GPU and can be efficiently implemented. And then, the searching load is averagely distributed to each thread block, each thread in the thread blocks positions the vertex of the edge needing to be processed according to the edge ID in a binary searching mode, and positions and expands the neighbor vertex according to the vertex ID and the information of the edge.

The static load division uses edges as granularity to carry out load division, ensures that the load of each thread block and the thread in the thread blocks is equivalent, and effectively solves the problem of uneven load. However, task division overhead is introduced in static load division, and extra overhead brought by load division can be found from Top-down algorithm execution of the static load division, namely acquisition of degree prefixes and arrays, load division to each thread block and positioning of vertices to which edges belong. These overheads can be ignored when the layer is large in task volume, however, due to the non-scalabilities of real graphs, the current queue has only a few vertices and the degree is small in some search layers, in which case the impact of task partitioning overheads on the layer execution time is large.

The dynamic load division and the static load division are both in a load division mode with edges as the granularity, compared with the load division with coarse granularity, the load division with edges as the granularity enables loads among threads to be more balanced, and the utilization rate of BFS to GPU computing resources is improved. Therefore, in the actual Top-down execution process, the method adopts a mode of combining dynamic load division and static load division to deal with different situations brought by power law distribution.

The graph data partitioning optimization method facing the GPU card group is used for partitioning graph data G (V, E), the vertex set of the graph data G (V, E) is V, the edge set is E, and the set of edges from a source vertex V to a destination vertex w is (V, w) (V belongs to V, w belongs to V), the method comprises a dynamic load partitioning method and a static load partitioning method, as shown in FIG. 1, the switching mechanism of the dynamic load partitioning method and the static load partitioning method is as follows: initially, a dynamic load partitioning method is adopted, when

|V_fWhen one of the values is less than beta, switching to a dynamic load division method, wherein mf is the total number of edges needing to be traversed by a vertex in a lower-layer queue, alpha and beta are empirical parameters, and m is_f＝|{(V，W)∈EV∈NF，W∈A(V)}|，V_f＝{V|V∈V，V∈NF}，V_fIs the set of vertices for the lower queue, NF is the lower queue, A (V) is the set of neighbor vertices for vertex v,

as can be seen from fig. 1, when the algorithm starts to execute, a dynamic partitioning method is used, and at this time, the number of vertices in the current queue is small and the task amount is small. When the number of vertexes in the current queue is rapidly increased to exceed beta and the average edge search task amount of the vertexes exceeds alpha, the current queue is large in task load amount, and if dynamic load division is continuously used, too many sub-kernel functions are created, so that the static load division mode is switched to. When the static load division mode is used, as the task amount is remarkably reduced, the dynamic load division mode is switched back.

When the current queue is small and the task load is small, the dynamic load division has better performance, and when the task load is large, the static load division has better performance. Through the combination of dynamic and static load division modes, the condition of uneven load in the Top-down stage can be effectively processed, and better performance can be achieved no matter the load of the task at the current layer is large or small.

As shown in fig. 3, which is a Top-down flow diagram of dynamic load partitioning, the dynamic load partitioning method is as follows:

as shown in fig. 2, the CSR structure of the graph data is shown, the row number group index corresponds to the vertex number, where the element points to the edge array, and if the value stored by the vertex 0 is 0, that is, the difference between the vertex 0 and the next element is 3, it can be seen that the vertex degree number of 0 is 3, and there are 3 neighbor vertices, which are 1, 2, and 3 respectively, indicating that the vertex 0 has three edges (0, 1), (0, 2), (0, 3).

it should be noted that, during dynamic load partitioning, the number of sub-kernel functions is not too large, and too many sub-kernel functions introduce a large amount of scheduling overhead to affect performance, so the number of sub-kernel functions created is controlled by the condition in step S14, and the ultra-high number vertex will not be controlled by the number of sub-kernel functions. In addition, the number of threads in the sub-kernel function should not be too large or too small, if the number of threads is too small, the load of each thread is too large, the utilization rate of computing resources is not high, and if the number of threads is too large, too much atomic overhead and thread scheduling problems are introduced, so that the program performance is reduced.

According to the experimental situation, good performance is achieved when k is 32, so that each thread is responsible for processing 32 neighbor vertices. Fig. 4 is a schematic diagram of a dynamic load partitioning process, where a thread of a main kernel function is responsible for processing each vertex, and then a sub-kernel function is created according to vertex degrees to perform vertex edge search. The load division strategy with edges as the granularity can ensure that the load of a large-degree vertex is effectively divided into all threads, the problem of load unevenness caused by power law distribution is solved, the dimensionality of the kernel function is dynamically changed according to the actual load condition, and the problem of a large number of idle threads is avoided. The dynamic load division can effectively process the problem of load imbalance caused by power law distribution of graph data, and can ensure that the loads of all threads are close even if the vertex with ultrahigh degrees is encountered. However, dynamic load partitioning cannot be used unconditionally, and when the number of vertices in the current queue is large, a dynamic parallel mechanism creates excessive sub-kernel functions, so that a large amount of scheduling overhead is introduced, and the advantage of dynamic load partitioning can be covered.

As shown in fig. 5, which is a schematic processing flow diagram of static load partitioning, the static load partitioning method is as follows:

That is, in the static load partitioning method, a thread block partitions an edge processing interval [ block _ process _ start, block _ process _ end ] into threads, the threads perform binary search according to an edge _ id contrast prefix and an array to obtain vertex information, and then the threads obtain neighbor vertices according to the vertex information until the storage position of an edge in the edge array is found. As shown in fig. 6, the load distribution of the boundary vertices in the graph is performed by distributing different numbers of threads according to the load of the vertices.

Those of ordinary skill in the art will understand that: the figures are merely schematic representations of one embodiment, and the blocks or flow diagrams in the figures are not necessarily required to practice the present invention.

Those of ordinary skill in the art will understand that: modules in the devices in the embodiments may be distributed in the devices in the embodiments according to the description of the embodiments, or may be located in one or more devices different from the embodiments with corresponding changes. The modules of the above embodiments may be combined into one module, or further split into multiple sub-modules.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. The method is characterized in that the method is used for dividing graph data G (V, E), the vertex set of the graph data G (V, E) is V, the edge set is E, and the set of edges from a source vertex V to a destination vertex w is (V, w) (V belongs to V, w belongs to V), the method comprises a dynamic load division method and a static load division method, and the switching mechanism of the dynamic load division method and the static load division method is as follows: initially, a dynamic load partitioning method is adopted, when

|V_fWhen one of the values is less than beta, switching to a dynamic load dividing method, wherein m_fThe total number of edges to be traversed for the top point in the lower layer queue, alpha and beta are empirical parameters, and m_f＝|{(v，w)∈E|v∈NF，w∈A(v)}|，V_f＝{v|v∈V，v∈NF}，V_fIs the set of vertices for the lower queue, NF is the lower queue, A (v) is the set of neighbor vertices for vertex v,

the dynamic load division method comprises the following steps:

s14: judging whether (deg > deg _ min & kemel _ num < kemel _ th) | deg > deg _ max, if yes, creating a sub-kernel function, wherein when the vertex frequency is greater than deg _ min but not greater than deg _ max, if kernel _ num is not greater than kernel _ th, creating the sub-kernel function, and if the vertex frequency is greater than deg _ max, creating the kernel function without relation with the value of kernel _ th;

the static load division method comprises the following steps:

s22: obtaining a local prefix and an array of the thread block by using a parallel prefix and algorithm for the array degree _ prefix _ scan, communicating with other thread blocks to obtain a global prefix and an array, storing the prefix and the array into the array degree _ prefix _ scan, and storing the sum of the degrees of all boundary vertexes into a parameter total _ degree, namely the total load searched by the layer;

s26: and the thread updates the edge _ id being processed by taking the total thread number thread _ num in the thread block as the step length, if the edge _ id is less than block _ process _ end, the step S24 is skipped to continue execution, otherwise, the edge which is responsible for processing by the thread is completely processed, and the searching process of the thread is ended.

2. The method of claim 1, wherein k has a value of 32.

3. The method of claim 1, wherein α is a coefficient of judgment when switching from Top-down to Bottom-up, and β is a coefficient of judgment when switching from Bottom-up to Top-down.