CN113419862A - GPU card group-oriented graph data division optimization method - Google Patents

GPU card group-oriented graph data division optimization method Download PDF

Info

Publication number
CN113419862A
CN113419862A CN202110750006.0A CN202110750006A CN113419862A CN 113419862 A CN113419862 A CN 113419862A CN 202110750006 A CN202110750006 A CN 202110750006A CN 113419862 A CN113419862 A CN 113419862A
Authority
CN
China
Prior art keywords
vertex
thread
edge
block
array
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110750006.0A
Other languages
Chinese (zh)
Other versions
CN113419862B (en
Inventor
罗鑫
王达
吴冬冬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zhongke Flux Technology Co ltd
Original Assignee
Beijing Ruixin High Throughput Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Ruixin High Throughput Technology Co ltd filed Critical Beijing Ruixin High Throughput Technology Co ltd
Priority to CN202110750006.0A priority Critical patent/CN113419862B/en
Publication of CN113419862A publication Critical patent/CN113419862A/en
Application granted granted Critical
Publication of CN113419862B publication Critical patent/CN113419862B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/505Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the load
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining

Abstract

A graph data partitioning optimization method facing a GPU card group is used for partitioning graph data G (V, E), the vertex set of the graph data G (V, E) is V, the edge set is E, and the set of edges from a source vertex V to a destination vertex w is (V, w) (V belongs to V, w belongs to V). Compared with coarse-grained load division, load division with edges as the granularity enables loads among threads to be more balanced, and the utilization rate of BFS on GPU computing resources is improved.

Description

GPU card group-oriented graph data division optimization method
Technical Field
The invention relates to a GPU card group-oriented graph data partitioning optimization method, in particular to graph data partitioning optimization capable of improving load balance of a breadth-first search algorithm on a GPU card group.
Background
The graph is a mathematical object representing the relationship between objects, and many application scenes in real life need to be represented by a graph data structure, such as protein structure prediction, shortest time path, scientific literature reference relationship, social network analysis and the like. The Breadth First Search (BFS) is a classical graph traversal algorithm.
The BFS algorithm has the following features: dense access, irregularity, data dependency, poor locality. The conventional BFS algorithm is executed in a Top-down (Top-down) manner, and is also a BFS algorithm implementation commonly used in serial programs. Top-down involves the use of a current queue and a lower queue, vertices stored in the current queue are called boundary vertices, the vertices are the vertices to be searched in the current layer, and the lower queue is a lower layer vertex to be processed found by searching the neighbors of the vertices in the current queue, i.e. a vertex expanded in the current layer. For graph G ═ V, E, where V is the set of vertices of the graph and E is the set of edges of the graph, when the source vertex s of the BFS algorithm is given, the Top-down algorithm iteratively searches all reachable points through layers starting from the source vertex s and finally forms the BFS search tree. The BFS search tree is finally recorded as a path array and a hierarchy array, wherein the path array records a parent vertex of an extended vertex in the search process, and the hierarchy array records a hierarchy of the extended vertex.
Due to the inherent characteristics of the algorithm, the parallelization of the BFS algorithm is difficult to obtain better performance, and the research of the BFS algorithm on a multi-core processor platform is always a hot topic. In recent years, domestic and foreign researchers develop a series of researches for BFS algorithms, and provide a plurality of effective optimization schemes, including a layer synchronization BFS algorithm, a direction optimization BFS algorithm, a BFS algorithm on a NUMA (non uniform memory access) architecture, data preprocessing and the like.
Graphics Processing (GPU) is generated to solve the problem of Graphics rendering efficiency, but since GPU hardware is developed at a speed greatly exceeding moore's law, the GPU is gradually receiving attention in the field of high-performance general-purpose computing. Meanwhile, the GPU has abundant computing resources and high energy efficiency, has become an attractive computing platform, is generally used as a coprocessor of a CPU for high-performance computing, and plays an important role in the fields of physics, biology, medical treatment, finance and the like at present.
In the GPU platform, because the degrees of graph data follow power law distribution, the degrees of vertexes are different, and the problems of unbalanced load, access branches, redundant overhead and the like are serious. At present, some special load balancing and memory access optimization technologies are also provided, and aiming at the performance bottleneck problem on a GPU platform, the algorithm performance is improved to a certain extent, but a better speed-up ratio is not obtained, and a lot of space for improving the GPU in the parallelization performance on the graph algorithm is still provided.
Disclosure of Invention
The invention provides a GPU card group-oriented graph data partitioning optimization method, which combines two parallelization strategies of dynamic load partitioning and static load partitioning with edges as granularity in a breadth-first search algorithm and aims to improve load balance in the execution process of a Top-down algorithm.
In order to achieve the above object, the present invention provides a graph data partitioning optimization method facing a GPU card group, the method is used for partitioning a graph data G (V, E), a vertex set of the graph data G (V, E) is V, an edge set is E, and a set of edges from a source vertex V to a destination vertex w is (V, w) (V ∈ V, w ∈ V), the method includes a dynamic load partitioning method and a static load partitioning method, and a switching mechanism of the dynamic load partitioning method and the static load partitioning method is: initially, a dynamic load partitioning method is adopted, when
Figure BDA0003145781320000021
And | VfWhen | ≧ beta is simultaneously established, switching to the static load dividing method, when |, the method
Figure BDA0003145781320000022
|VfWhen one of the values is less than beta, switching to a dynamic load dividing method, wherein mfThe total number of edges to be traversed for the top point in the lower layer queue, alpha and beta are empirical parameters, and mf=|{(V,W)∈E|V∈NF,W∈A(V)}|、Vf={V|V∈V,V∈NF}、VfIs the set of vertices for the lower queue, NF is the lower queue, A (v) is the set of neighbor vertices for vertex v,
the dynamic load division method comprises the following steps:
s11: storing graph data G (V, E) as a CSR structure, wherein the CSR structure comprises a row array and an edge array, the row array stores a pointer pointing to an offset index of the edge array, the pointer points to a first neighbor of a current vertex, the length of the pointer is the total number of vertices of the graph data, the difference value between a next position value and the current position value is the degree of the vertex, the edge array stores a neighbor vertex list of the edge at each vertex, and the length is the total number of the edges of the graph data;
s12: creating a main kernel function, wherein the dimensionality of the main kernel function is determined by the number of vertexes in the current queue, and each thread processes a boundary vertex;
s13: calculating a vertex degree deg and the number of sub-kernel functions kernel _ num which are created currently, and setting three threshold parameters deg _ min, deg _ max and kernel _ th, wherein the deg _ min, the deg _ max and the kernel _ th are respectively the minimum value of the degree, the maximum value of the degree and the threshold value of the number of the sub-kernel functions;
s14: judging whether (deg > deg _ min & & kernel _ num < kernel _ th) | deg > deg _ max holds, if yes, creating a sub-kernel function, wherein when the vertex frequency is greater than deg _ min but not greater than deg _ max, if the kernel _ num is not greater than kernel _ th, creating the sub-kernel function, and if the vertex frequency is greater than deg _ max, creating the sub-kernel function independent of the value of kernel _ th;
s15: creating a sub-kernel function with grid _ dim of grid dimension, wherein all threads in the grid are responsible for searching the neighbors of the vertex, and the execution load is divided into all threads on average, wherein the grid dimension grid _ dim is calculated according to the following formula:
Figure BDA0003145781320000031
vertex _ degree is a vertex degree, block _ dim is a thread block dimension which is a fixed value, and k is an experience parameter;
s16: if the condition for creating the sub-kernel function is not met, the thread of the main kernel function executes a searching process on the neighbor of the vertex;
s17: accessing the state of the neighbor vertex, screening the neighbor vertex with the state of not being accessed, expanding and outputting to a lower layer queue;
the static load division method comprises the following steps:
s21: each thread processes a vertex in the current queue, obtains the degree of the vertex and stores the degree into an array degree _ prefix _ scan, and the number of effective elements in the array degree _ prefix _ scan is the same as that of boundary vertices;
s22: obtaining a local prefix and an array of the thread block by using a parallel prefix and algorithm for the array degree _ prefix _ scan, communicating with other thread blocks to obtain a global prefix and an array, storing the prefix and the array into the array degree _ prefix _ scan, and storing the sum of the degrees of all boundary vertexes into a total degree, namely the total load searched by the layer;
s23: the edges of all boundary vertices needing to be processed in the layer are averagely distributed to each thread block, and the number of the edges edge _ per _ block needing to be searched in each thread block is as follows: edge _ per _ block is total _ depth/block _ num, total _ depth is the total number of edges needing to be searched in the current queue, block _ num is the total number of thread blocks in the kernel function, after the load is evenly divided into the thread blocks, the edge range which is handled by the thread block numbered as block _ id is marked as [ block _ process _ start, block _ process _ end ];
s24: processing a thread with thread ID being thread _ ID in a thread block from edge _ ID, wherein edge _ ID is initially (block _ process _ start + thread _ ID), locating a vertex to which the edge belongs in a binary search mode, and locating the storage position of the edge in an edge array through the value and the degree prefix sum in the vertex row array;
s25: searching the edge through the edge array to obtain the number of a neighbor vertex, judging whether the vertex needs to be expanded or not through the vertex access state, and outputting the vertex to a lower-layer queue if the vertex needs to be expanded;
s26: and the thread updates the edge _ id being processed by taking the total thread number thread _ num in the thread block as the step length, if the edge _ id < block _ process _ end, the step S24 is skipped to continue execution, otherwise, the edge which is responsible for processing by the thread is completely processed, and the searching process of the thread is ended.
In one embodiment of the present invention, k has a value of 32.
In an embodiment of the present invention, α is a judgment coefficient when Top-down is switched to Bottom-up, and β is a judgment coefficient when Top-down is switched to Bottom-up.
The GPU card group-oriented graph data partitioning optimization method combines two parallelization strategies of dynamic load partitioning and static load partitioning, wherein the parallelization strategies are based on edges as the granularity, and compared with coarse-granularity load partitioning, load partitioning based on edges as the granularity enables loads among threads to be more balanced, and the utilization rate of BFS on GPU computing resources is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a diagram of switching load sharing modes;
FIG. 2 is a CSR structure of graph data;
FIG. 3 is a schematic Top-down flow chart of dynamic load partitioning;
FIG. 4 is a schematic diagram of a dynamic load partitioning process;
FIG. 5 is a process flow diagram of static load partitioning;
FIG. 6 is a diagram illustrating the load distribution of the vertices of the boundary.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without inventive effort based on the embodiments of the present invention, are within the scope of the present invention.
Breadth-first search algorithms typically store graph counts using a compressed Sparse matrix (CSR) structure. CSR stores the adjacency matrix in row-compressed form, the neighbors of the vertices are stored sequentially, and all the spaces are valid values. The structure comprises the idea of an adjacency list structure, but has smaller storage space, higher access efficiency and larger information quantity.
First, a dynamic load division method and a static load division method applied in the present invention are briefly described:
dynamic load partitioning in the present invention handles load balancing by using CUDA dynamic parallelism. The dynamic parallel mechanism allows nested kernel functions to be created, with secondary kernel functions being created dynamically in the threads of the primary kernel functions. The dynamic load division dynamically adjusts the dimensionality of the primary and secondary kernel functions in the execution process to adapt to the scale-free characteristic of real graph data, so that the purpose of load balancing is achieved. The dimensionality of the primary kernel function is determined according to the number of boundary vertices, and each thread processes the search task of one vertex. The dimensionality of the secondary kernel function is determined according to the degree of the vertex, and each thread is responsible for traversing a certain number of adjacent edges.
After a dynamic parallel mechanism is used, even under the condition that the degree difference of the boundary vertex is large, the load is divided according to the edges, and the load among threads is balanced. The dynamic load division method ensures that the load of the vertex with large scale can be effectively divided, relieves the influence caused by the non-standard property of the graph data, but is not suitable for all conditions in the Top-down algorithm. The dynamic load division dynamically changes the number of the sub-kernel functions and the number of the threads in the execution process, when the number of vertexes in the current queue is large and the task amount is large, a large number of secondary kernel functions can be dynamically created in the search process, and too many secondary kernel functions and threads bring a large amount of scheduling overhead to influence the execution of the program. At this point, the Top-down algorithm will be executed in a static load partitioning manner.
Static load partitioning ensures that the load of each thread block is equivalent, the load among threads is balanced and each thread is in an active state by distributing the load of all boundary vertex edge searches of each layer to each thread block and then executing the search of each edge by the thread in each thread block. In order to equally divide the load into each thread block, the degree of the boundary vertex needs to be prefixed and operated to obtain a degree prefix and an array, and the prefixing and the algorithm are common algorithms on the GPU and can be efficiently implemented. And then, the searching load is averagely distributed to each thread block, each thread in the thread blocks positions the vertex of the edge needing to be processed according to the edge ID in a binary searching mode, and positions and expands the neighbor vertex according to the vertex ID and the information of the edge.
The static load division uses edges as granularity to carry out load division, ensures that the load of each thread block and the thread in the thread blocks is equivalent, and effectively solves the problem of uneven load. However, task division overhead is introduced in static load division, and extra overhead brought by load division can be found from Top-down algorithm execution of the static load division, namely acquisition of degree prefixes and arrays, load division to each thread block and positioning of vertices to which edges belong. These overheads can be ignored when the layer is large in task volume, however, due to the non-scalabilities of real graphs, the current queue has only a few vertices and the degree is small in some search layers, in which case the impact of task partitioning overheads on the layer execution time is large.
The dynamic load division and the static load division are both in a load division mode with edges as the granularity, compared with the load division with coarse granularity, the load division with edges as the granularity enables loads among threads to be more balanced, and the utilization rate of BFS to GPU computing resources is improved. Therefore, in the actual Top-down execution process, the method adopts a mode of combining dynamic load division and static load division to deal with different situations brought by power law distribution.
The graph data partitioning optimization method facing the GPU card group is used for partitioning graph data G (V, E), the vertex set of the graph data G (V, E) is V, the edge set is E, and the set of edges from a source vertex V to a destination vertex w is (V, w) (V belongs to V, w belongs to V), the method comprises a dynamic load partitioning method and a static load partitioning method, as shown in FIG. 1, the switching mechanism of the dynamic load partitioning method and the static load partitioning method is as follows: initially, a dynamic load partitioning method is adopted, when
Figure BDA0003145781320000081
And | VfWhen | ≧ beta is simultaneously established, switching to the static load dividing method, when |, the method
Figure BDA0003145781320000082
|VfWhen one of the values is less than beta, switching to a dynamic load division method, wherein mf is the total number of edges needing to be traversed by a vertex in a lower-layer queue, alpha and beta are empirical parameters, and m isf=|{(V,W)∈EV∈NF,W∈A(V)}|,Vf={V|V∈V,V∈NF},VfIs the set of vertices for the lower queue, NF is the lower queue, A (V) is the set of neighbor vertices for vertex v,
as can be seen from fig. 1, when the algorithm starts to execute, a dynamic partitioning method is used, and at this time, the number of vertices in the current queue is small and the task amount is small. When the number of vertexes in the current queue is rapidly increased to exceed beta and the average edge search task amount of the vertexes exceeds alpha, the current queue is large in task load amount, and if dynamic load division is continuously used, too many sub-kernel functions are created, so that the static load division mode is switched to. When the static load division mode is used, as the task amount is remarkably reduced, the dynamic load division mode is switched back.
When the current queue is small and the task load is small, the dynamic load division has better performance, and when the task load is large, the static load division has better performance. Through the combination of dynamic and static load division modes, the condition of uneven load in the Top-down stage can be effectively processed, and better performance can be achieved no matter the load of the task at the current layer is large or small.
As shown in fig. 3, which is a Top-down flow diagram of dynamic load partitioning, the dynamic load partitioning method is as follows:
s11: storing graph data G (V, E) as a CSR structure, wherein the CSR structure comprises a row array and an edge array, the row array stores a pointer pointing to an offset index of the edge array, the pointer points to a first neighbor of a current vertex, the length of the pointer is the total number of vertices of the graph data, the difference value between a next position value and the current position value is the degree of the vertex, the edge array stores a neighbor vertex list of the edge at each vertex, and the length is the total number of the edges of the graph data;
as shown in fig. 2, the CSR structure of the graph data is shown, the row number group index corresponds to the vertex number, where the element points to the edge array, and if the value stored by the vertex 0 is 0, that is, the difference between the vertex 0 and the next element is 3, it can be seen that the vertex degree number of 0 is 3, and there are 3 neighbor vertices, which are 1, 2, and 3 respectively, indicating that the vertex 0 has three edges (0, 1), (0, 2), (0, 3).
S12: creating a main kernel function, wherein the dimensionality of the main kernel function is determined by the number of vertexes in the current queue, and each thread processes a boundary vertex;
s13: calculating a vertex degree deg and the number of sub-kernel functions kernel _ num which are created currently, and setting three threshold parameters deg _ min, deg _ max and kernel _ th, wherein the deg _ min, the deg _ max and the kernel _ th are respectively the minimum value of the degree, the maximum value of the degree and the threshold value of the number of the sub-kernel functions;
s14: judging whether (deg > deg _ min & & kernel _ num < kernel _ th) | deg > deg _ max holds, if yes, creating a sub-kernel function, wherein when the vertex frequency is greater than deg _ min but not greater than deg _ max, if the kernel _ num is not greater than kernel _ th, creating the sub-kernel function, and if the vertex frequency is greater than deg _ max, creating the sub-kernel function independent of the value of kernel _ th;
s15: creating a sub-kernel function with grid _ dim of grid dimension, wherein all threads in the grid are responsible for searching the neighbors of the vertex, and the execution load is divided into all threads on average, wherein the grid dimension grid _ dim is calculated according to the following formula:
Figure BDA0003145781320000091
vertex _ degree is a vertex degree, block _ dim is a thread block dimension which is a fixed value, and k is an experience parameter;
s16: if the condition for creating the sub-kernel function is not met, the thread of the main kernel function executes a searching process on the neighbor of the vertex;
s17: accessing the state of the neighbor vertex, screening the neighbor vertex with the state of not being accessed, expanding and outputting to a lower layer queue;
it should be noted that, during dynamic load partitioning, the number of sub-kernel functions is not too large, and too many sub-kernel functions introduce a large amount of scheduling overhead to affect performance, so the number of sub-kernel functions created is controlled by the condition in step S14, and the ultra-high number vertex will not be controlled by the number of sub-kernel functions. In addition, the number of threads in the sub-kernel function should not be too large or too small, if the number of threads is too small, the load of each thread is too large, the utilization rate of computing resources is not high, and if the number of threads is too large, too much atomic overhead and thread scheduling problems are introduced, so that the program performance is reduced.
According to the experimental situation, good performance is achieved when k is 32, so that each thread is responsible for processing 32 neighbor vertices. Fig. 4 is a schematic diagram of a dynamic load partitioning process, where a thread of a main kernel function is responsible for processing each vertex, and then a sub-kernel function is created according to vertex degrees to perform vertex edge search. The load division strategy with edges as the granularity can ensure that the load of a large-degree vertex is effectively divided into all threads, the problem of load unevenness caused by power law distribution is solved, the dimensionality of the kernel function is dynamically changed according to the actual load condition, and the problem of a large number of idle threads is avoided. The dynamic load division can effectively process the problem of load imbalance caused by power law distribution of graph data, and can ensure that the loads of all threads are close even if the vertex with ultrahigh degrees is encountered. However, dynamic load partitioning cannot be used unconditionally, and when the number of vertices in the current queue is large, a dynamic parallel mechanism creates excessive sub-kernel functions, so that a large amount of scheduling overhead is introduced, and the advantage of dynamic load partitioning can be covered.
As shown in fig. 5, which is a schematic processing flow diagram of static load partitioning, the static load partitioning method is as follows:
s21: each thread processes a vertex in the current queue, obtains the degree of the vertex and stores the degree into an array degree _ prefix _ scan, and the number of effective elements in the array degree _ prefix _ scan is the same as that of boundary vertices;
s22: obtaining a local prefix and an array of the thread block by using a parallel prefix and algorithm for the array degree _ prefix _ scan, communicating with other thread blocks to obtain a global prefix and an array, storing the prefix and the array into the array degree _ prefix _ scan, and storing the sum of the degrees of all boundary vertexes into a total degree, namely the total load searched by the layer;
s23: the edges of all boundary vertices needing to be processed in the layer are averagely distributed to each thread block, and the number of the edges edge _ per _ block needing to be searched in each thread block is as follows: edge _ per _ block is total _ depth/block _ num, total _ depth is the total number of edges needing to be searched in the current queue, block _ num is the total number of thread blocks in the kernel function, after the load is evenly divided into the thread blocks, the edge range which is handled by the thread block numbered as block _ id is marked as [ block _ process _ start, block _ process _ end ];
s24: processing a thread with thread ID being thread _ ID in a thread block from edge _ ID, wherein edge _ ID is initially (block _ process _ start + thread _ ID), locating a vertex to which the edge belongs in a binary search mode, and locating the storage position of the edge in an edge array through the value and the degree prefix sum in the vertex row array;
s25: searching the edge through the edge array to obtain the number of a neighbor vertex, judging whether the vertex needs to be expanded or not through the vertex access state, and outputting the vertex to a lower-layer queue if the vertex needs to be expanded;
s26: and the thread updates the edge _ id being processed by taking the total thread number thread _ num in the thread block as the step length, if the edge _ id < block _ process _ end, the step S24 is skipped to continue execution, otherwise, the edge which is responsible for processing by the thread is completely processed, and the searching process of the thread is ended.
That is, in the static load partitioning method, a thread block partitions an edge processing interval [ block _ process _ start, block _ process _ end ] into threads, the threads perform binary search according to an edge _ id contrast prefix and an array to obtain vertex information, and then the threads obtain neighbor vertices according to the vertex information until the storage position of an edge in the edge array is found. As shown in fig. 6, the load distribution of the boundary vertices in the graph is performed by distributing different numbers of threads according to the load of the vertices.
The GPU card group-oriented graph data partitioning optimization method combines two parallelization strategies of dynamic load partitioning and static load partitioning, wherein the parallelization strategies are based on edges as the granularity, and compared with coarse-granularity load partitioning, load partitioning based on edges as the granularity enables loads among threads to be more balanced, and the utilization rate of BFS on GPU computing resources is improved.
Those of ordinary skill in the art will understand that: the figures are merely schematic representations of one embodiment, and the blocks or flow diagrams in the figures are not necessarily required to practice the present invention.
Those of ordinary skill in the art will understand that: modules in the devices in the embodiments may be distributed in the devices in the embodiments according to the description of the embodiments, or may be located in one or more devices different from the embodiments with corresponding changes. The modules of the above embodiments may be combined into one module, or further split into multiple sub-modules.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (3)

1. The method is characterized in that the method is used for dividing graph data G (V, E), the vertex set of the graph data G (V, E) is V, the edge set is E, and the set of edges from a source vertex V to a destination vertex w is (V, w) (V belongs to V, w belongs to V), the method comprises a dynamic load division method and a static load division method, and the switching mechanism of the dynamic load division method and the static load division method is as follows: initially, a dynamic load partitioning method is adopted, when
Figure FDA0003145781310000011
And | VfWhen | ≧ beta is simultaneously established, switching to the static load dividing method, when |, the method
Figure FDA0003145781310000012
|VfWhen one of the values is less than beta, switching to a dynamic load dividing method, wherein mfThe total number of edges to be traversed for the top point in the lower layer queue, alpha and beta are empirical parameters, and mf=|{(v,w)∈E|v∈NF,w∈A(v)}|,Vf={v|v∈V,v∈NF},VfIs the set of vertices for the lower queue, NF is the lower queue, A (v) is the set of neighbor vertices for vertex v,
the dynamic load division method comprises the following steps:
s11: storing graph data G (V, E) as a CSR structure, wherein the CSR structure comprises a row array and an edge array, the row array stores a pointer pointing to an offset index of the edge array, the pointer points to a first neighbor of a current vertex, the length of the pointer is the total number of vertices of the graph data, the difference value between a next position value and the current position value is the degree of the vertex, the edge array stores a neighbor vertex list of the edge at each vertex, and the length is the total number of the edges of the graph data;
s12: creating a main kernel function, wherein the dimensionality of the main kernel function is determined by the number of vertexes in the current queue, and each thread processes a boundary vertex;
s13: calculating a vertex degree deg and the number of sub-kernel functions kernel _ num which are created currently, and setting three threshold parameters deg _ min, deg _ max and kernel _ th, wherein the deg _ min, the deg _ max and the kernel _ th are respectively the minimum value of the degree, the maximum value of the degree and the threshold value of the number of the sub-kernel functions;
s14: judging whether (deg > deg _ min & kemel _ num < kemel _ th) | deg > deg _ max, if yes, creating a sub-kernel function, wherein when the vertex frequency is greater than deg _ min but not greater than deg _ max, if kernel _ num is not greater than kernel _ th, creating the sub-kernel function, and if the vertex frequency is greater than deg _ max, creating the kernel function without relation with the value of kernel _ th;
s15: creating a sub-kernel function with grid _ dim of grid dimension, wherein all threads in the grid are responsible for searching the neighbors of the vertex, and the execution load is divided into all threads on average, wherein the grid dimension grid _ dim is calculated according to the following formula:
Figure FDA0003145781310000021
vertex _ degree is a vertex degree, block _ dim is a thread block dimension which is a fixed value, and k is an experience parameter;
s16: if the condition for creating the sub-kernel function is not met, the thread of the main kernel function executes a searching process on the neighbor of the vertex;
s17: accessing the state of the neighbor vertex, screening the neighbor vertex with the state of not being accessed, expanding and outputting to a lower layer queue;
the static load division method comprises the following steps:
s21: each thread processes a vertex in the current queue, obtains the degree of the vertex and stores the degree into an array degree _ prefix _ scan, and the number of effective elements in the array degree _ prefix _ scan is the same as that of boundary vertices;
s22: obtaining a local prefix and an array of the thread block by using a parallel prefix and algorithm for the array degree _ prefix _ scan, communicating with other thread blocks to obtain a global prefix and an array, storing the prefix and the array into the array degree _ prefix _ scan, and storing the sum of the degrees of all boundary vertexes into a parameter total _ degree, namely the total load searched by the layer;
s23: the edges of all boundary vertices needing to be processed in the layer are averagely distributed to each thread block, and the number of the edges edge _ per _ block needing to be searched in each thread block is as follows: edge _ per _ block is total _ depth/block _ num, total _ depth is the total number of edges needing to be searched in the current queue, block _ num is the total number of thread blocks in the kernel function, after the load is evenly divided into the thread blocks, the edge range which is handled by the thread block numbered as block _ id is marked as [ block _ process _ start, block _ process _ end ];
s24: processing a thread with thread ID being thread _ ID in a thread block from edge _ ID, wherein edge _ ID is initially (block _ process _ start + thread _ ID), locating a vertex to which the edge belongs in a binary search mode, and locating the storage position of the edge in an edge array through the value and the degree prefix sum in the vertex row array;
s25: searching the edge through the edge array to obtain the number of a neighbor vertex, judging whether the vertex needs to be expanded or not through the vertex access state, and outputting the vertex to a lower-layer queue if the vertex needs to be expanded;
s26: and the thread updates the edge _ id being processed by taking the total thread number thread _ num in the thread block as the step length, if the edge _ id is less than block _ process _ end, the step S24 is skipped to continue execution, otherwise, the edge which is responsible for processing by the thread is completely processed, and the searching process of the thread is ended.
2. The method of claim 1, wherein k has a value of 32.
3. The method of claim 1, wherein α is a coefficient of judgment when switching from Top-down to Bottom-up, and β is a coefficient of judgment when switching from Bottom-up to Top-down.
CN202110750006.0A 2021-07-02 2021-07-02 GPU card group-oriented graph data division optimization method Active CN113419862B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110750006.0A CN113419862B (en) 2021-07-02 2021-07-02 GPU card group-oriented graph data division optimization method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110750006.0A CN113419862B (en) 2021-07-02 2021-07-02 GPU card group-oriented graph data division optimization method

Publications (2)

Publication Number Publication Date
CN113419862A true CN113419862A (en) 2021-09-21
CN113419862B CN113419862B (en) 2023-09-19

Family

ID=77721438

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110750006.0A Active CN113419862B (en) 2021-07-02 2021-07-02 GPU card group-oriented graph data division optimization method

Country Status (1)

Country Link
CN (1) CN113419862B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023184836A1 (en) * 2022-03-31 2023-10-05 深圳清华大学研究院 Subgraph segmented optimization method based on inter-core storage access, and application

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110245135A (en) * 2019-05-05 2019-09-17 华中科技大学 A kind of extensive streaming diagram data update method based on NUMA architecture
CN110619595A (en) * 2019-09-17 2019-12-27 华中科技大学 Graph calculation optimization method based on interconnection of multiple FPGA accelerators
CN112445940A (en) * 2020-10-16 2021-03-05 苏州浪潮智能科技有限公司 Graph partitioning method, graph partitioning device and computer-readable storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110245135A (en) * 2019-05-05 2019-09-17 华中科技大学 A kind of extensive streaming diagram data update method based on NUMA architecture
CN110619595A (en) * 2019-09-17 2019-12-27 华中科技大学 Graph calculation optimization method based on interconnection of multiple FPGA accelerators
CN112445940A (en) * 2020-10-16 2021-03-05 苏州浪潮智能科技有限公司 Graph partitioning method, graph partitioning device and computer-readable storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023184836A1 (en) * 2022-03-31 2023-10-05 深圳清华大学研究院 Subgraph segmented optimization method based on inter-core storage access, and application

Also Published As

Publication number Publication date
CN113419862B (en) 2023-09-19

Similar Documents

Publication Publication Date Title
Hong et al. Efficient parallel graph exploration on multi-core CPU and GPU
Kabir et al. Parallel k-core decomposition on multicore platforms
Akhremtsev et al. High-quality shared-memory graph partitioning
Gaihre et al. XBFS: eXploring runtime optimizations for breadth-first search on GPUs
Yu et al. A load-balanced distributed parallel mining algorithm
Tom et al. Exploring optimizations on shared-memory platforms for parallel triangle counting algorithms
Zhang et al. Efficient graph computation on hybrid CPU and GPU systems
Gavagsaz et al. Load balancing in join algorithms for skewed data in MapReduce systems
Boyer et al. Dense dynamic programming on multi GPU
Banerjee et al. Work efficient parallel algorithms for large graph exploration
CN113419862B (en) GPU card group-oriented graph data division optimization method
Tang et al. Efficient selection algorithm for fast k-nn search on gpus
Afanasyev et al. Developing efficient implementations of shortest paths and page rank algorithms for NEC SX-Aurora TSUBASA architecture
Tripathy et al. Scalable k-core decomposition for static graphs using a dynamic graph data structure
CN113419861B (en) GPU card group-oriented graph traversal hybrid load balancing method
Gowanlock et al. Clustering throughput optimization on the gpu
Chen et al. HiClus: Highly scalable density-based clustering with heterogeneous cloud
Fang et al. GPU-Based Efficient Parallel Heuristic Algorithm for High-Utility Itemset Mining in Large Transaction Datasets
Gallet et al. Load imbalance mitigation optimizations for GPU-accelerated similarity joins
CN109840306A (en) One kind being based on recursive parallel FFT communication optimization method and system
Man et al. An efficient parallel sorting compatible with the standard qsort
Wang et al. An optimized BFS algorithm: A path to load balancing in MIC
Pınar et al. One-dimensional partitioning for heterogeneous systems: Theory and practice
Eedi et al. An efficient practical non-blocking PageRank algorithm for large scale graphs
Wang et al. A novel parallel algorithm for sparse tensor matrix chain multiplication via tcu-acceleration

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP01 Change in the name or title of a patent holder

Address after: Room 711c, 7 / F, block a, building 1, yard 19, Ronghua Middle Road, Beijing Economic and Technological Development Zone, Daxing District, Beijing 102600

Patentee after: Beijing Zhongke Flux Technology Co.,Ltd.

Address before: Room 711c, 7 / F, block a, building 1, yard 19, Ronghua Middle Road, Beijing Economic and Technological Development Zone, Daxing District, Beijing 102600

Patentee before: Beijing Ruixin high throughput technology Co.,Ltd.

CP01 Change in the name or title of a patent holder