CN111754383B - Detection method of strong connectivity graph based on warp reuse and colored partition on GPU - Google Patents

Detection method of strong connectivity graph based on warp reuse and colored partition on GPU Download PDF

Info

Publication number
CN111754383B
CN111754383B CN202010403115.0A CN202010403115A CN111754383B CN 111754383 B CN111754383 B CN 111754383B CN 202010403115 A CN202010403115 A CN 202010403115A CN 111754383 B CN111754383 B CN 111754383B
Authority
CN
China
Prior art keywords
vertex
graph
strong
degree
partition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010403115.0A
Other languages
Chinese (zh)
Other versions
CN111754383A (en
Inventor
侯骏腾
吴广君
王树鹏
王振宇
贾思宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Information Engineering of CAS
Original Assignee
Institute of Information Engineering of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Information Engineering of CAS filed Critical Institute of Information Engineering of CAS
Priority to CN202010403115.0A priority Critical patent/CN111754383B/en
Publication of CN111754383A publication Critical patent/CN111754383A/en
Application granted granted Critical
Publication of CN111754383B publication Critical patent/CN111754383B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Image Generation (AREA)

Abstract

The invention provides a GPU acceleration-based strong connection graph detection method for optimizing thread scheduling and partitioning, which is a method for detecting a strong connection graph by using a heterogeneous system.

Description

Detection method of strong connectivity graph based on warp reuse and colored partition on GPU
Technical Field
The invention relates to a method for detecting a strong connectivity graph by using a heterogeneous system, in particular to a strong connectivity graph detection algorithm based on warp reuse and coloring partition on a GPU.
Background
Graph data is a basic data structure in data processing, and the graph data can well express the relevance among data, so that the graph data is widely applied to multiple fields of biology, chemistry, artificial intelligence, social networks and the like. A Strongly Connected Component (SCC) is a basic graph structure, which refers to the largest subset of all vertices in a directed graph that are Connected two-by-two. Although strong connectivity graph detection is a problem that has been studied since a long time ago, there are excellent detection methods such as the Tarjan algorithm, the Kosaraju algorithm, and the Dijkstra algorithm. However, most of these algorithms are serial algorithms based on depth-first search (DFS), the running time of which is drastically increased as the scale of graph data is enlarged, and parallelization is difficult. With the wide application of general-purpose parallel computing devices such as GPUs to high-performance computing, researchers have proposed some parallel strong-connectivity graph detection algorithms based on GPUs, which are mostly based on a method of solving intersections of regions formed by forward traversal and backward traversal of a central point, and increase the parallelism of each iterative detection by partitioning. However, in the process of traversing, each vertex in all unfinished detection vertices is assigned with a thread or a warp (a thread group comprising 32 threads), many of the vertices do not need to be processed, the assigned threads are directly released, and the number of adjacent vertices of other vertices is very different, so that different threads or thread groups are strictly unbalanced in load. In addition, the existing partitioning method only divides the vertex which has no connection with other vertices into the same partition, and the number of generated partitions is not large. After each iterative detection, all the vertex states which are not completely detected are completely reset to zero, so that the calculation result after each detection cannot be fully utilized. Therefore, optimizing and improving the thread task allocation and partitioning method in the current GPU-based parallel strong connected graph detection algorithm is the key for improving the efficiency of the parallel strong connected graph detection algorithm.
Disclosure of Invention
The invention provides a GPU acceleration-based strong connection graph detection method for optimizing thread scheduling and partitioning, which balances thread allocation and increases the number of strong connection graphs generated by each iteration by dividing each warp into a plurality of virtual warps, allocating a plurality of vertex tasks, replacing the traditional WCC partition with a coloring partition and the like, thereby achieving the purpose of improving the operation efficiency of the algorithm.
The technical scheme of the invention is as follows:
a method for detecting a strong connection graph based on warp reuse and colored partition on a GPU comprises the following steps:
loading graph data and unifying storage formats;
carrying out first-class pruning operation on the graph data, and detecting a strong connection graph 1-SCC only consisting of one vertex;
selecting a first type of central point, and using a vertex with the largest product of in-degree and out-degree as a central point;
using a warp reuse method to traverse forwards and backwards in parallel from a central point to obtain a strong communication graph and three subareas, wherein the strong communication graph is a communication graph formed by vertexes traversed forwards and backwards and the central point, the three subareas are vertexes traversed forwards and backwards but not traversed, vertexes traversed forwards and backwards but not traversed and subareas formed by vertexes not traversed forwards and backwards;
judging whether a second type of pruning operation is needed, if so, performing the second type of pruning operation, and detecting a strong communication graph 1-SCC formed by one vertex and a strong communication graph 2-SCC formed by two vertexes;
further partitioning the three partitions by using a coloring partition method to obtain more and smaller partitions;
performing second-class center point selection operation in the formed small partitions, selecting vertexes with the same initial color values and partition color values in each partition as center points, traversing from the center points forward, forming a strong communication graph by the traversed vertexes and the center points, and forming new partitions by the non-traversed vertexes;
performing a third type center point selection operation, directly and randomly selecting a vertex in each partition as a center point, and traversing forwards and backwards in parallel from the center point;
carrying out the first class pruning operation again, and updating the strong connection graph and the subareas;
judging whether a new strong connected graph is generated, if not, ending, and if so, judging whether the iteration times exceed a threshold k t If the threshold value is exceeded, the coloring partition method is used for coloring partition operation, and otherwise, the third type center point selection operation is carried out.
Further, the step of loading graph data and unifying storage formats includes:
storing data by using a format of a compression matrix row (CSR), applying for a host memory for an array C and an array R according to the size of graph data, and storing the graph data into the array C and the array R;
switching the starting point and the end point of the edge, storing the reverse CSR format graph data obtained by the steps, and storing the data into an array C 'and an array R';
and applying for a memory space which is the same as the memory of the host on the GPU equipment, copying all the arrays to the equipment, and applying for an array M for representing the vertex state in the memory of the equipment.
Further, the step of the first type of pruning operation comprises: and detecting the in-degree and the out-degree of each vertex on the GPU in parallel, if the in-degree or the out-degree of each vertex is zero, forming a 1-SCC by the vertex, marking the vertex state as a central point in the array M and completing the detection.
Further, the step of selecting the first type of center point includes: applying for a variable used for storing the central point on the GPU and assigning the variable as 0, calculating the products of the in-degree and the out-degree of the central point and each vertex in parallel, if the products of the in-degree and the out-degree of the current vertex are larger than the products of the in-degree and the out-degree of the central point, using the vertex to replace the central point, and repeating the processes until the value of the central point is not changed.
Further, the step of traversing forward and backward in parallel from the center point using the warp reuse method comprises:
according to a preset parameter k w Dividing each warp in the GPU into a plurality of equal-size virtual warps, wherein each virtual warp comprises k w A thread;
all vertex tasks are assigned to virtual warps in sequence, with each virtual warp assigned a k v A vertex task;
from the center point, performing forward breadth-first traversal on the array R and the data C in parallel, and performing backward breadth-first traversal on the array R 'and the data C';
in each virtual warp, all threads simultaneously detect whether the adjacent vertex of a certain vertex needing to be processed in the distributed vertices is detected or not in parallel for the vertex needing to be processed in the distributed vertices, and the result is marked into an array M.
Furthermore, each subarea independently contains all the strong connection graphs therein, and the detection of the strong connection graphs can be completed in parallel; vertices in the strongly connected graph and the concurrently detectable partition are labeled accordingly in array M.
Further, it is judged whether or not it is necessary toThe steps of performing the second type of pruning operation include: according to a preset parameter k t Deciding whether to perform pruning operations, k for each operation t Carrying out primary coloring partition by secondary strong connection diagram detection; if k is t If 0, pruning operation is not needed; if k is t If the sum is greater than 0, pruning operation is required.
Further, the second type of pruning operation comprises the steps of: firstly, detecting 1-SCC for a plurality of times until no 1-SCC is generated, then detecting 2-SCC for one time, and finally detecting 1-SCC for a plurality of times until no 1-SCC is generated, wherein the detection method of 1-SCC is the same as the steps of the first class of pruning operation, and the step of detecting 2-SCC comprises the following steps: and detecting adjacent vertexes of each undetected vertex in parallel, and if an directed edge exists on the adjacent vertex and is connected to the current vertex, and the outward degree or the inward degree of the two points except the mutual connection is zero, forming a strong connection graph by the two vertexes, and marking the vertexes in the group M.
Further, the method for colored partition includes the following steps: and taking the ID of each vertex as the color value of the vertex, detecting the color of the adjacent vertex of all the vertices which do not form the strong communication graph, if the color of the adjacent vertex is less than that of the current vertex, modifying the color of the current vertex into the color of the adjacent vertex, repeating the above processes until the color values of all the vertices do not change, dividing all the vertices which do not form the strong communication graph into a plurality of subareas according to different color values, wherein each subarea independently contains all the strong communication graphs.
The method of the invention has the following advantages and effects:
1. the thread waste situation is reduced, and the workload of different threads is balanced. The warp reuse approach effectively reduces the problem of thread wastage by allocating a vertex to a thread or group of threads but which does not need to be processed, by dividing each warp into an equally large number of virtual warps and allocating an equally large number of vertex tasks to each virtual warp. At the same time, the processing of multiple vertices by each virtual warp effectively balances the workload of each virtual warp.
2. The number of partitions and the number of strong connected graphs generated in each iteration are increased. By using the coloring algorithm for partitioning, a WCC partitioning method in the traditional algorithm is replaced, the number of partitions generated by each iteration is increased, strong connection graph detection can be executed in each partition in parallel, and the parallelism of the algorithm is effectively increased.
3. The colored partitioning algorithm saves time for one iteration to detect the strong connected graph. In the coloring partition algorithm, the color value of each partition is directly used as the central point of the partition, and a strong connection graph can be detected in each partition through forward traversal of the central point in each partition.
4. And optimizing the depths of the partition detection and the strong communication graph detection. The frequency of calling the partition method and the FB-Trim algorithm is adjusted through preset parameters, so that the algorithm is balanced between the increase of the parallelism and the detection of the strong connection graph, and the optimal effect of the algorithm is achieved.
Drawings
Fig. 1 is a flowchart of a method for detecting a strong connectivity graph based on warp reuse and colored partition on a GPU according to the present invention.
FIG. 2 is a schematic diagram of a CSR format map data store used in the present invention.
FIG. 3 is a diagram comparing thread allocation of warp reuse method with conventional method in the present invention.
FIG. 4 is a schematic diagram illustrating the feasibility of the colored zoning method of the present invention.
FIG. 5 is a runtime acceleration histogram of a different algorithm compared to the Tarjan algorithm.
Detailed Description
The present invention will be described in further detail below with reference to specific examples and the accompanying drawings.
The detection method of the strong communication graph based on the warp reuse and the colored partition on the GPU provided by the invention covers a warp reuse method and a colored partition method distributed by parallel threads of the GPU, and the partition method and an FB-Trim algorithm are deeply fused.
The invention provides a new thread allocation method, namely a warp reuse method. Most graph data usually comprise a large-scale strong connection graph, the number of vertexes of the strong connection graph can even reach more than 80% of the data of the whole graph, and the algorithm efficiency is effectively improved by singly detecting the large-scale strong connection graph through a plurality of traditional parallel methods. However, in the process of detecting the extremely strong connection graph, although the traditional method allocates one thread or one warp (including a thread group of 32 threads) to each vertex in all detected vertices, many vertices do not need to be processed in the current iteration process, and unnecessary thread allocation is caused; meanwhile, the data of adjacent vertexes of different vertexes in the real graph data are greatly different, so that the workload of different threads is greatly different, and the algorithm workload is unbalanced. The warp reuse method provided by the invention divides a group of threads in each warp into several virtual warps with the same number of threads, and then sequentially distributes an equal amount of multiple vertex tasks for each virtual warp. In the vertex processing process, each virtual warp is taken as a parallel unit, namely all threads in each virtual warp process the distributed vertex tasks in parallel, and after one vertex task is finished, the next vertex task is processed until all the vertex tasks distributed in the virtual warp are finished, and the threads are released.
The invention provides a partition method with higher efficiency, namely a coloring partition method. After the detection of the extremely large strong connection graphs is completed, a plurality of trivial strong connection graphs which are connected with each other only a few times can be formed on the rest vertexes, and if the detection of the strong connection graphs is carried out according to the principle of carrying out parallel detection in the subarea generated in the last iteration, the number of the strong connection graphs detected in each iteration is limited. The traditional method will perform Weak Connected Components (WCC) detection once after completing the detection of the extremely large strong Connected graph, and divide the vertices without any connection between them into independent partitions. During each iterative detection, the parallel detection strong connection graph in each WCC partition can effectively improve the parallelism of the algorithm. However, the partitioning method that only partitions the vertices without any connection between them into independent partitions is a coarse partitioning method, and can also perform finer partitioning, and the WCC partitioning process also causes a certain time consumption. The invention finds that each partition formed by the coloring process can independently contain each strong connection graph, namely, any strong connection graph does not exist, one part of the vertexes is contained in one coloring partition, and the other part of the vertexes is contained in the other coloring partition. Therefore, the invention provides a coloring partition method and replaces the WCC partition method in the traditional method. And after the detection of the extremely large strong connection graph is finished, detecting all adjacent vertexes of the vertexes which do not form the strong connection graph, and if the color values of the adjacent vertexes are smaller than that of the current vertex, setting the color value of the current vertex to be the same as that of the adjacent vertex. Iteration is carried out until the color values of all the vertexes are not changed any more, a plurality of partitions with different colors can be formed, and then each partition can finish the detection of the strong connection graph in parallel. And then, in each subarea, selecting a vertex with the same initial color as the subarea color as a central point, and traversing from the central point in a forward direction, wherein the traversed vertex forms a strong communication graph. Compared with the traditional partition method, the coloring partition method can form more partitions, increase the parallelism of each iteration detection, and generate strong connection graphs with the same quantity as the partitions, thereby improving the overall efficiency of the algorithm.
The invention provides a method for detecting depth fusion of subareas and a strong connectivity graph. The method adopts FB-Trim algorithm, namely detecting the strong communication graph according to the result of Forward traversal and Backward traversal (FB) of a central point to form different partitions, and rapidly detecting trivial strong communication graphs through pruning operation (Trim) when the strong communication graphs are detected in each partition in parallel. However, since the partitioning process causes a certain time consumption, the conventional method only performs the partitioning process once. However, the partitioning method of the present invention can generate equal amounts of strong connection maps, and thus, the partitioning method of the present invention can also be used as a method for detecting the strong connection maps. The invention adjusts the frequency of calling the partition method and the FB-Trim algorithm through parameter setting, thereby optimizing the operation efficiency of the algorithm.
The flow of the whole method is shown in the attached figure 1, which describes a strong connection graph detection method for balancing workload and increasing algorithm parallelism by adjusting thread allocation in a GPU and optimizing a partition method.
As shown in FIG. 1, the process of the present invention can be roughly divided into two stages, the first stage of detecting the large-scale huge strong connection graph and the second stage of detecting the remaining numerous small-scale trivial strong connection graphs. The warp reuse method in the first phase plays the roles of saving thread allocation and balancing workload. In the second stage, the colored subarea is combined with the traditional strong connection graph detection method, so that the parallelism of the algorithm is effectively increased. Meanwhile, the method comprises three central point selection methods and two pruning methods, and corresponding operations are carried out according to algorithm requirements of different stages. Specifically, the operation can be divided into 15 small parts, and the detailed description will be given for each part.
1. Loading graph data and unifying storage formats
And loading the graph data to be processed into a host memory, and storing the data into a uniform format of a compression matrix row (CSR). First, according to the size of the graph data, applying for the memory of the host for the array C and the array R in the CSR format. The graph data is then saved to array C and array R as per the method of FIG. 2. FIG. 2 (a) is a directed graph in which numbered circles represent vertices and arrowed lines represent directed edges; the array C and the array R in fig. 2 (b) are directed graphs in fig. 2 (a) represented in the CSR format, in which the adjacent vertices of each vertex are sequentially stored in the array C, and the start positions of the adjacent vertices of each vertex in the array C are stored in the array R. And applying an array M with the same size as the number of the top points to mark the top point state in the detection process of the strong connected graph. Then, the start point and the end point of the edge are converted, and the reverse CSR format diagram data obtained by the same method is stored and is stored in the array C 'and the array R'. And finally, applying for the same memory space on GPU equipment, and copying the arrays to the equipment.
The first stage is as follows:
2. pruning operation of the first kind
The first type of pruning operation is mainly used for rapidly detecting a strong connectivity graph (1-SCC) consisting of only one vertex. The method comprises the following specific steps: and detecting the in-degree and the out-degree of each vertex on the GPU in parallel, if the in-degree or the out-degree of each vertex is zero, the point is 1-SCC, the vertex state is marked as a central point in the array M, and the detection is finished. Although the method cannot guarantee that all 1-SCCs are detected, most 1-SCCs can be detected, and time is saved.
3. First type center point selection
The first stage of the method mainly detects a large-scale maximum strong connection graph. In order to ensure that the selected central point is on the extremely strong connection graph, the invention uses the peak with the largest product of the in-degree and the out-degree as the central point in the selection of the first type of central point. The specific method comprises the following steps: applying for a variable for storing the central point on the GPU and assigning the variable as 0, calculating the products of the in-degree and the out-degree of the central point and each vertex in parallel, if the products of the in-degree and the out-degree of the current vertex are larger than the products of the in-degree and the out-degree of the central point, replacing the central point by using the vertex, and repeating the processes until the value of the central point is not changed any more.
4. Parallel forward and backward traversal starting from a central point using warp reuse method
Because the adjacent vertexes of each vertex in the great strong communication graph are numerous and the number of the adjacent vertexes of different vertexes is greatly different, in order to save unnecessary thread distribution and balance the workload of different threads, the invention provides a method for using warp reuse to finish the forward traversal and backward traversal work in the detection of the great strong communication graph. The thread allocation of the warp reuse method and the traditional method in the invention is shown in fig. 3, for convenience, it is assumed that there are 16 threads (actually 32 threads) in one warp, the curve in the figure represents the threads, the square represents the thread group, the circle represents the vertex task, the shaded circle represents the vertex task which does not need to be processed in the current iteration process, and the number in the unshaded circle represents the number of adjacent vertices of the current vertex. Fig. 3 (a) shows a thread allocation method in which each thread is allocated with one vertex in the conventional method. Fig. 3 (b) shows a thread assignment method in which one vertex is assigned to each thread group (virtual warp). FIG. 3 (c) shows a thread allocation method of the warp reuse method of the present invention.
Warp in the invention is heavyThe method comprises the following specific steps: according to a preset parameter k w Dividing each warp (comprising 32 threads) in the GPU into a plurality of equal-size virtual warps, wherein each virtual warp comprises k w And (4) each thread. Assigning all vertex tasks to virtual warps in sequence, where each virtual warp is assigned k v And (4) each vertex task. From the center point, a forward breadth-first traversal is performed on the array R and the data C in parallel, and a backward breadth-first traversal is performed on the array R 'and the data C'. In each virtual warp, for the vertex needing to be processed in the allocated vertices, all threads simultaneously detect whether the adjacent vertex of a certain vertex needing to be processed in the allocated vertices is detected or not in parallel, and mark the result into an array M. Then, other vertices to be processed among the assigned vertices are traversed in the same manner.
5. Updating states to obtain strongly connected graphs and partitions
And determining the detected strong connected graph according to the traversal condition of each vertex obtained in the part 4, and obtaining partitions capable of being detected in parallel. According to the traversal result of the central point, all vertexes traversed by the central point forward and backward form a complete strong communication graph with the central vertex. And the three subareas respectively and independently contain all strong connection graphs therein and can finish the detection of the strong connection graphs in parallel. The above result cases are marked in array M.
6. Judging the condition to finish the second kind of pruning operation
According to a preset parameter k t A decision is made whether to perform a pruning operation. In the second phase of the algorithm, k is performed each time t The second strong connectivity plot detection was performed for the first colored partition. If k is t 0, the algorithm is similar to the coloring algorithm, where pruning is not required; if k is t Above 0, pruning operations are required. Unlike the first type of pruning operation, the second type of pruning operation does not only detectA strong connectivity map (1-SCC) consisting of one vertex is also detected, and a strong connectivity map (2-SCC) consisting of two vertices is also detected. Although the number of 2-SCCs in the graph data is much less than 1-SCC, the individual detection of these 2-SCCs still saves much algorithmic runtime. The second type of pruning operation mainly comprises the following processes: firstly, carrying out multiple 1-SCC detections until no 1-SCC is generated, then carrying out one 2-SCC detection, and finally carrying out multiple 1-SCC detections until no 1-SCC is generated. The detection of 1-SCC is the same as that of part 2, and the detection method of 2-SCC is as follows: and detecting adjacent vertexes of each undetected vertex in parallel, and if an directed edge exists on the adjacent vertex and is connected to the current vertex, and the outward degree or the inward degree of the two points except the mutual connection is zero, forming a strong connection graph by the two vertexes, and marking the vertexes in the group M.
And a second stage:
7. partitioning using a colored partitioning method
The invention provides a new partition method, which can generate more partitions than the traditional method and can quickly detect a strong connection graph in each partition. In section 5, three partitions have been formed, and a colored partition can produce more smaller partitions in each partition. The specific process is as follows: the ID of each vertex is considered to be the color value of that vertex. And detecting the color of the adjacent vertex of all the vertexes which do not form the strong connection graph, and modifying the color of the current vertex into the color of the adjacent vertex if the color of the adjacent vertex is smaller than that of the current vertex. The above process is repeated until the color values of all vertices no longer change. All the vertexes which do not form the strong connection graph can be divided into a plurality of subareas according to different color values. It can be shown that each partition independently contains all the strongly connected maps therein. So that strong connected graphs can be detected in all partitions in parallel. The simple demonstration is as follows:
as shown in FIG. 4, the outer box represents a graph data G that includes a set of vertices V and a set of edges E. If A and B are two partitions formed by the process of coloring partition, respectively, where P is a And P b Respectively, the center points. Suppose there is one vertex V in partition A a One vertex V exists in the partition B b And V is a And V b Belong to the same SCC (without setting the central points of the partition A and the partition B as P a And P b And the size relationship is as follows: p a <P b )。
Because of V a And V b Is of the same SCC, then V a ->V b ,V b ->V a 。(“->"indicates the existence of a directed connection)
Because in partition A, P a ->V a
So P a ->V b
So V b Belongs to A. Not in accordance with the assumptions.
It can be shown that there is no such strong connectivity graph, V a And V b Is an arbitrary vertex on a strongly connected graph, where V a Belonging to the sub-areas A, V b Belonging to partition B. Each partition thus independently contains all of the strongly connected maps therein.
8. Second class center point selection
After the very large strong connection graph is detected, the scales of other strong connection graphs are not large, and the influence of the central point selection on the algorithm efficiency is not large, so that in order to save the algorithm time, a vertex with the same initial color value and the same partition color value in each partition is directly selected as the central point and marked in the array M.
9. Forward traversal from the center point results in SCC
In each partition, parallel forward breadth-first traversal is performed starting from the selected center point. And after traversing, forming a strong communication graph by all traversed vertexes and the central point in each partition. The above result cases are marked in array M.
10. Class III center point selection
Through the process, the scale of the remaining undetected strong communication graphs is not large, the influence of the central point selection on the algorithm efficiency is not large, and a vertex is directly and randomly selected in each partition to serve as the central point and is marked in the array M.
11. Parallel forward and backward traversal starting from a center point
And starting from the central point, performing forward breadth-first traversal on the array R and the data C in parallel, performing backward breadth-first traversal on the array R 'and the data C', and marking the traversal condition of each vertex into the array M.
12. Pruning operation of the first kind
This procedure is the same as in section 2.
13. Updating the state to obtain a strongly connected graph and partitions
This procedure is the same as in section 5.
14. Judging whether a strong connection graph is generated
And judging whether a new strong connection graph is generated according to the result of the part 13. If a new strong communication graph is generated, continuing to perform the operation of the 15 th part; and if no new strong connection graph is generated, outputting a result and ending the algorithm.
15. Judging whether the iteration number exceeds k t
Judging whether the iteration number exceeds a threshold k according to the result t The threshold is preset by a user, and according to an experimental result, when the threshold is 2, the average operating efficiency is highest. If the threshold value is exceeded, jumping to the 7 th part for the coloring partition operation; if the threshold value is not exceeded, the 10 th part is jumped to for the center point selection operation.
Experimental data and conclusions
The experimental data compared by the present invention includes graph data of two categories of generation graph and real graph. The generated graph is generated by using a Georgia tech. graph generator (GTgraph) according to the following three types of graph data: random, R-MAT and SSCA #2, each type generating a 10M vertex number of graph data. Real graph data is from SNAP database and Koblenz Network Collection. The detailed information is shown in table 1. The experiment used a total of 3 generation plots and 8 real plots, which differed greatly in size, degree distribution of vertices, and distribution of SCC. This experiment compared 5 protocols, including: (1) Tarjan: the detection algorithm of the serial strong connectivity graph of Tarjan is a representative traditional serial algorithm; (2) Barnat: the Barnat parallel strong communication graph detection method is a classical FB-Trim algorithm; (3) Hong: the method for detecting the parallel strong connectivity graph by adopting the WCC partitioning method is proposed by Hong et al; (4) wHong: based on the method of Hong, devsshatwar and the like adopt a parallel strong connection graph detection method improved by a virtual warp method; (5) Slota: slota et al improved parallel colored strongly connected graph detection method; (6) FBw-Pc: the invention provides a concurrent strong connectivity graph detection method based on warp reuse and a coloring partition.
Table 1 experimental data detailed parameters
Figure BDA0002490258870000091
1) warp reuse method performance analysis
In order to verify that the warp reuse method can remarkably accelerate the detection process of the extremely-large strong connection graph in the first stage of the algorithm, the experiment compares various methods. Among the various parallel methods, the Slota method uses a thread assignment method in which each thread is assigned with a vertex when detecting a very large strong connected graph in the first stage, and uses a coloring algorithm when detecting the other trivial strong connected graphs in the second stage. The wSlota method proposed by Deshatwar et al improves the first stage of the Slota method by assigning a vertex task to each virtual warp. For comparison, the first stage of the Slota method was improved in this experiment using the warp reuse method, and the other parts are identical to the Slota, wSlota algorithms. The comparison results are shown in table 2 below, where ReuseWarps1 and ReusingWarps2 are two different implementations of the warp reuse method.
TABLE 2warp reuse correlation Algorithm runtime comparison
Name of the drawing Slota wSlota ReuseWarps1 ReusingWarps2
soc-LiveJournal1 215.075 221.727 200.538 192.508
soc-pock 104.735 89.440 87.065 87.938
wiki-topcats 415.820 380.122 263.125 146.377
WikiTalk 129.016 55.937 55.108 50.434
youtube-links 120.968 52.340 56.145 56.639
baidu-internallink 198.023 132.268 126.782 132.679
synthetic-rmat 346.220 379.750 385.638 345.330
synthetic-random 344.003 352.820 362.146 346.785
synthetic-ssca 967.494 4264.943 3721.599 1916.128
wikipedia-en 1977.462 2385.520 2041.199 1405.830
uk-2002 1660.969 1666.089 1709.536 1600.710
As can be seen from the above table, the method of the present invention is superior to other methods in most of the graph data, especially in the graphs wiki-toptables, wikiTalk and youtube-links, and the acceleration ratio reaches 2.84, 2.56 and 2.14. In generating the synthetic-random and synthetic-ssca graphs, the method of the invention is slightly slower than that of Slota, because the proposed warp reuse method mainly aims at graph data with serious data inclination, and the graph generation method of randomly connecting two vertexes through a certain rule can make the out-degree and in-degree distribution of different vertexes relatively even, so that the data structures of the generated graphs are not inclined particularly. The method is only slightly slower than the wSlota method on the graph of the baidu-internalink, and the speed is only different by 7 percent. Warp reuse itself can bring a small amount of workload compared to virtual warp, and if it brings more workload than the acceleration it provides, then the inventive method will be slower than the virtual warp method. The acceleration of warp reuse operation is evident from the overall results of the experiment.
2) Colored partitioning method Performance analysis
The invention provides a coloring partition method, and replaces a WCC partition method in the traditional method. In order to verify that the method adopted by the invention can generate more subareas and can quickly detect the strong connection graphs with the same number, the experiment compares the performances of different subarea methods. The basic method used in the experiment is a parallel strong connectivity graph detection method based on the WCC partitioning method proposed by Hong et al. In the Hong et al method, the WCC partition was replaced with the colored partition method in this experiment, and the other portions were not modified to compare the performance differences of the different partition methods, the comparison results are shown in table 3 below:
TABLE 3 comparison of different partitioning methods
Figure BDA0002490258870000111
As can be seen from the above table, there are real graphs or generated graphs that do not require the processing of the second stage of the algorithm. For example, wiki-topcaps is graph data consisting of vertices with larger out-degree and in-degree in strong connection graphs generated by hyperlinks of Wikipedia in 9 months of 2011, and the graph data only comprises one strong connection graph. Through the research on a plurality of RMAT and Random generation graphs, the data of the two generation graphs only comprise a very large strong connection graph and a large amount of 1-SCC, and do not comprise other large-scale strong connection graphs. Because the extremely large strong-connection graph is very suitable for parallel processing, and the 1-SCC can be quickly detected by a pruning method, compared with real graph data, the running speed of the parallel strong-connection graph method on the two generation graphs is far faster than that of a serial algorithm. However, except for these special cases, most of the real graph data contains a large number of trivial strong-connectivity graphs other than 1-SCC and 2-SCC, which need to be detected in the second stage of the algorithm. As shown in table 2, in the remaining graph data, the method of the present invention can generate more partitions, reduce the total iteration number of the algorithm, and detect the strong connection graph equal to the number of partitions. In the second stage of the algorithm, the algorithm efficiency can be obviously improved by adopting a partition coloring method.
3) Analysis of gross Properties
In order to verify that the method can effectively accelerate the detection process of the strong connectivity graph, the experiment compares the detection time of the strong connectivity graph on 11 graph data shown in table 1 of the method with other 5 methods, and the experimental result is shown in fig. 5. On most of the graph data, the method of the present invention is significantly faster than the other methods and achieves average acceleration ratios of 7.23, 30.55, 1.75, 1.26 and 1.10 times with respect to the methods of Tarjan, barnat, hong, slota and wHong, respectively. Among them, the most significant acceleration is on the two generation map data of synthetic-rmat and synthetic-random, and all parallel algorithms have an acceleration ratio of over 10 times, especially over 20 times, compared with the traditional serial method (Tarjan). It can be seen from the above experimental analysis that the two types of generation graphs are very suitable for parallel processing, and do not need accelerated operation in the second stage of the algorithm, while other optimization methods based on Barnat method optimize the second stage of the algorithm and bring some workload, so the operation efficiency is rather inferior to that of Barnat method. Although the generated graph synthetic-ssca contains strong connection graphs except for a very large strong connection graph, 1-SCC and 2-SCC, the graphs are generated according to the rule of the clusters, and all vertexes in each cluster are directly connected in pairs, so that the condition of data inclination does not exist, and the warp-based methods of wHong and FBw-Pc are not greatly accelerated. Since the graph algorithm is mainly used for processing real graph data information, the operation efficiency on the real graph data is more convincing. As can be seen from FIG. 5, the method of the present invention, although it did not achieve the best acceleration effect on the generated map, performed well on all the real map data, especially on the map data soc-Livejournal1, soc-pot and wikipedia-en, achieving an acceleration ratio of over 8 times. On real graph data, the Barnat method is almost slower than the traditional serial algorithm, and even on two very large graph data (wikipedia-en and uk-2002), the Barnat method cannot complete detection. The experiment shows that the method can obtain good acceleration effect on various real graph data.
The above embodiments are only intended to illustrate the technical solution of the present invention and not to limit the same, and a person skilled in the art can modify the technical solution of the present invention or substitute the same without departing from the spirit and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims (4)

1. A strong connectivity graph detection method based on warp reuse and colored partition on a GPU is characterized by comprising the following steps:
loading graph data and unifying storage formats, comprising the following steps: storing data by using a format of a compressed matrix row, applying for a host memory for the array C and the array R according to the size of the graph data, and storing the graph data into the array C and the array R; switching the starting point and the end point of the edge, storing the reverse CSR format graph data obtained by the steps, and storing the data into an array C 'and an array R'; applying for a memory space which is the same as that of a host memory on GPU equipment, copying all the arrays to the equipment, and applying for an array M for representing a vertex state in the equipment memory;
carrying out first-class pruning operation on the graph data, and detecting a strong connection graph 1-SCC only consisting of one vertex;
selecting a first type of central point, and using a vertex with the largest product of in-degree and out-degree as a central point;
traversing forward and backward in parallel from a central point using a warp reuse method, comprising the steps of: dividing each warp in the GPU into a plurality of equal-size virtual warps according to a preset parameter kw, wherein each virtual warp comprises kw threads; sequentially assigning all vertex tasks to virtual warps, wherein each virtual warp is assigned kv vertex tasks; from the center point, performing forward breadth-first traversal on the array R and the data C in parallel, and performing backward breadth-first traversal on the array R 'and the data C'; in each virtual warp, for the vertex needing to be processed in the distributed vertices, all threads simultaneously detect whether the adjacent vertex of a certain vertex needing to be processed in the distributed vertices is detected or not in parallel, and mark the result into an array M; obtaining a strong connection graph and three subareas through the traversal, wherein the strong connection graph is a connection graph formed by vertexes traversed in the forward direction and the backward direction and a central point, and the three subareas are formed by vertexes traversed in the forward direction and not traversed in the backward direction, and vertexes not traversed in the forward direction and not traversed in the backward direction; each subarea independently comprises all strong connection graphs therein, and the detection of the strong connection graphs can be completed in parallel; carrying out corresponding labeling on the strong connection graph and the vertexes in the partitions capable of being detected in parallel in the array M;
judging whether a second type of pruning operation is needed or not, comprising the following steps: determining whether to execute pruning operation or not according to preset kt, and carrying out coloring partition once every kt times of detection of the strong connectivity graph; if kt is 0, pruning operation is not required; if kt is greater than 0, pruning operation is required; if so, carrying out a second type of pruning operation, and detecting a strong connection graph 1-SCC formed by one vertex and a strong connection graph 2-SCC formed by two vertices; the second type of pruning operation comprises the steps of: firstly, detecting 1-SCC for a plurality of times until no 1-SCC is generated, then detecting 2-SCC for one time, and finally detecting 1-SCC for a plurality of times until no 1-SCC is generated, wherein the detection method of 1-SCC is the same as the steps of the first class of pruning operation, and the step of detecting 2-SCC comprises the following steps: detecting adjacent vertexes of each undetected vertex in parallel, and if an directed edge exists on the adjacent vertexes and is connected to the current vertex, and the outward degree or the inward degree of the two points except the mutual connection is zero, forming a strong communication graph by the two vertexes, and marking the vertexes in the group M;
further partitioning the three partitions by using a coloring partition method to obtain more smaller partitions;
performing second-class center point selection operation in the formed small partitions, selecting vertexes with the same initial color values and partition color values in each partition as center points, traversing from the center points forward, forming a strong communication graph by the traversed vertexes and the center points, and forming new partitions by the non-traversed vertexes;
performing a third class center point selection operation, directly and randomly selecting a vertex in each partition as a center point, and traversing forwards and backwards in parallel from the center point;
carrying out the first class pruning operation again, and updating the strong connection graph and the subareas;
judging whether a new strong connection graph is generated or not, if not, ending, and if so, judging whether the iteration number exceeds a threshold k or not t If the threshold value is exceeded, performing coloring partition operation by using a coloring partition method, otherwise, performing third-class center point selection operation; the coloring partition method comprises the following steps: and taking the ID of each vertex as the color value of the vertex, detecting the color of the adjacent vertex of all the vertices which do not form the strong connected graph, if the color of the adjacent vertex is less than that of the current vertex, modifying the color of the current vertex into the color of the adjacent vertex, repeating the process until the color values of all the vertices do not change, dividing all the vertices which do not form the strong connected graph into a plurality of subareas according to different color values, wherein each subarea independently comprises all the strong connected graphs.
2. The method of claim 1, wherein the step of pruning of the first type comprises: and detecting the in-degree and the out-degree of each vertex on the GPU in parallel, if the in-degree or the out-degree of each vertex is zero, the vertex forms a 1-SCC independently, the vertex state is marked as a central point in the array M, and the detection is finished.
3. The method of claim 1, wherein the step of selecting the first type of center point comprises: applying for a variable for storing the central point on the GPU and assigning the variable as 0, calculating the products of the in-degree and the out-degree of the central point and each vertex in parallel, if the products of the in-degree and the out-degree of the current vertex are larger than the products of the in-degree and the out-degree of the central point, replacing the central point by using the vertex, and repeating the processes until the value of the central point is not changed any more.
4. The method of claim 1, wherein k is k t Including taking 2.
CN202010403115.0A 2020-05-13 2020-05-13 Detection method of strong connectivity graph based on warp reuse and colored partition on GPU Active CN111754383B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010403115.0A CN111754383B (en) 2020-05-13 2020-05-13 Detection method of strong connectivity graph based on warp reuse and colored partition on GPU

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010403115.0A CN111754383B (en) 2020-05-13 2020-05-13 Detection method of strong connectivity graph based on warp reuse and colored partition on GPU

Publications (2)

Publication Number Publication Date
CN111754383A CN111754383A (en) 2020-10-09
CN111754383B true CN111754383B (en) 2023-03-10

Family

ID=72674352

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010403115.0A Active CN111754383B (en) 2020-05-13 2020-05-13 Detection method of strong connectivity graph based on warp reuse and colored partition on GPU

Country Status (1)

Country Link
CN (1) CN111754383B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115712580B (en) * 2022-11-25 2024-01-30 格兰菲智能科技有限公司 Memory address allocation method, memory address allocation device, computer equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107392836A (en) * 2016-05-05 2017-11-24 辉达公司 The more projections of solid realized using graphics processing pipeline
CN109669770A (en) * 2018-12-12 2019-04-23 中国航空工业集团公司西安航空计算技术研究所 A kind of parallel colouring task scheduling unit system of graphics processor
CN110264392A (en) * 2019-05-06 2019-09-20 中国科学院信息工程研究所 A kind of strongly connected graph detection method based on more GPU
CN110288507A (en) * 2019-05-06 2019-09-27 中国科学院信息工程研究所 A kind of multi partition strongly connected graph detection method based on GPU
CN110764824A (en) * 2019-10-25 2020-02-07 湖南大学 Graph calculation data partitioning method on GPU

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017106606A2 (en) * 2015-12-16 2017-06-22 Stc.Unm SYSTEM AND METHODS FOR THE COMPUTATION OF THE FORWARD AND INVERSE DISCRETE PERIODIC RADON TRANSFORM ON GPUs AND CPUs

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107392836A (en) * 2016-05-05 2017-11-24 辉达公司 The more projections of solid realized using graphics processing pipeline
CN109669770A (en) * 2018-12-12 2019-04-23 中国航空工业集团公司西安航空计算技术研究所 A kind of parallel colouring task scheduling unit system of graphics processor
CN110264392A (en) * 2019-05-06 2019-09-20 中国科学院信息工程研究所 A kind of strongly connected graph detection method based on more GPU
CN110288507A (en) * 2019-05-06 2019-09-27 中国科学院信息工程研究所 A kind of multi partition strongly connected graph detection method based on GPU
CN110764824A (en) * 2019-10-25 2020-02-07 湖南大学 Graph calculation data partitioning method on GPU

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于OpenCL的实时KD-Tree与动态场景光线跟踪;卢贺齐等;《计算机辅助设计与图形学学报》;20130715(第07期);全文 *

Also Published As

Publication number Publication date
CN111754383A (en) 2020-10-09

Similar Documents

Publication Publication Date Title
Ucar et al. Task assignment in heterogeneous computing systems
Svensson Santa claus schedules jobs on unrelated machines
Gu et al. Efficient BVH construction via approximate agglomerative clustering
CN107015868B (en) Distributed parallel construction method of universal suffix tree
US8527988B1 (en) Proximity mapping of virtual-machine threads to processors
CN104881322B (en) A kind of cluster resource dispatching method and device based on vanning model
Schlag et al. Scalable edge partitioning
Gottesbüren et al. Deep multilevel graph partitioning
CN109614520B (en) Parallel acceleration method for multi-pattern graph matching
CN111754383B (en) Detection method of strong connectivity graph based on warp reuse and colored partition on GPU
CN106033442B (en) A kind of parallel breadth first search method based on shared drive architecture
CN111062462A (en) Local search and global search fusion method and system based on differential evolution algorithm
CN108776698B (en) Spark-based anti-deflection data fragmentation method
CN109711439A (en) A kind of extensive tourist&#39;s representation data clustering method in density peak accelerating neighbor seaching using Group algorithm
CN110288507B (en) GPU-based multi-partition strong connection graph detection method
Ou et al. Parallel remapping algorithms for adaptive problems
Simon et al. HARP: A fast spectral partitioner
CN111160711B (en) Parallel machine batch scheduling method based on ant colony algorithm
CN105763636B (en) The selection method and system of optimal host in a kind of distributed system
CN108171785B (en) SAH-KD tree design method for ray tracing
CN113010316B (en) Multi-target group intelligent algorithm parallel optimization method based on cloud computing
CN114490799A (en) Method and device for mining frequent subgraphs of single graph
CN112052879B (en) Method for accelerating density peak clustering by using GPU
Zhang et al. Highly efficient breadth-first search on cpu-based single-node system
Borgulya A parallel hyper-heuristic approach for the two-dimensional rectangular strip-packing problem

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant