CN110264392B

CN110264392B - Strong connection graph detection method based on multiple GPUs

Info

Publication number: CN110264392B
Application number: CN201910371236.9A
Authority: CN
Inventors: 吴广君; 王树鹏; 侯骏腾; 李斌斌
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2019-05-06
Filing date: 2019-05-06
Publication date: 2021-05-04
Anticipated expiration: 2039-05-06
Also published as: CN110264392A

Abstract

The invention provides a strong connectivity graph detection method based on multiple GPUs, which comprises the following steps: loading graph data and unifying storage formats; preprocessing the graph data, including dividing the graph according to the number of partitions, storing the partitions, and copying vertexes of different partitions which are linked with each other; storing the preprocessed data into a plurality of GPUs, performing breadth-first traversal by taking a replication vertex as a center, and recording replication side information; the copied edges are transmitted back to the CPU, the strong connection graph is detected, and the vertexes belonging to the same strong connection graph are marked; and transmitting the marked vertexes back to the plurality of GPUs for strong connection graph detection.

Description

Strong connection graph detection method based on multiple GPUs

Technical Field

The invention relates to a method for detecting a strong connection graph by using a heterogeneous system, in particular to a strong connection graph detection method based on multiple GPUs.

Background

The graph data structure is used as a basic data structure and can be widely applied because the relevance among data can be well expressed. With the development of artificial intelligence and social networks, accelerated computation of a large amount of graph data becomes a popular field. The Strong Connected Components (SCC) is a basic graph structure, how to quickly detect the strong Connected graph from graph data is a basic problem for many graph computing applications, and has important applications in the fields of social network analysis, computer assistance, model calculation, and the like.

Strong connectivity graph detection was a research direction that has been initiated very early. Before general-purpose GPUs were widely used, researchers commonly used CPUs for strong connectivity graph detection and proposed many excellent algorithms, such as Tarjan algorithm, Kosaraju algorithm, Dijkstra algorithm, and the like. Among the algorithms, the Tarjan algorithm can complete the final detection of the strong connected graph only by traversing the edge of each strong connected graph once, so that the Tarjan algorithm becomes a classic serial strong connected graph detection algorithm. However, the Tarjan algorithm is a method based on depth-first search (DFS), so that it is difficult to parallelize the method and cannot further increase the operating speed. In order to be able to accelerate the detection of strong connected graphs, researchers have proposed some parallel algorithms. The classical parallel algorithm comprises a Forward-Backward traversal algorithm (FB) and a Forward-Backward traversal pruning algorithm (FB-Trim) obtained by improving the FB. Since then, many algorithms based on these two algorithms have been proposed, and most of them can fully utilize the parallel computing power of special hardware such as GPU. The parallel strong connection graph detection algorithm by using the GPU can simultaneously call a large number of threads in the GPU for detection, so that the acceleration effect of dozens of times can be shown on graph data with large scale. However, the current server is generally equipped with a plurality of GPUs, or video cards such as NVIDIA Tesla K80, and each video card is equipped with two GPUs. However, the GPU-based strong connectivity graph detection scheme uses a single GPU for processing, and cannot fully utilize all GPUs on the device, or use the existing graph computing system for strong connectivity graph detection, which brings a lot of data interaction between the GPUs and finally affects the execution efficiency of the algorithm. Thus, parallel approaches to strong-connected graph detection using a single GPU or using a graph system limit the efficiency and applicability of detection on large graphs.

Disclosure of Invention

The invention aims to provide a strong connection graph detection method based on multiple GPUs, which is designed aiming at the condition that multiple GPUs or graphs on equipment have a large scale, uses different algorithms to process on the CPUs or the multiple GPUs according to the computing requirements of different stages and the characteristics of different hardware, can fully utilize the parallel computing capability of the multiple GPUs, is a multi-GPU different-graph processing algorithm integrating breadth-first traversal, a Tarjan algorithm and an improved FB-Trim algorithm, and can achieve a good strong connection graph detection effect on graph data.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows:

a strong connected graph detection method based on multiple GPUs comprises the following steps:

loading graph data and unifying storage formats;

preprocessing the graph data, including dividing the graph according to the number of partitions, storing the partitions, and copying vertexes of different partitions which are linked with each other;

storing the preprocessed data into a plurality of GPUs, performing breadth-first traversal by taking a replication vertex as a center, and recording replication side information;

the copied edges are transmitted back to the CPU, the strong connection graph is detected, and the vertexes belonging to the same strong connection graph are marked;

and transmitting the marked vertexes back to the plurality of GPUs for strong connection graph detection.

Further, the graph data includes information such as the number of vertices, the number of edges, and the numbers of the head vertex and the tail vertex of each edge.

Further, the unified storage of the graph data in the CSR format includes the following steps:

applying for a link vertex array C which is 1 greater than the number of edges and a link vertex position array R which is 1 greater than the number of vertices;

sequentially storing tail vertex information of all edges taking each vertex as a head vertex in an array C, and assigning a null value to the last element of the array C; in the array R, the ith element stores the position information of the first tail vertex of all edges taking the ith vertex as a head vertex in the array C, and the last element stores the position of the last element in the array C.

Further, the step of pre-processing comprises:

dividing the graph data into graphs to obtain partition numbers corresponding to each vertex and the number of the vertexes of each partition, reading the partition numbers into an array P according to the vertex numbers, wherein the ith element of the array P records the partition number to which the vertex i belongs;

applying for an array C according to the total number N of GPUs of the equipment_iArray R_iCopy point mark array M_iAnd vertex marking array V corresponding to the partition_iWherein i is more than or equal to 1 and less than or equal to N;

for each vertex traversed in the graph data, according to the partition number of the vertex recorded in the array P, the vertex is recorded into the array V of the partition where the vertex is located_iRecord the linked vertex condition of the vertex to the array C_iAnd array R_iPerforming the following steps;

if a vertex k linked to a vertex j belongs to other partitions, the vertex k is also stored in the array V corresponding to the vertex j_iAnd mark vertex j as a duplicate vertex, in array M_iThe position corresponding to the middle vertex j is marked as 1, and the vertex k is also marked as a copy vertex in the partition to which the vertex k belongs; if all the vertices linked to vertex j belong to the partition where vertex j is located, then in group M_iThe position corresponding to the middle vertex j is marked as 0.

Further, performing breadth-first traversal with the copy vertex as the center and recording the copy side information for all GPUs to run in parallel, comprising the following steps:

for each vertex stored in the ith GPU, if a certain vertex j is in the number group M_GiIf the label in (1) is 1, performing breadth-first traversal with the vertex j as the center;

if the traversed vertex k does not belong to the current partition or the vertex k is in the number group M_GiIs marked as 1, the vertex group (j, k) is stored as a new edge in the array E_GiPerforming the following steps;

the above two steps are repeated until all the vertices on each GPU are detected.

Further, detecting the strong connectivity graph by using a Tarjan algorithm and marking the vertexes belonging to the same strong connectivity graph, wherein the method comprises the following steps of:

1) applying a timestamp array D and a root node array L according to the number of the vertexes in the current graph, wherein the timestamp array D is used for recording the sequence number of the current vertex when being searched, and the root node array L is used for recording the root of the smallest subtree of the current vertex in the current search tree;

2) slave array C_CReading a new vertex u and starting a Tarjan algorithm taking the vertex u as the center;

3) updating the values of D (u) and L (u) using the current search depth, vertex u being pushed S;

4) for all edges (u, v) on vertex u, if v has not been searched, searching v through steps 3) to 5), and updating the value of l (u) with the minimum value of l (u) and l (v) after the search is finished; if v has been searched, updating the value of L (u) directly with the minimum of L (u) and L (v);

5) if the values of D (u) and L (u) are equal, it is shown that a strong connection graph can be formed by taking the vertex u as a root, and the following process is repeatedly executed: a vertex v is obtained by popping from the stack S, and the vertex v is stored in a strong connection graph SCC (u) formed by a vertex u until the vertex v and the vertex u are equal;

6) if the vertex which is not searched still exists in the current graph, taking one searched vertex u to execute the steps 3) to 6) until all the vertices in the current graph are searched.

Further, after the marked vertexes are transmitted back to the multiple GPUs, all the GPUs simultaneously execute the FB-reinform algorithm in parallel to perform strong connection graph detection, and the method comprises the following steps of:

strong connectivity graph detection at the replication vertices;

detecting a strong connected graph of data parallelism at a non-copy vertex;

task-parallel strong connectivity graph detection at non-replicated vertices.

Further, the strong connectivity graph detection at the replication vertices comprises the steps of:

1) randomly selecting a vertex v which does not belong to a strong connection graph from all the copied vertices of a GPU as a central point, searching the vertex which belongs to the same strong connection graph with the vertex v through a memory, and marking;

2) starting from all vertexes which belong to the same strong connected graph with the central point, performing parallel forward breadth-first traversal on the vertexes which do not form the strong connected graph, and marking all traversed vertexes which do not form the strong connected graph as being traversed in a forward sequence;

3) starting from all vertexes which belong to the same strong connected graph with the central point, performing parallel backward breadth-first traversal on the vertexes which do not form the strong connected graph, and marking all traversed vertexes which do not form the strong connected graph as being traversed in a subsequent manner;

4) performing pruning operation on the vertexes without the strong connected graph, traversing all vertexes without the strong connected graph in parallel, and marking the vertexes u to form an independent strong connected graph if the vertexes u are not copied vertexes in the GPU and the in-degree is 0 or the out-degree is 0;

5) the vertex marked as being traversed by the antecedent and the vertex marked as being traversed by the postorder are subjected to parallel intersection solving, and the obtained vertex set mark is marked into the formed strong communication graph;

6) parallelly detecting the vertexes marked as being traversed by the forward sequence and the backward sequence, and canceling the traversed marks if a strong connected graph is not formed;

7) and (3) randomly selecting one vertex v which does not form the strong connected graph from all the copied vertices of the GPU as a central point, and continuing the operations of the steps 2) to 7) until all the copied vertices form the strong connected graph.

Further, the detection of the data parallel strong connected graph at the non-copy vertex comprises the following steps:

1) randomly selecting one vertex v from all vertexes without forming a strong communication graph as a central point;

2) carrying out parallel forward breadth-first traversal on the vertexes without the strong connection graph from the vertex v, and marking all traversed vertexes without the strong connection graph as being traversed in a forward order;

3) carrying out parallel backward breadth-first traversal on the vertexes without the strong connection graph from the vertex v, and marking all traversed vertexes without the strong connection graph as being traversed in a subsequent manner;

4) performing pruning operation on the vertexes without the strong connection graph, traversing all vertexes without the strong connection graph in parallel, and marking the vertexes u to form an independent strong connection graph if the vertexes u exist, and the degree of entry is 0 or the degree of exit is 0;

5) performing parallel intersection solving on the vertex marked as being traversed by the forward sequence and the vertex marked as being traversed by the backward sequence, and marking the obtained vertex set as a new strong communication graph;

6) and detecting the vertexes marked as being traversed by the antecedent order and the vertexes marked as being traversed by the postorder in parallel, and canceling the traversed marks if a strong-connection graph is not formed.

7) And (3) arbitrarily selecting one vertex v as a central point from all the vertices without the strong connection graph, and continuing the operations of the steps 2) to 7) until the total number of the deleted vertices exceeds 1% of the total number of the vertices in the graph.

Further, the task-parallel strong connected graph detection at the non-copy vertex comprises the following steps:

2) performing parallel forward breadth-first traversal on the vertexes which do not form the strong connected graph in the current subarea from the vertex v, and marking all traversed vertexes which do not form the strong connected graph as being traversed by a forward sequence;

3) carrying out parallel backward breadth-first traversal on the vertexes which do not form the strong connected graph in the current subarea from the vertex v, and marking all traversed vertexes which do not form the strong connected graph as being traversed in a subsequent order;

4) performing pruning operation on the vertexes which do not form the strong connection graph in the current partition, traversing all vertexes which do not form the strong connection graph in parallel, and marking the vertexes u to form an independent strong connection graph if the vertexes u exist, and the degree of entry is 0 or the degree of exit is 0;

6) parallelly detecting the vertexes marked as being traversed by the front sequence and the vertexes marked as being traversed by the back sequence, dividing the vertexes without forming the strong communication graph into three subareas, namely the vertexes which are traversed by the front sequence and not traversed by the back sequence, the vertexes which are not traversed by the front sequence and are traversed by the back sequence, the vertexes which are not traversed by the front sequence and not traversed by the back sequence, and canceling all the vertexes in the three subareas from the marks which are traversed;

7) and (3) arbitrarily selecting one vertex v from each partition obtained in the step 6) as a central point, and continuing the operations of the steps 2) to 7) until all the vertices in the graph form a strong connection graph.

The principle on which the method of the invention is based is as follows:

1. in a device with multiple GPUs, the graph data is loaded and converted into a unified storage format. The graph data are subjected to graph segmentation by using a rapid high-quality multi-scale segmentation scheme which is specially provided for irregular graph data by George, Vipin and the like, and the number of the partitions corresponding to each vertex and the number of the vertices of each partition are obtained. And respectively storing the graph data in each partition into the corresponding GPU. After special processing is performed on the edge data crossing the two partitions, only a general strong connectivity graph detection method needs to be used for independent detection on each GPU.

2. The breadth-first traversal and Tarjan algorithm and the improved FB-Trim algorithm are integrated. When the graph data of different partitions is saved to different GPUs, there are still many edges with starting and ending points located in two different partitions, respectively, which if not processed, may cause some SCCs to be split. The method comprises the following steps: firstly, respectively storing edges of a starting point and an end point which are positioned in different partitions on the partition where the starting point is positioned, and marking the starting point and the end point as replication vertexes; secondly, detecting the connection condition of all the copied vertexes by using a breadth-first traversal method and transmitting the result back to the CPU; secondly, collecting the connection condition of the replication vertexes obtained in all the partitions on a CPU, detecting the SCC formed by the points by using a Tarjan algorithm, and transmitting the obtained result back to the GPU; and finally, performing strong connected graph detection on the rest vertexes on each GPU in parallel.

The method of the invention has the following advantages and effects:

1. the detection of the strong connected graph can be carried out on a plurality of GPUs, and the computing power of all parallel devices on the device is fully utilized. Multiple GPUs are typically present on a high performance computing device, and the present invention can utilize all GPUs on the device to accelerate the detection of strong connected graphs. In the multi-GPU computing model, the process is optimally designed considering that huge time consumption is brought by data exchange between a plurality of GPUs and a CPU. In the invention, data transmission among a plurality of GPUs is not needed, only one read-write operation from the CPU to each GPU is needed for the data transmission among the GPUs and the CPU, and the related data amount only accounts for a small part of the total data amount.

2. And fusing a plurality of algorithms. The invention adopts different algorithms in different stages according to the calculation requirements of different stages and the characteristics of different hardware. When the connection condition between each partition of the graph data is calculated, because the number of related vertexes and edges is small, breadth-first traversal and a Tarjan algorithm are used; when the strong connectivity graph is calculated on each partition, a large number of vertexes and edges on each partition need to be calculated for multiple times, and the improved FB-Trim algorithm is used for performing parallel calculation on multiple GPUs.

3. And the detection pertinence of the strong connected graph on the multiple GPUs is strong. In the prior art, the detection algorithm based on the GPU parallel strong connected graph is used for detecting the strong connected graph by using a single GPU or directly completing the detection algorithm of the strong connected graph on a graph system based on multiple GPUs. However, the strong connected graph detection algorithm based on a single GPU cannot fully utilize multiple GPUs in the device, and the strong connected graph detection algorithm completed on a multi-GPU graph system causes frequent interaction of information among the multiple GPUs, so that a large amount of time is wasted in the information transmission process. The invention directly carries out the initial exploration of the detection of the strong connection graph by using the plurality of GPUs, fully utilizes the operation performance of the CPU and the plurality of GPUs, and reduces the frequent data transmission between the GPUs.

4. The effect is obvious. The scheme is mainly designed aiming at the condition that a plurality of GPUs or graphs on the equipment have large scale, the acceleration effect is obviously faster than that of the existing serial or parallel strong connection graph detection scheme, and meanwhile, the good strong connection graph detection effect can be achieved for graph data of the common scale.

Drawings

FIG. 1 is a flowchart of a strong connectivity graph detection algorithm based on multiple GPUs according to the present invention.

FIGS. 2A-2B are schematic diagrams of CSR storage formats used in the present invention, wherein FIG. 2A is a directed graph of graph data, and FIG. 2B is a schematic diagram of storage using array C and array R.

3A-3B are graphs comparing before and after graph partitioning for an example graph with 13 vertices and 19 edges based on 3 GPUs.

FIG. 4 is a schematic diagram of the storage of graph data in the second partition of FIGS. 3A-3B using the storage format of the present invention.

Fig. 5 is a diagram illustrating the final strong connection diagram detection result of the example diagram.

Fig. 6 is a graph comparing overall performance.

FIG. 7 is a graph of graph size versus algorithm effect.

FIG. 8 is a graph comparing the effect of edge density on the algorithm.

FIG. 9 is a graph of the impact of replicated vertices on the algorithm.

Detailed Description

In order to make the aforementioned and other features and advantages of the present invention comprehensible, several embodiments accompanied with figures are described in detail below.

The embodiment provides a strong connectivity graph detection method based on multiple GPUs, the method makes full use of the parallel computing capability of the multiple GPUs, multiple algorithms are fused, and different algorithms are adopted according to task requirements of different stages. As shown in FIG. 1, the implementation of the present invention is divided into two parts, running processing on the CPU and running processing on multiple GPUs. The more detailed division can break down the whole implementation process into nine steps, and the implementation of each step is described in detail below:

the first step is as follows: loading graph data and converting the graph data into a unified storage format:

reading graph data information from a graph data file, comprising: the number of vertices, the number of edges, the number of head vertices and tail vertices of each edge. The above information is stored using Compressed Sparse Row (CSR) format, as shown in fig. 2A-2B. The method comprises the following specific steps:

(1) a linked vertex array C that is 1 greater than the number of edges and a linked vertex position array R that is 1 greater than the number of vertices are applied.

(2) And sequentially storing tail vertex information of all edges taking each vertex as a head vertex in an array C, and assigning a null value to the last element of the array C. In the array R, the ith element stores the position information of the first tail vertex of all edges taking the ith vertex as a head vertex in the array C, and the last element stores the position of the last element in the array C.

By the above method, all vertex information to which the vertex i can be connected can be read in the array C with the position recorded by the i-1 st element in the array R and the position immediately before the recording position of the i-th element as the starting point and the ending point.

The second step is that: graph partitioning and replication vertex preprocessing:

according to the set number of the graph partitions, graph partitioning is carried out on the graph data by using a rapid high-quality multi-scale partitioning scheme which is specially provided for irregular graph data by George, Vipin and the like, and the partition number corresponding to each vertex and the vertex number of each partition are obtained. And reading the partition number into an array P according to the vertex sequence number. Wherein, the ith element of the array P records the partition number to which the vertex i belongs.

And (3) partitioning and storing the information of the whole graph by using the CSR format in the first step, wherein the specific process is as follows:

(1) applying for an array C according to the total number N of GPUs on the current equipment_iArray R_iCopy point mark array M_iAnd the number of vertex indices V in the current partition_iWherein i is more than or equal to 1 and less than or equal to N.

(2) The data stored in array C and array R is traversed. For each traversed vertex j, recording the linked vertex condition of the vertex to the array C by the same method in the first step according to the partition number of the vertex j recorded in the array P_iAnd array R_iIn (1). Record the current vertex j into the array V_iIn (1). If some vertex k belongs to other partitions in all the vertices linked to by the current vertex, the vertex k is also stored in the array V_iIn (3), vertex j is marked as a duplicate vertex, i.e., in array M_iThe position corresponding to the middle vertex j is marked as 1, and the vertex k is also marked as a copy vertex in the partition to which the vertex k belongs; if all vertices to which the current vertex is linked belong to the current partition, then M is set_iThe position corresponding to the middle vertex j is marked as 0.

Fig. 3A-3B are graphs comparing before and after graph splitting for an exemplary graph having 13 vertices and 19 edges based on 3 GPUs, where the vertices marked with diagonal line shading represent replicated vertices.

FIG. 4 is a schematic diagram of the storage format of the present invention for graph data in the second partition of FIGS. 3A-3B, including array C and array R stored using CSR format, and array V for marking the vertex number and array M for marking whether the current vertex is a duplicate vertex.

The third step: storing the preprocessed data into a plurality of GPUs:

respectively applying for an array C on each GPU_GiArray R_GiCopy point mark array M_GiAnd the number of vertex indices V in the current partition_GiWherein i is the serial number of the current GPU, and i is more than or equal to 1 and less than or equal to N.

Array C in the memory of the host_iArray R_iArray M_iAnd array V_iArrays C stored in corresponding device memories respectively_GiArray R_GiArray M_GiAnd array V_GiIn (1).

The fourth step: performing breadth-first traversal by taking the replication vertex as a center and recording replication side information:

the following processes run concurrently on all GPUs in parallel:

(1) for each vertex stored in the ith GPU, if a certain vertex j is in the number group M_GiIf the label in (1) is 1, performing breadth-first traversal with the vertex j as the center;

(2) if the traversed vertex k does not belong to the current partition or the vertex k is in the number group M_GiIs marked as 1, the vertex group (j, k) is stored as a new edge in the array E_GiPerforming the following steps;

(3) the above two processes are repeated until all vertices on each GPU are detected.

The fifth step: and transmitting the obtained copy edge back to the CPU:

the array E obtained in the fourth step_GiArray E copied from device memory to host memory_CiIn (1), combine them into an array E_CPair array E using the CSR format in the first step_CStoring to obtain a linked vertex array C_CAnd linked vertex position array R_CThe two arrays can be represented as a new graph.

And a sixth step: the Tarjan algorithm detects the strong connectivity graph and marks the vertices belonging to the same strong connectivity graph:

and on the CPU, detecting the strong connection graph of the new graph obtained in the fifth step by using a Tarjan algorithm, and marking the vertexes belonging to the same strong connection graph as the same color. The specific process is as follows:

(1) and applying a timestamp array D and a root node array L according to the number of the vertexes in the current graph, wherein the timestamp array D is used for recording the sequence number of the searched current vertex, and the root node array L is used for recording the root of the smallest subtree of the current vertex in the current search tree.

(2) Slave array C_CA new vertex u is read and the Tarjan algorithm centered on vertex u is started.

(3) Vertex u is pushed S using the current search depth to update the values of D (u) and L (u).

(4) For all edges (u, v) on vertex u, if v is not searched, searching v by using the processes of (3) - (5), and updating the value of L (u) by using the minimum value of L (u) and L (v) after the search is finished; if v has already been searched, the value of L (u) is updated directly with the minimum of L (u) and L (v).

(5) If the values of D (u) and L (u) are equal, it is shown that a strong connection graph can be formed by taking the vertex u as a root, and the following process is repeatedly executed: and (4) popping the stack S to obtain a vertex v, and storing the vertex v into a strong connection graph SCC (u) formed by the vertex u until the vertex v and the vertex u are equal.

(6) If the vertex which is not searched still exists in the current graph, taking one searched vertex u to execute the processes of (3) - (6) until all the vertices in the current graph are searched.

In the above process, the root node of the strong connectivity graph where the current vertex is located is recorded in the root node array L, so that if the values of the two vertices in the root node array L are equal, the two vertices belong to the same strong connectivity graph.

The seventh step: passing the tagged vertex information back to the GPU:

copying root node array L from host memory to device memory L_GAmong them.

Eighth step: strong connectivity graph detection using FB-reinform algorithm:

the process is performed concurrently in parallel on each GPU. The FB-Reinfore algorithm used here is an improved scheme of the FB-Trim algorithm proposed by Jiri Barnat et al, and comprises three stages: the method comprises the steps of strong connected graph detection at a copying vertex, data parallel strong connected graph detection at a non-copying vertex and task parallel strong connected graph detection at the non-copying vertex. For the ith GPU, the specific process is as follows:

stage of strong connectivity graph detection at the replication vertices:

(1) in array M_GiRandomly selecting one vertex v from all the vertexes marked as 1 as a central point, and if the vertex v already belongs to a strong connection graph, counting the number of groups M_GiSelecting one vertex v as the central point again from all the vertexes marked as 1; if the vertex v does not belong to a strong connected graph, then pass L_GAnd searching and marking the vertex which belongs to the same strong communication graph with the vertex v.

(2) And performing parallel forward breadth-first traversal on the vertexes which do not form the strong connected graph from all vertexes which belong to the same strong connected graph with the central point, and marking all traversed vertexes which do not form the strong connected graph as being traversed in a forward sequence.

(3) And performing parallel backward breadth-first traversal on the vertexes which do not form the strong connected graph from all vertexes which belong to the same strong connected graph with the central point, and marking all traversed vertexes which do not form the strong connected graph as being traversed in a subsequent manner.

(4) And carrying out pruning operation on the vertexes without the strong connection graph. All the vertexes which do not form the strong connection graph in the parallel traversal graph if the vertexes u exist in the array M_GiIs marked as 0 and the in-degree is 0 or the out-degree is 0, the marked vertex u forms an independent strong connected graph.

(5) And (3) carrying out parallel intersection on the vertex marked as being traversed by the forward sequence and the vertex marked as being traversed by the backward sequence in the steps (2) and (3), and marking the obtained vertex set mark into the formed strong connection graph.

(6) And (3) detecting the vertexes marked as being traversed by the antecedent order and the vertexes marked as being traversed by the postorder in parallel, and canceling the marks which are traversed if the strong-connection graph is not formed.

(7) In array M_GiRandomly selecting one vertex v which does not form a strong connected graph from all the vertices marked as 1 as a central point, and continuing the operations of (2) - (7) until the vertex is in the array M_GiAll vertices marked 1 are until a strong connected graph is formed.

A strong connected graph detection stage of data parallelism at a non-copy vertex:

(1) and randomly selecting one vertex v from all the vertexes without forming the strong connection graph as a central point.

(2) And performing parallel forward breadth-first traversal on the vertexes which do not form the strong connection graph from the vertex v, and marking all the traversed vertexes which do not form the strong connection graph as being traversed in a forward order.

(3) And performing parallel backward breadth-first traversal on the vertexes without forming the strong connection graph from the vertex v, and marking all traversed vertexes without forming the strong connection graph as being traversed in a subsequent manner.

(4) And carrying out pruning operation on the vertexes without the strong connection graph. And (4) traversing all the vertexes in the graph without forming the strong connection graph in parallel, and if the vertexes u exist, the in-degree is 0 or the out-degree is 0, marking the vertexes u to form an independent strong connection graph.

(5) And (3) carrying out parallel intersection on the vertex marked as being traversed by the forward sequence and the vertex marked as being traversed by the backward sequence in the steps (2) and (3), and marking the obtained vertex set as a new strong connection graph.

(7) And randomly selecting one vertex v from all the vertexes without forming the strong connection graph as a central point. The operations (2) - (7) continue until the total number of vertices deleted exceeds 1% of the total number of vertices in the graph.

And (3) detecting a task parallel strong connection graph at a non-copying vertex:

(6) And (3) detecting vertexes marked as being traversed by the forward order and vertexes marked as being traversed by the backward order in parallel, and dividing vertexes which do not form the strong connection graph into three subareas, wherein the subareas are respectively as follows: vertices that have been traversed by a predecessor and not a successor, vertices that have not been traversed by a predecessor and have been traversed by a successor, vertices that have not been traversed by a predecessor and not been traversed by a successor. All vertices in the three partitions are unmarked that have already been traversed.

(7) And (4) arbitrarily selecting a vertex v from each partition obtained in the step (6) as a central point. Operations (2) - (7) continue until all vertices in the graph form a strong connected graph.

The ninth step: counting results and outputting:

and counting the obtained detection conditions of the strong connection graph and outputting a detection result. As shown in fig. 5, all vertices belonging to the same strong connectivity graph are designated by changing the original vertex number to the center point number of the strong connectivity graph to which the vertex belongs.

Experimental data and conclusions

The experiment comparison scheme provided by the invention comprehensively considers a plurality of aspects such as the total running time, the influence of the graph scale, the influence of the edge density, the influence of the number of the copied vertexes and the like, and the experiment comparison is comprehensive and strong in persuasion. The advantages of the present method are exemplified by further experimental analysis below.

Four algorithms are compared, including: (1) the Tarjan's algorithm: the method is a serial strong communication graph detection algorithm which is most classic and good in effect at present; (2) FB-Trim algorithm: in a classical parallel strong connection graph detection algorithm, a plurality of parallel strong connection graph detection algorithms are improved on the algorithm; (3) FB-Hybrid: a parallel strong connected graph detection algorithm based on a single GPU improves an FB-Trim algorithm; (4) MG-Hybrid, the proposed parallel strong connectivity graph detection algorithm based on multiple GPUs of the Hybrid scheme, and the set of dividing graph data into two partitions is uniformly used for carrying out comparison experiments. The invention is valuable to be proved by comparison experiments on the generated graph data with different scales and different parameters. An R-MAT graph generation scheme is used to generate the required graph data, where four probability values with a total of 1 can be set as parameters for adjusting the degree distribution of the vertices. The detailed information of the partial mapping used in the experiments of the present invention is as follows:

table 1 detailed information of the experimental diagrams

Name of the drawing	Number of vertexes	Number of sides	Parameter setting
				Graph-A	1,000	10,000	(0.25；0.25；0.25；0.25)
Graph-B	10,000	100,000	(0.25；0.25；0.25；0.25)
				Graph-C	100,000	1,000,000	(0.25；0.25；0.25；0.25)
Graph-D	1,000,000	2,000,000	(0.25；0.25；0.25；0.25)
				Graph-E	1,000,000	5,000,000	(0.25；0.25；0.25；0.25)
Graph-F	1,000,000	5,000,000	(0.45；0.15；0.15；0.25)
				Graph-G	1,000,000	10,000,000	(0.25；0.25；0.25；0.25)
Graph-H	1,000,000	10,000,000	(0.45；0.15；0.15；0.25)
				Graph-I	1,000,000	10,000,000	(0.4；0.2；0.2；0.2)

1) Comparison of Overall Performance

The invention compares the running time of each algorithm on all the graph data in table 1, and each algorithm runs 10 times for averaging. Since the graph data sizes are very different, the running times of the algorithms are also very different, and in order to clearly show the comparison effect of the running times, the running times of all the algorithms are normalized by using the running times of the Tarjan algorithm on each graph data, and the result is shown in FIG. 6. As can be seen from FIG. 6, for small-scale Graph data (Graph-A and Graph-B), none of the 3 parallel strong connectivity Graph detection algorithms is as fast as the serial algorithm. This is mainly because the two graphs have fewer vertices and edges, and many threads in the GPU are idle, so the advantages of the parallel algorithm are not fully exploited. Nevertheless, the algorithm still has 1.3 times acceleration effect on Graph-B compared to the other two parallel algorithms. On Graph-C with 100K vertices, the parallel algorithms have been significantly faster than the serial Tarjan algorithm. The FB-Trim algorithm and the FB-Hybrid algorithm have the acceleration effects of 3.36 times and 3.80 times respectively, and the acceleration effect of the invention is 5.56 times. On the remaining 6 big graphs, the parallel capability of the GPU is fully utilized, and it can be seen from fig. 6 that the speed of the parallel algorithm can reach tens or even twenty times of the speed of the serial algorithm. In particular, the algorithm achieves 20 times and 22 times of acceleration effects on Graph-E and Graph-D respectively. According to the information in table 1, the distribution of the two graphs is (0.25; 0.25; 0.25; 0.25), so the present invention has better acceleration effect on the uniformly distributed graph data.

2) Graph size impact test

To further analyze the effect of the image size on each strong connectivity map detection algorithm, comparative experiments were performed on different scales of image data. Dividing the graph data into 5 different levels according to the scale of the number of the vertexes, wherein the number of the vertexes is respectively as follows: 1K,10K, 100K,1M, 10M. The vertex distribution of each graph data is (0.25; 0.25; 0.25; 0.25), the number of edges is 10 times the number of corresponding vertices, and the number of first type duplicate vertices is within 100. Fig. 7 is a graph of the operation time of each algorithm on graph data with different sizes, and in order to clearly show the time increase of the algorithm at each stage, the natural logarithm of the actual operation time of the algorithm is used as the vertical axis of the coordinate system. As can be seen from fig. 7, the acceleration effect of the serial algorithm on small-scale image data is significantly better than that of the parallel algorithm, mainly because the number of vertices of the image is too small, and a large number of threads in the GPU are idle. However, the runtime of the serial algorithm is linear with the size of the graph, and as the number of graph vertices increases, the runtime of the serial algorithm increases significantly more than the parallel algorithm. When the number of the vertexes of the graph is about 100K, the running time of the serial algorithm is approximately equal to that of the parallel algorithm; when the number of the vertexes of the graph is larger than 1M, the efficiency of the three parallel algorithms can be about 10 times faster than that of the serial algorithm. Except for graph data comprising 1K and 1M vertices, the algorithm uses the least runtime on most graphs. The slight deceleration on the two graphs is caused by the difference in graph structure of the different partitions.

3) Edge density impact test

To test the variation in the running time of each algorithm as the edge density in the graph data increases, comparative experiments were performed using each algorithm on graph data in which the 10 vertex distributions are all (0.25; 0.25; 0.25; 0.25) with the number of vertices being 100K, and the number of edges of the graph data varied from 100K to 1M. In order to clearly show the contrast effect, the runtime of each algorithm on graph data with 100K vertices is used to normalize the runtime of other graph data. As shown in fig. 8, the running time of the serial algorithm shows a significant upward trend as the edge density increases. For three parallel algorithms, when the edges of the graph data exceed 500K, the runtime of the algorithm is close to half the runtime of the graph data with 100K edges, and the runtime tends to a steady state. However, when the number of edges is 200K, the operation time is doubled or so, mainly because the structural difference inside the graph data is large when the edge density of the graph data is small. The running time of the FB-Hybrid algorithm has the minimum fluctuation in all parallel algorithms. Since the graph data structure has a large influence on graph partitioning, an algorithm for performing strong connected graph detection using a plurality of GPUs is not stable when the edge degree of graph data is small. When the number of vertices in the graph data exceeds 100K, the time spent by the algorithm hardly changes. Thus, the algorithm can exhibit greater stability as the edge density of the graph data increases.

4) Duplicate vertex impact testing

This section mainly tests the effect of the number of replication vertices on the detection speed of the strong connectivity graph of the present invention. Graph data of a vertex distribution of (0.25; 0.25; 0.25) was used in which the number of 4 vertices was 100K. The number of copying top points in each partition is controlled by controlling the number of connecting edges among the partitions, and the number of the connecting edges of the data of each graph is 400K,600K,800K and 1M respectively. The runtime of the invention on graph data with 100K connecting edges is used to standardize the runtime of several other cases. It can be seen from the experimental results of fig. 9 that the number of replication vertices has little effect on the runtime of the algorithm of the present invention. When the number of the copy top points is increased by about 10 times, the time spent on Graph-400K is the largest, but the increment is still within 3 times. The time spent increments were within a factor of 2 on all three other graph data. Especially in Graph-1M, the running time of the algorithm tends to decrease as the number of copied vertices increases. Therefore, finding a partitioning method that produces fewer replicated vertices is also a future focus of the present invention.

The above embodiments are only intended to illustrate the technical solution of the present invention and not to limit the same, and a person skilled in the art can modify the technical solution of the present invention or substitute the same without departing from the spirit and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims

1. A strong connected graph detection method based on multiple GPUs is characterized by comprising the following steps:

loading graph data and unifying storage formats;

returning the copied edges to a CPU, detecting the strong connection graph by using a Tarjan algorithm and marking the vertexes belonging to the same strong connection graph;

and returning the marked vertexes to the plurality of GPUs, and simultaneously and parallelly executing strong connection graph detection by utilizing an FB-Reinfore algorithm by all the GPUs, wherein the detection comprises the following steps: the method comprises the steps of strong connected graph detection at copying top points, data parallel strong connected graph detection at non-copying top points and task parallel strong connected graph detection at non-copying top points.

2. The method of claim 1, wherein the graph data includes information including a number of vertices, a number of edges, and a number of head and tail vertices of each edge.

3. The method of claim 1, wherein storing graph data uniformly in CSR format comprises the steps of:

4. The method of claim 1, wherein the step of pre-processing comprises:

5. The method of claim 4, wherein performing breadth-first traversal centered on the replicated vertex and recording replicated side information runs in parallel for all GPUs, comprising the steps of:

6. The method of claim 1, wherein detecting strong connectivity graphs and labeling vertices belonging to the same strong connectivity graph using the Tarjan algorithm comprises the steps of:

7. The method of claim 1, wherein strong connectivity graph detection at a replication vertex comprises the steps of:

8. The method of claim 1, wherein the detection of the data-parallel strong connectivity graph at the non-replicated vertices comprises the steps of:

9. The method of claim 1, wherein task-parallel strong connectivity graph detection at non-replication vertices comprises the steps of: