CN116186339A

CN116186339A - Parallel graph calculation processing method and device, electronic equipment and storage medium

Info

Publication number: CN116186339A
Application number: CN202310305669.0A
Authority: CN
Inventors: 张昊涵; 康一
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2023-03-24
Filing date: 2023-03-24
Publication date: 2023-05-30

Abstract

The disclosure provides a parallel graph calculation processing method, a parallel graph calculation processing device, electronic equipment and a storage medium, which can be applied to the field of computers. The method comprises the following steps: grouping the processing unit array graph data sets to obtain a target top point set of each row of processing units and a plurality of source top point sets corresponding to each target top point set; for each processing unit in each row of processing units, storing a sub-graph destination top point set, a sub-graph source top point set and a superposition top point set corresponding to the processing units into corresponding storage units by using a distribution scheduler; the computing module is used for respectively reading the edge data of the storage unit and the processing unit to obtain initial vertex weights of the source vertexes and the destination vertexes of each vertex set and edge weights of the edge data; determining a single-source shortest path between a source vertex and a target vertex by using a calculation module based on the edge weight and the vertex weight to obtain a target vertex weight of a target vertex set and a coincident vertex set of the subgraph; and updating the target vertex weight to the storage unit by using the calculation module.

Description

Parallel graph calculation processing method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computers, and in particular, to a parallel graph computing processing method, apparatus, electronic device, storage medium, and program product.

Background

Graph computation has wide application in the Internet, and graph data structures can well express the relevance between data. The graph calculation is usually processed by a Central Processing Unit (CPU), and in the data processing process, the data volume required to be processed by the graph calculation is large, the data association degree is low, and the memory access is irregular.

In the process of realizing the disclosed conception, the inventor finds that at least the following problems exist in the related art, and the data access efficiency is lower and the graph calculation speed is slower due to the irregularity of the graph algorithm memory access.

Disclosure of Invention

In view of the foregoing, the present disclosure provides a parallel graph computation processing method, apparatus, electronic device, storage medium, and program product.

According to a first aspect of the present disclosure, there is provided a parallel graph calculation processing method, including: grouping the graph data sets of the processing unit arrays based on a preset grouping rule to obtain a target top point set of each row of processing units in the processing unit arrays and a plurality of source top point sets corresponding to each target top point set; for each processing unit in each row of processing units, storing a sub-graph target vertex set, a sub-graph source vertex set and a superposition vertex set corresponding to the processing unit into a corresponding storage unit by using a distribution scheduler in the processing unit, wherein the superposition vertex set is a set formed by superposed vertices between the sub-graph target vertex set and the sub-graph source vertex set; respectively reading the edge data corresponding to the processing unit in the storage unit and the graph data set by using a calculation module in the processing unit to obtain initial vertex weights corresponding to source vertexes and destination vertexes of all vertex sets in the storage unit and the edge weights of the edge data; determining a single-source shortest path between the source vertex and the target vertex by using the calculation module based on the edge weight and the vertex weight, so as to obtain a target vertex weight corresponding to the target vertex set and the coincident vertex set of the subgraph; and updating the target vertex weight to a corresponding storage unit by using the calculation module.

According to an embodiment of the present disclosure, the grouping of graph data sets of a processing unit array based on a preset grouping rule to obtain a destination vertex set of each row of processing units in the processing unit array and a plurality of source vertex sets corresponding to each of the destination vertex sets includes: based on the vertex numbers of a plurality of target vertices in the graph data set, uniformly grouping the target vertices by utilizing a hash grouping rule to obtain a target vertex set of each row of processing units in the processing unit array; and based on the total input degree and the preset group number of the target vertexes in each target vertex set, uniformly grouping a plurality of source vertexes corresponding to the target vertex set to obtain a plurality of source vertex sets.

According to an embodiment of the present disclosure, the uniformly grouping a plurality of source vertices corresponding to the destination vertex set based on a total input number and a preset group number of destination vertices in each of the destination vertex sets to obtain a plurality of source vertex sets includes: calculating a target output number of each source vertex set based on the total input number and the preset group; and uniformly grouping the source vertexes based on the target output number to obtain a plurality of source vertex sets.

According to an embodiment of the present disclosure, the storing, by using a distribution scheduler in the processing unit, the set of sub-graph destination vertices, the set of sub-graph source vertices, and the set of overlapping vertices corresponding to the processing unit into corresponding storage units includes: and storing the sub-graph destination vertex set into an on-chip static destination storage unit by using the distribution scheduler, storing the sub-graph source vertex set into an on-chip static source storage unit, and storing the coincident vertex set into an off-chip static coincident storage unit.

According to an embodiment of the disclosure, updating the target vertex weights to corresponding storage units by the computing module includes: updating a target vertex weight corresponding to the sub-graph target vertex set into the on-chip static target storage unit by using the calculation module; and under the condition that the weight of the target vertex corresponding to the coincident vertex in the coincident vertex set is smaller than the weight of the initial vertex, updating the weight of the target vertex corresponding to the coincident vertex set into the off-chip static coincident storage unit by utilizing the calculation module.

According to an embodiment of the present disclosure, the above method further includes: selecting a fastest convergence processing unit in each row of processing units under the condition that each processing unit completes the updating of the vertex weight, and storing an iteration destination vertex set and an iteration source vertex set corresponding to the fastest convergence processing unit into an off-chip storage unit; performing iterative processing on the iteration target vertex set and the iteration source vertex set by utilizing the calculation module aiming at each iteration target vertex set and each iteration source vertex set in the off-chip storage unit to obtain a new iteration target vertex set and a new iteration source vertex set; and determining the new iteration destination vertex set and the new iteration source vertex set corresponding to the ith iteration as a graph calculation result under the condition that the difference of vertex weights of corresponding vertexes is smaller than a preset value between the new iteration destination vertex set and the new iteration source vertex set respectively generated in the ith iteration and the (i+1) th iteration.

A second aspect of the present disclosure provides a parallel graph calculation processing apparatus including: the grouping module is used for grouping the graph data sets of the processing unit arrays based on a preset grouping rule to obtain a target top point set of each row of processing units in the processing unit arrays and a plurality of source top point sets corresponding to each target top point set; a storage module, configured to store, for each processing unit in each row of processing units, a sub-graph destination vertex set, a sub-graph source vertex set, and a superposition vertex set corresponding to the processing unit, where the superposition vertex set is a set formed by vertices superposed between the sub-graph destination vertex set and the sub-graph source vertex set, using a distribution scheduler in the processing unit; the reading module is used for respectively reading the storage unit and the edge data corresponding to the processing unit in the graph data set by utilizing the calculation module in the processing unit to obtain initial vertex weights corresponding to the source vertexes and the destination vertexes of all the vertex sets in the storage unit and the edge weights of the edge data; the calculation module is used for determining a single-source shortest path between the source vertex and the destination vertex by using the calculation module based on the edge weight and the vertex weight to obtain a target vertex weight corresponding to the sub-graph destination vertex set and the coincident vertex set; and the updating module is used for updating the target vertex weight to the corresponding storage unit by utilizing the calculating module.

A third aspect of the present disclosure provides an electronic device, comprising: one or more processors; and a memory for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method.

A fourth aspect of the present disclosure also provides a computer-readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to perform the above-described method.

A fifth aspect of the present disclosure also provides a computer program product comprising a computer program which, when executed by a processor, implements the above method.

According to the parallel graph calculation processing method, the device, the electronic equipment, the storage medium and the program product, the graph data sets of the processing unit array are grouped based on the preset grouping rule, so that the data in the graph data sets can be sequentially accessed based on the grouping result in the data traversal process, the random address access is not involved in the access process, and the access speed is increased. In the data processing process, the vertex weights of the target vertex and the source vertex are updated, so that the single-source shortest path can be determined more quickly, and the weight of the overlapped vertex can be further determined in the updating process by setting the overlapped vertex set, so that the speed of graph calculation is increased.

Drawings

The foregoing and other objects, features and advantages of the disclosure will be more apparent from the following description of embodiments of the disclosure with reference to the accompanying drawings, in which:

FIG. 1 schematically illustrates an application scenario diagram of a parallel graph computation processing method, apparatus, electronic device, storage medium, and program product according to an embodiment of the present disclosure;

FIG. 2 schematically illustrates a flow chart of a parallel graph computation processing method according to an embodiment of the disclosure;

FIG. 3 schematically illustrates a flow chart of a parallel graph computation processing method according to another embodiment of the present disclosure;

FIG. 4 schematically illustrates a diagram of a graph dataset grouping, according to an embodiment of the disclosure;

FIG. 5 schematically illustrates a block diagram of a parallel graph computing processing apparatus according to an embodiment of the present disclosure;

fig. 6 schematically illustrates a block diagram of an electronic device adapted to implement a parallel graph computation processing method according to an embodiment of the disclosure.

Detailed Description

Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is only exemplary and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the present disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the concepts of the present disclosure.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and/or the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It should be noted that the terms used herein should be construed to have meanings consistent with the context of the present specification and should not be construed in an idealized or overly formal manner.

Where expressions like at least one of "A, B and C, etc. are used, the expressions should generally be interpreted in accordance with the meaning as commonly understood by those skilled in the art (e.g.," a system having at least one of A, B and C "shall include, but not be limited to, a system having a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).

In the technical scheme of the disclosure, the related data (such as including but not limited to personal information of a user) are collected, stored, used, processed, transmitted, provided, disclosed, applied and the like, all conform to the regulations of related laws and regulations, necessary security measures are adopted, and the public welcome is not violated.

Classical graph algorithms such as breadth first search (BFS, breadth First Search), single source shortest path (SSSP, single-Source Shortest Paths), web page ranking (PageRank), etc. are widely used in the internet.

The graph data set to be processed by the graph algorithm has large data quantity, low data association degree and irregular access. Where a graph dataset is a collection of vertices and edges, one graph dataset may be stored with an adjacency matrix, adjacency table, or directly using an edge list. Different storage methods essentially correspond to different algorithm execution strategies, and generally, all execution strategies can be roughly divided into vertex-centric (vertex-centric processing) and edge-centric (edge-centric processing) modes. The former strategy is to traverse the vertex and access each destination vertex of the vertex in turn, more preferably using a storage mode of an adjacency matrix or adjacency table; the latter traverses the edge and accesses the source vertex and destination vertex of the edge, which is more prone to storage such as edge lists.

For architecture that pursues multi-core parallelism, the choice of edge-centric strategies is clearly more appropriate, since the edge list is fixed (only vertex information is updated constantly) during the graph algorithm iteration process, so that multiple cores can iterate different edge lists independently. In processing large-scale graph datasets, it is common practice to divide a graph into multiple sub-graphs by grouping vertices, currently using uniform grouping by vertices so that each vertex set has the same number of vertices.

In the related art, a Central Processing Unit (CPU) is generally used to process the problems of graph computation, but bottlenecks in parallelism and memory access are encountered during processing. When a Graphics Processing Unit (GPU) is used to process the problems of graph computation, the GPU is far superior to the CPU in parallelism, but the video memory is essentially a memory, and the memory bottleneck caused by irregular access is still present. In practice, conventional cache memory (cache) structures are extremely inefficient due to the discontinuities in memory access of the graph algorithm, with almost complete loss of spatial locality, and memory systems designed for general purpose computing have not been suitable.

In view of this, embodiments of the present disclosure provide a parallel graph computation processing method, a parallel graph computation processing apparatus, an electronic device, a readable storage medium, and a computer program product. The parallel graph calculation processing method comprises the following steps: grouping the graph data sets of the processing unit array based on a preset grouping rule to obtain a target top point set of each row of processing units in the processing unit array and a plurality of source top point sets corresponding to each target top point set; for each processing unit in each row of processing units, storing a sub-graph destination vertex set, a sub-graph source vertex set and a superposition vertex set corresponding to the processing units into corresponding storage units by using a distribution scheduler in the processing units, wherein the superposition vertex set is a set formed by superposed vertices between the sub-graph destination vertex set and the sub-graph source vertex set; respectively reading the edge data corresponding to the processing unit in the storage unit and the graph data set by utilizing a calculation module in the processing unit to obtain initial vertex weights corresponding to source vertexes and destination vertexes of each vertex set in the storage unit and edge weights of the edge data; determining a single-source shortest path between a source vertex and a destination vertex by using a calculation module based on the edge weight and the vertex weight to obtain a target vertex weight corresponding to the destination vertex set and the coincident vertex set of the subgraph; and updating the target vertex weight to a corresponding storage unit by using a calculation module.

Fig. 1 schematically illustrates an application scenario diagram of a parallel graph computation processing method, an apparatus, an electronic device, a storage medium, and a program product according to an embodiment of the present disclosure.

As shown in fig. 1, an application scenario 100 according to this embodiment may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 is a medium used to provide a communication link between the first terminal device 101, the second terminal device 102, the third terminal device 103, and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The user may interact with the server 105 through the network 104 using at least one of the first terminal device 101, the second terminal device 102, the third terminal device 103, to receive or send messages, etc. Various communication client applications, such as a shopping class application, a web browser application, a search class application, an instant messaging tool, a mailbox client, social platform software, etc. (by way of example only) may be installed on the first terminal device 101, the second terminal device 102, and the third terminal device 103.

The first terminal device 101, the second terminal device 102, the third terminal device 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.

The server 105 may be a server providing various services, such as a background management server (by way of example only) providing support for websites browsed by the user using the first terminal device 101, the second terminal device 102, and the third terminal device 103. The background management server may analyze and process the received data such as the user request, and feed back the processing result (e.g., the web page, information, or data obtained or generated according to the user request) to the terminal device.

It should be noted that, the parallel graph computation processing method provided in the embodiments of the present disclosure may be generally executed by the server 105. Accordingly, the parallel graph computation processing apparatus provided by the embodiments of the present disclosure may be generally disposed in the server 105. The parallel graph calculation processing method provided by the embodiment of the present disclosure may also be performed by a server or a server cluster that is different from the server 105 and is capable of communicating with the first terminal device 101, the second terminal device 102, the third terminal device 103, and/or the server 105. Accordingly, the parallel graph computation processing apparatus provided by the embodiments of the present disclosure may also be provided in a server or a server cluster that is different from the server 105 and is capable of communicating with the first terminal device 101, the second terminal device 102, the third terminal device 103, and/or the server 105.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

The parallel graph calculation processing method of the disclosed embodiment will be described in detail below with reference to the scenario described in fig. 1 through fig. 2 to 4.

Fig. 2 schematically illustrates a flow chart of a parallel graph computation processing method according to an embodiment of the present disclosure.

As shown in fig. 2, the method includes operations S201 to S205.

In operation S201, the graph data sets of the processing unit array are grouped based on a preset grouping rule, so as to obtain a destination vertex set of each row of processing units in the processing unit array and a plurality of source vertex sets corresponding to each destination vertex set.

In operation S202, for each processing unit in each row of processing units, the vertex set of the sub-graph object, the vertex set of the sub-graph source, and the overlapping vertex set corresponding to the processing unit are stored in the corresponding storage unit by using the distribution scheduler in the processing unit, where the overlapping vertex set is a set of vertices overlapping between the vertex set of the sub-graph object and the vertex set of the sub-graph source.

In operation S203, the computing module in the processing unit is used to read the storage unit and the edge data corresponding to the processing unit in the graph data set, so as to obtain the initial vertex weights corresponding to the source vertex and the destination vertex of each vertex set in the storage unit, and the edge weights of the edge data.

In operation S204, a single-source shortest path between the source vertex and the destination vertex is determined by using the calculation module based on the edge weight and the vertex weight, so as to obtain a destination vertex set of the sub-graph and a target vertex weight corresponding to the coincident vertex set.

In operation S205, the calculation module is utilized to update the target vertex weights to the corresponding storage units.

According to an embodiment of the present disclosure, the processing unit array is composed of a plurality of processing units (Processing Element, abbreviated as PEs), and may be a 4*4 PE array. And acquiring the graph data set from the off-chip storage unit, and grouping the graph data set of the processing unit array based on a preset grouping rule. The graph data set is a set of a plurality of vertexes and edges, and comprises vertex weights corresponding to the vertexes and edge weights corresponding to the edges. The destination vertex sets of each row of processing units of the processing unit array are identical by grouping the graph data sets, and a plurality of source vertex sets corresponding to the destination vertex sets of each row are continuous. Furthermore, random address access is not involved in the access process, so that the access speed is increased.

According to an embodiment of the disclosure, for each processing unit in each row of processing units, a set of vertices of a sub-graph object, a set of vertices of a sub-graph source, and a set of overlapping vertices corresponding to the processing unit are stored into corresponding storage units using a distribution scheduler in the processing unit. The distribution scheduler comprises a scheduling module, a merging module and a storage unit for storing execution information of the processing unit. The scheduling module is used for reading the vertex information from the off-chip storage unit and distributing the vertex information to the corresponding storage unit. And the merging module is used for reading the target vertexes in the storage unit after the execution of all the processing units in the same row is completed, and sending the target vertexes back to the off-chip storage unit.

According to embodiments of the present disclosure, each vertex in the graph dataset may be the destination vertex or the source vertex, and therefore there may be an intersection in the destination vertex set and the source vertex set generated in groups. When the destination vertex set and the source vertex set have coincident vertices, besides the destination vertex needs to be updated, a part of source vertices which need to be updated exist, so that one or more storage units which are accessible by the processing units are introduced into the processing units and used for storing data of the coincident vertices.

According to the embodiment of the disclosure, a computing module in a processing unit is utilized to read a storage unit, so as to obtain initial vertex weights corresponding to source vertexes and destination vertexes of each vertex set in the storage unit. And reading the edge data corresponding to the processing unit in the graph data set by using a computing module in the processing unit to obtain the edge weight of the edge data. The vertex weight and the weight in the edge weight can be represented by the distance in the single-source shortest path algorithm, can be represented by the webpage ranking (PR value) of the webpages in the webpage sorting algorithm, and can also be represented by the number of layers in the breadth-first search algorithm.

According to the embodiment of the disclosure, based on the edge weight and the vertex weight, a single-source shortest path between a source vertex and a destination vertex is determined by a calculation module, and a destination vertex set of the subgraph and a target vertex weight corresponding to the coincident vertex set are obtained. For example: assuming that the vertex weight of the source vertex 1 is 10, the initial vertex weight of the destination vertex 10 is 100, and the edge weight between the source vertex 1 and the destination vertex 10 is 50, determining that the single-source shortest path between the source vertex 1 and the destination vertex 10 is the sum of the vertex weight of the source vertex 1 and the edge weight, namely 60. Therefore, the target vertex weight of the vertex of the number 10 is determined to be 60, and then the weight of the vertex of the number 10 is updated.

Fig. 3 schematically illustrates a flow chart of a parallel graph computation processing method according to another embodiment of the present disclosure.

According to an embodiment of the present disclosure, as shown in fig. 3, a destination vertex set and a source vertex set in an OFF-CHIP MEMORY Unit 301 (off_chip MEMORY) are input into a distribution scheduler 302 (Control Unit) of a processing Unit (PE) by centering on a vertex (vertex-centric processing). The dispatch scheduler 302 includes a dispatch module 303 (Dispatcher) and a merge module 304 (Merger), and further includes a storage unit 305 (mem_info, memory information sram) for storing processing unit execution information. The scheduling module 303 is configured to read vertex information from an off-chip storage unit, allocate source vertices to an on-chip static source storage unit 306 (src_sram), and allocate a destination vertex set to an on-chip static destination storage unit 307 (dst_sram). The on-chip static source storage unit 306 and the on-chip static destination storage unit 307 are each composed of a plurality of static banks (SRAM banks) for storing source vertices and destination vertices from the off-chip storage unit 301, respectively.

According to an embodiment of the present disclosure, the edge data in the Local dynamic storage unit 308 (Local DRAM) corresponding to the processing unit is input into the computing module 309 (CAL) of the processing unit in an edge-centric (edge-centric processing) manner. The edge data includes a source vertex (SRC), a destination vertex (DST), and an edge weight (weight). The computing module 309 uses the edge data to access the on-chip static source storage unit 306 and the on-chip static destination storage unit 307 and complete the corresponding weight calculation, and writes the weight calculation result back to the on-chip static destination storage unit 307.

According to an embodiment of the present disclosure, the processing unit further introduces one off-chip static coincidence storage unit 310 (SRAM for overlap) accessible to each of the PEs, and the dispatch scheduler 302 inputs the coincident vertices into the off-chip static coincidence storage unit 310. The computing module 309 uses the edge data to access the off-chip static coincidence storage unit 310 and complete the corresponding weight calculation, and writes the weight calculation result back to the off-chip static coincidence storage unit 310.

According to an embodiment of the present disclosure, grouping a graph dataset of a processing unit array based on a preset grouping rule to obtain a destination vertex set of each row of processing units in the processing unit array and a plurality of source vertex sets corresponding to each destination vertex set may include the following operations:

Based on vertex numbers of a plurality of target vertices in the graph data set, uniformly grouping the target vertices by utilizing a hash grouping rule to obtain a target vertex set of each row of processing units in the processing unit array; and based on the total input degree and the preset group number of the target vertexes in each target vertex set, uniformly grouping a plurality of source vertexes corresponding to the target vertex set to obtain a plurality of source vertex sets.

According to the embodiment of the disclosure, for the destination vertex set, a method of uniformly grouping the destination vertex sets by vertex is used, that is, the destination vertex sets of PEs in each row have the same number of destination vertices, hash grouping is performed based on the vertex numbers of the destination vertices (assuming that a certain vertex number is x, the N destination vertex sets are planned to be divided, then the vertex belongs to the (x mod N) th vertex set), and the hash grouping can avoid that vertices with higher access degrees continuously occur in the original data set to make the access degrees of the corresponding vertex sets excessively high.

According to the embodiment of the disclosure, grouping the source vertexes is uniformly grouped according to edges, and a plurality of source vertexes corresponding to the target vertex set are uniformly grouped based on the total incidence number and the preset group number of the target vertexes in each target vertex set to obtain a plurality of source vertex sets, so that the edge sets of each sub-graph processed by the PE in the same row have the same scale.

Fig. 4 schematically illustrates a diagram of a graph dataset grouping according to an embodiment of the disclosure.

According to an embodiment of the present disclosure, based on a total input number of destination vertices and a preset group number in each destination vertex set, a plurality of source vertices corresponding to the destination vertex set are uniformly grouped to obtain a plurality of source vertex sets, which may include the following operations:

calculating a target output number of each source vertex set based on the total input number and a preset group; and uniformly grouping the plurality of source vertices based on the target output number to obtain a plurality of source vertex sets.

According to the embodiment of the disclosure, the graph dataset has the characteristic of power law distribution, if a traditional grouping method of uniformly grouping the vertices is used, the sizes of edge lists corresponding to each sub-graph are also greatly different due to the large difference of the output numbers of different vertices, so that the execution time difference of different PEs in the same row is large, and the PE which is executed first needs to wait for the PE which is executed last to complete the synchronization of the target vertex data. In this embodiment, a grouping method for uniformly grouping the destination vertex sets according to the number of vertices is provided, the grouping method for changing the source vertex sets is maintained, for each destination vertex set, the total output number of all vertices to the destination vertex sets is obtained, and an average is calculated, so that the output number of each source vertex set to the destination vertex sets is equal to the average, because the output number is equal to the number of edges of the subgraph in value, the scale of the edge list iterated by each PE is controlled to be the same.

According to the embodiment of the disclosure, as shown in fig. 4, it is assumed that the processing unit array is a 4*4 PE array, including 16 PEx _y, and each PE corresponds to a position Ex, y_x. Dx represents the set of destination vertices corresponding to the x-th row PE, and Sx, y represents the set of y-th source vertices separated by Dx. Assuming that the total number of incoming vertices of the row is in dgr and the preset number of groups is M, the number of outgoing vertices of each source vertex set should be equal to in dgr/M, so that the number of outgoing vertices of each source vertex set is the same relative to the number of outgoing vertices of the row. In fig. 4, according to the y-th source vertex set divided by Dx, sx,0 to Sx, M are satisfied, and the number of output degrees for Dx is equal to in_dgr/M. This allows the edge sets of each sub-graph processed by the same row of PEs to be scaled the same because the number of edges is equal to the number of outputs from the source set of vertices to the destination set of vertices, i.e., in dgr/M.

According to an embodiment of the present disclosure, storing, by a distribution scheduler in a processing unit, a set of subimage destination vertices, a set of subimage source vertices, and a set of overlapping vertices corresponding to the processing unit into corresponding storage units may include the following operations:

and storing the sub-graph destination vertex set into an on-chip static destination storage unit by using a distribution scheduler, storing the sub-graph source vertex set into an on-chip static source storage unit, and storing the coincident vertex set into an off-chip static coincident storage unit.

According to an embodiment of the present disclosure, the processing unit includes an on-chip static destination storage unit and an on-chip static source storage unit. The off-chip static superposition storage unit is a storage unit accessible to the processing unit.

According to an embodiment of the disclosure, updating the target vertex weights into the corresponding storage units using the calculation module may include the following operations:

updating the target vertex weights corresponding to the target vertex sets of the subgraph into an on-chip static target storage unit by using a calculation module; and under the condition that the weight of the target vertex corresponding to the coincident vertex in the coincident vertex set is smaller than the weight of the initial vertex, updating the weight of the target vertex corresponding to the coincident vertex set into an off-chip static coincident storage unit by using a calculation module.

According to the embodiment of the disclosure, the computing module updates the obtained target vertex weight corresponding to the target vertex to the on-chip static target storage unit by computing the single-source shortest path. For the overlapping vertexes in the overlapping vertex set, the overlapping vertexes can be source vertexes or destination vertexes, and under the condition that the weight of the target vertexes of the overlapping vertexes is smaller than the weight of the initial vertexes, the weight of the target vertexes corresponding to the overlapping vertex set is updated to an off-chip static overlapping storage unit, so that the speed of calculating the graph is increased.

According to an embodiment of the present disclosure, the method may further include the operations of:

and under the condition that each processing unit finishes the updating of the vertex weight, selecting the fastest convergence processing unit in the row of processing units, and storing the iteration destination vertex set and the iteration source vertex set corresponding to the fastest convergence processing unit into an off-chip storage unit.

And carrying out iterative processing on the iteration target vertex set and the iteration source vertex set by utilizing a calculation module aiming at each iteration target vertex set and each iteration source vertex set in the off-chip storage unit to obtain a new iteration target vertex set and a new iteration source vertex set.

And determining the new iteration destination vertex set and the new iteration source vertex set corresponding to the ith iteration as a graph calculation result under the condition that the difference of vertex weights of corresponding vertexes is smaller than a preset value between the new iteration destination vertex set and the new iteration source vertex set respectively generated in the ith iteration and the (i+1) th iteration.

According to the embodiment of the disclosure, since the edge list scale of the PE processing subgraphs of the same row is the same by adopting the method of uniformly grouping by edges, the time required for the PE to execute and finish is similar, and the time for the PE to wait for synchronization is greatly reduced. For each row of processing units, after the iteration of the edge list is completed by all PE in the same row, the synchronization of the target vertex is started, and as the target vertex sets of all PE in the row are the same, all PE are only required to read the target vertex in sequence and write the target vertex into the off-chip storage unit, and before the target vertex sets are written into the off-chip storage unit, the result which enables the algorithm to converge fastest is selected as the target vertex result of the iteration, namely the iteration target vertex set and the iteration source vertex set.

According to embodiments of the present disclosure, since the goal of graph computation is to iterate to converge to all vertices, the vertex that maximizes the convergence rate is selected for updating. And from loading the vertex to PE to writing back the vertex to off-chip storage, completing one multi-sub-graph parallel iteration, repeating the steps, and when the weights of all the vertices are not reduced any more, describing the convergence of the graph algorithm of the graph dataset.

Based on the parallel graph calculation processing method, the disclosure also provides a parallel graph calculation processing device. The device will be described in detail below in connection with fig. 5.

Fig. 5 schematically shows a block diagram of a parallel graph calculation processing apparatus according to an embodiment of the present disclosure.

As shown in fig. 5, the parallel graph computing processing apparatus 500 of this embodiment includes a grouping module 510, a storage module 520, a reading module 530, a computing module 540, and an updating module 550.

The grouping module 510 is configured to group the graph data sets of the processing unit array based on a preset grouping rule, so as to obtain a destination vertex set of each row of processing units in the processing unit array, and a plurality of source vertex sets corresponding to each destination vertex set. In an embodiment, the grouping module 510 may be configured to perform the operation S201 described above, which is not described herein.

And the storage module 520 is configured to store, for each processing unit in each row of processing units, a sub-graph destination vertex set, a sub-graph source vertex set, and a superposition vertex set corresponding to the processing unit, where the superposition vertex set is a set formed by vertices superposed between the sub-graph destination vertex set and the sub-graph source vertex set, using a distribution scheduler in the processing unit. In an embodiment, the storage module 520 may be used to perform the operation S202 described above, which is not described herein.

And the reading module 530 is configured to read the storage unit and the edge data corresponding to the processing unit in the graph data set by using the computing module in the processing unit, so as to obtain initial vertex weights corresponding to the source vertex and the destination vertex of each vertex set in the storage unit, and edge weights of the edge data. In an embodiment, the reading module 530 may be used to perform the operation S203 described above, which is not described herein.

The calculating module 540 is configured to determine a single-source shortest path between the source vertex and the destination vertex by using the calculating module based on the edge weight and the vertex weight, and obtain a destination vertex weight corresponding to the destination vertex set and the coincident vertex set of the sub-graph. In an embodiment, the calculating module 540 may be configured to perform the operation S204 described above, which is not described herein.

And the updating module 550 is configured to update the target vertex weights to the corresponding storage units by using the calculating module. In an embodiment, the update module 550 may be configured to perform the operation S205 described above, which is not described herein.

According to an embodiment of the present disclosure, the grouping module 510 includes a destination grouping sub-module and a source grouping sub-module.

And the destination grouping sub-module is used for uniformly grouping the destination vertexes by utilizing a hash grouping rule based on vertex numbers of a plurality of destination vertexes in the graph data set to obtain a destination vertex set of each row of processing units in the processing unit array.

And the source grouping sub-module is used for uniformly grouping a plurality of source vertexes corresponding to the target vertex sets based on the total input degree of the target vertexes and the preset group number in each target vertex set to obtain a plurality of source vertex sets.

According to an embodiment of the present disclosure, a source grouping submodule includes a calculation unit and a source grouping unit.

And the calculating unit is used for calculating the target output number of each source vertex set based on the total input number and the preset group.

And the source grouping unit is used for uniformly grouping the plurality of source vertexes based on the target output number to obtain a plurality of source vertex sets.

According to an embodiment of the present disclosure, the storage module 520 includes a storage sub-module.

And the storage sub-module is used for storing the sub-graph destination vertex set into an on-chip static destination storage unit by utilizing the distribution scheduler, storing the sub-graph source vertex set into the on-chip static source storage unit and storing the coincident vertex set into an off-chip static coincident storage unit.

According to an embodiment of the present disclosure, the update module 550 includes a first update sub-module and a second update sub-module.

And the first updating sub-module is used for updating the target vertex weights corresponding to the target vertex sets of the subgraph into the on-chip static target storage unit by using the computing module.

And the second updating sub-module is used for updating the target vertex weight corresponding to the coincident vertex set into the off-chip static coincident storage unit by using the calculating module under the condition that the target vertex weight corresponding to the coincident vertex in the coincident vertex set is smaller than the initial vertex weight.

According to an embodiment of the present disclosure, the parallel graph computing processing apparatus 500 further includes an off-chip storage module, an iteration module, and a result determination module.

And the off-chip storage module is used for selecting the fastest convergence processing unit in each row of processing units and storing the iteration destination vertex set and the iteration source vertex set corresponding to the fastest convergence processing unit into the off-chip storage unit under the condition that each processing unit completes vertex weight updating.

And the iteration module is used for carrying out iteration processing on the iteration target vertex set and the iteration source vertex set by utilizing the calculation module aiming at each iteration target vertex set and each iteration source vertex set in the off-chip storage unit to obtain a new iteration target vertex set and a new iteration source vertex set.

The result determining module is configured to determine, as a graph calculation result, a new iteration destination vertex set and a new iteration source vertex set corresponding to the ith iteration when a difference between vertex weights of corresponding vertices between the new iteration destination vertex set and the new iteration source vertex set generated in the ith and the (i+1) th iterations is smaller than a preset value.

Any of the grouping module 510, the storage module 520, the reading module 530, the computing module 540, and the updating module 550 may be combined in one module to be implemented, or any of the modules may be split into a plurality of modules according to an embodiment of the present disclosure. Alternatively, at least some of the functionality of one or more of the modules may be combined with at least some of the functionality of other modules and implemented in one module. At least one of the grouping module 510, the storage module 520, the reading module 530, the computing module 540, and the updating module 550 may be implemented at least in part as hardware circuitry, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system-on-chip, a system-on-substrate, a system-on-package, an Application Specific Integrated Circuit (ASIC), or in hardware or firmware, such as any other reasonable manner of integrating or packaging the circuitry, or in any one of or a suitable combination of three of software, hardware, and firmware, in accordance with embodiments of the present disclosure. Alternatively, at least one of the grouping module 510, the storage module 520, the reading module 530, the computing module 540, and the updating module 550 may be at least partially implemented as a computer program module, which when executed, may perform the corresponding functions.

As shown in fig. 6, an electronic device 600 according to an embodiment of the present disclosure includes a processor 601 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 602 or a program loaded from a storage section 608 into a Random Access Memory (RAM) 603. The processor 601 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or an associated chipset and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), or the like. Processor 601 may also include on-board memory for caching purposes. The processor 601 may comprise a single processing unit or a plurality of processing units for performing different actions of the method flows according to embodiments of the disclosure.

In the RAM 603, various programs and data necessary for the operation of the electronic apparatus 600 are stored. The processor 601, the ROM 602, and the RAM 603 are connected to each other through a bus 604. The processor 601 performs various operations of the method flow according to the embodiments of the present disclosure by executing programs in the ROM 602 and/or the RAM 603. Note that the program may be stored in one or more memories other than the ROM 602 and the RAM 603. The processor 601 may also perform various operations of the method flow according to embodiments of the present disclosure by executing programs stored in the one or more memories.

According to an embodiment of the present disclosure, the electronic device 600 may also include an input/output (I/O) interface 605, the input/output (I/O) interface 605 also being connected to the bus 604. The electronic device 600 may also include one or more of the following components connected to an input/output (I/O) interface 605: an input portion 606 including a keyboard, mouse, etc.; an output portion 607 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The drive 610 is also connected to an input/output (I/O) interface 605 as needed. Removable media 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed as needed on drive 610 so that a computer program read therefrom is installed as needed into storage section 608.

The present disclosure also provides a computer-readable storage medium that may be embodied in the apparatus/device/system described in the above embodiments; or may exist alone without being assembled into the apparatus/device/system. The computer-readable storage medium carries one or more programs which, when executed, implement methods in accordance with embodiments of the present disclosure.

According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, for example, but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the present disclosure, the computer-readable storage medium may include ROM 602 and/or RAM 603 and/or one or more memories other than ROM 602 and RAM 603 described above.

Embodiments of the present disclosure also include a computer program product comprising a computer program containing program code for performing the methods shown in the flowcharts. When the computer program product runs in a computer system, the program code is used for enabling the computer system to realize the parallel graph calculation processing method provided by the embodiment of the disclosure.

The above-described functions defined in the system/apparatus of the embodiments of the present disclosure are performed when the computer program is executed by the processor 601. The systems, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.

In one embodiment, the computer program may be based on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program may also be transmitted, distributed in the form of signals over a network medium, and downloaded and installed via the communication section 609, and/or installed from the removable medium 611. The computer program may include program code that may be transmitted using any appropriate network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

In such an embodiment, the computer program may be downloaded and installed from a network through the communication portion 609, and/or installed from the removable medium 611. The above-described functions defined in the system of the embodiments of the present disclosure are performed when the computer program is executed by the processor 601. The systems, devices, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.

According to embodiments of the present disclosure, program code for performing computer programs provided by embodiments of the present disclosure may be written in any combination of one or more programming languages, and in particular, such computer programs may be implemented in high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. Programming languages include, but are not limited to, such as Java, c++, python, "C" or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Those skilled in the art will appreciate that the features recited in the various embodiments of the disclosure and/or in the claims may be provided in a variety of combinations and/or combinations, even if such combinations or combinations are not explicitly recited in the disclosure. In particular, the features recited in the various embodiments of the present disclosure and/or the claims may be variously combined and/or combined without departing from the spirit and teachings of the present disclosure. All such combinations and/or combinations fall within the scope of the present disclosure.

The embodiments of the present disclosure are described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described above separately, this does not mean that the measures in the embodiments cannot be used advantageously in combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be made by those skilled in the art without departing from the scope of the disclosure, and such alternatives and modifications are intended to fall within the scope of the disclosure.

Claims

1. A parallel graph computing processing method comprises the following steps:

grouping the graph data sets of the processing unit arrays based on a preset grouping rule to obtain a target top point set of each row of processing units in the processing unit arrays and a plurality of source top point sets corresponding to each target top point set;

For each processing unit in each row of processing units, storing a sub-graph destination vertex set, a sub-graph source vertex set and a superposition vertex set corresponding to the processing unit into a corresponding storage unit by using a distribution scheduler in the processing unit, wherein the superposition vertex set is a set formed by superposed vertices between the sub-graph destination vertex set and the sub-graph source vertex set;

respectively reading the edge data corresponding to the processing unit in the storage unit and the graph data set by utilizing a calculation module in the processing unit to obtain initial vertex weights corresponding to source vertexes and destination vertexes of all vertex sets in the storage unit and the edge weights of the edge data;

determining a single-source shortest path between the source vertex and the target vertex by using the calculation module based on the edge weight and the vertex weight, and obtaining a target vertex weight corresponding to the target vertex set of the subgraph and the coincident vertex set;

and updating the target vertex weight to a corresponding storage unit by utilizing the calculation module.

2. The method of claim 1, wherein the grouping the graph data sets of the processing unit array based on the preset grouping rule to obtain a destination vertex set of each row of processing units in the processing unit array and a plurality of source vertex sets corresponding to each destination vertex set, includes:

Based on vertex numbers of a plurality of target vertices in the graph dataset, uniformly grouping the target vertices by utilizing a hash grouping rule to obtain a target vertex set of each row of processing units in the processing unit array;

and based on the total input degree and the preset group number of the target vertexes in each target vertex set, uniformly grouping a plurality of source vertexes corresponding to the target vertex set to obtain a plurality of source vertex sets.

3. The method of claim 2, wherein the uniformly grouping the source vertices corresponding to the destination vertex set based on the total number of ingress of the destination vertices and the preset group number in each destination vertex set to obtain the source vertex sets includes:

calculating a target output number of each source vertex set based on the total input number and the preset group;

and uniformly grouping the source vertexes based on the target output number to obtain a plurality of source vertex sets.

4. The method of claim 1, wherein the storing, with the distribution scheduler in the processing unit, the set of vertices of the sub-graph corresponding to the processing unit, the set of vertices of the sub-graph source, and the set of overlapping vertices into the corresponding storage unit comprises:

And storing the sub-graph destination vertex set into an on-chip static destination storage unit by using the distribution scheduler, storing the sub-graph source vertex set into an on-chip static source storage unit, and storing the coincident vertex set into an off-chip static coincident storage unit.

5. The method of claim 4, wherein the updating the target vertex weights into the corresponding storage units with the computing module comprises:

updating a target vertex weight corresponding to the sub-graph target vertex set to the on-chip static target storage unit by using the calculation module;

and under the condition that the weight of the target vertex corresponding to the coincident vertex in the coincident vertex set is smaller than the weight of the initial vertex, updating the weight of the target vertex corresponding to the coincident vertex set into the off-chip static coincident storage unit by utilizing the calculation module.

6. The method of claim 1, the method further comprising:

selecting a fastest convergence processing unit in each row of processing units under the condition that each processing unit completes vertex weight updating, and storing an iteration destination vertex set and an iteration source vertex set corresponding to the fastest convergence processing unit into an off-chip storage unit;

Performing iterative processing on the iteration target vertex set and the iteration source vertex set by utilizing the calculation module aiming at each iteration target vertex set and each iteration source vertex set in the off-chip storage unit to obtain a new iteration target vertex set and a new iteration source vertex set;

7. A parallel graph computation processing apparatus comprising:

the grouping module is used for grouping the graph data sets of the processing unit array based on a preset grouping rule to obtain a target top point set of each row of processing units in the processing unit array and a plurality of source top point sets corresponding to each target top point set;

the storage module is used for storing a sub-graph destination vertex set, a sub-graph source vertex set and a superposition vertex set corresponding to each processing unit in each row of processing units into the corresponding storage units by utilizing a distribution scheduler in the processing units, wherein the superposition vertex set is a set formed by superposed vertices between the sub-graph destination vertex set and the sub-graph source vertex set;

The reading module is used for respectively reading the storage unit and the edge data corresponding to the processing unit in the graph data set by utilizing the computing module in the processing unit to obtain initial vertex weights corresponding to the source vertexes and the destination vertexes of all the vertex sets in the storage unit and the edge weights of the edge data;

the calculation module is used for determining a single-source shortest path between the source vertex and the destination vertex by utilizing the calculation module based on the edge weight and the vertex weight to obtain a target vertex weight corresponding to the sub-graph destination vertex set and the coincident vertex set;

and the updating module is used for updating the target vertex weight to the corresponding storage unit by utilizing the calculating module.

8. An electronic device, comprising:

one or more processors;

storage means for storing one or more programs,

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of any of claims 1-6.

9. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform the method according to any of claims 1-6.

10. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 6.