CN113900786A - Distributed computing method and device based on graph data - Google Patents

Distributed computing method and device based on graph data Download PDF

Info

Publication number
CN113900786A
CN113900786A CN202111183041.5A CN202111183041A CN113900786A CN 113900786 A CN113900786 A CN 113900786A CN 202111183041 A CN202111183041 A CN 202111183041A CN 113900786 A CN113900786 A CN 113900786A
Authority
CN
China
Prior art keywords
calculation
mode
model
computing
graph data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111183041.5A
Other languages
Chinese (zh)
Inventor
周晶
孙喜民
郑斌
李鑫
孙博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid E Commerce Co Ltd
State Grid E Commerce Technology Co Ltd
Original Assignee
State Grid E Commerce Co Ltd
State Grid E Commerce Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid E Commerce Co Ltd, State Grid E Commerce Technology Co Ltd filed Critical State Grid E Commerce Co Ltd
Priority to CN202111183041.5A priority Critical patent/CN113900786A/en
Publication of CN113900786A publication Critical patent/CN113900786A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a distributed computing method and a distributed computing device based on graph data, wherein the distributed computing method comprises the following steps: when a calculation request for current map data is received, transmitting the current map data to a preset map data calculation model, wherein the preset map data calculation model is constructed on the basis of a BSP calculation model and a GAS calculation model; the preset graph data calculation model performs distributed calculation on the current graph data based on a target calculation mode, wherein the target calculation mode is one of a synchronous calculation mode, an asynchronous calculation mode and a hybrid calculation mode. In the process, the calculation of each super step is realized in the preset graph calculation model based on the BSP model, the further subdivision of the calculation process in each super step is realized based on the GAS model, the concurrency and the expandability of the model calculation are increased, the calculation is carried out based on the target calculation mode, the target calculation mode can be selected differently, and the expandability of the calculation process is further increased.

Description

Distributed computing method and device based on graph data
Technical Field
The invention relates to the technical field of data processing, in particular to a distributed computing method and device based on graph data.
Background
The mass data is complicated in data type and heterogeneous in data source, so that the distributed deployment, the high-performance parallel processing, the dynamic graph calculation, the establishment of a large-scale graph data processing and calculating model system and the like are particularly necessary for better developing the related calculation and application of the large-scale mass data, and the necessary links for processing the large-scale mass data are realized in principle or practice, so that the diversification mode is realized, the platform access system achieves good data compatibility, and better specifications and mechanisms are provided for the intelligent interconnection system and the application of the e-commerce data and the electrical equipment in the energy industry cloud network.
In the prior art, distributed computation is realized based on a MapReduce model, which is a programming model and is used for parallel operation of large-scale data sets (larger than 1 TB). The method greatly facilitates programmers to operate programs on a distributed system under the condition that distributed parallel programming is not achieved, but the MapReduce model is poor in expandability and cannot be expanded.
Disclosure of Invention
In view of this, the present invention provides a distributed computing method and apparatus based on graph data, so as to solve the problem in the prior art that distributed computing is implemented based on a MapReduce model, which is a programming model used for parallel operation of large-scale data sets (greater than 1 TB). The method and the device greatly facilitate programmers to operate programs on a distributed system under the condition that distributed parallel programming is not available, but the MapReduce model has poor expandability and cannot be expanded. The specific scheme is as follows:
a graph data-based distributed computing method, comprising:
when a calculation request for current map data is received, transmitting the current map data to a preset map data calculation model, wherein the preset map data calculation model is constructed on the basis of a BSP calculation model and a GAS calculation model;
the preset graph data calculation model performs distributed calculation on the current graph data based on a target calculation mode, wherein the target calculation mode is one of a synchronous calculation mode, an asynchronous calculation mode and a hybrid calculation mode.
Optionally, the constructing the preset graph data calculation model based on the BSP calculation model and the GAS calculation model includes:
obtaining each over-step in the BSP calculation model, wherein each over-step comprises: a calculation process, a communication process and a fence synchronization process;
and dividing the calculation process into an information collection stage in charge of extracting the messages, an application node in charge of carrying out local processing based on the collected messages and a distribution stage in charge of sending new messages according to the GAS calculation model.
Optionally, in the method, when the target calculation mode is a synchronous calculation mode, the performing, by the preset graph data calculation model, distributed calculation on the current graph data based on the target calculation mode includes:
executing the current graph data based on iteration using synchronized control and data flows;
calculating the vertex value of the current vertex based on the previous round of the tie vertex values;
synchronizing the vertex value to a backup node based on the message.
Optionally, in the method, when the target calculation mode is a hybrid calculation mode, the performing, by the preset graph data calculation model, distributed calculation on the current graph data based on the target calculation mode includes:
traversing activated vertexes in the preset calculation model, and calculating the activated vertexes based on vertex scheduling of a synchronous mode;
asynchronous messaging is employed in the computation process to allow distributed computation of the currently activated vertex based on the value of the tie vertex.
Optionally, in the method, when the target calculation mode is a one-step calculation mode, the performing, by the preset graph data calculation model, distributed calculation on the current graph data based on the target calculation mode includes:
asynchronous transmission is carried out on the current graph data by adopting asynchronous control and data flow;
storing the current graph data received by the activated vertex of the preset graph data calculation model into a distributed scheduling queue;
and performing distributed computation based on the distributed scheduling queue.
A graph data-based distributed computing apparatus, comprising:
the transmission module is used for transmitting the current map data to a preset map data calculation model when a calculation request for the current map data is received, wherein the preset map data calculation model is constructed on the basis of a BSP calculation model and a GAS calculation model;
and the calculation module is used for performing distributed calculation on the current graph data by the preset graph data calculation model based on a target calculation mode, wherein the target calculation mode is one of a synchronous calculation mode, an asynchronous calculation mode and a hybrid calculation mode.
Optionally, in the apparatus described above, the constructing, by the preset graph data calculation model in the transfer module based on the BSP calculation model and the GAS calculation model, includes:
an obtaining unit, configured to obtain each superstep in the BSP calculation model, where each superstep includes: a calculation process, a communication process and a fence synchronization process;
and the dividing unit is used for dividing the calculation process into an information collection stage in charge of extracting the information, an application node in charge of carrying out local processing based on the collected information and a distribution stage in charge of sending the new information according to the GAS calculation model.
Optionally, in the apparatus described above, when the target computing mode is a synchronous computing mode, the computing module includes:
an iteration unit for executing the current graph data based on an iteration using synchronized control and data flows;
the first calculation unit is used for calculating the vertex value of the current vertex based on the previous round of the tie vertex value;
a synchronization unit for synchronizing the vertex values to the backup nodes based on the message.
Optionally, in the apparatus described above, when the target computing mode is a hybrid computing mode, the computing module includes:
the second calculation unit is used for traversing the activated vertex in the preset calculation model and calculating the activated vertex based on the vertex scheduling of the synchronous mode;
and the third computing unit is used for adopting asynchronous message transmission in the computing process so as to enable the currently activated vertex to perform distributed computing based on the tie vertex value.
Optionally, in the apparatus described above, when the target calculation mode is a one-step calculation mode, the calculation module includes:
the transmission unit is used for asynchronously transmitting the current graph data by adopting asynchronous control and data flow;
the storage unit is used for storing the current graph data received by the activated vertex of the preset graph data calculation model into a distributed scheduling queue;
and the fourth calculation unit is used for performing distributed calculation based on the distributed scheduling queue.
Compared with the prior art, the invention has the following advantages:
the invention discloses a distributed computing method and a distributed computing device based on graph data, wherein the distributed computing method comprises the following steps: when a calculation request for current map data is received, transmitting the current map data to a preset map data calculation model, wherein the preset map data calculation model is constructed on the basis of a BSP calculation model and a GAS calculation model; the preset graph data calculation model performs distributed calculation on the current graph data based on a target calculation mode, wherein the target calculation mode is one of a synchronous calculation mode, an asynchronous calculation mode and a hybrid calculation mode. In the process, the calculation of each super step is realized in the preset graph calculation model based on the BSP model, the further subdivision of the calculation process in each super step is realized based on the GAS model, the concurrency and the expandability of the model calculation are increased, the calculation is carried out based on the target calculation mode, the target calculation mode can be selected differently, and the expandability of the calculation process is further increased.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flowchart of a distributed computing method based on graph data according to an embodiment of the present disclosure;
FIG. 2 is a schematic diagram of an execution flow of a master and a slave disclosed in an embodiment of the present application;
fig. 3 is a block diagram of a device structure of a distributed computing method based on graph data according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The invention discloses a distributed computing method and device based on graph data, which are applied to the parallel computing process of massive graph data, and realize distributed computing based on a MapReduce model in the prior art, wherein the MapReduce model is used for parallel computing of large-scale data sets (larger than 1 TB). The concepts "Map" and "Reduce" are their main ideas, both borrowed from functional programming languages, and features borrowed from vector programming languages. The MapReduce model is a data flow model, divides a graph calculation operation into a plurality of map tasks and reduce tasks, abstracts the execution of the operation into two stages of map and reduce, and distributes different calculation tasks to a plurality of nodes in parallel to complete the operation. The MapReduce model comprises the following steps: the system comprises a JobTracker and a TaskTracker, wherein the JobTracker is a background service process, and after being started, information sent by each TaskTracker is monitored and received all the time, and the information comprises information such as resource use conditions and task running conditions. The main functions of the JobTracker are: (1) and (3) operation control: each application program is expressed into one job in hadoop, each job is divided into a plurality of tasks, and a job control module of the JobTracker is responsible for the decomposition and state monitoring of the job; (2) and (5) managing resources.
The TaskTracker is a bridge between JobTracker and Task: in one aspect, various commands are received and executed from the JobTracker: running, submitting, killing and the like tasks; on the other hand, the state of each task on the local node is periodically reported to the JobTracker through the heartbeat. The TaskTracker and the JobTracker and the Task adopt RPC protocol for communication. The functions of the TaskTracker are as follows: (1) reporting the heartbeat: the Tracker periodically reports various information on all the nodes to the JobTracker through a heartbeat mechanism; (2) executing a command: the JobTracker issues various commands to the tasktacker, which mainly includes: launch task (LaunchTaskAction), submit task (commimittaskaction), kill task (KillTaskAction), kill job (KillJobAction), and reinitialize (task trackreinitiaction). The MapReduce model greatly facilitates programmers to operate programs on a distributed system under the condition that distributed parallel programming is not available, but the MapReduce model is poor in expandability and cannot be expanded. In order to solve the above problem, the present invention provides a distributed computation method based on graph data, the execution flow of the method is shown in fig. 1, and the method includes the steps of:
s101, when a calculation request for current graph data is received, transmitting the current graph data to a preset graph data calculation model, wherein the preset graph data calculation model is constructed on the basis of a BSP calculation model and a GAS calculation model;
in the embodiment of the present invention, the bsp (bulksyncchronous parallelll) calculation model: proposed by Leslie Valiant, Harvard university and BillMcColl, Oxford university. In BSP, a calculation process consists of a series of global supersteps, each of which consists of three steps of concurrent calculation process, communication process and fence synchronization. Synchronization is complete, marking the completion of the one over-step, and the start of the next over-step. The most famous of large-scale distributed graph computing systems based on the BSP model is Pregel. Pregel adopts a BSP calculation model, abstracts a graph calculation task into a series of super steps by taking a vertex as a center, and iteratively executes calculation until all the vertices are converged to obtain a calculation result. It provides a feasible solution in the aspects of graph partitioning, computing processing, synchronous control, communication optimization, fault tolerance management and the like. Relatively speaking, for large-scale graph computation requiring multiple iterative processes, the base BSP model has higher computational efficiency than a large-scale graph computation system based on the MapReduce model. Compared with the Map and Reduce phases of the MapReduce model, the BSP model realizes the iterative computation of the graph data based on the super steps, realizes the network transmission of the data in each super step through a message communication mechanism, and performs data synchronization between each super step, thereby ensuring the convergence of the task and the correctness of the result.
The GAS computational model can be viewed as a refinement to the vertex-centric graph computation programming model that increases computational concurrency by further subdividing the computational process. The GAS computational model divides the vertex Update function into three successive processing phases, an information collection phase (Gather) responsible for extracting messages, an application phase (Apply) responsible for local processing based on the collected messages, and a distribution phase (Scatter) responsible for sending new messages. By means of the division of the computing stages, an original complete computing process can be subdivided, and therefore, each sub-processing stage can be executed concurrently in the computing process to further increase the concurrent processing capacity of the system. Furthermore, the GAS computing model incorporates the idea of distributed collaboration on mass data, which is a concrete embodiment of optimizing expansibility and improving usability
In the embodiment of the invention, Pregel adopts a BSP calculation model, designs a calculation process with superstep as granularity, a message communication process and a fence synchronization process, and PowerGraph abstracts a Gather-Apply-Scatter fixed point program model, provides a node calculation model for executing a graph calculation task, and can be regarded as a model for further subdividing to increase calculation and distribution. Based on Pregel, a BSP calculation model and a GAS calculation model of PowerGraph are adopted, and graph calculation models such as graph traversal, shortest path, PageRank calculation and the like based on a synchronous iterative mode are designed. The BSP model of Pregel follows a calculation-update-synchronization iteration mode, and message communication is carried out by adopting fixed-point-based Push and Pull modes.
In the embodiment of the invention, a BSP calculation model and a GAS calculation model of PowerGraph are adopted based on Pregel, and a graph calculation model based on a synchronous iterative formula is designed. The BSP model of Pregel follows a calculation-update-synchronization iteration mode, and message communication is carried out by adopting fixed-point-based Push and Pull modes. In the edge cutting system only suitable for the class of Pregel, if a point cutting scene is considered, because a certain vertex is distributed on different physical nodes due to the point cutting, Push and combination mechanisms can cause the transmission times of the messages not to be reduced and the range to be increased.
The graph in real life mostly follows power law distribution, and the use of edge cutting can cause the storage overhead and the communication overhead to increase sharply. GAS computational models need to contain fixed point activities in the computational process: receiving and sending messages, executing computation logic, and updating vertex values, it can be seen that the GAS model abstracted in PowerGraph solves the computation and communication problems of the power-law graph well. After the point is cut, the fixed point automatically selects the Master backup Master, and the other copies become Mirror image backup Mirror.
S102, the preset graph data calculation model performs distributed calculation on the current graph data based on a target calculation mode, wherein the target calculation mode is one of a synchronous calculation mode, an asynchronous calculation mode and a hybrid calculation mode.
In the embodiment of the present invention, the preset graph data calculation model performs distributed calculation on the current graph data based on a target calculation mode, where the target calculation mode is one of a synchronous calculation mode, an asynchronous calculation mode and a hybrid calculation mode, and the selection of the target calculation mode may be based on experience or a specific setting selection rule, which is not specifically limited in the embodiment of the present invention.
Under the condition that the target computing mode is a synchronous computing mode, the synchronous computing mode (called the synchronous mode for short) adopts synchronous control and data flow, the whole computing process is divided into multiple rounds of iteration execution, all live vertexes in each round are computed by using adjacent vertex values in the previous round, and global synchronization among the rounds ensures that the vertex computation of the previous round and the updating of the vertex values (the data is synchronized to backup vertexes by the vertexes through messages) are completed on all machines. The scheduling overhead is small because each round in the synchronization mode only needs to traverse the live vertex and record the vertex activated in the round, and the access to the vertex data in the calculation process does not need to be protected (the vertex value of the previous round is always used). However, the fact that only the value of the previous round can be used in each round of calculation also causes the convergence of the whole calculation process to be slow, the number of vertex calculation times to be large, and the number of rounds to be calculated in a protracted manner. With the increase of the number of convergence vertices, the proportion of effective calculation time is rapidly reduced, and the global synchronization overhead occupies most of the execution time of each round, resulting in severe performance loss.
In the case that the target computing mode is an asynchronous computing mode, the asynchronous computing mode (referred to as asynchronous mode for short) employs asynchronous control and data flows, the current graph data received by the activated vertices of the preset graph data computing model are stored in a distributed scheduling queue, and distributed computing is performed based on the distributed scheduling queue, and the synchronous computing mode is mainly different from the one-step computing mode in whether a (global) coordination mechanism for synchronous task operation exists.
Under the condition that the target computing mode is a hybrid computing mode, a hybrid computing mode (for short, a hybrid mode) is provided based on analysis and comparison of characteristics of a synchronous computing mode and an asynchronous computing mode, namely asynchronous data flow (message transmission) is used on the basis of synchronous control flow (task scheduling), and the advantages of the two existing modes are combined, so that relatively high convergence rate is obtained while the scheduling overhead is reduced, and the performance of a distributed graph computing system is improved.
In contrast to the synchronous computation mode, the hybrid computation mode employs asynchronous message passing to enable the vertices to compute using newer neighboring vertex values.
The hybrid computing mode combines the advantages of the synchronous computing mode and the asynchronous computing mode, overcomes the defects of the synchronous computing mode and the asynchronous computing mode, and is an ideal mode of the graph computing engine in a distributed environment.
The invention discloses a distributed computing method based on graph data, which comprises the following steps: when a calculation request for current map data is received, transmitting the current map data to a preset map data calculation model, wherein the preset map data calculation model is constructed on the basis of a BSP calculation model and a GAS calculation model; the preset graph data calculation model performs distributed calculation on the current graph data based on a target calculation mode, wherein the target calculation mode is one of a synchronous calculation mode, an asynchronous calculation mode and a hybrid calculation mode. In the process, the calculation of each super step is realized in the preset graph calculation model based on the BSP model, the further subdivision of the calculation process in each super step is realized based on the GAS model, the concurrency and the expandability of the model calculation are increased, the calculation is carried out based on the target calculation mode, the target calculation mode can be selected differently, and the expandability of the calculation process is further increased.
In the embodiment of the invention, in the distributed system, the division and the indexing of the nodes are an important research problem. According to different network division methods, nodes of a network may be placed on different slave, so how to quickly find a data structure of the node in the system according to the node ID is an important research content. However, the connection among the nodes of the network structure is tight, if the network is divided into a plurality of sub-graphs, the connection among the sub-graphs is tight, and the connection among the sub-graphs is less to become an important index of the evaluation division algorithm. In the distributed graph algorithm research, a graph is divided into important research branches, if the network division is good, the communication inside the slave is increased, the communication between the slaves is reduced, the operation speed is improved, and the point which is uploaded to the master from the slave file and exceeds a certain threshold value is an active point.
As shown in fig. 2, an execution flow chart based on a master and a slave starts a master process, initializes a network, and determines whether an active node exists, where a determination rule of the active node is that a point, which is uploaded to the master by a slave file and exceeds a certain threshold, is an active point, where the certain threshold may be set based on experience or specific conditions, and in the embodiment of the present invention, a specific value of the certain threshold is not limited, and if the specific value exists, the process is directly ended, and if the specific value does not exist, a super step S is executed to monitor the slave process, where the monitoring process is as follows: starting a slave process, storing a network, traversing nodes, judging whether the slave process is active or not, wherein the specific judgment process is the same as that described above, if the slave process is inactive, returning to traverse the next node again, if the next node is not present, directly ending, if the slave process is active, executing a node computer function, executing an ending signal after the execution is finished, judging whether the next node is present or not after receiving the ending signal, if the next node is present, continuing to move a rocket, and if the next node is not present, directly ending.
In the embodiment of the invention, the core idea of the Pregel model is derived from a BSP calculation model proposed by Leslie Valiang in the 80 th century. Graph computation involves constant updates on the same data and extensive messaging, which, if implemented using MapReduce, results in significant unnecessary serialization and deserialization overhead. In the traditional high-performance computing field, MPI is usually used for completion, but MPI only provides a series of communication interfaces, the development difficulty is high, and the fault tolerance of MPI is not enough for a cluster constructed by Google from a common PC. Some existing graph computing systems do not adapt to the Google scenario.
The Pregel computation consists of a series of super steps, and in each super step, the framework calls an appropriate user-defined function of each vertex, the function defines the computation required in the super step, and the function can read in the message sent to the function by the previous super step and generate the message of the next iteration to other vertices, and simultaneously modify the states of the vertices and the edges thereof.
A node can send any number of messages in one super-step. All messages received by node F are available in the super-step, and through an iterator, the node executes the custom function computer in super-step S + 7.
There is no deterministic message order in the iterative process. But there is a point that all messages will be delivered and not repeated.
One common application model is to iterate through edges of the nodes and then send the message to the destination node of each edge. The destination node is then not necessarily a neighbor of node K. A node may have obtained the index number of a non-neighbor node through a received message or the index number of a node may be inferred. For example, in a fully connected graph, node i is connected with n-1 nodes, so that all destination nodes can be known without accessing out-degree edges.
When the destination node of any message does not exist, the user-defined scaffold function can be executed. For example, create a non-existent node or remove an out-degree edge of the source node.
(1) In the local calculation stage, each processor only performs local calculation on the data stored in the local memory.
(2) And a global communication phase, which is used for operating any non-local data.
(3) And a barrier synchronization stage, which waits for the end of all communication behaviors.
In the embodiment of the invention, the existing distributed computing platform systems are fused and improved to construct a proper computing system, and the model construction is carried out on the main problems to be faced in the system, so that the expandability and the usability of the system are enhanced, the BSP and GAS computing models are fused, the proper synchronous, asynchronous and mixed computing modes are constructed, the distributed nodes are indexed, and the proper message transmission model is constructed, so that the construction of the mass data distributed computing system is realized.
Based on the foregoing distributed computing method based on graph data, an embodiment of the present invention further provides a distributed computing apparatus based on graph data, where a structural block diagram of the computing apparatus is shown in fig. 3, and the method includes:
a transfer module 201 and a calculation module 202.
Wherein the content of the first and second substances,
the transfer module 201 is configured to transfer the current map data to a preset map data calculation model when a calculation request for the current map data is received, where the preset map data calculation model is constructed based on a BSP calculation model and a GAS calculation model;
the calculation module 202 is configured to perform distributed calculation on the current graph data based on a target calculation mode by the preset graph data calculation model, where the target calculation mode is one of a synchronous calculation mode, an asynchronous calculation mode, and a hybrid calculation mode.
The invention discloses a distributed computing device based on graph data, which comprises: when a calculation request for current map data is received, transmitting the current map data to a preset map data calculation model, wherein the preset map data calculation model is constructed on the basis of a BSP calculation model and a GAS calculation model; the preset graph data calculation model performs distributed calculation on the current graph data based on a target calculation mode, wherein the target calculation mode is one of a synchronous calculation mode, an asynchronous calculation mode and a hybrid calculation mode. In the process, the calculation of each super step is realized in the preset graph calculation model based on the BSP model, the further subdivision of the calculation process in each super step is realized based on the GAS model, the concurrency and the expandability of the model calculation are increased, the calculation is carried out based on the target calculation mode, the target calculation mode can be selected differently, and the expandability of the calculation process is further increased.
In the embodiment of the present invention, the constructing the preset graph data calculation model in the transmission module 201 based on the BSP calculation model and the GAS calculation model includes:
an acquisition unit 203 and a dividing unit 204.
Wherein the content of the first and second substances,
the obtaining unit 203 is configured to obtain each superstep in the BSP calculation model, where each superstep includes: a calculation process, a communication process and a fence synchronization process;
the dividing unit 204 is configured to divide the calculation process into an information collection stage responsible for extracting the message, an application node responsible for performing local processing based on the collected message, and a distribution stage responsible for sending the new message according to the GAS calculation model.
In this embodiment of the present invention, when the target computing mode is a synchronous computing mode, the computing module 202 includes:
an iteration unit 205, a first calculation unit 206 and a synchronization unit 207.
Wherein the content of the first and second substances,
the iteration unit 205 is configured to execute the current graph data on an iteration basis using synchronized control and data flows;
the first calculating unit 206, configured to calculate a vertex value of a current vertex based on a previous round of tie vertex values;
the synchronization unit 207 is configured to synchronize the vertex value to the backup node based on a message.
In this embodiment of the present invention, when the target computing mode is a hybrid computing mode, the computing module 202 includes:
a second calculation unit 208 and a third calculation unit 209.
Wherein the content of the first and second substances,
the second calculating unit 208 is configured to traverse activated vertices in the preset calculation model, and calculate the activated vertices based on vertex scheduling in a synchronous mode;
the third computing unit 209 is configured to employ asynchronous message passing in the computing process to perform distributed computing on the currently activated vertex based on the tie vertex value.
In this embodiment of the present invention, when the target calculation mode is a one-step calculation mode, the calculation module 202 includes:
a transfer unit 210, a storage unit 211, and a fourth calculation unit 212.
Wherein the content of the first and second substances,
the transmission unit 210 is configured to asynchronously transmit the current graph data by using asynchronous control and data streams;
the storage unit 211 is configured to store the current graph data received by the activated vertex of the preset graph data calculation model into a distributed scheduling queue;
the fourth calculating unit 212 is configured to perform distributed calculation based on the distributed scheduling queue.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (10)

1. A distributed computing method based on graph data is characterized by comprising the following steps:
when a calculation request for current map data is received, transmitting the current map data to a preset map data calculation model, wherein the preset map data calculation model is constructed on the basis of a BSP calculation model and a GAS calculation model;
the preset graph data calculation model performs distributed calculation on the current graph data based on a target calculation mode, wherein the target calculation mode is one of a synchronous calculation mode, an asynchronous calculation mode and a hybrid calculation mode.
2. The method of claim 1, wherein the constructing the pre-defined graph data calculation model based on a BSP calculation model and a GAS calculation model comprises:
obtaining each over-step in the BSP calculation model, wherein each over-step comprises: a calculation process, a communication process and a fence synchronization process;
and dividing the calculation process into an information collection stage in charge of extracting the messages, an application node in charge of carrying out local processing based on the collected messages and a distribution stage in charge of sending new messages according to the GAS calculation model.
3. The method according to claim 1, wherein in a case that the target computing mode is a synchronous computing mode, the preset graph data computing model performs distributed computing on the current graph data based on the target computing mode, including:
executing the current graph data based on iteration using synchronized control and data flows;
calculating the vertex value of the current vertex based on the previous round of the tie vertex values;
synchronizing the vertex value to a backup node based on the message.
4. The method according to claim 1, wherein in a case that the target computing mode is a hybrid computing mode, the preset graph data computing model performs distributed computing on the current graph data based on the target computing mode, including:
traversing activated vertexes in the preset calculation model, and calculating the activated vertexes based on vertex scheduling of a synchronous mode;
asynchronous messaging is employed in the computation process to allow distributed computation of the currently activated vertex based on the value of the tie vertex.
5. The method according to claim 1, wherein in a case that the target calculation mode is a one-step calculation mode, the preset graph data calculation model performs distributed calculation on the current graph data based on the target calculation mode, including:
asynchronous transmission is carried out on the current graph data by adopting asynchronous control and data flow;
storing the current graph data received by the activated vertex of the preset graph data calculation model into a distributed scheduling queue;
and performing distributed computation based on the distributed scheduling queue.
6. A graph data-based distributed computing apparatus, comprising:
the transmission module is used for transmitting the current map data to a preset map data calculation model when a calculation request for the current map data is received, wherein the preset map data calculation model is constructed on the basis of a BSP calculation model and a GAS calculation model;
and the calculation module is used for performing distributed calculation on the current graph data by the preset graph data calculation model based on a target calculation mode, wherein the target calculation mode is one of a synchronous calculation mode, an asynchronous calculation mode and a hybrid calculation mode.
7. The apparatus of claim 6, wherein the predetermined graph data calculation model in the transfer module is constructed based on a BSP calculation model and a GAS calculation model, and comprises:
an obtaining unit, configured to obtain each superstep in the BSP calculation model, where each superstep includes: a calculation process, a communication process and a fence synchronization process;
and the dividing unit is used for dividing the calculation process into an information collection stage in charge of extracting the information, an application node in charge of carrying out local processing based on the collected information and a distribution stage in charge of sending the new information according to the GAS calculation model.
8. The apparatus of claim 6, wherein in the case that the target computing mode is a synchronous computing mode, the computing module comprises:
an iteration unit for executing the current graph data based on an iteration using synchronized control and data flows;
the first calculation unit is used for calculating the vertex value of the current vertex based on the previous round of the tie vertex value;
a synchronization unit for synchronizing the vertex values to the backup nodes based on the message.
9. The apparatus of claim 6, wherein in the case that the target computing mode is a hybrid computing mode, the computing module comprises:
the second calculation unit is used for traversing the activated vertex in the preset calculation model and calculating the activated vertex based on the vertex scheduling of the synchronous mode;
and the third computing unit is used for adopting asynchronous message transmission in the computing process so as to enable the currently activated vertex to perform distributed computing based on the tie vertex value.
10. The apparatus of claim 6, wherein in the case that the target computing mode is a one-step computing mode, the computing module comprises:
the transmission unit is used for asynchronously transmitting the current graph data by adopting asynchronous control and data flow;
the storage unit is used for storing the current graph data received by the activated vertex of the preset graph data calculation model into a distributed scheduling queue;
and the fourth calculation unit is used for performing distributed calculation based on the distributed scheduling queue.
CN202111183041.5A 2021-10-11 2021-10-11 Distributed computing method and device based on graph data Pending CN113900786A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111183041.5A CN113900786A (en) 2021-10-11 2021-10-11 Distributed computing method and device based on graph data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111183041.5A CN113900786A (en) 2021-10-11 2021-10-11 Distributed computing method and device based on graph data

Publications (1)

Publication Number Publication Date
CN113900786A true CN113900786A (en) 2022-01-07

Family

ID=79191276

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111183041.5A Pending CN113900786A (en) 2021-10-11 2021-10-11 Distributed computing method and device based on graph data

Country Status (1)

Country Link
CN (1) CN113900786A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023184834A1 (en) * 2022-03-31 2023-10-05 深圳清华大学研究院 Collective communication optimization method for global high-degree vertices, and application

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103336808A (en) * 2013-06-25 2013-10-02 中国科学院信息工程研究所 System and method for real-time graph data processing based on BSP (Board Support Package) model
US20180039710A1 (en) * 2016-08-05 2018-02-08 International Business Machines Corporation Distributed graph databases that facilitate streaming data insertion and queries by efficient throughput edge addition
CN111859027A (en) * 2019-04-24 2020-10-30 华为技术有限公司 Graph calculation method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103336808A (en) * 2013-06-25 2013-10-02 中国科学院信息工程研究所 System and method for real-time graph data processing based on BSP (Board Support Package) model
US20180039710A1 (en) * 2016-08-05 2018-02-08 International Business Machines Corporation Distributed graph databases that facilitate streaming data insertion and queries by efficient throughput edge addition
CN111859027A (en) * 2019-04-24 2020-10-30 华为技术有限公司 Graph calculation method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
陈军航: "面向分布式图计算系统的自适应优化方法研究", 中国优秀硕士学位论文全文数据库 基础科学辑, 15 March 2021 (2021-03-15), pages 9 - 11 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023184834A1 (en) * 2022-03-31 2023-10-05 深圳清华大学研究院 Collective communication optimization method for global high-degree vertices, and application

Similar Documents

Publication Publication Date Title
US8521782B2 (en) Methods and systems for processing large graphs using density-based processes using map-reduce
CN108632365A (en) Service Source method of adjustment, relevant apparatus and equipment
CN109933631A (en) Distributed parallel database system and data processing method based on Infiniband network
CN107729138B (en) Method and device for analyzing high-performance distributed vector space data
US10498817B1 (en) Performance tuning in distributed computing systems
US11314694B2 (en) Facilitating access to data in distributed storage system
Nguyen et al. High performance peer-to-peer distributed computing with application to obstacle problem
CN113778615B (en) Rapid and stable network shooting range virtual machine construction system
CN113900786A (en) Distributed computing method and device based on graph data
CN115794373A (en) Calculation force resource hierarchical scheduling method, system, electronic equipment and storage medium
Liu et al. DCNSim: A data center network simulator
CN117311975A (en) Large model parallel training method, system and readable storage medium
CN115495056B (en) Distributed graph computing system and method
CN114884830A (en) Distributed parallel simulation deduction system based on wide area network
Herlicq et al. Nextgenemo: an efficient provisioning of edge-native applications
EP4285222A1 (en) Systems and methods for automated network state and network inventory tracking
Wang et al. A new mobile agent-based middleware system design for Wireless Sensor Network
Hui et al. Epsilon: A microservices based distributed scheduler for kubernetes cluster
Balteanu et al. Near real-time scheduling in cloud-edge platforms
Cicirelli et al. An agent framework for high performance simulations over multi-core clusters
CN114095356B (en) Method and device for configuring node task strategy in real time
CN110955731A (en) Multi-source remote sensing big data processing method and device based on Chord ring
Carlini et al. Distributed graph processing: an approach based on overlay composition
CN115361388B (en) Resource scheduling method and device in edge cloud computing system
CN114884817A (en) Data interaction method and system for power transmission and transformation equipment internet of things

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination