CN110955497B

CN110955497B - Distributed graph computing system based on task segmentation

Info

Publication number: CN110955497B
Application number: CN201911063615.8A
Authority: CN
Inventors: 俞山青; 周嘉俊; 王甬琪; 崔文豪; 孟栎均; 王永恒
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2019-11-04
Filing date: 2019-11-04
Publication date: 2023-03-31
Anticipated expiration: 2039-11-04
Also published as: CN110955497A

Abstract

A task segmentation based distributed graph computing system, comprising: the client is used for being responsible for dividing and uploading the calculation tasks, processing and uploading the graph data, detecting whether all subtasks are completely executed or not at regular time and finishing part of simple calculation tasks; the server is used for receiving and managing the subtasks uploaded by the client and distributing the tasks to the working end for execution; the working end is used for completing the calculation of the subtasks and uploading the calculation results; the data center is used for managing the graph data, the task state table and the calculation result processed by the client; the server side uses a Gearman task distribution framework, the client side and the working side are designed according to a calculation task, and the data center adopts a MongoDB database. The invention provides a distributed graph computing system based on task segmentation, wherein graph data of the distributed graph computing system are uniformly stored in data. In this way, consumption caused by a large amount of inter-subgraph communication can be avoided, and therefore the computing efficiency of the system is improved.

Description

Distributed graph computing system based on task segmentation

Technical Field

The invention belongs to the field of distributed graph computing, and particularly relates to a distributed graph computing system based on task segmentation.

Background

The data structure of the graph can briefly represent a plurality of relation models in the real world, and relevant calculation can simplify a plurality of problems. With the appearance of related technologies and concepts such as the internet, the internet of things, the internet of everything and the like, research objects become more and more complex, graph data obtained through abstraction also become more and more complex in structure and larger in scale, time and space occupied by graph calculation become larger and larger, and the traditional centralized graph calculation method is difficult to process the graph data quickly and effectively. The distributed graph computation method provides a feasible method for efficiently processing large-scale graph data.

Current distributed graph computing systems include primarily a MapReduce model based computing system, a Bulk Synchronization Parallel (BSP) model based computing system, a Gather-Apply-Scatter (GAS) model based computing system, and other computing systems. The computing system based on MapReduce is difficult to be competent for large-scale graph computing tasks due to the problems of disk access, difficult iteration and the like; the BSP-based computing system is typified by Pregal, giraph and the like, and due to the fact that waiting idle time of computing nodes exists in the roadblock synchronization stage of the model, the operation efficiency of the system is reduced; the typical GAS-based computing system comprises GraphLab, powerGraph and the like, the model has no roadblock synchronization stage, the waiting time of each computing node is avoided, and when the number of computing rounds of each node is large, errors of computing results can be caused. The distributed graph calculation models including other models have respective advantages and disadvantages, but all of the distributed graph calculation models realize the distribution of calculation by dividing graph data into subgraphs and then executing an iterative update function, and when a large amount of parameters need to be transmitted between the subgraphs in a calculation task, the distributed method usually greatly increases the calculation time due to excessive data communication overhead, and has poor system performance.

Disclosure of Invention

In order to overcome the defects of greatly increased computing time and poorer system performance of the existing distributed method, the invention provides a distributed graph computing system based on task segmentation, wherein graph data of the distributed graph computing system is uniformly stored in data. In this way, consumption caused by a large amount of inter-subgraph communication can be avoided, and therefore the computing efficiency of the system is improved.

In order to solve the technical problems, the invention provides the following technical scheme:

a distributed graph computing system based on task segmentation, the system comprising:

the client is used for being responsible for dividing and uploading the calculation tasks, processing and uploading the graph data, detecting whether all subtasks are completely executed or not at regular time and finishing part of simple calculation tasks;

the server is used for receiving and managing the subtasks uploaded by the client and distributing the tasks to the working end for execution;

the working end is used for completing the calculation of the subtasks and uploading the calculation results;

the data center manages the graph data, the task state table and the calculation result processed by the client by the macro;

the service end uses a Gearman task distribution framework, the client end and the working end are designed according to a calculation task, and the data center adopts a MongoDB database.

The task segmentation in the distributed graph computing system based on task segmentation is different from the method of the existing distributed graph computing frame based on graph segmentation (graph segmentation is carried out, and the computing task of the graph is converted into the computing task of a subgraph).

Further, the processing procedure of the client is as follows:

1.1 Obtaining image data, converting the image data into an adjacent chain table form, and uploading the image data to a data center, wherein the adjacent chain table is compressed by a compression tool and then transmitted in a GirdFS file form, the image data is transmitted by using integral transmission instead of transmission one by one, so that the transmission time overhead is reduced, and meanwhile, the transmission quantity is reduced by file compression;

1.2 The uniform division strategy is adopted in the division process of the computing task, so that the computing time of each subtask tends to be consistent when the computing nodes are used for computing, the time difference when each computing node completes the task is small, and the phenomenon that part of computing nodes are idle when a task which consumes more time is executed is avoided;

1.3 After the task segmentation is completed, generating a task list and uploading the task list to a server, and generating a task state table and uploading the task state table to a data center;

1.4 Partial simple calculation tasks are completed, and calculation results are directly uploaded to a data center, so that the time consumption of data transmission required in distributed calculation is reduced;

1.5 After the above functions are completed, whether all tasks are completed is detected regularly through the task state table, and after all subtasks are completed, all calculation results are integrated, uploaded and stored in the data center.

And further, the server side uses a Gearman task distribution framework to complete the interaction of data and related information between the server side and the client side and between the server side and the working side through interfaces provided by Gearman, wherein the interaction comprises the receiving, management and distribution of computing subtasks.

Further, the processing procedure of the working end is as follows:

3.1 The server side is responsible for receiving tasks distributed by the server side, a graph calculation algorithm is operated after relevant graph data are obtained from a database, calculation of each subtask is completed, and a calculation result is uploaded to a data center after calculation is completed;

3.2 The system performance is improved by using a multithreading technology when a computing task is executed, and a two-dimensional array storage mode is used for avoiding partial data which can generate data conflict in a multithreading process.

The processing process of the data center is as follows:

4.1 Selecting a MongoDB database as a data center, and mainly taking charge of receiving and managing graph data, storing and managing task state lists and storing and managing calculation results;

4.2 In the process of data transmission, the GirdFS file format in the MongoDB database is used for overall transmission, the transmission rate higher than that of single transmission is obtained, a compression tool is used for compressing the files in the GirdFS format to reduce the transmission amount, and the transmission rate is further improved.

The invention has the beneficial effects that: consumption caused by a large amount of inter-subgraph communication can be avoided, and therefore computing efficiency of the system is improved.

Drawings

FIG. 1 is a block diagram of a distributed graph computing system based on task partitioning according to the present invention.

FIG. 2 shows the deployment and node relationship of the components of the system according to an embodiment of the present invention.

FIG. 3 is a flowchart illustrating execution of a client program according to an embodiment of the present invention.

Fig. 4 is a flowchart illustrating the execution of a working end program according to an embodiment of the present invention.

FIG. 5 is a graph comparing the performance of the example of the present invention and a general graph calculation tool.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described below with reference to specific embodiments and related drawings. It is to be understood that the following description is only a few embodiments of the present invention, and is not to be taken as limiting the scope of the invention.

Referring to fig. 1 to 5, a distributed graph computation system based on task segmentation firstly performs design of a client and a work end according to computation tasks, including design and implementation of a graph computation algorithm and related system functions. After the design is completed, a distributed graph calculation system can be constructed through the method, and corresponding graph calculation is completed.

In one embodiment, based on the distributed graph computing system (as shown in fig. 1) provided by the invention, an embodiment system is built by taking graph-centric computing as a graph computing task, and all components and node relations of the system are shown in fig. 2. The system consists of 5 nodes: 1 node is used as a client and is responsible for partitioning and uploading tasks, uploading graph data, calculating simple parts in tasks and detecting at regular time; the method comprises the following steps that 1 node serves as a server and is responsible for receiving, managing and distributing tasks, and meanwhile, a data center MongoDB database is deployed and is responsible for managing various data; the more the working end is used as a computing node, the better the working end is, so that working end programs are deployed on 5 nodes and are responsible for computing subtasks and uploading computing results. The hardware configuration of five nodes is consistent, the CPU models are all Inter (R) Xeon (R) E5-2650 v4, the main frequency is 2.20GHz, the core number is 8, and the running memory is 8G. On node 2 (acting as a server and data center) there are installed Gearman and MongoDB. In the present embodiment, java is used as a development language of the client and the work end, and therefore, java environments need to be installed on the client and the work end nodes. In fact, gearman provides a multi-language interface, and other languages can be used for realizing various computing tasks, and a corresponding development environment needs to be installed at the moment.

Before the work flow of the deployment and the environment configuration of the embodiment is specifically introduced, the central computing task of the diagram is simply described, so that the relevant computing process is better understood. The graph centrality calculation task comprises calculation of three important graph centrality indexes, namely vertex centrality, approach centrality and intermediate centrality. The centrality is simply calculated, and in the embodiment, the centrality is directly calculated by the client, so that the loss caused by data transmission is reduced. The center-approaching property and the intermediate center property need to calculate the shortest path, wherein the calculation of the center-approaching property of a certain vertex only needs to calculate the shortest paths of the vertex and all other vertices, and the calculation can be directly completed in the subtasks. The calculation of the intermediary centrality of a certain vertex needs to calculate the shortest path between every two vertexes, so that only the intermediary centrality component of the vertex can be calculated in the subtask, and the final intermediary centrality is obtained by integrating after all tasks are executed. Since the main body of the present invention is a distributed graph computing system, detailed description of the centrality algorithms is not provided herein.

In an embodiment, the client program executes a flowchart as shown in fig. 3, and the processing procedure is as follows: 1.1 Obtaining graph data, converting the graph data into an adjacent chain table form, and uploading the graph data to a data center, wherein the adjacent chain table is compressed by a compression tool and then transmitted in a GirdFS file form in the process; 1.2 The uniform division strategy is adopted in the division process of the calculation tasks, so that the calculation of each subtask tends to be consistent when the calculation nodes are used; 1.3 After the task segmentation is completed, generating a task list to be uploaded to a server side, and generating a task state table to be uploaded to a data center; 1.4 Partial simple calculation tasks are completed, and calculation results are directly uploaded to a data center; 1.5 After the functions are completed, whether all the calculation tasks are completed is detected regularly through the task state table, and after all the subtasks are completed, all calculation results are integrated, uploaded and stored in the data center.

Specifically, the client obtains graph data from the current path by starting the parameter graph name. In the process of reading one by one, an adjacent chain table form of the graph data is created, and meanwhile, a node degree centrality calculation algorithm is operated to calculate the node degree centrality. After reading is finished, the vertex without the subsequent vertex in the created adjacent linked list and the vertex with the degree centrality value of 0 have no data, and the vertex segmentation task is completed when traversed. The client confirms the division vertex number in the subtask by starting the parameter task division number, the work end thread number and the graph data and using the idea of average division, thereby completing the division of the task and simultaneously completing the adjacent linked list and the degree centrality calculation result.

The client is connected to the database through the IP of the starting parameter MongoDB and the port thereof, and the task state table is established and uploaded to the database through the task segmentation result. The contiguous linked list representation of the graph data is uploaded to the database in a GirdFS file format after being compressed by Snappy. And directly uploading the calculation result of the centrality to a database in a document form for storage. The client is connected with the server through the IP address and the port of the Gearman server, and the split task list is compressed by Snappy and then submitted to the server.

And the client detects the task state table in the database at regular time by starting the parameter detection time interval. And after all the calculation subtasks are inquired to be completed, downloading and decompressing the intermediate centrality calculation component from the database. And after the integration, a final intermediate centrality calculation result document list is created and uploaded to a database for storage, and the calculation task is finished.

In an embodiment, as shown in fig. 4, the flow chart of the execution of the working end program includes the following processing procedures: 3.1 Receive and decompress the compute subtask from the server; acquiring an adjacency linked list, a task number and the like through data in the task, operating a graph centrality algorithm for calculation, and updating a task state table after the calculation is finished; uploading a calculation result approximate centrality meter and a medium centrality component to a database, and updating a task state table; 3.2 The system performance is improved by using a multithreading technology when a computing task is executed, and a two-dimensional array storage mode is used for avoiding partial data which can generate data conflict in a multithreading process.

Specifically, the working end is connected to the server end through an IP (Internet protocol) of a starting parameter Gearman server end and a port of the IP, and if the current working end is idle, a request for acquiring a task is sent to the server end. And the working end decompresses the tasks after acquiring the tasks, obtains the information of the IP, the port, the task number, the thread number, the vertex set, the graph type and the like of the MongoDB, and performs subsequent calculation and uploading work.

The working end is connected to the database through the IP of the MongoDB and the port thereof, and the task state in the task state table is updated to be the calculation start. The adjacency linked list required by the calculation is downloaded and decompressed, and the centrality calculation algorithm to be operated is determined by the graph type (the authorized graph or the unauthorized graph). In the process of centrality calculation, a multithreading technology is adopted, the thread number is confirmed by the parameter thread number in the task, and as the task is a calculation intensive task, the thread number selected in the embodiment is consistent with the CPU core number, namely 8 threads. And after the centrality calculation is completed, the task state in the task state table is updated to be the calculation completion.

After calculation is finished, the calculation result is close to centrality and is directly uploaded to a database for storage through a centrality document list, the central component of the intermediary is compressed through Snapy and then uploaded to the database for storage through GirdFS, and integration is carried out after all tasks are finished finally. And after the central uploading is finished, updating the task state in the task state table to be the completion of the calculation. At this time, the working end sends a request to the server end again to receive the task, and the same operation is executed until all the calculation tasks are completed.

The above description introduces the functions realized by the client and the working end and the information interaction between the client and the working end and the server and the data center in detail through the specific working conditions of the client and the working end. The process has been substantially fully described based on embodiments of the system of the present invention, and features of the system of the present invention are embodied in the embodiments, including but not limited to: the system is based on a task segmentation mode; the system consists of a client, a server, a working end and a data center; the server side selects a Gearman task distribution frame and performs information interaction with the client side and the working side through interfaces; when the task of the working end is divided, a uniform division strategy is used; the working end adopts a multithreading technology, and uses a two-dimensional array to store partial data to avoid data collision possibly occurring in multiple lines; the data center uses a MongoDB database, data which needs to be transmitted frequently is compressed by a compression tool and then is transmitted through a GirdFS format file, and the transmission quantity is reduced.

In the embodiment, the open source data sets email-Enro and episons are used as calculation objects to perform graph centrality calculation. The calculation results are shown in fig. 5, which is the same as those of the common chart calculation tool. Graph x, which is also a distributed graph computation method, shows a performance far lower than that of the embodiments of the present invention, because the graph data is distributed to each computation node in a graph partitioning manner, data transmission is required to be performed continuously in a centrality computation process, which causes a large amount of time loss, and the system of the present invention avoids a large amount of data transmission processes by a task partitioning manner. Therefore, the method has the advantage of showing superior performance in the computing task with larger data dependency degree between subgraphs.

The above examples are only for illustrating the proposed system of the present invention and should not be construed as limiting the present invention. It will be appreciated by those of ordinary skill in the art that the present system is applicable to many graph computation tasks other than those described above. Modifications and equivalents of the embodiments described in the examples are possible without departing from the spirit and scope of the invention and within the scope of the claims of the invention.

Claims

1. A distributed graph computing system based on task segmentation, the system comprising:

the data center is used for managing the graph data, the task state table and the calculation result processed by the client;

the server side uses a Gearman task distribution framework, the client side and the working side are designed according to a calculation task, and the data center adopts a MongoDB database;

the processing process of the client side comprises the following steps:

1.1 Obtaining graph data, converting the graph data into an adjacent chain table form, and uploading the graph data to a data center, wherein in the process, the adjacent chain table is compressed by a compression tool and then transmitted in a GirdFS file form;

1.2 The client confirms the division of the top points in the subtasks by starting the parameter task division number, the working end thread number and the graph data and using the idea of average division, thereby completing the division of the tasks and simultaneously completing the adjacent linked list and the degree centrality calculation result to ensure that each subtask tends to be consistent when calculating the nodes;

1.3 After the task segmentation is completed, generating a task list to be uploaded to a server side, and generating a task state table to be uploaded to a data center;

1.4 Partial simple calculation tasks are completed, and calculation results are directly uploaded to a data center;

1.5 After the functions are completed, whether all the calculation tasks are completed is detected regularly through the task state table, when all the calculation subtasks are inquired to be completed, the intermediate centrality calculation components are downloaded and decompressed from the database, the intermediate centrality calculation components are integrated, a final intermediate centrality calculation result document list is created and uploaded to the database for storage, all calculation results are integrated and uploaded to the data center to be stored, and the calculation tasks are completed.

2. The distributed graph computing system based on task segmentation as claimed in claim 1, wherein the server uses a Gearman task distribution framework to complete interaction of data and related information between the client and the working end through interfaces provided by Gearman, including receiving, managing and distributing computing subtasks.

3. The distributed graph computing system based on task segmentation as claimed in claim 1, wherein the processing procedure of the working end is:

4. The distributed graph computing system based on task segmentation as claimed in claim 1, wherein the data center processes:

4.1 Selecting a MongoDB database as a data center and mainly taking charge of receiving and managing graph data, storing and managing task state lists and storing and managing calculation results;

4.2 In the process of data transmission, the GirdFS file format in the MongoDB database is used for overall transmission, the transmission rate higher than that of single transmission is obtained, and a compression tool is used for compressing the files in the GirdFS format.