CN107247628B

CN107247628B - Data flow program task dividing and scheduling method for multi-core system

Info

Publication number: CN107247628B
Application number: CN201710480622.2A
Authority: CN
Inventors: 于俊清; 汪亮; 何云峰; 唐九飞
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2017-06-22
Filing date: 2017-06-22
Publication date: 2019-12-20
Anticipated expiration: 2037-06-22
Also published as: CN107247628A

Abstract

The invention discloses a data flow program task dividing and scheduling method for a multi-core system, which mainly comprises the following steps: the method comprises the steps of a data flow graph node splitting algorithm, a GAP task dividing algorithm, a software pipeline scheduling model and a data flow graph node double-buffer mechanism. The method maximizes the parallelism of the program by utilizing the data parallelism, the task parallelism and the software pipelining parallelism contained in the data stream programming model, schedules the data stream program according to the characteristics of a multi-core architecture, and fully exerts the performance of the multi-core processor.

Description

Data flow program task dividing and scheduling method for multi-core system

Technical Field

The invention belongs to the technical field of computer compilation, and particularly relates to a data flow program task dividing and scheduling method for a multi-core system.

Background

With the popularization of intelligent terminals, streaming media such as texts, images, audio and video enable data to be explosively increased, and the popularity of technologies such as big data and cloud computing puts higher demands on the processing speed of computers. Simply increasing the main frequency of a CPU has faced the problems of high manufacturing difficulty, high power consumption, and the like, and moore's law has not been applicable. Each large chip manufacturer converts a plurality of cores integrated on a CPU to improve the performance of the processor, and the multi-core processor has the advantages of high speed, low power consumption and the like and becomes the mainstream trend of the existing processor. The multi-core processor greatly improves the computing capability of the computer to a certain extent, but the multi-core advantage of the multi-core processor is not fully utilized. Researchers are actively looking for new parallel programming models to exploit the higher performance of multi-core processors.

Typical parallel programming models require a programmer to determine the parallel execution order of a program and then statically create parallel programs such as Pthreads, MPI, OpenMP. When a static parallel program executes, many complications may arise that make the program burdensome, such as data race, memory consistency, deadlocks, and the like. If these complex situations are not handled, the program may crash or yield erroneous results. While much research has overcome the shortcomings of these models over the past decade, it requires programmers to be familiar with this computer knowledge and to perform task partitioning and scheduling, data communication, and synchronous design of programs according to the underlying architecture, which greatly increases the work difficulty and workload of the programmers.

Disclosure of Invention

In view of the above defects or improvement needs in the prior art, an object of the present invention is to provide a method for dividing and scheduling tasks of a data flow program for a multi-core system, so as to solve the technical problem of low performance of executing the data flow program in the prior art.

To achieve the above object, according to an aspect of the present invention, there is provided a data flow program task partitioning and scheduling method for a multi-core system, including the following steps:

(1) carrying out workload statistics on nodes in a data flow graph, selecting target nodes capable of being split, horizontally splitting stateless target nodes, and vertically splitting stateful target nodes, wherein the data flow graph is generated by a data flow program through the front end of a data flow compiler;

(2) initializing k subgraphs, and moving all nodes in the data flow graph after node splitting into any subgraph V_kFrom sub-diagram V_kThe nodes in the sub-graph construct the remaining k-1 sub-graphs, and the communication among the sub-graphs is reduced by maximizing the sum of the edge weights in each sub-graph under the condition of ensuring load balance;

(3) moving the node in the subgraph with the maximum workload to the subgraph with the minimum workload, or moving the node in the subgraph with the maximum workload to an adjacent subgraph, and carrying out load balancing optimization on the k constructed subgraphs;

(4) moving the isolated zero node in the data flow graph after the load balancing optimization to a subgraph where an adjacent node of the isolated zero node is located so as to reduce the communication traffic between the subgraphs, wherein the isolated zero node represents that all the adjacent nodes of the node are not in the same subgraph as the node;

(5) and distributing the nodes with the dependency relationship in the data flow graph after communication optimization on different processor cores, and enabling the nodes with the dependency relationship to run for different scheduling times in the same flow cycle.

Preferably, the horizontally splitting the stateless target node and the vertically splitting the stateful target node includes:

for a stateless target node, if the running times of the target node exceed a first preset value, the target node is horizontally split into a plurality of same nodes, each split node processes one task, and each split node is in a parallel relation;

and for a target node with a state, if the running times of the target node exceed a second preset value, vertically splitting the target node into a plurality of nodes with the same function, enabling each split node to process one task, and connecting the split nodes in series.

Preferably, step (2) comprises in particular the following sub-steps:

(2.1) setting V₁,V₂,...,V_kFor the k subgraphs to be constructed, we ═ Wsum/kThe average weight of k subgraphs moves all nodes in the data flow graph after the node splitting into any subgraph V_kThe rest subgraphs are taken as empty subgraphs, wherein Wsum represents the total task quantity of the k subgraphs;

(2.2) for an arbitrary void graph V_i(i ≠ k), from V_kRandomly selecting a node to join a candidate set;

(2.3) selecting the node V with the maximum gain value from the candidate set to be added into the subgraph V_iIf sub-graph V_iIf the workload in (1) is less than we, then step (2.4) is performed, otherwise, sub-graph V_iCompleting construction, and executing the step (2.5);

(2.4) all the children belonging to V to be connected to node V_kThe node(s) of (2) is added into the candidate set and step (2.3) is performed;

and (2.5) judging whether all the null graphs are constructed completely, and if the null graphs which are not constructed completely exist, skipping to execute the step (2.2).

Preferably, in step (3), performing load balancing optimization on the constructed k sub-graphs by moving the node in the sub-graph with the largest workload to the sub-graph with the smallest workload, including:

counting the workload of each subgraph, and acquiring the subgraph with the maximum workload and the subgraph with the minimum workload;

traversing each node in the maximum workload subgraph, and if the balance factor of the maximum workload subgraph is reduced after the current traversed node is moved to the minimum workload subgraph, moving the current traversed node to the minimum workload subgraph, wherein the balance factor represents the division of the workload of the maximum workload subgraph and we;

and updating the subgraph with the maximum workload and the subgraph with the minimum workload, and repeatedly executing traversal and moving operation on the nodes in the subgraph with the maximum workload until all the nodes in the subgraph with the maximum workload are traversed and the balance factor is not reduced any more.

Preferably, the load balancing optimization of the k constructed sub-graphs by moving the node in the sub-graph with the largest workload to the adjacent sub-graph comprises:

counting the workload of each subgraph to obtain the subgraph with the maximum workload;

traversing each node in the subgraph with the maximum workload, searching all adjacent subgraphs of the current traversed node, determining a temporary solution after the current traversed node is moved to each adjacent subgraph, and for each temporary solution, if the balance factor and the traffic of the temporary solution are reduced, moving the current traversed node to the subgraph corresponding to the temporary solution;

and updating the subgraph with the maximum workload, and repeatedly executing traversal and moving operation on the nodes in the subgraph with the maximum workload until the maximum subgraph is unchanged and all the nodes in the maximum subgraph are traversed.

Preferably, the method further comprises:

transforming buffers among nodes, and changing a single buffer into a double-buffer mechanism, wherein the double-buffer mechanism indicates that read-write operation is alternately performed, and when one buffer performs read operation, the other buffer performs write operation;

and aligning the memory alignment mode of the double buffer areas according to the size of the cache lines.

In general, compared with the prior art, the above technical solution contemplated by the present invention can achieve the following beneficial effects:

(1) the granularity of the tasks is reduced, and the parallelism of the program is improved. The invention fully utilizes three parallelities of software pipeline parallelism, task parallelism and data parallelism of the data flow program, performs horizontal splitting and vertical splitting on the nodes of the data flow diagram, and prepares for balanced division of the data flow program.

(2) The task division is more balanced. The task division of the data flow graph is more balanced through initialization division, load balancing optimization and communication edge optimization, and the communication traffic among subgraphs is smaller.

(3) The efficiency of the inter-node buffer is improved. The invention improves the buffer areas among the nodes, uses double buffer areas to avoid the occurrence of a false sharing problem, and aligns the buffer areas according to the size of the cache line to improve the hit rate of the cache.

Drawings

Fig. 1 is a schematic flowchart of a data flow program task partitioning and scheduling method for a multi-core system according to an embodiment of the present invention;

fig. 2 is a flowchart of a node splitting algorithm of a data flow program on a multi-core platform according to an embodiment of the present invention, where fig. 2(a) shows a schematic diagram of task horizontal splitting and fig. 2(b) shows a schematic diagram of task vertical splitting;

fig. 3 is a flowchart of a task partitioning algorithm of a data flow program on a multi-core platform according to an embodiment of the present invention, where fig. 3(a) shows an initial partitioning schematic diagram, fig. 3(b) shows load balancing optimization moving to a sub-graph with the minimum workload, and fig. 3(c) shows load balancing optimization moving to an adjacent sub-graph;

fig. 4 is a schematic diagram of a software pipeline execution of a data flow program on a multi-core platform according to an embodiment of the present invention, where fig. 4(a) shows software pipeline scheduling, and fig. 4(b) shows inter-node cache optimization.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

The data flow programming model uses a data flow graph to carry out logic expression, and is an efficient parallel programming model. The dataflow programming model has proven to be the most appropriate model to compute large data. The data flow program has software pipeline parallelism, task parallelism and data parallelism, so that efficient parallel computing can be performed by using the data flow program. The field of multi-core processors is a main operation platform of a data flow program, and the data flow program has the characteristic of separation of calculation and communication, and is particularly suitable for a multi-core system. How to develop a data stream programming model to perform task division, scheduling and data communication on a data stream program by combining the parallelism of the data stream program and the characteristics of a multi-core system is a big problem to be solved at present.

The invention combines the task division and scheduling of the data flow program with the characteristics of a multi-core architecture, realizes the three-stage optimization process of the data flow program, specifically comprises node splitting, task division and task scheduling, and improves the execution performance of the data flow program on a target platform.

Fig. 1 is a schematic flow chart of a data flow program task partitioning and scheduling method for a multi-core system according to an embodiment of the present invention, where an optimization method adopted in the present invention uses a synchronous data flow graph generated at a front end of a data flow compiler as an input, and sequentially performs three-level processing of node splitting, task partitioning, and task scheduling on the synchronous data flow graph, and finally generates an executable code. The method comprises the following specific steps:

(1) splitting a node: carrying out workload statistics on nodes in a data flow graph, selecting target nodes capable of being split, horizontally splitting stateless target nodes, and vertically splitting stateful target nodes, wherein the data flow graph is generated by a data flow program through the front end of a data flow compiler;

wherein, splitting the node comprises: for a stateless target node, if the running times of the target node exceed a first preset value, the target node is horizontally split into a plurality of same nodes, each split node processes one task, and each split node is in a parallel relation; and for a target node with a state, if the running times of the target node exceed a second preset value, vertically splitting the target node into a plurality of nodes with the same function, enabling each split node to process one task, and connecting the split nodes in series.

Specifically, the large nodes without the state are horizontally split by using a split join structure, and the large nodes with the state are vertically split by using a pipeline structure. The task granularity of the data flow program in the multi-core system is reduced, and the parallelism is improved. The method comprises the following specific steps:

(1.1) horizontal splitting. For a stateless node, if the number of running times of the computing node is too large and the task amount is too heavy, the computing node can be horizontally split into a plurality of same nodes, and each node processes the task once, so that the granularity of the task can be reduced. In engineering practice, we can use the split join structure to implement, as shown in fig. 2(a), the number of runs of bigOperator is N, we can split it into N identical child nodes by the number of runs, and each child node processes one task. A split node in the form of a roundbin (pop) is used upstream to distribute data for all children in turn, pop data at a time, where pop is equal to the pop value of bigOperator. The join node in the form of roundbin (push) is used downstream to take turns receiving the data output by all child nodes, each time receiving push data, where push is equal to the push value of bigOperator. After the horizontal splitting, multiple tasks originally belonging to one node are distributed to multiple nodes with the same function and mutually independent, so that the granularity of the tasks is reduced, and the task division operation in the later period is facilitated. The nodes after splitting are independent of each other, so that the parallel of tasks can be achieved among the nodes.

(1.2) vertical splitting. As shown in fig. 2(b), if a node is stateful and needs to run multiple times, each run of the node is related to the last run. We still split the node into multiple functionally identical nodes, so that each of the split child nodes processes a task, and then concatenates the child nodes using pipeline structure. Considering that there is data dependency between these nodes, we need to modify the nodes, and move the definitions of the variables with dependency relationship into composition, so that these variables with dependency relationship belong to all the children nodes.

(2) Initial division: initializing k subgraphs, and moving all nodes in the data flow graph after node splitting into any subgraph V_kFrom sub-diagram V_kThe nodes in the sub-graph construct the remaining k-1 sub-graphs, and the communication among the sub-graphs is reduced by maximizing the sum of the edge weights in each sub-graph under the condition of ensuring load balance;

wherein, the step (2) comprises the following steps:

(2.1) setting V₁,V₂,...,V_kIs to be treatedObtaining the average weight of k sub-graphs from we ═ Wsum/k, and moving all nodes in the data flow graph after node splitting into any sub-graph V_kThe rest subgraphs are taken as empty subgraphs, wherein Wsum represents the total task quantity of the k subgraphs;

Specifically, the implementation method of the initial partitioning includes: as shown in FIG. 3(a), assume V₁,V₂,...,V_kWhen the sub-graphs are initially divided, the average weight we of all the sub-graphs is firstly obtained to be Wsum/k, wherein Wsum represents the total task amount of the k sub-graphs, and all nodes are moved to any sub-graph V_kAnd the rest subgraphs are taken as empty subgraphs. Selecting a plurality of nodes from all the nodes to form a candidate set, and then adopting a greedy strategy to construct the remaining empty subgraphs one by one, wherein the construction mode is to select the node with the maximum gain function value from the candidate set for immigration operation, and the gain function is as follows:

wherein V represents a node in the candidate set, and u represents a null graph V_iNode in (1), u' represents subgraph V_kW (u, V) represents the weight of the edge connected with u and V, w (u ', V) represents the weight of the edge connected with u' and V, and each subgraph V is constructed_iThe specific process is as follows: initialization subgraph V_iIn the air, follow it firstMachine-ground slave sub-graph V_kA node is selected to be added into the candidate set, and the random purpose is to ensure a solution with diversity; then selecting the node V with the maximum gain value from the candidate set to be added into the subgraph V_iAnd all nodes connected to node V belong to V_kThe node of (2) joins the candidate set. When a node enters the candidate set, the gain function values of the nodes in the candidate set are updated synchronously, so that the gain function values can be read directly when the node moves next time. Subfigure V_iThe construction is completed under the condition that the workload is not less than we. And finally, constructing the rest of the blank graphs by adopting the same method.

The node vmax at which the gain function value is maximum is obtained by: first, each node and subgraph V are calculated_iThe sum of all the connected edge weights is calculated, and then each node and V are calculated_kAnd finally, solving the difference value of the sum of all the connected edge weights, wherein the node with the largest difference value is the node to be moved.

(3) Load balancing optimization: moving the node in the subgraph with the maximum workload to the subgraph with the minimum workload, or moving the node in the subgraph with the maximum workload to an adjacent subgraph, and carrying out load balancing optimization on the k constructed subgraphs;

the load balancing optimization of the k constructed sub-graphs is carried out by moving the nodes in the sub-graph with the maximum workload to the sub-graph with the minimum workload, and comprises the following steps:

Specifically, the method for implementing load balancing optimization comprises the following steps:

as shown in fig. 3(b), the load balancing optimization algorithm moving to the minimum subgraph first traverses all nodes in the maximum subgraph, and for each node, tries to move to the minimum subgraph, and if the balance factor decreases after moving, the balance factor refers to the maximum subgraph work load divided by we, then moves. If the balance factor is not decreased, then the move is undone. After moving, the load of the maximum subgraph and the load of the minimum subgraph are changed, the load of the maximum subgraph is reduced to be equal to the workload of the mobile node, and the load of the minimum subgraph is increased to be equal to the workload of the mobile node. And after the movement is successful, reselecting a new maximum subgraph and a new minimum subgraph, and then adopting the same movement operation until all the nodes in the maximum subgraph are traversed and the balance factor is not reduced any more, and terminating the optimization algorithm when the movement of all the nodes in the new maximum subgraph is cancelled.

As shown in fig. 3(c), the load balancing optimization algorithm for moving to the adjacent subgraph also needs to traverse all nodes in the largest subgraph first, and for each node, try to move to the subgraph adjacent to the node, and if there are multiple adjacent subgraphs, move multiple times respectively to obtain multiple temporary solutions. For each provisional solution, if the balance factor for that provisional solution is reduced and traffic is also reduced, the current partition is replaced with the provisional solution. Then, the moving operation is continued from the plurality of temporary solution issues, and the moved nodes do not move back to the original subgraph. If the balance factor is not decreased after the move, the move is cancelled. Likewise, the maximum sub-figure after the movement may change, so that the maximum sub-figure is replaced in the next round. Until the maximum subgraph is unchanged and all nodes in the maximum subgraph have been traversed to completion.

(4) Optimizing the communication edge between subgraphs: moving the isolated zero node in the data flow graph after the load balancing optimization to a subgraph where an adjacent node of the isolated zero node is located so as to reduce the communication traffic between the subgraphs, wherein the isolated zero node represents that all the adjacent nodes of the node are not in the same subgraph as the node;

the method comprises the following steps of moving nodes in the subgraph with the maximum workload to the adjacent subgraph, and carrying out load balancing optimization on the k constructed subgraphs, wherein the load balancing optimization comprises the following steps:

Specifically, the method for implementing inter-subgraph communication edge optimization comprises the following steps:

in the data flow graph, the communication traffic between subgraphs refers to the weight of the edge between adjacent subgraphs. Nodes can be classified into three types according to the positions of all adjacent nodes of a node:

1. internal nodes: if all the adjacent nodes of a node are in the same subgraph with the node, the node is called an internal node, the internal node is a node needing protection, and the movement is prohibited.

2. Boundary nodes: if all the adjacent nodes of a node have a part in the same subgraph as it and another part in other subgraphs, the node is called a boundary node, and the boundary node is a node which can consider movement.

3. An isolated zero node: if all the adjacent nodes of a node are not in the same subgraph with the node, the node is called an orphan zero node, and the orphan zero node is a node which is primarily considered to move.

For the lone zero node, a merging strategy of the lone zero node is provided. On the premise of not influencing load balance, the lone zero node is moved to the subgraph where the adjacent node is located, and the communication traffic between the subgraphs can be effectively reduced. If there are multiple subgraphs of adjacent nodes, multiple moves are attempted and the move that minimizes traffic is chosen.

In another embodiment, software pipeline scheduling and cache optimization is performed for the task partitioning results. The software pipeline model under the multi-core environment is mainly a process of mutual cooperative operation of a plurality of cores, computing nodes and buffer areas among the computing nodes of a processor. Load balancing between stages of pipelines is a key factor in order to maximize performance of the software pipeline. Load balancing is guaranteed already during task division of the data flow program, so that the scheduling order of the nodes and the buffer areas among the nodes are mainly aimed at during scheduling. The method comprises the following specific steps:

(A) software pipeline scheduling

And carrying out stage assignment on the divided nodes to obtain the execution sequence of the nodes.

Distributing the task subgraph to the corresponding processor core according to the divided result, and dividing the scheduling into 3 stages: filling phase, steady state phase, emptying phase. The filling stage is the starting stage of program operation, and gradually accumulates data for the steady-state stage, and only part of nodes operate. The steady state phase is entered when data is accumulated to the limit that the steady state phase can be executed. After entering the steady-state execution stage, all the computing tasks in the same pipeline cycle do not have any data dependence, and complete parallelism can be achieved because the data required by the operation of the computing tasks are already put into a buffer area in the previous pipeline cycle. The emptying phase is the end phase of the program operation, and the computing nodes are stopped in sequence.

On the left of fig. 4(a), there is a simple data flow diagram, and each downstream node depends on the data generated by the upstream node, so that these nodes cannot operate at the same scheduling number if in the same pipeline cycle. To enable these nodes to be in parallel, they are spatially isolated, with them being located on different processor cores. They are staggered in time so that they run at different scheduling times in the same pipeline cycle.

The right side of fig. 4(a) shows the node scheduling diagram, in the first pipeline cycle, the a node starts to operate, and the a node will put the generated result in the buffers of a and B after the operation of the a node is completed. The buffer transmits data to the computing node B through dma (direct Memory access). And when the second pipeline period is reached, the node A continues to carry out next scheduling, and the node B operates while the node A operates because the data required by the node B to operate is already put into the buffer by the node A in the last period. Similarly, after the A operation is completed, new data is overwritten in the original buffers A and B, and the data generated by B is written into the buffer between B and C. And when the third pipeline cycle is reached, the C node reads the data written by the B node in the second pipeline cycle and then starts to operate, so that the three nodes can operate on the core from the third pipeline cycle, and the software pipeline parallelism is achieved. Since the program is scheduled 6 times, three nodes are running up to the sixth pipeline cycle. Since the seventh pipeline cycle begins, node a will finish executing, and there will not be data in the buffers of node a and node B, but node B and node C will still be running because they used the data generated in the last cycle. After the eighth pipeline cycle, the node B finds that there is no data in the upstream buffer, and therefore ends the run, where only the node C runs. The C node will also stop running after the eighth pipeline cycle.

(B) Buffer optimization

The buffer of the current data flow programming model adopts a single buffer mechanism of a shared memory, and the length of the buffer is fixed. In the process of software pipeline scheduling, in order to ensure the consistency of data, an upstream node writes and reads a buffer simultaneously with a downstream node, and the difference is that the upstream node operates on the data of the current cycle, and the downstream node operates on the data generated by the upstream node in the previous cycle. When the upstream node and the downstream node are positioned on different processor cores, the upstream node and the downstream node communicate with each other in a memory sharing mode. If the upstream node and the downstream node frequently read and write the same cache line of the buffer, a pseudo-sharing problem occurs. The occurrence of the pseudo sharing problem can greatly increase the read-write time of the thread to the memory, and influence the execution efficiency of the program.

Aiming at the occurrence of the pseudo sharing problem, firstly, a buffer area is reconstructed, and a double-buffer area mechanism is used. The double-Buffer mechanism can prevent an upstream node and a downstream node from reading and writing the same Buffer, as shown in fig. 4(b), the node a is the upstream node, when the node a writes data, the node a writes Buffer1, and when the node a writes data once, the pointer points to the Buffer of Buffer2, and the Buffer2 starts to be written. The node B is a downstream node, when the node A writes to the Buffer1, the pointer of the node B points to the Buffer2, if the Buffer2 has data, the node B will read from the Buffer, and if the Buffer2 has no data, the node B2 will enter a synchronous waiting state. After the node a writes to Buffer1 and points to Buffer2, the pointer of the node B will also change to point to Buffer1, and since Buffer1 has written data from the node a, the node a writes data to Buffer2 and the node B reads data from Buffer 1. This ensures that the node a and node B operate on data in different buffers.

And secondly, aligning the memory alignment mode of the double buffer areas according to the size of the cache lines. The benefit of aligning by the size of the cache line is that the hit rate of the cache can be increased. Because the read-write speed of the cache by the multi-core processor is obviously higher than that of the memory, if the read-write of the data by the computing node can directly come from the buffer area, the communication speed between the computing nodes is greatly increased.

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A data flow program task dividing and scheduling method for a multi-core system is characterized by comprising the following steps:

2. The method of claim 1, wherein the performing horizontal splitting on stateless target nodes and performing vertical splitting on stateful target nodes comprises:

3. Method according to claim 1 or 2, characterized in that step (2) comprises in particular the sub-steps of:

(2.1) setting V₁,V₂,...,V_kFor k sub-graphs to be constructed, the average weight of k sub-graphs is obtained by we ═ Wsum/k, and all nodes in the data flow graph after node splitting are moved into any sub-graph V_kThe rest subgraphs are taken as empty subgraphs, wherein Wsum represents the total task quantity of the k subgraphs;

(2.2) for an arbitrary void graph V_iFrom V_kRandomly selecting a node to join a candidate set, wherein i is not equal to k;

and (2.5) judging whether all the blank graphs are constructed completely, if so, ending the step (2), otherwise, skipping to execute the step (2.2) if the blank graphs which are not constructed completely exist.

4. The method of claim 3, wherein in step (3), the load balancing optimization of the k constructed sub-graphs by moving the nodes in the sub-graph with the largest workload to the sub-graph with the smallest workload comprises:

5. The method of claim 3, wherein in step (3), the load balancing optimization of the constructed k sub-graphs by moving the nodes in the sub-graph with the largest workload to the adjacent sub-graphs comprises:

traversing each node in the subgraph with the maximum workload, searching all adjacent subgraphs of the currently traversed node, determining a temporary solution after the currently traversed node is moved to each adjacent subgraph, and for each temporary solution, if the balance factor and the traffic of the temporary solution are both reduced, moving the currently traversed node to the subgraph corresponding to the temporary solution, wherein the balance factor represents that the workload of the subgraph with the maximum workload is divided by we;

6. The method according to claim 4 or 5, characterized in that the method further comprises: