CN107247628B - Data flow program task dividing and scheduling method for multi-core system - Google Patents

Data flow program task dividing and scheduling method for multi-core system Download PDF

Info

Publication number
CN107247628B
CN107247628B CN201710480622.2A CN201710480622A CN107247628B CN 107247628 B CN107247628 B CN 107247628B CN 201710480622 A CN201710480622 A CN 201710480622A CN 107247628 B CN107247628 B CN 107247628B
Authority
CN
China
Prior art keywords
node
subgraph
nodes
workload
sub
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710480622.2A
Other languages
Chinese (zh)
Other versions
CN107247628A (en
Inventor
于俊清
汪亮
何云峰
唐九飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN201710480622.2A priority Critical patent/CN107247628B/en
Publication of CN107247628A publication Critical patent/CN107247628A/en
Application granted granted Critical
Publication of CN107247628B publication Critical patent/CN107247628B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/505Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the load
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication

Abstract

The invention discloses a data flow program task dividing and scheduling method for a multi-core system, which mainly comprises the following steps: the method comprises the steps of a data flow graph node splitting algorithm, a GAP task dividing algorithm, a software pipeline scheduling model and a data flow graph node double-buffer mechanism. The method maximizes the parallelism of the program by utilizing the data parallelism, the task parallelism and the software pipelining parallelism contained in the data stream programming model, schedules the data stream program according to the characteristics of a multi-core architecture, and fully exerts the performance of the multi-core processor.

Description

Data flow program task dividing and scheduling method for multi-core system
Technical Field
The invention belongs to the technical field of computer compilation, and particularly relates to a data flow program task dividing and scheduling method for a multi-core system.
Background
With the popularization of intelligent terminals, streaming media such as texts, images, audio and video enable data to be explosively increased, and the popularity of technologies such as big data and cloud computing puts higher demands on the processing speed of computers. Simply increasing the main frequency of a CPU has faced the problems of high manufacturing difficulty, high power consumption, and the like, and moore's law has not been applicable. Each large chip manufacturer converts a plurality of cores integrated on a CPU to improve the performance of the processor, and the multi-core processor has the advantages of high speed, low power consumption and the like and becomes the mainstream trend of the existing processor. The multi-core processor greatly improves the computing capability of the computer to a certain extent, but the multi-core advantage of the multi-core processor is not fully utilized. Researchers are actively looking for new parallel programming models to exploit the higher performance of multi-core processors.
Typical parallel programming models require a programmer to determine the parallel execution order of a program and then statically create parallel programs such as Pthreads, MPI, OpenMP. When a static parallel program executes, many complications may arise that make the program burdensome, such as data race, memory consistency, deadlocks, and the like. If these complex situations are not handled, the program may crash or yield erroneous results. While much research has overcome the shortcomings of these models over the past decade, it requires programmers to be familiar with this computer knowledge and to perform task partitioning and scheduling, data communication, and synchronous design of programs according to the underlying architecture, which greatly increases the work difficulty and workload of the programmers.
Disclosure of Invention
In view of the above defects or improvement needs in the prior art, an object of the present invention is to provide a method for dividing and scheduling tasks of a data flow program for a multi-core system, so as to solve the technical problem of low performance of executing the data flow program in the prior art.
To achieve the above object, according to an aspect of the present invention, there is provided a data flow program task partitioning and scheduling method for a multi-core system, including the following steps:
(1) carrying out workload statistics on nodes in a data flow graph, selecting target nodes capable of being split, horizontally splitting stateless target nodes, and vertically splitting stateful target nodes, wherein the data flow graph is generated by a data flow program through the front end of a data flow compiler;
(2) initializing k subgraphs, and moving all nodes in the data flow graph after node splitting into any subgraph VkFrom sub-diagram VkThe nodes in the sub-graph construct the remaining k-1 sub-graphs, and the communication among the sub-graphs is reduced by maximizing the sum of the edge weights in each sub-graph under the condition of ensuring load balance;
(3) moving the node in the subgraph with the maximum workload to the subgraph with the minimum workload, or moving the node in the subgraph with the maximum workload to an adjacent subgraph, and carrying out load balancing optimization on the k constructed subgraphs;
(4) moving the isolated zero node in the data flow graph after the load balancing optimization to a subgraph where an adjacent node of the isolated zero node is located so as to reduce the communication traffic between the subgraphs, wherein the isolated zero node represents that all the adjacent nodes of the node are not in the same subgraph as the node;
(5) and distributing the nodes with the dependency relationship in the data flow graph after communication optimization on different processor cores, and enabling the nodes with the dependency relationship to run for different scheduling times in the same flow cycle.
Preferably, the horizontally splitting the stateless target node and the vertically splitting the stateful target node includes:
for a stateless target node, if the running times of the target node exceed a first preset value, the target node is horizontally split into a plurality of same nodes, each split node processes one task, and each split node is in a parallel relation;
and for a target node with a state, if the running times of the target node exceed a second preset value, vertically splitting the target node into a plurality of nodes with the same function, enabling each split node to process one task, and connecting the split nodes in series.
Preferably, step (2) comprises in particular the following sub-steps:
(2.1) setting V1,V2,...,VkFor the k subgraphs to be constructed, we ═ Wsum/kThe average weight of k subgraphs moves all nodes in the data flow graph after the node splitting into any subgraph VkThe rest subgraphs are taken as empty subgraphs, wherein Wsum represents the total task quantity of the k subgraphs;
(2.2) for an arbitrary void graph Vi(i ≠ k), from VkRandomly selecting a node to join a candidate set;
(2.3) selecting the node V with the maximum gain value from the candidate set to be added into the subgraph ViIf sub-graph ViIf the workload in (1) is less than we, then step (2.4) is performed, otherwise, sub-graph ViCompleting construction, and executing the step (2.5);
(2.4) all the children belonging to V to be connected to node VkThe node(s) of (2) is added into the candidate set and step (2.3) is performed;
and (2.5) judging whether all the null graphs are constructed completely, and if the null graphs which are not constructed completely exist, skipping to execute the step (2.2).
Preferably, in step (3), performing load balancing optimization on the constructed k sub-graphs by moving the node in the sub-graph with the largest workload to the sub-graph with the smallest workload, including:
counting the workload of each subgraph, and acquiring the subgraph with the maximum workload and the subgraph with the minimum workload;
traversing each node in the maximum workload subgraph, and if the balance factor of the maximum workload subgraph is reduced after the current traversed node is moved to the minimum workload subgraph, moving the current traversed node to the minimum workload subgraph, wherein the balance factor represents the division of the workload of the maximum workload subgraph and we;
and updating the subgraph with the maximum workload and the subgraph with the minimum workload, and repeatedly executing traversal and moving operation on the nodes in the subgraph with the maximum workload until all the nodes in the subgraph with the maximum workload are traversed and the balance factor is not reduced any more.
Preferably, the load balancing optimization of the k constructed sub-graphs by moving the node in the sub-graph with the largest workload to the adjacent sub-graph comprises:
counting the workload of each subgraph to obtain the subgraph with the maximum workload;
traversing each node in the subgraph with the maximum workload, searching all adjacent subgraphs of the current traversed node, determining a temporary solution after the current traversed node is moved to each adjacent subgraph, and for each temporary solution, if the balance factor and the traffic of the temporary solution are reduced, moving the current traversed node to the subgraph corresponding to the temporary solution;
and updating the subgraph with the maximum workload, and repeatedly executing traversal and moving operation on the nodes in the subgraph with the maximum workload until the maximum subgraph is unchanged and all the nodes in the maximum subgraph are traversed.
Preferably, the method further comprises:
transforming buffers among nodes, and changing a single buffer into a double-buffer mechanism, wherein the double-buffer mechanism indicates that read-write operation is alternately performed, and when one buffer performs read operation, the other buffer performs write operation;
and aligning the memory alignment mode of the double buffer areas according to the size of the cache lines.
In general, compared with the prior art, the above technical solution contemplated by the present invention can achieve the following beneficial effects:
(1) the granularity of the tasks is reduced, and the parallelism of the program is improved. The invention fully utilizes three parallelities of software pipeline parallelism, task parallelism and data parallelism of the data flow program, performs horizontal splitting and vertical splitting on the nodes of the data flow diagram, and prepares for balanced division of the data flow program.
(2) The task division is more balanced. The task division of the data flow graph is more balanced through initialization division, load balancing optimization and communication edge optimization, and the communication traffic among subgraphs is smaller.
(3) The efficiency of the inter-node buffer is improved. The invention improves the buffer areas among the nodes, uses double buffer areas to avoid the occurrence of a false sharing problem, and aligns the buffer areas according to the size of the cache line to improve the hit rate of the cache.
Drawings
Fig. 1 is a schematic flowchart of a data flow program task partitioning and scheduling method for a multi-core system according to an embodiment of the present invention;
fig. 2 is a flowchart of a node splitting algorithm of a data flow program on a multi-core platform according to an embodiment of the present invention, where fig. 2(a) shows a schematic diagram of task horizontal splitting and fig. 2(b) shows a schematic diagram of task vertical splitting;
fig. 3 is a flowchart of a task partitioning algorithm of a data flow program on a multi-core platform according to an embodiment of the present invention, where fig. 3(a) shows an initial partitioning schematic diagram, fig. 3(b) shows load balancing optimization moving to a sub-graph with the minimum workload, and fig. 3(c) shows load balancing optimization moving to an adjacent sub-graph;
fig. 4 is a schematic diagram of a software pipeline execution of a data flow program on a multi-core platform according to an embodiment of the present invention, where fig. 4(a) shows software pipeline scheduling, and fig. 4(b) shows inter-node cache optimization.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
The data flow programming model uses a data flow graph to carry out logic expression, and is an efficient parallel programming model. The dataflow programming model has proven to be the most appropriate model to compute large data. The data flow program has software pipeline parallelism, task parallelism and data parallelism, so that efficient parallel computing can be performed by using the data flow program. The field of multi-core processors is a main operation platform of a data flow program, and the data flow program has the characteristic of separation of calculation and communication, and is particularly suitable for a multi-core system. How to develop a data stream programming model to perform task division, scheduling and data communication on a data stream program by combining the parallelism of the data stream program and the characteristics of a multi-core system is a big problem to be solved at present.
The invention combines the task division and scheduling of the data flow program with the characteristics of a multi-core architecture, realizes the three-stage optimization process of the data flow program, specifically comprises node splitting, task division and task scheduling, and improves the execution performance of the data flow program on a target platform.
Fig. 1 is a schematic flow chart of a data flow program task partitioning and scheduling method for a multi-core system according to an embodiment of the present invention, where an optimization method adopted in the present invention uses a synchronous data flow graph generated at a front end of a data flow compiler as an input, and sequentially performs three-level processing of node splitting, task partitioning, and task scheduling on the synchronous data flow graph, and finally generates an executable code. The method comprises the following specific steps:
(1) splitting a node: carrying out workload statistics on nodes in a data flow graph, selecting target nodes capable of being split, horizontally splitting stateless target nodes, and vertically splitting stateful target nodes, wherein the data flow graph is generated by a data flow program through the front end of a data flow compiler;
wherein, splitting the node comprises: for a stateless target node, if the running times of the target node exceed a first preset value, the target node is horizontally split into a plurality of same nodes, each split node processes one task, and each split node is in a parallel relation; and for a target node with a state, if the running times of the target node exceed a second preset value, vertically splitting the target node into a plurality of nodes with the same function, enabling each split node to process one task, and connecting the split nodes in series.
Specifically, the large nodes without the state are horizontally split by using a split join structure, and the large nodes with the state are vertically split by using a pipeline structure. The task granularity of the data flow program in the multi-core system is reduced, and the parallelism is improved. The method comprises the following specific steps:
(1.1) horizontal splitting. For a stateless node, if the number of running times of the computing node is too large and the task amount is too heavy, the computing node can be horizontally split into a plurality of same nodes, and each node processes the task once, so that the granularity of the task can be reduced. In engineering practice, we can use the split join structure to implement, as shown in fig. 2(a), the number of runs of bigOperator is N, we can split it into N identical child nodes by the number of runs, and each child node processes one task. A split node in the form of a roundbin (pop) is used upstream to distribute data for all children in turn, pop data at a time, where pop is equal to the pop value of bigOperator. The join node in the form of roundbin (push) is used downstream to take turns receiving the data output by all child nodes, each time receiving push data, where push is equal to the push value of bigOperator. After the horizontal splitting, multiple tasks originally belonging to one node are distributed to multiple nodes with the same function and mutually independent, so that the granularity of the tasks is reduced, and the task division operation in the later period is facilitated. The nodes after splitting are independent of each other, so that the parallel of tasks can be achieved among the nodes.
(1.2) vertical splitting. As shown in fig. 2(b), if a node is stateful and needs to run multiple times, each run of the node is related to the last run. We still split the node into multiple functionally identical nodes, so that each of the split child nodes processes a task, and then concatenates the child nodes using pipeline structure. Considering that there is data dependency between these nodes, we need to modify the nodes, and move the definitions of the variables with dependency relationship into composition, so that these variables with dependency relationship belong to all the children nodes.
(2) Initial division: initializing k subgraphs, and moving all nodes in the data flow graph after node splitting into any subgraph VkFrom sub-diagram VkThe nodes in the sub-graph construct the remaining k-1 sub-graphs, and the communication among the sub-graphs is reduced by maximizing the sum of the edge weights in each sub-graph under the condition of ensuring load balance;
wherein, the step (2) comprises the following steps:
(2.1) setting V1,V2,...,VkIs to be treatedObtaining the average weight of k sub-graphs from we ═ Wsum/k, and moving all nodes in the data flow graph after node splitting into any sub-graph VkThe rest subgraphs are taken as empty subgraphs, wherein Wsum represents the total task quantity of the k subgraphs;
(2.2) for an arbitrary void graph Vi(i ≠ k), from VkRandomly selecting a node to join a candidate set;
(2.3) selecting the node V with the maximum gain value from the candidate set to be added into the subgraph ViIf sub-graph ViIf the workload in (1) is less than we, then step (2.4) is performed, otherwise, sub-graph ViCompleting construction, and executing the step (2.5);
(2.4) all the children belonging to V to be connected to node VkThe node(s) of (2) is added into the candidate set and step (2.3) is performed;
and (2.5) judging whether all the null graphs are constructed completely, and if the null graphs which are not constructed completely exist, skipping to execute the step (2.2).
Specifically, the implementation method of the initial partitioning includes: as shown in FIG. 3(a), assume V1,V2,...,VkWhen the sub-graphs are initially divided, the average weight we of all the sub-graphs is firstly obtained to be Wsum/k, wherein Wsum represents the total task amount of the k sub-graphs, and all nodes are moved to any sub-graph VkAnd the rest subgraphs are taken as empty subgraphs. Selecting a plurality of nodes from all the nodes to form a candidate set, and then adopting a greedy strategy to construct the remaining empty subgraphs one by one, wherein the construction mode is to select the node with the maximum gain function value from the candidate set for immigration operation, and the gain function is as follows:
wherein V represents a node in the candidate set, and u represents a null graph ViNode in (1), u' represents subgraph VkW (u, V) represents the weight of the edge connected with u and V, w (u ', V) represents the weight of the edge connected with u' and V, and each subgraph V is constructediThe specific process is as follows: initialization subgraph ViIn the air, follow it firstMachine-ground slave sub-graph VkA node is selected to be added into the candidate set, and the random purpose is to ensure a solution with diversity; then selecting the node V with the maximum gain value from the candidate set to be added into the subgraph ViAnd all nodes connected to node V belong to VkThe node of (2) joins the candidate set. When a node enters the candidate set, the gain function values of the nodes in the candidate set are updated synchronously, so that the gain function values can be read directly when the node moves next time. Subfigure ViThe construction is completed under the condition that the workload is not less than we. And finally, constructing the rest of the blank graphs by adopting the same method.
The node vmax at which the gain function value is maximum is obtained by: first, each node and subgraph V are calculatediThe sum of all the connected edge weights is calculated, and then each node and V are calculatedkAnd finally, solving the difference value of the sum of all the connected edge weights, wherein the node with the largest difference value is the node to be moved.
(3) Load balancing optimization: moving the node in the subgraph with the maximum workload to the subgraph with the minimum workload, or moving the node in the subgraph with the maximum workload to an adjacent subgraph, and carrying out load balancing optimization on the k constructed subgraphs;
the load balancing optimization of the k constructed sub-graphs is carried out by moving the nodes in the sub-graph with the maximum workload to the sub-graph with the minimum workload, and comprises the following steps:
counting the workload of each subgraph, and acquiring the subgraph with the maximum workload and the subgraph with the minimum workload;
traversing each node in the maximum workload subgraph, and if the balance factor of the maximum workload subgraph is reduced after the current traversed node is moved to the minimum workload subgraph, moving the current traversed node to the minimum workload subgraph, wherein the balance factor represents the division of the workload of the maximum workload subgraph and we;
and updating the subgraph with the maximum workload and the subgraph with the minimum workload, and repeatedly executing traversal and moving operation on the nodes in the subgraph with the maximum workload until all the nodes in the subgraph with the maximum workload are traversed and the balance factor is not reduced any more.
Specifically, the method for implementing load balancing optimization comprises the following steps:
as shown in fig. 3(b), the load balancing optimization algorithm moving to the minimum subgraph first traverses all nodes in the maximum subgraph, and for each node, tries to move to the minimum subgraph, and if the balance factor decreases after moving, the balance factor refers to the maximum subgraph work load divided by we, then moves. If the balance factor is not decreased, then the move is undone. After moving, the load of the maximum subgraph and the load of the minimum subgraph are changed, the load of the maximum subgraph is reduced to be equal to the workload of the mobile node, and the load of the minimum subgraph is increased to be equal to the workload of the mobile node. And after the movement is successful, reselecting a new maximum subgraph and a new minimum subgraph, and then adopting the same movement operation until all the nodes in the maximum subgraph are traversed and the balance factor is not reduced any more, and terminating the optimization algorithm when the movement of all the nodes in the new maximum subgraph is cancelled.
As shown in fig. 3(c), the load balancing optimization algorithm for moving to the adjacent subgraph also needs to traverse all nodes in the largest subgraph first, and for each node, try to move to the subgraph adjacent to the node, and if there are multiple adjacent subgraphs, move multiple times respectively to obtain multiple temporary solutions. For each provisional solution, if the balance factor for that provisional solution is reduced and traffic is also reduced, the current partition is replaced with the provisional solution. Then, the moving operation is continued from the plurality of temporary solution issues, and the moved nodes do not move back to the original subgraph. If the balance factor is not decreased after the move, the move is cancelled. Likewise, the maximum sub-figure after the movement may change, so that the maximum sub-figure is replaced in the next round. Until the maximum subgraph is unchanged and all nodes in the maximum subgraph have been traversed to completion.
(4) Optimizing the communication edge between subgraphs: moving the isolated zero node in the data flow graph after the load balancing optimization to a subgraph where an adjacent node of the isolated zero node is located so as to reduce the communication traffic between the subgraphs, wherein the isolated zero node represents that all the adjacent nodes of the node are not in the same subgraph as the node;
the method comprises the following steps of moving nodes in the subgraph with the maximum workload to the adjacent subgraph, and carrying out load balancing optimization on the k constructed subgraphs, wherein the load balancing optimization comprises the following steps:
counting the workload of each subgraph to obtain the subgraph with the maximum workload;
traversing each node in the subgraph with the maximum workload, searching all adjacent subgraphs of the current traversed node, determining a temporary solution after the current traversed node is moved to each adjacent subgraph, and for each temporary solution, if the balance factor and the traffic of the temporary solution are reduced, moving the current traversed node to the subgraph corresponding to the temporary solution;
and updating the subgraph with the maximum workload, and repeatedly executing traversal and moving operation on the nodes in the subgraph with the maximum workload until the maximum subgraph is unchanged and all the nodes in the maximum subgraph are traversed.
Specifically, the method for implementing inter-subgraph communication edge optimization comprises the following steps:
in the data flow graph, the communication traffic between subgraphs refers to the weight of the edge between adjacent subgraphs. Nodes can be classified into three types according to the positions of all adjacent nodes of a node:
1. internal nodes: if all the adjacent nodes of a node are in the same subgraph with the node, the node is called an internal node, the internal node is a node needing protection, and the movement is prohibited.
2. Boundary nodes: if all the adjacent nodes of a node have a part in the same subgraph as it and another part in other subgraphs, the node is called a boundary node, and the boundary node is a node which can consider movement.
3. An isolated zero node: if all the adjacent nodes of a node are not in the same subgraph with the node, the node is called an orphan zero node, and the orphan zero node is a node which is primarily considered to move.
For the lone zero node, a merging strategy of the lone zero node is provided. On the premise of not influencing load balance, the lone zero node is moved to the subgraph where the adjacent node is located, and the communication traffic between the subgraphs can be effectively reduced. If there are multiple subgraphs of adjacent nodes, multiple moves are attempted and the move that minimizes traffic is chosen.
(5) And distributing the nodes with the dependency relationship in the data flow graph after communication optimization on different processor cores, and enabling the nodes with the dependency relationship to run for different scheduling times in the same flow cycle.
In another embodiment, software pipeline scheduling and cache optimization is performed for the task partitioning results. The software pipeline model under the multi-core environment is mainly a process of mutual cooperative operation of a plurality of cores, computing nodes and buffer areas among the computing nodes of a processor. Load balancing between stages of pipelines is a key factor in order to maximize performance of the software pipeline. Load balancing is guaranteed already during task division of the data flow program, so that the scheduling order of the nodes and the buffer areas among the nodes are mainly aimed at during scheduling. The method comprises the following specific steps:
(A) software pipeline scheduling
And carrying out stage assignment on the divided nodes to obtain the execution sequence of the nodes.
Distributing the task subgraph to the corresponding processor core according to the divided result, and dividing the scheduling into 3 stages: filling phase, steady state phase, emptying phase. The filling stage is the starting stage of program operation, and gradually accumulates data for the steady-state stage, and only part of nodes operate. The steady state phase is entered when data is accumulated to the limit that the steady state phase can be executed. After entering the steady-state execution stage, all the computing tasks in the same pipeline cycle do not have any data dependence, and complete parallelism can be achieved because the data required by the operation of the computing tasks are already put into a buffer area in the previous pipeline cycle. The emptying phase is the end phase of the program operation, and the computing nodes are stopped in sequence.
On the left of fig. 4(a), there is a simple data flow diagram, and each downstream node depends on the data generated by the upstream node, so that these nodes cannot operate at the same scheduling number if in the same pipeline cycle. To enable these nodes to be in parallel, they are spatially isolated, with them being located on different processor cores. They are staggered in time so that they run at different scheduling times in the same pipeline cycle.
The right side of fig. 4(a) shows the node scheduling diagram, in the first pipeline cycle, the a node starts to operate, and the a node will put the generated result in the buffers of a and B after the operation of the a node is completed. The buffer transmits data to the computing node B through dma (direct Memory access). And when the second pipeline period is reached, the node A continues to carry out next scheduling, and the node B operates while the node A operates because the data required by the node B to operate is already put into the buffer by the node A in the last period. Similarly, after the A operation is completed, new data is overwritten in the original buffers A and B, and the data generated by B is written into the buffer between B and C. And when the third pipeline cycle is reached, the C node reads the data written by the B node in the second pipeline cycle and then starts to operate, so that the three nodes can operate on the core from the third pipeline cycle, and the software pipeline parallelism is achieved. Since the program is scheduled 6 times, three nodes are running up to the sixth pipeline cycle. Since the seventh pipeline cycle begins, node a will finish executing, and there will not be data in the buffers of node a and node B, but node B and node C will still be running because they used the data generated in the last cycle. After the eighth pipeline cycle, the node B finds that there is no data in the upstream buffer, and therefore ends the run, where only the node C runs. The C node will also stop running after the eighth pipeline cycle.
(B) Buffer optimization
The buffer of the current data flow programming model adopts a single buffer mechanism of a shared memory, and the length of the buffer is fixed. In the process of software pipeline scheduling, in order to ensure the consistency of data, an upstream node writes and reads a buffer simultaneously with a downstream node, and the difference is that the upstream node operates on the data of the current cycle, and the downstream node operates on the data generated by the upstream node in the previous cycle. When the upstream node and the downstream node are positioned on different processor cores, the upstream node and the downstream node communicate with each other in a memory sharing mode. If the upstream node and the downstream node frequently read and write the same cache line of the buffer, a pseudo-sharing problem occurs. The occurrence of the pseudo sharing problem can greatly increase the read-write time of the thread to the memory, and influence the execution efficiency of the program.
Aiming at the occurrence of the pseudo sharing problem, firstly, a buffer area is reconstructed, and a double-buffer area mechanism is used. The double-Buffer mechanism can prevent an upstream node and a downstream node from reading and writing the same Buffer, as shown in fig. 4(b), the node a is the upstream node, when the node a writes data, the node a writes Buffer1, and when the node a writes data once, the pointer points to the Buffer of Buffer2, and the Buffer2 starts to be written. The node B is a downstream node, when the node A writes to the Buffer1, the pointer of the node B points to the Buffer2, if the Buffer2 has data, the node B will read from the Buffer, and if the Buffer2 has no data, the node B2 will enter a synchronous waiting state. After the node a writes to Buffer1 and points to Buffer2, the pointer of the node B will also change to point to Buffer1, and since Buffer1 has written data from the node a, the node a writes data to Buffer2 and the node B reads data from Buffer 1. This ensures that the node a and node B operate on data in different buffers.
And secondly, aligning the memory alignment mode of the double buffer areas according to the size of the cache lines. The benefit of aligning by the size of the cache line is that the hit rate of the cache can be increased. Because the read-write speed of the cache by the multi-core processor is obviously higher than that of the memory, if the read-write of the data by the computing node can directly come from the buffer area, the communication speed between the computing nodes is greatly increased.
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (6)

1. A data flow program task dividing and scheduling method for a multi-core system is characterized by comprising the following steps:
(1) carrying out workload statistics on nodes in a data flow graph, selecting target nodes capable of being split, horizontally splitting stateless target nodes, and vertically splitting stateful target nodes, wherein the data flow graph is generated by a data flow program through the front end of a data flow compiler;
(2) initializing k subgraphs, and moving all nodes in the data flow graph after node splitting into any subgraph VkFrom sub-diagram VkThe nodes in the sub-graph construct the remaining k-1 sub-graphs, and the communication among the sub-graphs is reduced by maximizing the sum of the edge weights in each sub-graph under the condition of ensuring load balance;
(3) moving the node in the subgraph with the maximum workload to the subgraph with the minimum workload, or moving the node in the subgraph with the maximum workload to an adjacent subgraph, and carrying out load balancing optimization on the k constructed subgraphs;
(4) moving the isolated zero node in the data flow graph after the load balancing optimization to a subgraph where an adjacent node of the isolated zero node is located so as to reduce the communication traffic between the subgraphs, wherein the isolated zero node represents that all the adjacent nodes of the node are not in the same subgraph as the node;
(5) and distributing the nodes with the dependency relationship in the data flow graph after communication optimization on different processor cores, and enabling the nodes with the dependency relationship to run for different scheduling times in the same flow cycle.
2. The method of claim 1, wherein the performing horizontal splitting on stateless target nodes and performing vertical splitting on stateful target nodes comprises:
for a stateless target node, if the running times of the target node exceed a first preset value, the target node is horizontally split into a plurality of same nodes, each split node processes one task, and each split node is in a parallel relation;
and for a target node with a state, if the running times of the target node exceed a second preset value, vertically splitting the target node into a plurality of nodes with the same function, enabling each split node to process one task, and connecting the split nodes in series.
3. Method according to claim 1 or 2, characterized in that step (2) comprises in particular the sub-steps of:
(2.1) setting V1,V2,...,VkFor k sub-graphs to be constructed, the average weight of k sub-graphs is obtained by we ═ Wsum/k, and all nodes in the data flow graph after node splitting are moved into any sub-graph VkThe rest subgraphs are taken as empty subgraphs, wherein Wsum represents the total task quantity of the k subgraphs;
(2.2) for an arbitrary void graph ViFrom VkRandomly selecting a node to join a candidate set, wherein i is not equal to k;
(2.3) selecting the node V with the maximum gain value from the candidate set to be added into the subgraph ViIf sub-graph ViIf the workload in (1) is less than we, then step (2.4) is performed, otherwise, sub-graph ViCompleting construction, and executing the step (2.5);
(2.4) all the children belonging to V to be connected to node VkThe node(s) of (2) is added into the candidate set and step (2.3) is performed;
and (2.5) judging whether all the blank graphs are constructed completely, if so, ending the step (2), otherwise, skipping to execute the step (2.2) if the blank graphs which are not constructed completely exist.
4. The method of claim 3, wherein in step (3), the load balancing optimization of the k constructed sub-graphs by moving the nodes in the sub-graph with the largest workload to the sub-graph with the smallest workload comprises:
counting the workload of each subgraph, and acquiring the subgraph with the maximum workload and the subgraph with the minimum workload;
traversing each node in the maximum workload subgraph, and if the balance factor of the maximum workload subgraph is reduced after the current traversed node is moved to the minimum workload subgraph, moving the current traversed node to the minimum workload subgraph, wherein the balance factor represents the division of the workload of the maximum workload subgraph and we;
and updating the subgraph with the maximum workload and the subgraph with the minimum workload, and repeatedly executing traversal and moving operation on the nodes in the subgraph with the maximum workload until all the nodes in the subgraph with the maximum workload are traversed and the balance factor is not reduced any more.
5. The method of claim 3, wherein in step (3), the load balancing optimization of the constructed k sub-graphs by moving the nodes in the sub-graph with the largest workload to the adjacent sub-graphs comprises:
counting the workload of each subgraph to obtain the subgraph with the maximum workload;
traversing each node in the subgraph with the maximum workload, searching all adjacent subgraphs of the currently traversed node, determining a temporary solution after the currently traversed node is moved to each adjacent subgraph, and for each temporary solution, if the balance factor and the traffic of the temporary solution are both reduced, moving the currently traversed node to the subgraph corresponding to the temporary solution, wherein the balance factor represents that the workload of the subgraph with the maximum workload is divided by we;
and updating the subgraph with the maximum workload, and repeatedly executing traversal and moving operation on the nodes in the subgraph with the maximum workload until the maximum subgraph is unchanged and all the nodes in the maximum subgraph are traversed.
6. The method according to claim 4 or 5, characterized in that the method further comprises:
transforming buffers among nodes, and changing a single buffer into a double-buffer mechanism, wherein the double-buffer mechanism indicates that read-write operation is alternately performed, and when one buffer performs read operation, the other buffer performs write operation;
and aligning the memory alignment mode of the double buffer areas according to the size of the cache lines.
CN201710480622.2A 2017-06-22 2017-06-22 Data flow program task dividing and scheduling method for multi-core system Active CN107247628B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710480622.2A CN107247628B (en) 2017-06-22 2017-06-22 Data flow program task dividing and scheduling method for multi-core system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710480622.2A CN107247628B (en) 2017-06-22 2017-06-22 Data flow program task dividing and scheduling method for multi-core system

Publications (2)

Publication Number Publication Date
CN107247628A CN107247628A (en) 2017-10-13
CN107247628B true CN107247628B (en) 2019-12-20

Family

ID=60019400

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710480622.2A Active CN107247628B (en) 2017-06-22 2017-06-22 Data flow program task dividing and scheduling method for multi-core system

Country Status (1)

Country Link
CN (1) CN107247628B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107888697B (en) * 2017-11-24 2020-07-14 北京航天自动控制研究所 Node locking method in load balancing algorithm
CN108932172B (en) * 2018-06-27 2021-01-19 西安交通大学 Fine-grained shared memory communication synchronization method based on OpenMP/MPI mixed parallel CFD calculation
CN110347617A (en) * 2019-07-03 2019-10-18 南京大学 The function verification method of dma module in a kind of multicore SoC
CN112988367A (en) * 2019-12-12 2021-06-18 中科寒武纪科技股份有限公司 Resource allocation method and device, computer equipment and readable storage medium
CN111598036B (en) * 2020-05-22 2021-01-01 广州地理研究所 Urban group geographic environment knowledge base construction method and system of distributed architecture
CN111858055B (en) * 2020-07-23 2023-02-03 平安普惠企业管理有限公司 Task processing method, server and storage medium
CN114168340B (en) * 2021-12-14 2023-04-18 电子科技大学 Multi-core system synchronous data flow graph instantiation concurrent scheduling method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104662566A (en) * 2011-10-31 2015-05-27 应用材料公司 Method and system for splitting scheduling problems into sub-problems
CN104965761A (en) * 2015-07-21 2015-10-07 华中科技大学 Flow program multi-granularity division and scheduling method based on GPU/CPU hybrid architecture
US9438490B2 (en) * 2014-03-07 2016-09-06 International Business Machines Corporation Allocating operators of a streaming application to virtual machines based on monitored performance

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104662566A (en) * 2011-10-31 2015-05-27 应用材料公司 Method and system for splitting scheduling problems into sub-problems
US9438490B2 (en) * 2014-03-07 2016-09-06 International Business Machines Corporation Allocating operators of a streaming application to virtual machines based on monitored performance
CN104965761A (en) * 2015-07-21 2015-10-07 华中科技大学 Flow program multi-granularity division and scheduling method based on GPU/CPU hybrid architecture

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
Exploiting coarse-grained task, data, and pipeline parallelism in stream programs;Gordon M I ,et al;《International Conference on Architectural Support for Programming Languages & Operating Systems. ACM, 2006》;20061025;1-12 *
Minimizing communication in rate-optimal software pipelining for stream programs;Wei H ,et al.,;《Proceedings of the CGO 2010, The 8th International Symposium on Code Generation and Optimization, Toronto, Ontario, Canada, April 24-28, 2010. DBLP》;20101231;1-14 *
基于GPU/CPU混合架构的流程序多粒度划分与调度方法研究;陈文斌,等;;《计算机工程与科学》;20170131;第39卷(第1期);15-24 *
面向Storm的数据流编程模型与编译优化方法研究;杨秋吉,等;;《计算机工程与科学》;20161231;第38卷(第12期);2409-2418 *
面向多核集群的数据流程序层次流水线并行优化方法;于俊清,等;;《计算机学报》;20141031;第37卷(第10期);2071-2082 *

Also Published As

Publication number Publication date
CN107247628A (en) 2017-10-13

Similar Documents

Publication Publication Date Title
CN107247628B (en) Data flow program task dividing and scheduling method for multi-core system
JP6525286B2 (en) Processor core and processor system
WO2021057720A1 (en) Neural network model processing method and apparatus, computer device, and storage medium
Xiao et al. A load balancing inspired optimization framework for exascale multicore systems: A complex networks approach
US9953003B2 (en) Systems and methods for in-line stream processing of distributed dataflow based computations
CN105159654B (en) Integrity measurement hashing algorithm optimization method based on multi-threaded parallel
CN106055311A (en) Multi-threading Map Reduce task parallelizing method based on assembly line
CN101807144A (en) Prospective multi-threaded parallel execution optimization method
WO2020083050A1 (en) Data stream processing method and related device
CN102981807A (en) Graphics processing unit (GPU) program optimization method based on compute unified device architecture (CUDA) parallel environment
CN103970602A (en) Data flow program scheduling method oriented to multi-core processor X86
JP2014216021A (en) Processor for batch thread processing, code generation apparatus and batch thread processing method
CN101655783B (en) Forward-looking multithreading partitioning method
CN101840329A (en) Data parallel processing method based on graph topological structure
Boechat et al. Representing and scheduling procedural generation using operator graphs
WO2018076979A1 (en) Detection method and apparatus for data dependency between instructions
CN102968295A (en) Speculation thread partitioning method based on weighting control flow diagram
Neelima et al. Communication and computation optimization of concurrent kernels using kernel coalesce on a GPU
Jiang et al. An optimized resource scheduling strategy for Hadoop speculative execution based on non-cooperative game schemes
Huang et al. Partial flattening: a compilation technique for irregular nested parallelism on GPGPUs
Wang et al. A new scheme for cache optimization based on cluster computing framework spark
Abdolrashidi Improving Data-Dependent Parallelism in GPUs Through Programmer-Transparent Architectural Support
Gilray et al. Toward parallel cfa with datalog, mpi, and cuda
Li et al. An Optimization Method for Embarrassingly Parallel under MIC Architecture
Kuhrt et al. iGPU-Accelerated Pattern Matching on Event Streams

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant