CN103970580A - Data flow compilation optimization method oriented to multi-core cluster - Google Patents

Data flow compilation optimization method oriented to multi-core cluster Download PDF

Info

Publication number
CN103970580A
CN103970580A CN201410185945.5A CN201410185945A CN103970580A CN 103970580 A CN103970580 A CN 103970580A CN 201410185945 A CN201410185945 A CN 201410185945A CN 103970580 A CN103970580 A CN 103970580A
Authority
CN
China
Prior art keywords
node
cluster
stage
division
scheduling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410185945.5A
Other languages
Chinese (zh)
Other versions
CN103970580B (en
Inventor
于俊清
张维维
唐九飞
何云峰
管涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN201410185945.5A priority Critical patent/CN103970580B/en
Publication of CN103970580A publication Critical patent/CN103970580A/en
Application granted granted Critical
Publication of CN103970580B publication Critical patent/CN103970580B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention discloses a data flow compilation optimization method oriented to a multi-core cluster system. The data flow compilation optimization method comprises the following steps that task partitioning and scheduling of mapping from calculation tasks to processing cores are determined; according to task partitioning and scheduling results, hierarchical pipeline scheduling of pipeline scheduling tables among cluster nodes and among cluster node inner cores is constructed; according to structural characteristics of a multi-core processor, communication situations among the cluster nodes, and execution situations of a data flow program on the multi-core processor, cache optimization based on cache is conducted. According to the method, the data flow program and optimization techniques related to the structure of the system are combined, high-load equilibrium and high parallelism of synchronous and asynchronous mixed pipelining codes on a multi-core cluster are brought into full play, and according to cache and communication modes of the multi-core cluster, cache access and communication transmission of the program are optimized; furthermore, the execution performance of the program is improved, and execution time is shorter.

Description

A kind of data stream compile optimization method towards multinuclear cluster
Technical field
The invention belongs to computer compile technology field, more specifically, relate to a kind of data stream compile optimization method towards multinuclear cluster.
Background technology
Along with the development of semiconductor technology, polycaryon processor has been verified as a feasible platform of exploitation concurrency.Multinuclear cluster parallel system becomes a kind of important parallel computing platform design with powerful computation capability and good extendability.Multi-Core Cluster System provides powerful computing ability, also more burden has been given to compiler and programming personnel effectively to develop internuclear coarse grain parallelism simultaneously.Data stream programming provides a kind of feasible method to develop the concurrency of multicore architecture.In this model, each node has represented a calculation task, and every limit has represented that the data between calculation task flow.Each calculation task is an independently computing unit.It has independently instruction stream and address space, and the data between calculation task flow through the communication queue of first in first out and realize.Data stream programming model be take data flow model as basis, take data stream programming language as implementation.Data stream compiling is about to data stream programming language and is converted to the involved technique of compiling of bottom target executable program.Wherein, to data flow program, the runnability in target processing core has played decisive role to compile optimization.
Massachusetts Institute of Technology's compiling laboratory discloses a kind of stream programming language StreamIt.This language, based on Java, has carried out stream expansion to Java and has introduced Filter concept.Filter is the most basic computing unit, and it is the program block of a single-input single-output.In Filter, each processing procedure is described with Work, adopts Push, Pop and Peek operation to communicate in FIFO mode between each Work.Meanwhile, for high-performance computer of future generation (Raw), proposed a kind of stream optimization: first, compiler adopts data splitting and merges the method combining, to calculating node, divided and merge, to increase, calculated and communication overhead ratio; Then processing calculating node mapping later, to each, process on core, reach load balancing, each processes the executive mode that core adopts streamline, and the communication of processing internuclear employing demonstration realizes data transmission.
The stream optimization of StreamIt is that the scheduling problem of stream programming model on polycaryon processor proposed a solution.By distribution of computation tasks is processed on core to each, realized load balancing, guaranteed that calculation task is in the executed in parallel of processing on core.But, there is following defect: (1) is dispatched to each calculating of processing on core is separated with communication, in streamline, separately for it has distributed independently call duration time, therefore increased the expense of communication; (2) do not consider bottom storage allocation optimization problem and the communication optimization problem of processing core; (3) compile optimization method is not optimized for the architectural framework characteristic of multi-Core Cluster System bottom.In a word, for multi-Core Cluster System, it has also opened storage organization and the software communication mechanism of its level when powerful calculating ability is provided to programmer.Existing stream compile optimization method, does not consider the architectural framework of bottom, does not make full use of system hardware resources and improves the execution efficiency of program as storage resources.
Summary of the invention
The object of the present invention is to provide a kind of data stream compile optimization method towards multinuclear cluster, the framework for multi-Core Cluster System, is optimized processing to data flow program, has improved largely the execution performance of data flow program.
The optimization method that the present invention adopts is usingd intermediate representation-synchrodata flow graph that data stream compiler front-end produces as input, and it is carried out to task division and scheduling, level pipeline schedule, cache optimization tertiary treatment successively, finally generates executable code.Concrete steps are as follows:
(1) task division and scheduling step determining calculation task and multinuclear cluster computing node and process core mapping
Node in data flow diagram represents calculation task, and limit represents the communication between calculation task.First, according to the number of node in cluster, synchrodata flow graph is carried out to process level task division, this sub-step adopts Group partitioning strategy of multitask, target maximizes program execution performance for minimizing inter-node communication expense, during division, should consider that load balancing considers that communication overhead minimizes again, by each distribution of computation tasks to corresponding clustered node on.Secondly, according to the calculation task on each clustered node, for the processing core enterprising line journey level task division of each distribution of computation tasks to clustered node, this sub-step adopts and copies splitting-up method, the calculation task that load is large divides, and target is the load balancing realizing on clustered node inter-process core.
(2) according to the level pipeline schedule step of pipeline schedule internuclear in saving with cluster between task division and scheduling result structure clustered node
Pipeline synchronization utilizes a global synchronization clock to guarantee that streamline executing the task on each stage completes simultaneously, adopts the mode of data-driven to carry out between each subtask of asynchronous software streamline.First, synchrodata flow graph is carried out to asynchronous pipeline scheduling, determine the tasks carrying process between clustered node, this step to cluster computing node, completes the mapping of process and clustered node by the whole Random Maps of the calculation task in each process; Secondly, according to the dependence between calculation task in clustered node, for each calculation task (node) is to distribute its stage No. in streamline, complete pipeline synchronization structure; Finally, utilize above two kinds of information, structure level Flow-shop table.
(3) according to the signal intelligence between the architectural characteristic of described polycaryon processor, clustered node and data flow program, the implementation status on polycaryon processor is done cache optimization step
When calculation task (node) is when carrying out, can there is pseudo-sharing in the use that buffer memory is checked in the processing at calculation task place, and the performance that program is carried out produces larger impact.
General polycaryon processor to X86-based is analyzed, and adopts capable mechanism and the steady propagation technology of filling of cache line to combine that to carry out the puppet existing shared for elimination program, and the use of buffer memory is optimized.
The present invention is relevant the optimizing integration of structure to multi-Core Cluster System by data stream scheduling, realized three of data flow program grades of optimizing processs, specifically comprise task division and scheduling, level pipeline schedule, cache optimization, improved the execution performance of data flow program on target platform.Particularly, the present invention has the following advantages:
(1) improved the concurrency of program.By the formalized description to problem, it is a greedy problem that the present invention is dispatched to data flow diagram abstract on the processing core of multi-Core Cluster System, thereby for data flow program has been constructed the Flow-shop model of level, task is all mapped to each to be processed on core, realize low communication expense and load balancing, improved the concurrency of program.
(2) reduce expense.The level pipeline schedule model that the present invention proposes a synchronous versus asynchronous mixing makes full use of calculating and the communication resource of system, simultaneously, use for the buffer memory of clustered node inside is optimized, and improves locality and the Buffer Utilization of data access, strengthens the operational efficiency of program.
Accompanying drawing explanation
Fig. 1 is the structural framing figure of the inventive method in data stream compiling system;
Fig. 2 be in the embodiment of the present invention data flow program at clustered node internal reproduction splitting-up method process flow diagram;
Fig. 3 is that in the embodiment of the present invention, data flow program asynchronous pipeline on cluster is carried out exemplary plot;
Fig. 4 (a) is in the embodiment of the present invention in synchronizing software Flow-shop, the exemplary plot of task division, stage assignment;
Fig. 4 (b) is the corresponding software flow implementation of Fig. 4 (a) exemplary plot;
Fig. 5 (a) is that in the embodiment of the present invention, data flow program is carried out the pseudo-tasks carrying schematic diagram of sharing of steady propagation technology elimination;
Fig. 5 (b) is that pseudo-the sharing of the task in Fig. 5 (a) eliminated front schematic diagram;
Fig. 5 (c) is that pseudo-the sharing of the task in Fig. 5 (a) eliminated rear schematic diagram.
Embodiment
In order to make object of the present invention, technical scheme and advantage clearer, below in conjunction with drawings and Examples, the present invention is further elaborated.Should be appreciated that specific embodiment described herein, only in order to explain the present invention, is not intended to limit the present invention.In addition,, in each embodiment of described the present invention, involved technical characterictic just can not combine mutually as long as do not form each other conflict.
Be illustrated in figure 1 the structural framing figure of the present embodiment in stream compiling system, data flow program can generate an intermediate representation after resolving through data stream compiler front-end---synchrodata flow graph (Synchronous Data Flow, SDF), pass through successively subsequently task division and scheduling, level pipeline schedule, cache optimization and three grades of optimizing processs of communication optimization, finally generate through message passing interface (Message Passing Interface, MPI) object code of encapsulation, completes compiling.
(1) task division and scheduling step determining calculation task and multinuclear cluster computing node and process core mapping
This step comprises two sub-steps: process level task division and thread-level task division.In multi-Core Cluster System, because different nodes has the different network addresss, between node, need to communicate by network, its communication cost is large, and communication in node belongs to machine intimate communication, its communication cost is little, thus to data flow program task division need to distinguish between node and node in these internuclear differences.Task division to different levels under cluster is described below: process level task division minimizes inter-node communication expense under the prerequisite that guarantees load balancing between node, and does not occur loop between division result; Thread-level task division minimizes synchronization overhead under the prerequisite that guarantees load balancing between node, and guarantees data locality as far as possible.Concrete steps are as follows:
(1.1) process level task division.Process task is divided the mapping of determining between computing unit and clustered node, when carrying out in order to amortize the communication overhead of data flow program unit data quantity, between process, data communication adopts piece communication mechanism, only has when buffer zone is filled or just can triggers message transmission during flush buffers by force.Deadlock when preventing that program from carrying out, there is loop in the data dependence of dividing between will avoiding dividing in process level.Process level task division for the synchrodata flow graph of multinuclear cluster has proposed Group partitioning strategy of multitask, and it has adopted greedy algorithm to realize.Group task division is introduced group structure, and group represents the set that one or more computing units form in synchrodata flow graph.When initial, each computing unit of synchrodata flow graph is treated as a group, consistent between dependence and computing unit between group.Group task division mainly contains four-stage and forms:
(1.1.1) pretreatment stage.This stage falls into a trap and calculates unit multiple-input and multiple-output and design for synchrodata flow graph, and this stage is fused into a group by a plurality of computing units, has reduced communicate by letter with the computing unit in other group number on limit of single computing unit in group.
(1.1.2) the Group coarseness stage.This stage is carried out roughening treatment to pretreated group figure, and a plurality of adjacent group are fused into one, will avoid occurring loop in group figure when coarseness.A pair of group merges the income producing and is called alligatoring income, and computing formula is as follows:
gain = comm ( SrcGroup ) workload ( srcGroup ) + workload ( snkGroup )
Wherein, workload (srcGroup) and workload (snkGroup) represent srcGroup and snkGroup load separately, comm (srcGroup, snkGroup) represent the communication overhead between srcGroup and snkGroup, communication overhead comprises that data send and two aspects of data receiver.
Coarseness adopts greed to inspire thought, first the alligatoring income of calculating all adjacent group is kept at result in a Priority Queues, from Priority Queues, selecting a pair of group of Income Maximum to do merges, if being not more than to divide back loading theoretical mean and merge in rear group figure, the load of the new group forming after fusion there will not be loop, this fusion is effective so, the group falling through effective integration is deleted from group figure, merging the new group obtain is inserted into and in figure, upgrades dependence between group, according to new group, upgrade the income in Priority Queues, said process iterates.The end condition of algorithm is between any a pair of group, to merge the number can not produce group in positive income or group figure to be less than threshold value.
(1.1.3) the initial division stage.This stage is preliminary determines after alligatoring the mapping between group and clustered node in group figure.Initial division is make each divide load balancing and guarantee that as far as possible between division, communication is minimum.Initial division adopts the strategy of deadlock prevention, in division, starts just to avoid occurring loop in division result.After coarseness, group figure is a directed acyclic graph (Directed AcyclicGraph, DAG), for DAG figure, topological sorting can utilize internodal partial ordering relation in figure to obtain a topological sequences, during initial division, according to group topological sequences, investigate one by one group node in group figure, determine the grid numbering that each group is concrete.
(1.1.4) the fine granularity adjusting stage.This stage is by the feature modeling unit of dividing, and has the computing unit of communicating by letter with the computing unit on other clustered nodes, according to signal intelligence, does further tuning, reduces node communication expense.For a feature modeling unit, the division set at this computing unit place is called source and divides (srcPartition), there is the target that is divided at the computing unit place of dependence to divide (objPartition) with this computing unit, a computing unit only has a srcPartition and may have a plurality of objPartition, the traffic of other computing units in computing unit and srcPartition is internalData, the traffic of the computing unit in computing unit and i objPartition is externalData[i], when adjusting, fine granularity safeguards a Priority Queues, its weights are externalData[i] – internalData.Be in course of adjustment and select processing of weights maximum, can a computing unit be moved to an objPartition and will consider from following two factors: first, can in division, not introduce loop; Secondly, can not destroy the load balancing between whole division to a certain extent.A computing unit has been adjusted later and will have been upgraded Priority Queues according to adjusting result later, but can not be used as adjustment object for adjusted computing unit again.
(1.2) thread-level task division.Thread-level task division will be determined the mapping between computing unit on clustered node inter-process core and this node.Tasks carrying adopts pipeline synchronization scheduling mode in node, and thread-level task division adopts is to take the allocation strategy that load balancing simultaneous minimization synchronization overhead is target.It is load balancing and locality that cross-thread is divided the main factor of considering.Thread-level task division step is specially: first, adopt multilayer K road figure partitioning algorithm to carry out initial division to the computing unit of each clustered node inside; Secondly, employing copies splitting-up method divides the large computing unit of load, reduces the granularity of computing unit, has described as shown in Figure 2 the process flow diagram of data flow program at multinuclear clustered node internal reproduction splitting-up method.Each step of this algorithm is as follows: in the above stage, the result of K road figure partitioning algorithm is as input, ask successively the computational load of each division, according to load, sort, finding can be by division actor grid numbering MaxPartition and workload maxWright (a basic calculating unit) and load maximum, look for again grid numbering MinPartition and the workload minWeight of least-loaded, according to the result of inequality maxWeight<minWeight*balanceFactor (balanceFactor is balance factor), judge again, if result is true, algorithm finishes, if result is false, continue to find fissionable actor of workload maximum in MaxPartition, calculate the division mark repFactor of this actor, repFactor=Max (repFactor, 2), then this actor horizontal split is become to repFactor part, portion is put in MinPartition, remaining repFactor-1 part is placed in MaxPartition, from MaxPartition, remove the actor having divided, then get back to the initial place (asking computational load the sequence of each division) of program, circulation is carried out, until algorithm meets exit criteria and exits, finally, reuse multilayer K road figure partitioning algorithm the figure after division is divided, the load balancing on assurance processing core and good locality.
(2) according to the level pipeline schedule step of pipeline schedule internuclear in saving with cluster between task division and scheduling result structure clustered node
This step is determined the pipeline implementation of the task that process level and thread-level are divided mainly for the task division result of step (1), make calling program carry out delay as much as possible little.Comprise two steps: the asynchronous pipeline scheduling between clustered node and the synchronizing software pipeline schedule between clustered node inner core.Pipeline synchronization utilizes a global synchronization clock to guarantee that streamline executing the task on each stage completes simultaneously, and each execute phase has equal execution and postpones.Between each subtask of asynchronous software streamline, adopt the mode of data-driven to carry out, the data that produce when a sub-tasks carrying are sent on another subtask that has dependence with it, subtask receives that in the situation that other conditions are satisfied data just can start to carry out, in asynchronous pipeline, the execution of whole streamline does not need global synchronization, calculate with communicate by letter separated.For EQUILIBRIUM CALCULATION FOR PROCESS time and data transmission period, between asynchronous pipeline subtask, data transmission adopts block transmission mechanism conventionally, as long as the communication buffer between task is just filled, can trigger transmission of messages, not need to wait to execute just and can transmit data to the subtask current generation.Concrete steps are as follows
(2.1) asynchronous pipeline scheduling between clustered node
Process level is divided in has also determined the dependence between subtask when subtask is assigned on node.Asynchronous pipeline scheduling does not have global synchronization clock, and the characteristic that meets data-driven is carried out in subtask, and the execution between subtask meets producer consumer pattern.The execution schematic diagram of corresponding data flow program on the cluster being comprised of 3 machines described as shown in Figure 3.In figure, have 3 multinuclear machines respectively corresponding compiler through process level task division, data flow program is divided into three subtask I, II and III.The execution of the interior actor of machine is relevant with machine intimate parallel architecture and scheduling mode, is sharing intra-node employing pipeline synchronization scheduling on storage multi-core platform.Between node, asynchronous pipeline is in order to amortize the expense of unit data quantity in transmission, and data flow program adopts piece communication mode between node, and the producer triggers message passing mechanism when communication block is filled up, and consumer starts to carry out after receiving message.I and the II of take in Fig. 3 are example, when actor C carries out after a period of time the communication buffer between actor C and actor F, are filled C and send data to F, and after F receives the data that C produces, F starts to carry out, and C can continue to carry out and generates new data simultaneously.By asynchronous pipeline executive mode, guarantee the execution of data flow program on cluster.
(2.2) the inner pipeline synchronization scheduling of clustered node
The scheduling of thread-level pipeline synchronization comprises that the stage distributes and structure pipeline schedule table two step.After thread-level task division completes, carry out stage assignment, build synchronizing software streamline.Concrete steps are as follows
(2.2.1) stage distributes.First, the calculating node in the data flow diagram of clustered node inside is carried out to topological sorting, form topological sequences, secondly, each computing node in topological sequences is initialized as to 0 by the stage No. of its node, then, judge that itself and forerunner's node are whether on same clustered node, if on same clustered node, judge that itself and forerunner's node are whether on same processing core, if on same processing core, it is identical with the stage of forerunner's node so, if not on same processing core, its stage No. is calculated the stage No. large 1 of node than forerunner so, if not on same clustered node, its stage No. and this forerunner's computing node are irrelevant so, by the topological sequences of traversal computing node, all nodes are carried out to stage No. assignment
(2.2.2) structure pipeline schedule.By the result structure pipeline synchronization dispatch list of task division and stage distribution.As shown in Figure 4, horizontal ordinate represents that resource comprises processing core, and ordinate represents stage No..In Fig. 4 (a), it is upper that P, Q, S are divided into same core Core0, and R, T are divided into same and core Core1 upper, and U, V are divided on same core Core2.P is start node, and stage No. is that 0, Q and its father node P are in same core, so its stage No. is also 0; The stage No. of R is that stage No. that the stage No. of 1, S is 2, T is that the stage No. of 3, U, V is 4.As Fig. 4 (b), in software pipeline implementation, experienced software flow and filled stage, full stage and empty stage.
(3) according to the signal intelligence between the architectural characteristic of described polycaryon processor, clustered node and data flow program, the implementation status on polycaryon processor is done cache optimization step
Due to multithreading shared buffer memory data, cache be take cache line as storage cell, when a plurality of threads, revise mutually independently variable and these variablees and shared same cache line when upper, have pseudo-share (False Sharing), affect the performance of program execution.The literary composition that this step exists mainly for cache access is shared and is optimized from two aspects:
(3.1) it is shared that Cache line fills the synchronous puppet producing of elimination flow line stage.By the row mechanism of filling, make the variable of different threads can not share same cache line, eliminate pseudo-sharing.
(3.2) adopt the puppet producing due to data transmission between steady propagation technology elimination computing unit to share.As Fig. 5 (a) if as shown in prosumer's chain P, C on different core during executed in parallel, when the space of P and C access is on same cache line, also can there is pseudo-sharing, as shown in Fig. 5 (b).In complicated data flow diagram, can there are many internuclear communication limits, if still adopt cache lines to fill mechanism, will inevitably cause large quantity space to be wasted, reduce space utilisation and produce higher communication delay.The utilization factor of sharing and improve as far as possible cache in order to eliminate the puppet of communication buffer, has adopted steady propagation technology.As being depicted as, Fig. 5 (c) eliminates the pseudo-service condition of sharing rear cache.Steady propagation technology algorithm adopts greedy thought, first computational data string routine stable state is carried out once all outputs limit and is eliminated the coefficient that relevant computing unit should be expanded after pseudo-sharing, and then in all spreading coefficients, looks for and can make when carrying out, can not make after all computing units expansions greatest coefficient that L1 data cache overflows as final spreading coefficient.For cache is better played effectiveness, when searching spreading coefficient, might not make all computing unit execution that data cache can not occur overflows, while allowing 10% computing unit execution according to " 90/10 principle ", overflow L1 data cache but can not make L2/L3cache overflow, so also can obtain good performance.
Those skilled in the art will readily understand; the foregoing is only preferred embodiment of the present invention; not in order to limit the present invention, all any modifications of doing within the spirit and principles in the present invention, be equal to and replace and improvement etc., within all should being included in protection scope of the present invention.

Claims (9)

1. towards a data stream compile optimization method for multinuclear cluster, it is characterized in that, comprise the following steps:
The task division and scheduling step determining calculation task and multinuclear cluster computing node and process core mapping;
According to the level pipeline schedule step of pipeline schedule internuclear in saving with cluster between task division and scheduling result structure clustered node;
According to the signal intelligence between the architectural characteristic of described polycaryon processor, clustered node and data flow program, the implementation status on polycaryon processor is done cache optimization step.
2. the stream of the data stream towards multinuclear cluster compile optimization method according to claim 1, is characterized in that, described task division is specially with scheduling step:
First, synchrodata flow graph is carried out to process level task division, determine the corresponding clustered node that each distribution of computation tasks arrives;
Secondly, the task in the synchrodata flow graph in clustered node is carried out to thread-level task division, determine the processing core in the corresponding clustered node that each distribution of computation tasks arrives.
3. the data stream towards multinuclear cluster according to claim 2 flows compile optimization method, it is characterized in that, described task division is by being translated into Graph partition problem, and according to the difference of process level and thread-level task division target, utilizes respectively Group partition strategy and copy division strategy it is solved and is obtained.
4. the stream of the data stream towards multinuclear cluster compile optimization method according to claim 3, is characterized in that, described process level task division adopts Group partition strategy step to be specially:
Pretreatment stage, is fused into a group by a plurality of computing units, has reduced communicate by letter with the computing unit in other group number on limit of single computing unit in group;
In the coarseness stage, a plurality of adjacent group are fused into one;
In the initial division stage, be mapped to group on cluster computing node, determines the mapping between computing node and clustered node simultaneously;
The fine granularity adjusting stage, the complete boundary node that each is divided of initial division is carried out to tuning, reduce communication overhead.
5. according to the stream of the data stream towards the multinuclear cluster compile optimization method described in claim 2 to 4 any one, it is characterized in that, described thread-level task division step is specially:
First, adopt multilayer K road figure partitioning algorithm to carry out initial division to the computing unit of each clustered node inside;
Secondly, employing copies splitting-up method divides the large computing unit of load, reduces the granularity of computing unit;
Finally, reuse multilayer K road figure partitioning algorithm the figure after division is divided, the load balancing on assurance processing core and good locality.
6. according to the stream of the data stream towards the multinuclear cluster compile optimization method described in claim 1 to 5 any one, it is characterized in that, described level pipeline schedule step is specially:
First, to adopting asynchronous pipeline scheduling between clustered node.
Secondly, clustered node inside is adopted to pipeline synchronization scheduling.
7. the data stream towards multinuclear cluster according to claim 6 flows compile optimization method, it is characterized in that, described in the method for the asynchronous pipeline scheduling carried out be that the result that adopts producers and consumers's model that process level is divided is assigned randomly on each node of cluster.
8. according to the data stream towards the multinuclear cluster stream compile optimization method described in claim 6 or 7, it is characterized in that, described in the detailed process of the pipeline synchronization scheduling carried out as follows:
First, the calculating node in the data flow diagram of process inside is carried out to topological sorting, form topological sequences;
Secondly, each computing node in topological sequences is initialized as to 0 by the stage No. of its node, then, judge that itself and forerunner's node are whether on same clustered node, if on same clustered node, judge that itself and forerunner's node are whether on same processing core, if on same processing core, it is identical with the stage of forerunner's node so, if not on same processing core, its stage No. is than the stage No. of forerunner node large 1 so, if not on same clustered node, its stage No. and this forerunner's node are irrelevant so, by the topological sequences of traversal computing node, all nodes are carried out to stage No. assignment, the pipeline synchronization dispatch list of structure clustered node inside.
9. according to the data stream towards the multinuclear cluster stream compile optimization method described in claim 6 to 8 any one, it is characterized in that, described in carry out cache optimization detailed process and be:
First, the puppet that adopts cache line filling mechanism elimination clustered node inter-sync software flow synchronously to cause between each stage is shared;
Secondly, the puppet that adopts steady propagation technology elimination computing unit data transmission to cause is shared.
CN201410185945.5A 2014-05-05 2014-05-05 A kind of data flow towards multinuclear cluster compiles optimization method Active CN103970580B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410185945.5A CN103970580B (en) 2014-05-05 2014-05-05 A kind of data flow towards multinuclear cluster compiles optimization method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410185945.5A CN103970580B (en) 2014-05-05 2014-05-05 A kind of data flow towards multinuclear cluster compiles optimization method

Publications (2)

Publication Number Publication Date
CN103970580A true CN103970580A (en) 2014-08-06
CN103970580B CN103970580B (en) 2017-09-15

Family

ID=51240117

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410185945.5A Active CN103970580B (en) 2014-05-05 2014-05-05 A kind of data flow towards multinuclear cluster compiles optimization method

Country Status (1)

Country Link
CN (1) CN103970580B (en)

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105242909A (en) * 2015-11-24 2016-01-13 无锡江南计算技术研究所 Method for many-core circulation partitioning based on multi-version code generation
CN105892996A (en) * 2015-12-14 2016-08-24 乐视网信息技术(北京)股份有限公司 Assembly line work method and apparatus for batch data processing
CN106610860A (en) * 2015-10-26 2017-05-03 三星电子株式会社 Operating method of semiconductor device and semiconductor system
CN106909343A (en) * 2017-02-23 2017-06-30 北京中科睿芯科技有限公司 A kind of instruction dispatching method and device based on data flow
CN107179956A (en) * 2017-05-17 2017-09-19 北京计算机技术及应用研究所 It is layered the internuclear reliable communication method of polycaryon processor
CN107391136A (en) * 2017-07-21 2017-11-24 众安信息技术服务有限公司 A kind of programing system and method based on streaming
CN107851040A (en) * 2015-07-23 2018-03-27 高通股份有限公司 For the system and method using cache requirements monitoring scheduler task in heterogeneous processor cluster framework
CN109426574A (en) * 2017-08-31 2019-03-05 华为技术有限公司 Distributed computing system, data transmission method and device in distributed computing system
CN109815617A (en) * 2019-02-15 2019-05-28 湖南高至科技有限公司 A kind of simulation model driving method
CN109857562A (en) * 2019-02-13 2019-06-07 北京理工大学 A kind of method of memory access distance optimization on many-core processor
CN111090464A (en) * 2018-10-23 2020-05-01 华为技术有限公司 Data stream processing method and related equipment
CN111367665A (en) * 2020-02-28 2020-07-03 清华大学 Parallel communication route establishing method and system
CN111817894A (en) * 2020-07-13 2020-10-23 济南浪潮数据技术有限公司 Cluster node configuration method and system and readable storage medium
CN111880918A (en) * 2020-07-28 2020-11-03 南京市城市与交通规划设计研究院股份有限公司 Road network front end rendering method and device and electronic equipment
CN112612585A (en) * 2020-12-16 2021-04-06 海光信息技术股份有限公司 Thread scheduling method, configuration method, microprocessor, device and storage medium
CN113160545A (en) * 2020-01-22 2021-07-23 阿里巴巴集团控股有限公司 Road network data processing method, device and equipment
CN113254021A (en) * 2021-04-16 2021-08-13 云南大学 Compiler-assisted reinforcement learning multi-core task allocation algorithm
CN114860406A (en) * 2022-05-18 2022-08-05 南京安元科技有限公司 Distributed compiling and packaging system and method based on Docker
CN115617917A (en) * 2022-12-16 2023-01-17 中国西安卫星测控中心 Method, device, system and equipment for controlling multiple activities of database cluster

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009021539A1 (en) * 2007-08-16 2009-02-19 Siemens Aktiengesellschaft Compilation of computer programs for multicore processes and the execution thereof
CN102855153A (en) * 2012-07-27 2013-01-02 华中科技大学 Flow compilation optimization method oriented to chip multi-core processor

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009021539A1 (en) * 2007-08-16 2009-02-19 Siemens Aktiengesellschaft Compilation of computer programs for multicore processes and the execution thereof
CN102855153A (en) * 2012-07-27 2013-01-02 华中科技大学 Flow compilation optimization method oriented to chip multi-core processor

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张维维 等: "COStream:一种面向数据流的编程语言和编译器实现", 《计算机学报》 *

Cited By (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107851040A (en) * 2015-07-23 2018-03-27 高通股份有限公司 For the system and method using cache requirements monitoring scheduler task in heterogeneous processor cluster framework
CN106610860A (en) * 2015-10-26 2017-05-03 三星电子株式会社 Operating method of semiconductor device and semiconductor system
CN105242909A (en) * 2015-11-24 2016-01-13 无锡江南计算技术研究所 Method for many-core circulation partitioning based on multi-version code generation
CN105892996A (en) * 2015-12-14 2016-08-24 乐视网信息技术(北京)股份有限公司 Assembly line work method and apparatus for batch data processing
CN106909343A (en) * 2017-02-23 2017-06-30 北京中科睿芯科技有限公司 A kind of instruction dispatching method and device based on data flow
CN106909343B (en) * 2017-02-23 2019-01-29 北京中科睿芯科技有限公司 A kind of instruction dispatching method and device based on data flow
CN107179956A (en) * 2017-05-17 2017-09-19 北京计算机技术及应用研究所 It is layered the internuclear reliable communication method of polycaryon processor
CN107179956B (en) * 2017-05-17 2020-05-19 北京计算机技术及应用研究所 Reliable communication method among cores of layered multi-core processor
CN107391136A (en) * 2017-07-21 2017-11-24 众安信息技术服务有限公司 A kind of programing system and method based on streaming
CN107391136B (en) * 2017-07-21 2020-11-06 众安信息技术服务有限公司 Programming system and method based on stream
CN109426574A (en) * 2017-08-31 2019-03-05 华为技术有限公司 Distributed computing system, data transmission method and device in distributed computing system
WO2019042312A1 (en) * 2017-08-31 2019-03-07 华为技术有限公司 Distributed computing system, data transmission method and device in distributed computing system
US11010681B2 (en) 2017-08-31 2021-05-18 Huawei Technologies Co., Ltd. Distributed computing system, and data transmission method and apparatus in distributed computing system
CN109426574B (en) * 2017-08-31 2022-04-05 华为技术有限公司 Distributed computing system, data transmission method and device in distributed computing system
US11900113B2 (en) 2018-10-23 2024-02-13 Huawei Technologies Co., Ltd. Data flow processing method and related device
CN111090464B (en) * 2018-10-23 2023-09-22 华为技术有限公司 Data stream processing method and related equipment
CN111090464A (en) * 2018-10-23 2020-05-01 华为技术有限公司 Data stream processing method and related equipment
CN109857562A (en) * 2019-02-13 2019-06-07 北京理工大学 A kind of method of memory access distance optimization on many-core processor
CN109815617A (en) * 2019-02-15 2019-05-28 湖南高至科技有限公司 A kind of simulation model driving method
CN113160545A (en) * 2020-01-22 2021-07-23 阿里巴巴集团控股有限公司 Road network data processing method, device and equipment
CN111367665A (en) * 2020-02-28 2020-07-03 清华大学 Parallel communication route establishing method and system
WO2021169393A1 (en) * 2020-02-28 2021-09-02 清华大学 Parallel communication routing setup method and system
CN111817894A (en) * 2020-07-13 2020-10-23 济南浪潮数据技术有限公司 Cluster node configuration method and system and readable storage medium
CN111880918A (en) * 2020-07-28 2020-11-03 南京市城市与交通规划设计研究院股份有限公司 Road network front end rendering method and device and electronic equipment
CN112612585A (en) * 2020-12-16 2021-04-06 海光信息技术股份有限公司 Thread scheduling method, configuration method, microprocessor, device and storage medium
CN112612585B (en) * 2020-12-16 2022-07-29 海光信息技术股份有限公司 Thread scheduling method, configuration method, microprocessor, device and storage medium
CN113254021A (en) * 2021-04-16 2021-08-13 云南大学 Compiler-assisted reinforcement learning multi-core task allocation algorithm
CN113254021B (en) * 2021-04-16 2022-04-29 云南大学 Compiler-assisted reinforcement learning multi-core task allocation algorithm
CN114860406A (en) * 2022-05-18 2022-08-05 南京安元科技有限公司 Distributed compiling and packaging system and method based on Docker
CN114860406B (en) * 2022-05-18 2024-02-20 安元科技股份有限公司 Distributed compiling and packing system and method based on Docker
CN115617917A (en) * 2022-12-16 2023-01-17 中国西安卫星测控中心 Method, device, system and equipment for controlling multiple activities of database cluster

Also Published As

Publication number Publication date
CN103970580B (en) 2017-09-15

Similar Documents

Publication Publication Date Title
CN103970580A (en) Data flow compilation optimization method oriented to multi-core cluster
CN107329828B (en) A kind of data flow programmed method and system towards CPU/GPU isomeric group
CN110619595B (en) Graph calculation optimization method based on interconnection of multiple FPGA accelerators
CN104965761B (en) A kind of more granularity divisions of string routine based on GPU/CPU mixed architectures and dispatching method
CN100449478C (en) Method and apparatus for real-time multithreading
CN103970602B (en) Data flow program scheduling method oriented to multi-core processor X86
CN104781786B (en) Use the selection logic of delay reconstruction program order
CN103809936A (en) System and method for allocating memory of differing properties to shared data objects
US7694290B2 (en) System and method for partitioning an application utilizing a throughput-driven aggregation and mapping approach
CN110413391A (en) Deep learning task service method for ensuring quality and system based on container cluster
Gent et al. A preliminary review of literature on parallel constraint solving
CN102855153B (en) Towards the stream compile optimization method of chip polycaryon processor
CN116401055B (en) Resource efficiency optimization-oriented server non-perception computing workflow arrangement method
CN111158790B (en) FPGA virtualization method for cloud deep learning reasoning
CN107247628A (en) A kind of data flow sequence task towards multiple nucleus system is divided and dispatching method
Song et al. Bridging the semantic gaps of GPU acceleration for scale-out CNN-based big data processing: Think big, see small
CN107329822A (en) Towards the multi-core dispatching method based on super Task Network of multi-source multiple nucleus system
CN112905317A (en) Task scheduling method and system under rapid reconfigurable signal processing heterogeneous platform
Roth et al. Adaptive algorithm and tool flow for accelerating systemc on many-core architectures
Varisteas et al. Resource management for task-based parallel programs over a multi-kernel.: Bias: Barrelfish inter-core adaptive scheduling
CN106844024B (en) GPU/CPU scheduling method and system of self-learning running time prediction model
Massari et al. Predictive resource management for next-generation high-performance computing heterogeneous platforms
Kumar et al. Overflowing emerging neural network inference tasks from the GPU to the CPU on heterogeneous servers
CN108205465A (en) The task-dynamic dispatching method and device of streaming applications
Ha et al. Decidable signal processing dataflow graphs

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant