CN103970580B - A kind of data flow towards multinuclear cluster compiles optimization method - Google Patents

A kind of data flow towards multinuclear cluster compiles optimization method Download PDF

Info

Publication number
CN103970580B
CN103970580B CN201410185945.5A CN201410185945A CN103970580B CN 103970580 B CN103970580 B CN 103970580B CN 201410185945 A CN201410185945 A CN 201410185945A CN 103970580 B CN103970580 B CN 103970580B
Authority
CN
China
Prior art keywords
node
stage
task
data flow
cluster
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410185945.5A
Other languages
Chinese (zh)
Other versions
CN103970580A (en
Inventor
于俊清
张维维
唐九飞
何云峰
管涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN201410185945.5A priority Critical patent/CN103970580B/en
Publication of CN103970580A publication Critical patent/CN103970580A/en
Application granted granted Critical
Publication of CN103970580B publication Critical patent/CN103970580B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

Optimization method is compiled the invention discloses a kind of data flow towards multi-Core Cluster System, including:Determine that calculating task and the task of processing nuclear mapping are divided and scheduling steps;The level pipeline schedule step of internuclear pipeline schedule table between scheduling result construction clustered node and in cluster section is divided according to task;Cache optimization step based on cache is done according to the implementation status of signal intelligence and data flow program on polycaryon processor between the architectural characteristic of the polycaryon processor, clustered node.The method of the present invention combines the data flow program optimisation technique related to system architecture, give full play to the high concurrency that high capacity is harmonious and synchronous versus asynchronous hybrid flow code is on multinuclear cluster, and for the caching and communication mode on multinuclear cluster, the cache access and communications of program are optimized, the execution performance of program is further increasing, with the smaller execution time.

Description

A kind of data flow towards multinuclear cluster compiles optimization method
Technical field
The invention belongs to computer technique of compiling field, compiled more particularly, to a kind of data flow towards multinuclear cluster Translate optimization method.
Background technology
With the development of semiconductor technology, polycaryon processor has been verified as developing a feasible platform of concurrency. Multinuclear cluster parallel system turns into a kind of important parallel computing platform with powerful computation capability and good autgmentability Design.Multi-Core Cluster System provides powerful calculating disposal ability, while also having given compiler and volume by more burdens Cheng personnel are effectively to develop internuclear coarse grain parallelism.Data flow programming provides a kind of feasible method to develop multinuclear frame The concurrency of structure.In this model, each node represents a calculating task, and each edge is represented between calculating task Data flow.Each calculating task is an independent computing unit.It has independent instruction stream and address space, calculates and appoints Data between business flow through the communication queue of first in first out to realize.Data stream programming model is using data flow model as base Plinth, using data flow programming language as implementation.Data flow programming language is converted to bottom target by data flow compiling to be held Technique of compiling involved by line program.Wherein, compiling optimization rises to runnability of the data flow program in target processing core Decisive role is arrived.
Massachusetts Institute of Technology's compiling laboratory discloses a kind of stream programming language StreamIt.The language is based on Java, right Java has carried out stream extension and has introduced Filter concepts.Filter is most basic computing unit, and it is that a single input list is defeated The program block gone out.The processing procedure of each in Filter is described with Work, is grasped between each Work using Push, Pop and Peek Work is communicated in FIFO modes.Meanwhile, propose a kind of stream compiling optimization skill for high-performance computer of future generation (Raw) Art:First, the method that compiler is combined using data splitting and fusion, line splitting is entered with merging to calculating node, to increase Calculate and communication overhead ratio;Then on the calculating node mapping after treating to each process cores, load balancing is reached, everywhere The executive mode that core uses streamline is managed, handles the internuclear communication using display to realize data transfer.
StreamIt stream optimization is that scheduling problem of the stream programming model on polycaryon processor proposes one Plant solution.By by distribution of computation tasks to each process cores, realizing load balancing, it is ensured that calculating task is at place Manage the parallel execution on core.But, there is following defect:(1) each calculating and communication being dispatched in process cores are separation , it is individually in a pipeline and is assigned with independent call duration time, therefore adds the expense of communication;(2) do not account for The bottom storage allocation optimization problems and communication optimization problem of process cores;(3) compiling optimization method is not directed to multinuclear cluster system The architectural framework characteristic of system bottom is optimized.In a word, for multi-Core Cluster System, it is providing the same of powerful calculating ability When, the storage organization and software communication mechanism of its level have also been opened to programmer.Existing stream compiling optimization method, not There is the architectural framework in view of bottom, do not make full use of system hardware resources such as storage resource to improve program performs effect Rate.
The content of the invention
Optimization method is compiled it is an object of the invention to provide a kind of data flow towards multinuclear cluster, for multinuclear cluster The framework of system, processing is optimized to data flow program, largely improves the execution performance of data flow program.
The intermediate representation that the optimization method that the present invention is used is produced with data flow compiler front-end-synchrodata flow graph is made For input, carry out task successively to it and divide and scheduling, level pipeline schedule, cache optimization tertiary treatment, ultimately produce Executable code.Comprise the following steps that:
(1) determine that calculating task and the task of multinuclear PC cluster node and processing nuclear mapping are divided and scheduling steps
Node in DFD represents calculating task, while the communication between representing calculating task.First, saved according in cluster The number of point carries out process level task division to synchrodata flow graph, and the sub-step uses Group partitioning strategy of multitask, and target is Minimize inter-node communication expense and maximize program execution performance, should consider that load balancing considers communication overhead again during division Minimize, by each distribution of computation tasks to corresponding clustered node on.Secondly, appointed according to the calculating on each clustered node Business, is that each distribution of computation tasks is divided to the enterprising line journey level task of process cores of clustered node, the sub-step is using duplication Splitting algorithm, enters line splitting, target is to realize the load balancing on clustered node inter-process core by the calculating task for loading big.
(2) layer of internuclear pipeline schedule between scheduling result construction clustered node and in cluster section is divided according to task Secondary pipeline schedule step
Pipeline synchronization ensures that the execution task on each stage of streamline is completed simultaneously using a global synchronization clock, Asynchronous software streamline is performed between each subtask by the way of data-driven.First, synchrodata flow graph is carried out asynchronous Pipeline schedule, determines the tasks carrying process between clustered node, and the step is integrally random by the calculating task in each process It is mapped on PC cluster node, completes the mapping of process and clustered node;Secondly, according between calculating task in clustered node Dependence, is each calculating task (node) to distribute its stage No. in a pipeline, completes pipeline synchronization construction;Most Afterwards, using both the above information, tectonic remnant basin Flow-shop table.
(3) according to the signal intelligence and data flow program between the architectural characteristic of the polycaryon processor, clustered node many Implementation status on core processor does cache optimization step
When calculating task (node) upon execution, use of the process cores to caching where calculating task can exist it is pseudo- altogether Enjoy, large effect is produced to the performance that program is performed.
The general polycaryon processor of X86-based is analyzed, mechanism and steady propagation are filled using cache line rows The puppet that technology is combined the execution presence of elimination program is shared, and the use to caching is optimized.
The present invention optimizes integration data stream scheduling is related to the structure of multi-Core Cluster System, realizes to data flow The three-level optimization process of program, specifically includes task and divides and scheduling, level pipeline schedule, cache optimization, improve number According to execution performance of the string routine on target platform.Specifically, the present invention has advantages below:
(1) concurrency of program is improved.By the formalized description to problem, DFD is dispatched to many by the present invention Abstract in the process cores of core group system is a greedy problem, so as to construct the Flow-shop of level for data flow program Model, task is both mapped in each process cores, realizes low communication expense and load balancing, improves the concurrency of program.
(2) expense is reduced.The level pipeline schedule model that the present invention proposes a synchronous versus asynchronous mixing is fully sharp Calculating and the communication resource with system, meanwhile, the use for the caching inside clustered node is optimized, and improves data and visits The locality and Buffer Utilization asked, strengthen the operational efficiency of program.
Brief description of the drawings
Fig. 1 is structural framing figure of the inventive method in data flow compiling system;
Fig. 2 is data flow program in the embodiment of the present invention in clustered node internal reproduction splitting algorithm flow chart;
Fig. 3 is data flow program asynchronous pipeline execution exemplary plot on cluster in the embodiment of the present invention;
Fig. 4 (a) is in synchronizing software Flow-shop in the embodiment of the present invention, task is divided, the exemplary plot of stage assignment;
Fig. 4 (b) is the software flow implementation procedure exemplary plot corresponding to Fig. 4 (a);
Fig. 5 (a) performs the pseudo- shared tasks carrying of steady propagation technology elimination for data flow program in the embodiment of the present invention and shown It is intended to;
Fig. 5 (b) is the preceding schematic diagram of the shared elimination of task puppet in Fig. 5 (a);
Fig. 5 (c) is that the task puppet in Fig. 5 (a) shares schematic diagram after elimination.
Embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, it is right below in conjunction with drawings and Examples The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.As long as in addition, technical characteristic involved in each embodiment of invention described below Not constituting conflict each other can just be mutually combined.
It is structural framing figure of the present embodiment in stream compiling system as shown in Figure 1, data flow program is compiled by data flow An intermediate representation can be generated by translating after device front end is parsed --- synchrodata flow graph (Synchronous Data Flow, SDF), then sequentially pass through task and divide and optimized with scheduling, level pipeline schedule, cache optimization and communication optimization three-level Journey, ultimately produces the object code encapsulated through message passing interface (Message Passing Interface, MPI), completes to compile Translate.
(1) determine that calculating task and the task of multinuclear PC cluster node and processing nuclear mapping are divided and scheduling steps
The step includes two sub-steps:Process level task is divided and thread-level task is divided.In multi-Core Cluster System, Need to be communicated by network between the different network address, node because different nodes has, its communication cost is big, and Communication in node belongs to machine intercommunication, and its communication cost is small, so data flow program task, which is divided, to be needed to distinguish These differences between node between node core.Task division to different levels under cluster is described as follows:Process level task is drawn Divide and inter-node communication expense is minimized on the premise of load balancing between ensureing node, and loop is occurred without between division result;Line Journey level task minimizes synchronization overhead on the premise of load balancing between being divided in guarantee node, and ensures that data are local as far as possible Property.Comprise the following steps that:
(1.1) process level task is divided.Process task is divided the mapping between determination computing unit and clustered node, In order to amortize the communication overhead of data flow program unit data quantity during execution, inter-process data communication uses block communication mechanism, only Have when buffering area is filled or can just trigger message transmission during flush buffers by force.Deadlock during in order to prevent that program from performing, Process level, which is divided, will avoid the data dependence between dividing from loop occur.Process for the synchrodata flow graph of multinuclear cluster is in charge of a grade Business division proposes Group partitioning strategy of multitask, and it employs greedy algorithm and realized.Group tasks, which are divided, to be introduced Group structures, group represents the set that one or more computing units are constituted in synchrodata flow graph.Will be synchronous when initial Each computing unit of DFD is treated as a group, consistent between dependence and computing unit between group.Group Task is divided and is mainly made up of four-stage:
(1.1.1) pretreatment stage.The stage is directed in synchrodata flow graph and calculates unit multiple-input and multiple-output and design, Multiple computing units are fused into a group by the stage, are reduced in group in single computing unit and other group The number on computing unit communication side.
(1.1.2) Group coarseness stages.The stage carries out roughening treatment to pretreated group figures, by multiple phases Adjacent group is fused into one, to avoid loop occur in group figures in coarseness.The income that a pair of group fusions are produced Referred to as it is roughened income, calculation formula is as follows:
Wherein, workload (srcGroup) and workload (snkGroup) represents that srcGroup and snkGroup is each From load, comm (srcGroup, snkGroup) represents the communication overhead between srcGroup and snkGroup, and communication opens Pin includes data and sent and two aspects of data receiver.
Coarseness inspires thought using greed, and result is stored in one by the roughening income for calculating all adjacent group first In individual Priority Queues, a pair of group of Income Maximum are selected to merge from Priority Queues, if what is formed after fusion is new It is not in loop that group load, which is not more than after division back loading theoretical mean and fusion in group figures, then this time is melted Conjunction is effective, and the group fallen by effective integration is deleted from group figures, merges obtained new group and is inserted into figure Dependence between middle renewal group, updates the income in Priority Queues, iterate said process according to new group.Calculate The end condition of method is to merge small all without the number for producing group in positive income or group figures between any pair of group In threshold value.
(1.1.3) initial division stage.After the stage tentatively decision roughening in group figures between group and clustered node Mapping.Initial division is so that each divides load balancing and ensures to communicate minimum between dividing as far as possible.Initial division is used The strategy of deadlock prevention, starts it is avoided that occurring loop in division result dividing.Group figures are one oriented after coarseness Acyclic figure (Directed Acyclic Graph, DAG), the partial order between figure interior joint can be utilized for the topological sorting of DAG figures Relation obtains a topological sequences, investigates group nodes in group figures one by one according to group topological sequences during initial division, really Fixed each specific grid numberings of group.
(1.1.4) fine granularity adjusting stage.The stage be by the feature modeling unit of division, i.e., with other clustered nodes Computing unit there is the computing unit of communication, further tuning is done according to signal intelligence, node communication overhead is reduced.To one For feature modeling unit, the division collection where the computing unit is collectively referred to as source and divides (srcPartition), with the calculating list The target that is divided into where member has the computing unit of dependence divides (objPartition), and a computing unit only has one Individual srcPartition and there may be multiple objPartition, other calculating in computing unit and srcPartition The traffic of unit is internalData, and the traffic of computing unit and the computing unit in i-th of objPartition is ExternalData [i], when fine granularity is adjusted safeguard a Priority Queues, its weights be externalData [i]- internalData.The progress for the selection maximum weight that is in course of adjustment is handled, and can a computing unit be moved to one ObjPartition will consider from following two factors:First, loop will not be introduced in division;Secondly, to a certain extent The load balancing that will not be destroyed between whole divide.One computing unit has been adjusted will update excellent according to the result after adjustment later First queue, but will not be used as regulating object again for adjusted computing unit.
(1.2) thread-level task is divided.Thread-level task, which is divided, will determine on clustered node inter-process core and the node Mapping between computing unit.Tasks carrying uses pipeline synchronization scheduling mode in node, and thread-level task divides what is used It is using load balancing while minimizing allocation strategy of the synchronization overhead as target.It is load that cross-thread, which divides the factor mainly considered, Balanced and locality.Thread-level task partiting step is specially:First, using multilayer K roads figure partitioning algorithm to each cluster section Computing unit inside point carries out initial division;Secondly, line splitting is entered to loading big computing unit using duplication splitting algorithm, The granularity of computing unit is reduced, data flow program is described as shown in Figure 2 in multinuclear clustered node internal reproduction splitting algorithm Flow chart.Each step of the algorithm is as follows:The result of K roads figure partitioning algorithm asks each to divide successively as input in the above stage Computational load, be ranked up according to load, finding can be split off actor's (basic computational ele- ment) and load is maximum Grid numbering MaxPartition and workload maxWright, then look for the minimum grid numbering MinPartition of load With workload minWeight, further according to inequality maxWeight<minWeight*balanceFactor The result of (balanceFactor is balance factor) is judged that, if result is true, algorithm terminates, if result is false, after It is continuous to find the maximum fissionable actor of workload in MaxPartition, the division fraction repFactor of the actor is calculated, RepFactor=Max (repFactor, 2), then by the actor horizontal splits into repFactor parts, portion is put into In MinPartition, remaining repFactor-1 parts are placed in MaxPartition, remove and have divided from MaxPartition Actor, be then return to the initial place (asking each computational load divided and sequence) of program, circulation is performed, until algorithm is full Sufficient exit criteria and exit;Finally, multilayer K roads figure partitioning algorithm is reused to divide the figure after division, it is ensured that Load balancing and good locality in process cores.
(2) layer of internuclear pipeline schedule between scheduling result construction clustered node and in cluster section is divided according to task Secondary pipeline schedule step
The step determines the flowing water for the task that process level and thread-level are divided mainly for the task division result of step (1) Line implementation procedure so that it is small as far as possible that program performs delay.Including two steps:Asynchronous pipeline scheduling between clustered node With synchronizing software pipeline schedule internuclear inside clustered node.Pipeline synchronization ensures flowing water using a global synchronization clock Execution task on each stage of line is completed simultaneously, and each execution stage, which has, equal performs delay.Asynchronous software streamline Performed between each subtask by the way of data-driven, when the data that a sub- tasks carrying is produced be sent to its have according to On another subtask for the relation of relying, subtask, which receives data in the case where other conditions are met, can just start execution, different The execution of whole streamline does not need global synchronization in step streamline, calculates and is separated with communicating.In order to the EQUILIBRIUM CALCULATION FOR PROCESS time with Data transfer generally uses block transmission mechanism between data transmission period, asynchronous pipeline subtask, as long as the communication between task is slow Deposit to be filled and can just trigger message transmission, it is not necessary to which data can just be transmitted by waiting to the subtask current generation to have performed.Specifically Step is as follows
(2.1) asynchronous pipeline is dispatched between clustered node
Process level is divided in the dependence for being assigned to subtask and being also determined while on node between subtask.It is asynchronous Pipeline schedule does not have global synchronization clock, and the execution that subtask is performed between the characteristic for meeting data-driven, subtask meets life Production person's consumer's pattern.Execution of the corresponding data flow program on the cluster being made up of 3 machines is described as shown in Figure 3 to show It is intended to.3 multinuclear machines are had in figure and corresponds to compiler respectively and is divided by process level task data flow program is divided into three Subtask I, II and III.Actor execution is relevant with parallel architecture inside machine and scheduling mode in machine, in shared storage Intra-node is dispatched using pipeline synchronization on multi-core platform.Asynchronous pipeline is transmitting to amortize unit data quantity between node In expense, data flow program between node use block communication mode, message-passing machine is triggered when the producer fills up communication block System, consumer starts to perform after message is received.By taking I and II in Fig. 3 as an example, actor after actor C perform a period of time Communication buffer between C and actor F is filled C and sends data to F, and F receives F after the data that C is produced and starts to perform, while C It can continue executing with and generate new data.Execution of the data flow program on cluster is ensured by asynchronous pipeline executive mode.
(2.2) clustered node internal synchronization pipeline schedule
The scheduling of thread-level pipeline synchronization includes stage distribution and construction pipeline schedule two steps of table.Thread-level task After the completion of division, stage assignment is carried out, synchronizing software streamline is built.Comprise the following steps that
(2.2.1) stage distributes.First, topological row is carried out to the calculating node in the DFD inside clustered node Sequence, forms topological sequences;Secondly, 0 is initialized as by the stage No. of its node to each calculate node in topological sequences, so Afterwards, judge its with forerunner's node whether on same clustered node, if on same clustered node, judging itself and forerunner Whether node is in same process cores, if in same process cores, then it is identical with the stage of predecessor node, if Not in same process cores, then its stage No. is bigger by 1 than the stage No. that forerunner calculates node, if not in same cluster section On point, then its stage No. is unrelated with forerunner's calculate node, by traveling through the topological sequences of calculate node, and all nodes are entered Row order segment number assignment
(2.2.2) constructs pipeline schedule.The result that task is divided and the stage distributes is constructed into pipeline synchronization dispatch list. As shown in Fig. 4 (a), abscissa, which represents resource, includes process cores, and ordinate represents stage No..As in Fig. 4 (a), P, Q, S are drawn Assign on same core Core0, R, T are divided on same and core Core1, U, V are divided on same core Core2.P For start node, stage No. is that 0, Q and its father node P is in same core, so its stage No. is also 0;R stage No. is 1, S Stage No. be 2, T stage No. be 3, U, V stage No. is 4.In such as Fig. 4 (b), software pipeline implementation procedure, it experienced Software flow filling stage, full stage and empty stage.
(3) according to the signal intelligence and data flow program between the architectural characteristic of the polycaryon processor, clustered node many Implementation status on core processor does cache optimization step
Due to multithreading shared buffer memory data, cache is using cache line as storage cell, when the modification of multiple threads Mutually, there is pseudo- shared (False Sharing), shadow in independent variable and when these variables are shared on same cache line Ring the performance that program is performed.The step is shared mainly for the text that cache access is present to be optimized in terms of two:
(3.1) puppet that Cache line fillings elimination flow line stage is synchronously produced is shared.Caused by row filling mechanism The variable of different threads will not share same cache line, eliminate pseudo- shared.
(3.2) puppet produced due to data transfer between computing unit is eliminated using steady propagation technology to share.Such as Fig. 5 (a) if P, C are parallel when performing on different core in productive consumption person's chain shown in, the space accessed as P and C is same Also puppet can occur when on cache line shared, as shown in Fig. 5 (b).Can exist in complicated DFD a plurality of internuclear logical Believe side, if still filling mechanism using cache lines, big quantity space will necessarily be caused to be wasted, reduce space utilisation and generation Higher communication delay.Cache utilization rate is shared and improved as far as possible to eliminate the puppet of communication buffer, employs stable state expansion Give full play to one's skill art.The pseudo- service condition for sharing rear cache is eliminated as Fig. 5 (c) is shown.Steady propagation technique algorithm is thought using greed Think, calculating data flow program stable state performs once the pseudo- shared rear related computing unit of all output sides eliminations and should extended first Coefficient, then looked in all spreading coefficients can make all computing units extend after upon execution all without making L1 data The greatest coefficient that cache overflows is used as final spreading coefficient.In order that cache preferably plays effectiveness, spreading coefficient is being searched When be not necessarily to make all computing units to perform all without occurring data cache to overflow, allow to have according to " 90/10 principle " Occur to overflow L1 data cache when 10% computing unit is performed but not overflow L2/L3cache, so can also obtain compared with Good performance.
As it will be easily appreciated by one skilled in the art that the foregoing is merely illustrative of the preferred embodiments of the present invention, it is not used to The limitation present invention, any modifications, equivalent substitutions and improvements made within the spirit and principles of the invention etc., it all should include Within protection scope of the present invention.

Claims (7)

1. a kind of data flow towards multinuclear cluster compiles optimization method, it is characterised in that comprise the following steps:
Determine that calculating task and the task of multinuclear PC cluster node and processing nuclear mapping are divided and scheduling steps;Described appoints Business is divided:First, process level task division is carried out to synchrodata flow graph, it is determined that each calculating task The corresponding clustered node being assigned to;Secondly, thread-level task is carried out to the task in the synchrodata flow graph in clustered node to draw Point, it is determined that the process cores in the corresponding clustered node that each distribution of computation tasks is arrived;
The level stream of internuclear pipeline schedule between scheduling result construction clustered node and in clustered node is divided according to task Waterline scheduling steps;Described level pipeline schedule step is specially:First, being adjusted clustered node using asynchronous pipeline Degree, secondly, is dispatched to clustered node inside using pipeline synchronization;
According to the signal intelligence and data flow program between the architectural characteristic of polycaryon processor, clustered node on polycaryon processor Implementation status does cache optimization step;The cache optimization step detailed process is:First, mechanism is filled using cache line The puppet synchronously caused between synchronizing software flowing water each stage in clustered node is eliminated to share;Secondly, eliminated using steady propagation technology The puppet that computing unit data transfer is caused is shared.
2. the data flow according to claim 1 towards multinuclear cluster compiles optimization method, it is characterised in that the task Divide by being translated into Graph partition problem, and being divided according to process level task the target difference point divided with thread-level task It is not solved using Group partition strategies and duplication division strategy and obtained.
3. the data flow according to claim 2 towards multinuclear cluster compiles optimization method, it is characterised in that the process Level task is divided is specially using Group partition strategy steps:
Multiple computing units are fused into a group by pretreatment stage, reduce in group single computing unit and other The number on the computing unit communication side in group;
In the coarseness stage, multiple adjacent group are fused into one;
In the initial division stage, group is mapped on PC cluster node, while determining between calculate node and clustered node Mapping;
The fine granularity adjusting stage, the boundary node in the complete each division of initial division is subjected to tuning, reduces communication overhead.
4. the data flow according to claim 1 towards multinuclear cluster compiles optimization method, it is characterised in that the thread Level task partiting step be specially:
First, initial division is carried out to the computing unit inside each clustered node using multilayer K roads figure partitioning algorithm;
Secondly, line splitting is entered to loading big computing unit using duplication splitting algorithm, reduces the granularity of computing unit;
Finally, reuse multilayer K roads figure partitioning algorithm to divide the figure after division, it is ensured that the load in process cores Balanced and good locality.
5. compiling optimization method according to any described data flow towards multinuclear cluster in Claims 1-4, its feature exists In the method for the asynchronous pipeline scheduling of, the progress be the result that is divided process level using producers and consumers' model with Machine is assigned on each node of cluster.
6. compiling optimization method according to any described data flow towards multinuclear cluster in Claims 1-4, its feature exists In the detailed process of the pipeline synchronization scheduling is as follows:
First, topological sorting is carried out to the calculating node in the DFD inside process, forms topological sequences;
Secondly, 0 is initialized as by the stage No. of its node to each calculate node in topological sequences, then, judge its with it is preceding Node is driven whether on same clustered node, if on same clustered node, judging it with forerunner's node whether same In one process cores, if in same process cores, then it is identical with the stage of predecessor node, if not at same place Manage on core, then its stage No. is bigger than the stage No. of forerunner node by 1, if not on same clustered node, then its stage It is number unrelated with the predecessor node, by traveling through the topological sequences of calculate node, all nodes are carried out with stage No. assignment, construction collection Pipeline synchronization dispatch list inside group node.
7. the data flow according to claim 5 towards multinuclear cluster compiles optimization method, it is characterised in that the synchronization The detailed process of pipeline schedule is as follows:
First, topological sorting is carried out to the calculating node in the DFD inside process, forms topological sequences;
Secondly, 0 is initialized as by the stage No. of its node to each calculate node in topological sequences, then, judge its with it is preceding Node is driven whether on same clustered node, if on same clustered node, judging it with forerunner's node whether same In one process cores, if in same process cores, then it is identical with the stage of predecessor node, if not at same place Manage on core, then its stage No. is bigger than the stage No. of forerunner node by 1, if not on same clustered node, then its stage It is number unrelated with the predecessor node, by traveling through the topological sequences of calculate node, all nodes are carried out with stage No. assignment, construction collection Pipeline synchronization dispatch list inside group node.
CN201410185945.5A 2014-05-05 2014-05-05 A kind of data flow towards multinuclear cluster compiles optimization method Active CN103970580B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410185945.5A CN103970580B (en) 2014-05-05 2014-05-05 A kind of data flow towards multinuclear cluster compiles optimization method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410185945.5A CN103970580B (en) 2014-05-05 2014-05-05 A kind of data flow towards multinuclear cluster compiles optimization method

Publications (2)

Publication Number Publication Date
CN103970580A CN103970580A (en) 2014-08-06
CN103970580B true CN103970580B (en) 2017-09-15

Family

ID=51240117

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410185945.5A Active CN103970580B (en) 2014-05-05 2014-05-05 A kind of data flow towards multinuclear cluster compiles optimization method

Country Status (1)

Country Link
CN (1) CN103970580B (en)

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9626295B2 (en) * 2015-07-23 2017-04-18 Qualcomm Incorporated Systems and methods for scheduling tasks in a heterogeneous processor cluster architecture using cache demand monitoring
KR20170047957A (en) * 2015-10-26 2017-05-08 삼성전자주식회사 Method for operating semiconductor device and semiconductor system
CN105242909B (en) * 2015-11-24 2017-08-11 无锡江南计算技术研究所 A kind of many-core cyclic blocking method based on multi version code building
CN105892996A (en) * 2015-12-14 2016-08-24 乐视网信息技术(北京)股份有限公司 Assembly line work method and apparatus for batch data processing
CN106909343B (en) * 2017-02-23 2019-01-29 北京中科睿芯科技有限公司 A kind of instruction dispatching method and device based on data flow
CN107179956B (en) * 2017-05-17 2020-05-19 北京计算机技术及应用研究所 Reliable communication method among cores of layered multi-core processor
CN107391136B (en) * 2017-07-21 2020-11-06 众安信息技术服务有限公司 Programming system and method based on stream
CN114880133A (en) * 2017-08-31 2022-08-09 华为技术有限公司 Distributed computing system, data transmission method and device in distributed computing system
CN111090464B (en) * 2018-10-23 2023-09-22 华为技术有限公司 Data stream processing method and related equipment
CN109857562A (en) * 2019-02-13 2019-06-07 北京理工大学 A kind of method of memory access distance optimization on many-core processor
CN109815617A (en) * 2019-02-15 2019-05-28 湖南高至科技有限公司 A kind of simulation model driving method
CN113160545A (en) * 2020-01-22 2021-07-23 阿里巴巴集团控股有限公司 Road network data processing method, device and equipment
CN111367665B (en) * 2020-02-28 2020-12-18 清华大学 Parallel communication route establishing method and system
CN111817894B (en) * 2020-07-13 2022-12-30 济南浪潮数据技术有限公司 Cluster node configuration method and system and readable storage medium
CN111880918B (en) * 2020-07-28 2021-05-18 南京市城市与交通规划设计研究院股份有限公司 Road network front end rendering method and device and electronic equipment
CN112612585B (en) * 2020-12-16 2022-07-29 海光信息技术股份有限公司 Thread scheduling method, configuration method, microprocessor, device and storage medium
CN113254021B (en) * 2021-04-16 2022-04-29 云南大学 Compiler-assisted reinforcement learning multi-core task allocation algorithm
CN114860406B (en) * 2022-05-18 2024-02-20 安元科技股份有限公司 Distributed compiling and packing system and method based on Docker
CN115617917B (en) * 2022-12-16 2023-03-10 中国西安卫星测控中心 Method, device, system and equipment for controlling multiple activities of database cluster

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102855153A (en) * 2012-07-27 2013-01-02 华中科技大学 Flow compilation optimization method oriented to chip multi-core processor

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2179356A1 (en) * 2007-08-16 2010-04-28 Siemens Aktiengesellschaft Compilation of computer programs for multicore processes and the execution thereof

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102855153A (en) * 2012-07-27 2013-01-02 华中科技大学 Flow compilation optimization method oriented to chip multi-core processor

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
COStream:一种面向数据流的编程语言和编译器实现;张维维 等;《计算机学报》;20131031;第36卷(第10期);第1993-2006页 *

Also Published As

Publication number Publication date
CN103970580A (en) 2014-08-06

Similar Documents

Publication Publication Date Title
CN103970580B (en) A kind of data flow towards multinuclear cluster compiles optimization method
CN110619595B (en) Graph calculation optimization method based on interconnection of multiple FPGA accelerators
Xie et al. Sync or async: Time to fuse for distributed graph-parallel computation
US20220129302A1 (en) Data processing system and method for heterogeneous architecture
CN107329828A (en) A kind of data flow programmed method and system towards CPU/GPU isomeric groups
CN103970602B (en) Data flow program scheduling method oriented to multi-core processor X86
CN105094751B (en) A kind of EMS memory management process for stream data parallel processing
CN110753107B (en) Resource scheduling system, method and storage medium under space-based cloud computing architecture
CN102541640A (en) Cluster GPU (graphic processing unit) resource scheduling system and method
CN102855153B (en) Towards the stream compile optimization method of chip polycaryon processor
Gent et al. A preliminary review of literature on parallel constraint solving
CN111274036A (en) Deep learning task scheduling method based on speed prediction
CN116401055B (en) Resource efficiency optimization-oriented server non-perception computing workflow arrangement method
CN107247628A (en) A kind of data flow sequence task towards multiple nucleus system is divided and dispatching method
CN1326567A (en) Job-parallel processor
CN111404818B (en) Routing protocol optimization method for general multi-core network processor
CN112114951A (en) Bottom-up distributed scheduling system and method
CN116996941A (en) Calculation force unloading method, device and system based on cooperation of cloud edge ends of distribution network
CN107133099B (en) A kind of cloud computing method
Xu et al. Parallel artificial bee colony algorithm for the traveling salesman problem
CN108205465A (en) The task-dynamic dispatching method and device of streaming applications
Li et al. HSP: Hybrid Synchronous Parallelism for Fast Distributed Deep Learning
Melot Algorithms and framework for energy efficient parallel stream computing on many-core architectures
Das Algorithmic Foundation of Parallel Paging and Scheduling under Memory Constraints
CN110262896A (en) A kind of data processing accelerated method towards Spark system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant