CN103970580B - A kind of data flow towards multinuclear cluster compiles optimization method - Google Patents
A kind of data flow towards multinuclear cluster compiles optimization method Download PDFInfo
- Publication number
- CN103970580B CN103970580B CN201410185945.5A CN201410185945A CN103970580B CN 103970580 B CN103970580 B CN 103970580B CN 201410185945 A CN201410185945 A CN 201410185945A CN 103970580 B CN103970580 B CN 103970580B
- Authority
- CN
- China
- Prior art keywords
- node
- stage
- task
- data flow
- cluster
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Abstract
Optimization method is compiled the invention discloses a kind of data flow towards multi-Core Cluster System, including:Determine that calculating task and the task of processing nuclear mapping are divided and scheduling steps;The level pipeline schedule step of internuclear pipeline schedule table between scheduling result construction clustered node and in cluster section is divided according to task;Cache optimization step based on cache is done according to the implementation status of signal intelligence and data flow program on polycaryon processor between the architectural characteristic of the polycaryon processor, clustered node.The method of the present invention combines the data flow program optimisation technique related to system architecture, give full play to the high concurrency that high capacity is harmonious and synchronous versus asynchronous hybrid flow code is on multinuclear cluster, and for the caching and communication mode on multinuclear cluster, the cache access and communications of program are optimized, the execution performance of program is further increasing, with the smaller execution time.
Description
Technical field
The invention belongs to computer technique of compiling field, compiled more particularly, to a kind of data flow towards multinuclear cluster
Translate optimization method.
Background technology
With the development of semiconductor technology, polycaryon processor has been verified as developing a feasible platform of concurrency.
Multinuclear cluster parallel system turns into a kind of important parallel computing platform with powerful computation capability and good autgmentability
Design.Multi-Core Cluster System provides powerful calculating disposal ability, while also having given compiler and volume by more burdens
Cheng personnel are effectively to develop internuclear coarse grain parallelism.Data flow programming provides a kind of feasible method to develop multinuclear frame
The concurrency of structure.In this model, each node represents a calculating task, and each edge is represented between calculating task
Data flow.Each calculating task is an independent computing unit.It has independent instruction stream and address space, calculates and appoints
Data between business flow through the communication queue of first in first out to realize.Data stream programming model is using data flow model as base
Plinth, using data flow programming language as implementation.Data flow programming language is converted to bottom target by data flow compiling to be held
Technique of compiling involved by line program.Wherein, compiling optimization rises to runnability of the data flow program in target processing core
Decisive role is arrived.
Massachusetts Institute of Technology's compiling laboratory discloses a kind of stream programming language StreamIt.The language is based on Java, right
Java has carried out stream extension and has introduced Filter concepts.Filter is most basic computing unit, and it is that a single input list is defeated
The program block gone out.The processing procedure of each in Filter is described with Work, is grasped between each Work using Push, Pop and Peek
Work is communicated in FIFO modes.Meanwhile, propose a kind of stream compiling optimization skill for high-performance computer of future generation (Raw)
Art:First, the method that compiler is combined using data splitting and fusion, line splitting is entered with merging to calculating node, to increase
Calculate and communication overhead ratio;Then on the calculating node mapping after treating to each process cores, load balancing is reached, everywhere
The executive mode that core uses streamline is managed, handles the internuclear communication using display to realize data transfer.
StreamIt stream optimization is that scheduling problem of the stream programming model on polycaryon processor proposes one
Plant solution.By by distribution of computation tasks to each process cores, realizing load balancing, it is ensured that calculating task is at place
Manage the parallel execution on core.But, there is following defect:(1) each calculating and communication being dispatched in process cores are separation
, it is individually in a pipeline and is assigned with independent call duration time, therefore adds the expense of communication;(2) do not account for
The bottom storage allocation optimization problems and communication optimization problem of process cores;(3) compiling optimization method is not directed to multinuclear cluster system
The architectural framework characteristic of system bottom is optimized.In a word, for multi-Core Cluster System, it is providing the same of powerful calculating ability
When, the storage organization and software communication mechanism of its level have also been opened to programmer.Existing stream compiling optimization method, not
There is the architectural framework in view of bottom, do not make full use of system hardware resources such as storage resource to improve program performs effect
Rate.
The content of the invention
Optimization method is compiled it is an object of the invention to provide a kind of data flow towards multinuclear cluster, for multinuclear cluster
The framework of system, processing is optimized to data flow program, largely improves the execution performance of data flow program.
The intermediate representation that the optimization method that the present invention is used is produced with data flow compiler front-end-synchrodata flow graph is made
For input, carry out task successively to it and divide and scheduling, level pipeline schedule, cache optimization tertiary treatment, ultimately produce
Executable code.Comprise the following steps that:
(1) determine that calculating task and the task of multinuclear PC cluster node and processing nuclear mapping are divided and scheduling steps
Node in DFD represents calculating task, while the communication between representing calculating task.First, saved according in cluster
The number of point carries out process level task division to synchrodata flow graph, and the sub-step uses Group partitioning strategy of multitask, and target is
Minimize inter-node communication expense and maximize program execution performance, should consider that load balancing considers communication overhead again during division
Minimize, by each distribution of computation tasks to corresponding clustered node on.Secondly, appointed according to the calculating on each clustered node
Business, is that each distribution of computation tasks is divided to the enterprising line journey level task of process cores of clustered node, the sub-step is using duplication
Splitting algorithm, enters line splitting, target is to realize the load balancing on clustered node inter-process core by the calculating task for loading big.
(2) layer of internuclear pipeline schedule between scheduling result construction clustered node and in cluster section is divided according to task
Secondary pipeline schedule step
Pipeline synchronization ensures that the execution task on each stage of streamline is completed simultaneously using a global synchronization clock,
Asynchronous software streamline is performed between each subtask by the way of data-driven.First, synchrodata flow graph is carried out asynchronous
Pipeline schedule, determines the tasks carrying process between clustered node, and the step is integrally random by the calculating task in each process
It is mapped on PC cluster node, completes the mapping of process and clustered node;Secondly, according between calculating task in clustered node
Dependence, is each calculating task (node) to distribute its stage No. in a pipeline, completes pipeline synchronization construction;Most
Afterwards, using both the above information, tectonic remnant basin Flow-shop table.
(3) according to the signal intelligence and data flow program between the architectural characteristic of the polycaryon processor, clustered node many
Implementation status on core processor does cache optimization step
When calculating task (node) upon execution, use of the process cores to caching where calculating task can exist it is pseudo- altogether
Enjoy, large effect is produced to the performance that program is performed.
The general polycaryon processor of X86-based is analyzed, mechanism and steady propagation are filled using cache line rows
The puppet that technology is combined the execution presence of elimination program is shared, and the use to caching is optimized.
The present invention optimizes integration data stream scheduling is related to the structure of multi-Core Cluster System, realizes to data flow
The three-level optimization process of program, specifically includes task and divides and scheduling, level pipeline schedule, cache optimization, improve number
According to execution performance of the string routine on target platform.Specifically, the present invention has advantages below:
(1) concurrency of program is improved.By the formalized description to problem, DFD is dispatched to many by the present invention
Abstract in the process cores of core group system is a greedy problem, so as to construct the Flow-shop of level for data flow program
Model, task is both mapped in each process cores, realizes low communication expense and load balancing, improves the concurrency of program.
(2) expense is reduced.The level pipeline schedule model that the present invention proposes a synchronous versus asynchronous mixing is fully sharp
Calculating and the communication resource with system, meanwhile, the use for the caching inside clustered node is optimized, and improves data and visits
The locality and Buffer Utilization asked, strengthen the operational efficiency of program.
Brief description of the drawings
Fig. 1 is structural framing figure of the inventive method in data flow compiling system;
Fig. 2 is data flow program in the embodiment of the present invention in clustered node internal reproduction splitting algorithm flow chart;
Fig. 3 is data flow program asynchronous pipeline execution exemplary plot on cluster in the embodiment of the present invention;
Fig. 4 (a) is in synchronizing software Flow-shop in the embodiment of the present invention, task is divided, the exemplary plot of stage assignment;
Fig. 4 (b) is the software flow implementation procedure exemplary plot corresponding to Fig. 4 (a);
Fig. 5 (a) performs the pseudo- shared tasks carrying of steady propagation technology elimination for data flow program in the embodiment of the present invention and shown
It is intended to;
Fig. 5 (b) is the preceding schematic diagram of the shared elimination of task puppet in Fig. 5 (a);
Fig. 5 (c) is that the task puppet in Fig. 5 (a) shares schematic diagram after elimination.
Embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, it is right below in conjunction with drawings and Examples
The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and
It is not used in the restriction present invention.As long as in addition, technical characteristic involved in each embodiment of invention described below
Not constituting conflict each other can just be mutually combined.
It is structural framing figure of the present embodiment in stream compiling system as shown in Figure 1, data flow program is compiled by data flow
An intermediate representation can be generated by translating after device front end is parsed --- synchrodata flow graph (Synchronous Data Flow,
SDF), then sequentially pass through task and divide and optimized with scheduling, level pipeline schedule, cache optimization and communication optimization three-level
Journey, ultimately produces the object code encapsulated through message passing interface (Message Passing Interface, MPI), completes to compile
Translate.
(1) determine that calculating task and the task of multinuclear PC cluster node and processing nuclear mapping are divided and scheduling steps
The step includes two sub-steps:Process level task is divided and thread-level task is divided.In multi-Core Cluster System,
Need to be communicated by network between the different network address, node because different nodes has, its communication cost is big, and
Communication in node belongs to machine intercommunication, and its communication cost is small, so data flow program task, which is divided, to be needed to distinguish
These differences between node between node core.Task division to different levels under cluster is described as follows:Process level task is drawn
Divide and inter-node communication expense is minimized on the premise of load balancing between ensureing node, and loop is occurred without between division result;Line
Journey level task minimizes synchronization overhead on the premise of load balancing between being divided in guarantee node, and ensures that data are local as far as possible
Property.Comprise the following steps that:
(1.1) process level task is divided.Process task is divided the mapping between determination computing unit and clustered node,
In order to amortize the communication overhead of data flow program unit data quantity during execution, inter-process data communication uses block communication mechanism, only
Have when buffering area is filled or can just trigger message transmission during flush buffers by force.Deadlock during in order to prevent that program from performing,
Process level, which is divided, will avoid the data dependence between dividing from loop occur.Process for the synchrodata flow graph of multinuclear cluster is in charge of a grade
Business division proposes Group partitioning strategy of multitask, and it employs greedy algorithm and realized.Group tasks, which are divided, to be introduced
Group structures, group represents the set that one or more computing units are constituted in synchrodata flow graph.Will be synchronous when initial
Each computing unit of DFD is treated as a group, consistent between dependence and computing unit between group.Group
Task is divided and is mainly made up of four-stage:
(1.1.1) pretreatment stage.The stage is directed in synchrodata flow graph and calculates unit multiple-input and multiple-output and design,
Multiple computing units are fused into a group by the stage, are reduced in group in single computing unit and other group
The number on computing unit communication side.
(1.1.2) Group coarseness stages.The stage carries out roughening treatment to pretreated group figures, by multiple phases
Adjacent group is fused into one, to avoid loop occur in group figures in coarseness.The income that a pair of group fusions are produced
Referred to as it is roughened income, calculation formula is as follows:
Wherein, workload (srcGroup) and workload (snkGroup) represents that srcGroup and snkGroup is each
From load, comm (srcGroup, snkGroup) represents the communication overhead between srcGroup and snkGroup, and communication opens
Pin includes data and sent and two aspects of data receiver.
Coarseness inspires thought using greed, and result is stored in one by the roughening income for calculating all adjacent group first
In individual Priority Queues, a pair of group of Income Maximum are selected to merge from Priority Queues, if what is formed after fusion is new
It is not in loop that group load, which is not more than after division back loading theoretical mean and fusion in group figures, then this time is melted
Conjunction is effective, and the group fallen by effective integration is deleted from group figures, merges obtained new group and is inserted into figure
Dependence between middle renewal group, updates the income in Priority Queues, iterate said process according to new group.Calculate
The end condition of method is to merge small all without the number for producing group in positive income or group figures between any pair of group
In threshold value.
(1.1.3) initial division stage.After the stage tentatively decision roughening in group figures between group and clustered node
Mapping.Initial division is so that each divides load balancing and ensures to communicate minimum between dividing as far as possible.Initial division is used
The strategy of deadlock prevention, starts it is avoided that occurring loop in division result dividing.Group figures are one oriented after coarseness
Acyclic figure (Directed Acyclic Graph, DAG), the partial order between figure interior joint can be utilized for the topological sorting of DAG figures
Relation obtains a topological sequences, investigates group nodes in group figures one by one according to group topological sequences during initial division, really
Fixed each specific grid numberings of group.
(1.1.4) fine granularity adjusting stage.The stage be by the feature modeling unit of division, i.e., with other clustered nodes
Computing unit there is the computing unit of communication, further tuning is done according to signal intelligence, node communication overhead is reduced.To one
For feature modeling unit, the division collection where the computing unit is collectively referred to as source and divides (srcPartition), with the calculating list
The target that is divided into where member has the computing unit of dependence divides (objPartition), and a computing unit only has one
Individual srcPartition and there may be multiple objPartition, other calculating in computing unit and srcPartition
The traffic of unit is internalData, and the traffic of computing unit and the computing unit in i-th of objPartition is
ExternalData [i], when fine granularity is adjusted safeguard a Priority Queues, its weights be externalData [i]-
internalData.The progress for the selection maximum weight that is in course of adjustment is handled, and can a computing unit be moved to one
ObjPartition will consider from following two factors:First, loop will not be introduced in division;Secondly, to a certain extent
The load balancing that will not be destroyed between whole divide.One computing unit has been adjusted will update excellent according to the result after adjustment later
First queue, but will not be used as regulating object again for adjusted computing unit.
(1.2) thread-level task is divided.Thread-level task, which is divided, will determine on clustered node inter-process core and the node
Mapping between computing unit.Tasks carrying uses pipeline synchronization scheduling mode in node, and thread-level task divides what is used
It is using load balancing while minimizing allocation strategy of the synchronization overhead as target.It is load that cross-thread, which divides the factor mainly considered,
Balanced and locality.Thread-level task partiting step is specially:First, using multilayer K roads figure partitioning algorithm to each cluster section
Computing unit inside point carries out initial division;Secondly, line splitting is entered to loading big computing unit using duplication splitting algorithm,
The granularity of computing unit is reduced, data flow program is described as shown in Figure 2 in multinuclear clustered node internal reproduction splitting algorithm
Flow chart.Each step of the algorithm is as follows:The result of K roads figure partitioning algorithm asks each to divide successively as input in the above stage
Computational load, be ranked up according to load, finding can be split off actor's (basic computational ele- ment) and load is maximum
Grid numbering MaxPartition and workload maxWright, then look for the minimum grid numbering MinPartition of load
With workload minWeight, further according to inequality maxWeight<minWeight*balanceFactor
The result of (balanceFactor is balance factor) is judged that, if result is true, algorithm terminates, if result is false, after
It is continuous to find the maximum fissionable actor of workload in MaxPartition, the division fraction repFactor of the actor is calculated,
RepFactor=Max (repFactor, 2), then by the actor horizontal splits into repFactor parts, portion is put into
In MinPartition, remaining repFactor-1 parts are placed in MaxPartition, remove and have divided from MaxPartition
Actor, be then return to the initial place (asking each computational load divided and sequence) of program, circulation is performed, until algorithm is full
Sufficient exit criteria and exit;Finally, multilayer K roads figure partitioning algorithm is reused to divide the figure after division, it is ensured that
Load balancing and good locality in process cores.
(2) layer of internuclear pipeline schedule between scheduling result construction clustered node and in cluster section is divided according to task
Secondary pipeline schedule step
The step determines the flowing water for the task that process level and thread-level are divided mainly for the task division result of step (1)
Line implementation procedure so that it is small as far as possible that program performs delay.Including two steps:Asynchronous pipeline scheduling between clustered node
With synchronizing software pipeline schedule internuclear inside clustered node.Pipeline synchronization ensures flowing water using a global synchronization clock
Execution task on each stage of line is completed simultaneously, and each execution stage, which has, equal performs delay.Asynchronous software streamline
Performed between each subtask by the way of data-driven, when the data that a sub- tasks carrying is produced be sent to its have according to
On another subtask for the relation of relying, subtask, which receives data in the case where other conditions are met, can just start execution, different
The execution of whole streamline does not need global synchronization in step streamline, calculates and is separated with communicating.In order to the EQUILIBRIUM CALCULATION FOR PROCESS time with
Data transfer generally uses block transmission mechanism between data transmission period, asynchronous pipeline subtask, as long as the communication between task is slow
Deposit to be filled and can just trigger message transmission, it is not necessary to which data can just be transmitted by waiting to the subtask current generation to have performed.Specifically
Step is as follows
(2.1) asynchronous pipeline is dispatched between clustered node
Process level is divided in the dependence for being assigned to subtask and being also determined while on node between subtask.It is asynchronous
Pipeline schedule does not have global synchronization clock, and the execution that subtask is performed between the characteristic for meeting data-driven, subtask meets life
Production person's consumer's pattern.Execution of the corresponding data flow program on the cluster being made up of 3 machines is described as shown in Figure 3 to show
It is intended to.3 multinuclear machines are had in figure and corresponds to compiler respectively and is divided by process level task data flow program is divided into three
Subtask I, II and III.Actor execution is relevant with parallel architecture inside machine and scheduling mode in machine, in shared storage
Intra-node is dispatched using pipeline synchronization on multi-core platform.Asynchronous pipeline is transmitting to amortize unit data quantity between node
In expense, data flow program between node use block communication mode, message-passing machine is triggered when the producer fills up communication block
System, consumer starts to perform after message is received.By taking I and II in Fig. 3 as an example, actor after actor C perform a period of time
Communication buffer between C and actor F is filled C and sends data to F, and F receives F after the data that C is produced and starts to perform, while C
It can continue executing with and generate new data.Execution of the data flow program on cluster is ensured by asynchronous pipeline executive mode.
(2.2) clustered node internal synchronization pipeline schedule
The scheduling of thread-level pipeline synchronization includes stage distribution and construction pipeline schedule two steps of table.Thread-level task
After the completion of division, stage assignment is carried out, synchronizing software streamline is built.Comprise the following steps that
(2.2.1) stage distributes.First, topological row is carried out to the calculating node in the DFD inside clustered node
Sequence, forms topological sequences;Secondly, 0 is initialized as by the stage No. of its node to each calculate node in topological sequences, so
Afterwards, judge its with forerunner's node whether on same clustered node, if on same clustered node, judging itself and forerunner
Whether node is in same process cores, if in same process cores, then it is identical with the stage of predecessor node, if
Not in same process cores, then its stage No. is bigger by 1 than the stage No. that forerunner calculates node, if not in same cluster section
On point, then its stage No. is unrelated with forerunner's calculate node, by traveling through the topological sequences of calculate node, and all nodes are entered
Row order segment number assignment
(2.2.2) constructs pipeline schedule.The result that task is divided and the stage distributes is constructed into pipeline synchronization dispatch list.
As shown in Fig. 4 (a), abscissa, which represents resource, includes process cores, and ordinate represents stage No..As in Fig. 4 (a), P, Q, S are drawn
Assign on same core Core0, R, T are divided on same and core Core1, U, V are divided on same core Core2.P
For start node, stage No. is that 0, Q and its father node P is in same core, so its stage No. is also 0;R stage No. is 1, S
Stage No. be 2, T stage No. be 3, U, V stage No. is 4.In such as Fig. 4 (b), software pipeline implementation procedure, it experienced
Software flow filling stage, full stage and empty stage.
(3) according to the signal intelligence and data flow program between the architectural characteristic of the polycaryon processor, clustered node many
Implementation status on core processor does cache optimization step
Due to multithreading shared buffer memory data, cache is using cache line as storage cell, when the modification of multiple threads
Mutually, there is pseudo- shared (False Sharing), shadow in independent variable and when these variables are shared on same cache line
Ring the performance that program is performed.The step is shared mainly for the text that cache access is present to be optimized in terms of two:
(3.1) puppet that Cache line fillings elimination flow line stage is synchronously produced is shared.Caused by row filling mechanism
The variable of different threads will not share same cache line, eliminate pseudo- shared.
(3.2) puppet produced due to data transfer between computing unit is eliminated using steady propagation technology to share.Such as Fig. 5
(a) if P, C are parallel when performing on different core in productive consumption person's chain shown in, the space accessed as P and C is same
Also puppet can occur when on cache line shared, as shown in Fig. 5 (b).Can exist in complicated DFD a plurality of internuclear logical
Believe side, if still filling mechanism using cache lines, big quantity space will necessarily be caused to be wasted, reduce space utilisation and generation
Higher communication delay.Cache utilization rate is shared and improved as far as possible to eliminate the puppet of communication buffer, employs stable state expansion
Give full play to one's skill art.The pseudo- service condition for sharing rear cache is eliminated as Fig. 5 (c) is shown.Steady propagation technique algorithm is thought using greed
Think, calculating data flow program stable state performs once the pseudo- shared rear related computing unit of all output sides eliminations and should extended first
Coefficient, then looked in all spreading coefficients can make all computing units extend after upon execution all without making L1 data
The greatest coefficient that cache overflows is used as final spreading coefficient.In order that cache preferably plays effectiveness, spreading coefficient is being searched
When be not necessarily to make all computing units to perform all without occurring data cache to overflow, allow to have according to " 90/10 principle "
Occur to overflow L1 data cache when 10% computing unit is performed but not overflow L2/L3cache, so can also obtain compared with
Good performance.
As it will be easily appreciated by one skilled in the art that the foregoing is merely illustrative of the preferred embodiments of the present invention, it is not used to
The limitation present invention, any modifications, equivalent substitutions and improvements made within the spirit and principles of the invention etc., it all should include
Within protection scope of the present invention.
Claims (7)
1. a kind of data flow towards multinuclear cluster compiles optimization method, it is characterised in that comprise the following steps:
Determine that calculating task and the task of multinuclear PC cluster node and processing nuclear mapping are divided and scheduling steps;Described appoints
Business is divided:First, process level task division is carried out to synchrodata flow graph, it is determined that each calculating task
The corresponding clustered node being assigned to;Secondly, thread-level task is carried out to the task in the synchrodata flow graph in clustered node to draw
Point, it is determined that the process cores in the corresponding clustered node that each distribution of computation tasks is arrived;
The level stream of internuclear pipeline schedule between scheduling result construction clustered node and in clustered node is divided according to task
Waterline scheduling steps;Described level pipeline schedule step is specially:First, being adjusted clustered node using asynchronous pipeline
Degree, secondly, is dispatched to clustered node inside using pipeline synchronization;
According to the signal intelligence and data flow program between the architectural characteristic of polycaryon processor, clustered node on polycaryon processor
Implementation status does cache optimization step;The cache optimization step detailed process is:First, mechanism is filled using cache line
The puppet synchronously caused between synchronizing software flowing water each stage in clustered node is eliminated to share;Secondly, eliminated using steady propagation technology
The puppet that computing unit data transfer is caused is shared.
2. the data flow according to claim 1 towards multinuclear cluster compiles optimization method, it is characterised in that the task
Divide by being translated into Graph partition problem, and being divided according to process level task the target difference point divided with thread-level task
It is not solved using Group partition strategies and duplication division strategy and obtained.
3. the data flow according to claim 2 towards multinuclear cluster compiles optimization method, it is characterised in that the process
Level task is divided is specially using Group partition strategy steps:
Multiple computing units are fused into a group by pretreatment stage, reduce in group single computing unit and other
The number on the computing unit communication side in group;
In the coarseness stage, multiple adjacent group are fused into one;
In the initial division stage, group is mapped on PC cluster node, while determining between calculate node and clustered node
Mapping;
The fine granularity adjusting stage, the boundary node in the complete each division of initial division is subjected to tuning, reduces communication overhead.
4. the data flow according to claim 1 towards multinuclear cluster compiles optimization method, it is characterised in that the thread
Level task partiting step be specially:
First, initial division is carried out to the computing unit inside each clustered node using multilayer K roads figure partitioning algorithm;
Secondly, line splitting is entered to loading big computing unit using duplication splitting algorithm, reduces the granularity of computing unit;
Finally, reuse multilayer K roads figure partitioning algorithm to divide the figure after division, it is ensured that the load in process cores
Balanced and good locality.
5. compiling optimization method according to any described data flow towards multinuclear cluster in Claims 1-4, its feature exists
In the method for the asynchronous pipeline scheduling of, the progress be the result that is divided process level using producers and consumers' model with
Machine is assigned on each node of cluster.
6. compiling optimization method according to any described data flow towards multinuclear cluster in Claims 1-4, its feature exists
In the detailed process of the pipeline synchronization scheduling is as follows:
First, topological sorting is carried out to the calculating node in the DFD inside process, forms topological sequences;
Secondly, 0 is initialized as by the stage No. of its node to each calculate node in topological sequences, then, judge its with it is preceding
Node is driven whether on same clustered node, if on same clustered node, judging it with forerunner's node whether same
In one process cores, if in same process cores, then it is identical with the stage of predecessor node, if not at same place
Manage on core, then its stage No. is bigger than the stage No. of forerunner node by 1, if not on same clustered node, then its stage
It is number unrelated with the predecessor node, by traveling through the topological sequences of calculate node, all nodes are carried out with stage No. assignment, construction collection
Pipeline synchronization dispatch list inside group node.
7. the data flow according to claim 5 towards multinuclear cluster compiles optimization method, it is characterised in that the synchronization
The detailed process of pipeline schedule is as follows:
First, topological sorting is carried out to the calculating node in the DFD inside process, forms topological sequences;
Secondly, 0 is initialized as by the stage No. of its node to each calculate node in topological sequences, then, judge its with it is preceding
Node is driven whether on same clustered node, if on same clustered node, judging it with forerunner's node whether same
In one process cores, if in same process cores, then it is identical with the stage of predecessor node, if not at same place
Manage on core, then its stage No. is bigger than the stage No. of forerunner node by 1, if not on same clustered node, then its stage
It is number unrelated with the predecessor node, by traveling through the topological sequences of calculate node, all nodes are carried out with stage No. assignment, construction collection
Pipeline synchronization dispatch list inside group node.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410185945.5A CN103970580B (en) | 2014-05-05 | 2014-05-05 | A kind of data flow towards multinuclear cluster compiles optimization method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410185945.5A CN103970580B (en) | 2014-05-05 | 2014-05-05 | A kind of data flow towards multinuclear cluster compiles optimization method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103970580A CN103970580A (en) | 2014-08-06 |
CN103970580B true CN103970580B (en) | 2017-09-15 |
Family
ID=51240117
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410185945.5A Active CN103970580B (en) | 2014-05-05 | 2014-05-05 | A kind of data flow towards multinuclear cluster compiles optimization method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103970580B (en) |
Families Citing this family (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9626295B2 (en) * | 2015-07-23 | 2017-04-18 | Qualcomm Incorporated | Systems and methods for scheduling tasks in a heterogeneous processor cluster architecture using cache demand monitoring |
KR20170047957A (en) * | 2015-10-26 | 2017-05-08 | 삼성전자주식회사 | Method for operating semiconductor device and semiconductor system |
CN105242909B (en) * | 2015-11-24 | 2017-08-11 | 无锡江南计算技术研究所 | A kind of many-core cyclic blocking method based on multi version code building |
CN105892996A (en) * | 2015-12-14 | 2016-08-24 | 乐视网信息技术(北京)股份有限公司 | Assembly line work method and apparatus for batch data processing |
CN106909343B (en) * | 2017-02-23 | 2019-01-29 | 北京中科睿芯科技有限公司 | A kind of instruction dispatching method and device based on data flow |
CN107179956B (en) * | 2017-05-17 | 2020-05-19 | 北京计算机技术及应用研究所 | Reliable communication method among cores of layered multi-core processor |
CN107391136B (en) * | 2017-07-21 | 2020-11-06 | 众安信息技术服务有限公司 | Programming system and method based on stream |
CN114880133A (en) * | 2017-08-31 | 2022-08-09 | 华为技术有限公司 | Distributed computing system, data transmission method and device in distributed computing system |
CN111090464B (en) * | 2018-10-23 | 2023-09-22 | 华为技术有限公司 | Data stream processing method and related equipment |
CN109857562A (en) * | 2019-02-13 | 2019-06-07 | 北京理工大学 | A kind of method of memory access distance optimization on many-core processor |
CN109815617A (en) * | 2019-02-15 | 2019-05-28 | 湖南高至科技有限公司 | A kind of simulation model driving method |
CN113160545A (en) * | 2020-01-22 | 2021-07-23 | 阿里巴巴集团控股有限公司 | Road network data processing method, device and equipment |
CN111367665B (en) * | 2020-02-28 | 2020-12-18 | 清华大学 | Parallel communication route establishing method and system |
CN111817894B (en) * | 2020-07-13 | 2022-12-30 | 济南浪潮数据技术有限公司 | Cluster node configuration method and system and readable storage medium |
CN111880918B (en) * | 2020-07-28 | 2021-05-18 | 南京市城市与交通规划设计研究院股份有限公司 | Road network front end rendering method and device and electronic equipment |
CN112612585B (en) * | 2020-12-16 | 2022-07-29 | 海光信息技术股份有限公司 | Thread scheduling method, configuration method, microprocessor, device and storage medium |
CN113254021B (en) * | 2021-04-16 | 2022-04-29 | 云南大学 | Compiler-assisted reinforcement learning multi-core task allocation algorithm |
CN114860406B (en) * | 2022-05-18 | 2024-02-20 | 安元科技股份有限公司 | Distributed compiling and packing system and method based on Docker |
CN115617917B (en) * | 2022-12-16 | 2023-03-10 | 中国西安卫星测控中心 | Method, device, system and equipment for controlling multiple activities of database cluster |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102855153A (en) * | 2012-07-27 | 2013-01-02 | 华中科技大学 | Flow compilation optimization method oriented to chip multi-core processor |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2179356A1 (en) * | 2007-08-16 | 2010-04-28 | Siemens Aktiengesellschaft | Compilation of computer programs for multicore processes and the execution thereof |
-
2014
- 2014-05-05 CN CN201410185945.5A patent/CN103970580B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102855153A (en) * | 2012-07-27 | 2013-01-02 | 华中科技大学 | Flow compilation optimization method oriented to chip multi-core processor |
Non-Patent Citations (1)
Title |
---|
COStream:一种面向数据流的编程语言和编译器实现;张维维 等;《计算机学报》;20131031;第36卷(第10期);第1993-2006页 * |
Also Published As
Publication number | Publication date |
---|---|
CN103970580A (en) | 2014-08-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103970580B (en) | A kind of data flow towards multinuclear cluster compiles optimization method | |
CN110619595B (en) | Graph calculation optimization method based on interconnection of multiple FPGA accelerators | |
Xie et al. | Sync or async: Time to fuse for distributed graph-parallel computation | |
US20220129302A1 (en) | Data processing system and method for heterogeneous architecture | |
CN107329828A (en) | A kind of data flow programmed method and system towards CPU/GPU isomeric groups | |
CN103970602B (en) | Data flow program scheduling method oriented to multi-core processor X86 | |
CN105094751B (en) | A kind of EMS memory management process for stream data parallel processing | |
CN110753107B (en) | Resource scheduling system, method and storage medium under space-based cloud computing architecture | |
CN102541640A (en) | Cluster GPU (graphic processing unit) resource scheduling system and method | |
CN102855153B (en) | Towards the stream compile optimization method of chip polycaryon processor | |
Gent et al. | A preliminary review of literature on parallel constraint solving | |
CN111274036A (en) | Deep learning task scheduling method based on speed prediction | |
CN116401055B (en) | Resource efficiency optimization-oriented server non-perception computing workflow arrangement method | |
CN107247628A (en) | A kind of data flow sequence task towards multiple nucleus system is divided and dispatching method | |
CN1326567A (en) | Job-parallel processor | |
CN111404818B (en) | Routing protocol optimization method for general multi-core network processor | |
CN112114951A (en) | Bottom-up distributed scheduling system and method | |
CN116996941A (en) | Calculation force unloading method, device and system based on cooperation of cloud edge ends of distribution network | |
CN107133099B (en) | A kind of cloud computing method | |
Xu et al. | Parallel artificial bee colony algorithm for the traveling salesman problem | |
CN108205465A (en) | The task-dynamic dispatching method and device of streaming applications | |
Li et al. | HSP: Hybrid Synchronous Parallelism for Fast Distributed Deep Learning | |
Melot | Algorithms and framework for energy efficient parallel stream computing on many-core architectures | |
Das | Algorithmic Foundation of Parallel Paging and Scheduling under Memory Constraints | |
CN110262896A (en) | A kind of data processing accelerated method towards Spark system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |