CN103970580B

CN103970580B - A kind of data flow towards multinuclear cluster compiles optimization method

Info

Publication number: CN103970580B
Application number: CN201410185945.5A
Authority: CN
Inventors: 于俊清; 张维维; 唐九飞; 何云峰; 管涛
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2014-05-05
Filing date: 2014-05-05
Publication date: 2017-09-15
Anticipated expiration: 2034-05-05
Also published as: CN103970580A

Abstract

Optimization method is compiled the invention discloses a kind of data flow towards multi-Core Cluster System, including：Determine that calculating task and the task of processing nuclear mapping are divided and scheduling steps；The level pipeline schedule step of internuclear pipeline schedule table between scheduling result construction clustered node and in cluster section is divided according to task；Cache optimization step based on cache is done according to the implementation status of signal intelligence and data flow program on polycaryon processor between the architectural characteristic of the polycaryon processor, clustered node.The method of the present invention combines the data flow program optimisation technique related to system architecture, give full play to the high concurrency that high capacity is harmonious and synchronous versus asynchronous hybrid flow code is on multinuclear cluster, and for the caching and communication mode on multinuclear cluster, the cache access and communications of program are optimized, the execution performance of program is further increasing, with the smaller execution time.

Description

A kind of data flow towards multinuclear cluster compiles optimization method

Technical field

The invention belongs to computer technique of compiling field, compiled more particularly, to a kind of data flow towards multinuclear cluster Translate optimization method.

Background technology

With the development of semiconductor technology, polycaryon processor has been verified as developing a feasible platform of concurrency. Multinuclear cluster parallel system turns into a kind of important parallel computing platform with powerful computation capability and good autgmentability Design.Multi-Core Cluster System provides powerful calculating disposal ability, while also having given compiler and volume by more burdens Cheng personnel are effectively to develop internuclear coarse grain parallelism.Data flow programming provides a kind of feasible method to develop multinuclear frame The concurrency of structure.In this model, each node represents a calculating task, and each edge is represented between calculating task Data flow.Each calculating task is an independent computing unit.It has independent instruction stream and address space, calculates and appoints Data between business flow through the communication queue of first in first out to realize.Data stream programming model is using data flow model as base Plinth, using data flow programming language as implementation.Data flow programming language is converted to bottom target by data flow compiling to be held Technique of compiling involved by line program.Wherein, compiling optimization rises to runnability of the data flow program in target processing core Decisive role is arrived.

Massachusetts Institute of Technology's compiling laboratory discloses a kind of stream programming language StreamIt.The language is based on Java, right Java has carried out stream extension and has introduced Filter concepts.Filter is most basic computing unit, and it is that a single input list is defeated The program block gone out.The processing procedure of each in Filter is described with Work, is grasped between each Work using Push, Pop and Peek Work is communicated in FIFO modes.Meanwhile, propose a kind of stream compiling optimization skill for high-performance computer of future generation (Raw) Art：First, the method that compiler is combined using data splitting and fusion, line splitting is entered with merging to calculating node, to increase Calculate and communication overhead ratio；Then on the calculating node mapping after treating to each process cores, load balancing is reached, everywhere The executive mode that core uses streamline is managed, handles the internuclear communication using display to realize data transfer.

StreamIt stream optimization is that scheduling problem of the stream programming model on polycaryon processor proposes one Plant solution.By by distribution of computation tasks to each process cores, realizing load balancing, it is ensured that calculating task is at place Manage the parallel execution on core.But, there is following defect：(1) each calculating and communication being dispatched in process cores are separation , it is individually in a pipeline and is assigned with independent call duration time, therefore adds the expense of communication；(2) do not account for The bottom storage allocation optimization problems and communication optimization problem of process cores；(3) compiling optimization method is not directed to multinuclear cluster system The architectural framework characteristic of system bottom is optimized.In a word, for multi-Core Cluster System, it is providing the same of powerful calculating ability When, the storage organization and software communication mechanism of its level have also been opened to programmer.Existing stream compiling optimization method, not There is the architectural framework in view of bottom, do not make full use of system hardware resources such as storage resource to improve program performs effect Rate.

The content of the invention

Optimization method is compiled it is an object of the invention to provide a kind of data flow towards multinuclear cluster, for multinuclear cluster The framework of system, processing is optimized to data flow program, largely improves the execution performance of data flow program.

The intermediate representation that the optimization method that the present invention is used is produced with data flow compiler front-end-synchrodata flow graph is made For input, carry out task successively to it and divide and scheduling, level pipeline schedule, cache optimization tertiary treatment, ultimately produce Executable code.Comprise the following steps that：

(1) determine that calculating task and the task of multinuclear PC cluster node and processing nuclear mapping are divided and scheduling steps

Node in DFD represents calculating task, while the communication between representing calculating task.First, saved according in cluster The number of point carries out process level task division to synchrodata flow graph, and the sub-step uses Group partitioning strategy of multitask, and target is Minimize inter-node communication expense and maximize program execution performance, should consider that load balancing considers communication overhead again during division Minimize, by each distribution of computation tasks to corresponding clustered node on.Secondly, appointed according to the calculating on each clustered node Business, is that each distribution of computation tasks is divided to the enterprising line journey level task of process cores of clustered node, the sub-step is using duplication Splitting algorithm, enters line splitting, target is to realize the load balancing on clustered node inter-process core by the calculating task for loading big.

(2) layer of internuclear pipeline schedule between scheduling result construction clustered node and in cluster section is divided according to task Secondary pipeline schedule step

Pipeline synchronization ensures that the execution task on each stage of streamline is completed simultaneously using a global synchronization clock, Asynchronous software streamline is performed between each subtask by the way of data-driven.First, synchrodata flow graph is carried out asynchronous Pipeline schedule, determines the tasks carrying process between clustered node, and the step is integrally random by the calculating task in each process It is mapped on PC cluster node, completes the mapping of process and clustered node；Secondly, according between calculating task in clustered node Dependence, is each calculating task (node) to distribute its stage No. in a pipeline, completes pipeline synchronization construction；Most Afterwards, using both the above information, tectonic remnant basin Flow-shop table.

(3) according to the signal intelligence and data flow program between the architectural characteristic of the polycaryon processor, clustered node many Implementation status on core processor does cache optimization step

When calculating task (node) upon execution, use of the process cores to caching where calculating task can exist it is pseudo- altogether Enjoy, large effect is produced to the performance that program is performed.

The general polycaryon processor of X86-based is analyzed, mechanism and steady propagation are filled using cache line rows The puppet that technology is combined the execution presence of elimination program is shared, and the use to caching is optimized.

The present invention optimizes integration data stream scheduling is related to the structure of multi-Core Cluster System, realizes to data flow The three-level optimization process of program, specifically includes task and divides and scheduling, level pipeline schedule, cache optimization, improve number According to execution performance of the string routine on target platform.Specifically, the present invention has advantages below：

(1) concurrency of program is improved.By the formalized description to problem, DFD is dispatched to many by the present invention Abstract in the process cores of core group system is a greedy problem, so as to construct the Flow-shop of level for data flow program Model, task is both mapped in each process cores, realizes low communication expense and load balancing, improves the concurrency of program.

(2) expense is reduced.The level pipeline schedule model that the present invention proposes a synchronous versus asynchronous mixing is fully sharp Calculating and the communication resource with system, meanwhile, the use for the caching inside clustered node is optimized, and improves data and visits The locality and Buffer Utilization asked, strengthen the operational efficiency of program.

Brief description of the drawings

Fig. 1 is structural framing figure of the inventive method in data flow compiling system；

Fig. 2 is data flow program in the embodiment of the present invention in clustered node internal reproduction splitting algorithm flow chart；

Fig. 3 is data flow program asynchronous pipeline execution exemplary plot on cluster in the embodiment of the present invention；

Fig. 4 (a) is in synchronizing software Flow-shop in the embodiment of the present invention, task is divided, the exemplary plot of stage assignment；

Fig. 4 (b) is the software flow implementation procedure exemplary plot corresponding to Fig. 4 (a)；

Fig. 5 (a) performs the pseudo- shared tasks carrying of steady propagation technology elimination for data flow program in the embodiment of the present invention and shown It is intended to；

Fig. 5 (b) is the preceding schematic diagram of the shared elimination of task puppet in Fig. 5 (a)；

Fig. 5 (c) is that the task puppet in Fig. 5 (a) shares schematic diagram after elimination.

Embodiment

In order to make the purpose , technical scheme and advantage of the present invention be clearer, it is right below in conjunction with drawings and Examples The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.As long as in addition, technical characteristic involved in each embodiment of invention described below Not constituting conflict each other can just be mutually combined.

It is structural framing figure of the present embodiment in stream compiling system as shown in Figure 1, data flow program is compiled by data flow An intermediate representation can be generated by translating after device front end is parsed --- synchrodata flow graph (Synchronous Data Flow, SDF), then sequentially pass through task and divide and optimized with scheduling, level pipeline schedule, cache optimization and communication optimization three-level Journey, ultimately produces the object code encapsulated through message passing interface (Message Passing Interface, MPI), completes to compile Translate.

The step includes two sub-steps：Process level task is divided and thread-level task is divided.In multi-Core Cluster System, Need to be communicated by network between the different network address, node because different nodes has, its communication cost is big, and Communication in node belongs to machine intercommunication, and its communication cost is small, so data flow program task, which is divided, to be needed to distinguish These differences between node between node core.Task division to different levels under cluster is described as follows：Process level task is drawn Divide and inter-node communication expense is minimized on the premise of load balancing between ensureing node, and loop is occurred without between division result；Line Journey level task minimizes synchronization overhead on the premise of load balancing between being divided in guarantee node, and ensures that data are local as far as possible Property.Comprise the following steps that：

(1.1) process level task is divided.Process task is divided the mapping between determination computing unit and clustered node, In order to amortize the communication overhead of data flow program unit data quantity during execution, inter-process data communication uses block communication mechanism, only Have when buffering area is filled or can just trigger message transmission during flush buffers by force.Deadlock during in order to prevent that program from performing, Process level, which is divided, will avoid the data dependence between dividing from loop occur.Process for the synchrodata flow graph of multinuclear cluster is in charge of a grade Business division proposes Group partitioning strategy of multitask, and it employs greedy algorithm and realized.Group tasks, which are divided, to be introduced Group structures, group represents the set that one or more computing units are constituted in synchrodata flow graph.Will be synchronous when initial Each computing unit of DFD is treated as a group, consistent between dependence and computing unit between group.Group Task is divided and is mainly made up of four-stage：

(1.1.1) pretreatment stage.The stage is directed in synchrodata flow graph and calculates unit multiple-input and multiple-output and design, Multiple computing units are fused into a group by the stage, are reduced in group in single computing unit and other group The number on computing unit communication side.

(1.1.2) Group coarseness stages.The stage carries out roughening treatment to pretreated group figures, by multiple phases Adjacent group is fused into one, to avoid loop occur in group figures in coarseness.The income that a pair of group fusions are produced Referred to as it is roughened income, calculation formula is as follows：

Wherein, workload (srcGroup) and workload (snkGroup) represents that srcGroup and snkGroup is each From load, comm (srcGroup, snkGroup) represents the communication overhead between srcGroup and snkGroup, and communication opens Pin includes data and sent and two aspects of data receiver.

Coarseness inspires thought using greed, and result is stored in one by the roughening income for calculating all adjacent group first In individual Priority Queues, a pair of group of Income Maximum are selected to merge from Priority Queues, if what is formed after fusion is new It is not in loop that group load, which is not more than after division back loading theoretical mean and fusion in group figures, then this time is melted Conjunction is effective, and the group fallen by effective integration is deleted from group figures, merges obtained new group and is inserted into figure Dependence between middle renewal group, updates the income in Priority Queues, iterate said process according to new group.Calculate The end condition of method is to merge small all without the number for producing group in positive income or group figures between any pair of group In threshold value.

(1.1.3) initial division stage.After the stage tentatively decision roughening in group figures between group and clustered node Mapping.Initial division is so that each divides load balancing and ensures to communicate minimum between dividing as far as possible.Initial division is used The strategy of deadlock prevention, starts it is avoided that occurring loop in division result dividing.Group figures are one oriented after coarseness Acyclic figure (Directed Acyclic Graph, DAG), the partial order between figure interior joint can be utilized for the topological sorting of DAG figures Relation obtains a topological sequences, investigates group nodes in group figures one by one according to group topological sequences during initial division, really Fixed each specific grid numberings of group.

(1.1.4) fine granularity adjusting stage.The stage be by the feature modeling unit of division, i.e., with other clustered nodes Computing unit there is the computing unit of communication, further tuning is done according to signal intelligence, node communication overhead is reduced.To one For feature modeling unit, the division collection where the computing unit is collectively referred to as source and divides (srcPartition), with the calculating list The target that is divided into where member has the computing unit of dependence divides (objPartition), and a computing unit only has one Individual srcPartition and there may be multiple objPartition, other calculating in computing unit and srcPartition The traffic of unit is internalData, and the traffic of computing unit and the computing unit in i-th of objPartition is ExternalData [i], when fine granularity is adjusted safeguard a Priority Queues, its weights be externalData [i]- internalData.The progress for the selection maximum weight that is in course of adjustment is handled, and can a computing unit be moved to one ObjPartition will consider from following two factors：First, loop will not be introduced in division；Secondly, to a certain extent The load balancing that will not be destroyed between whole divide.One computing unit has been adjusted will update excellent according to the result after adjustment later First queue, but will not be used as regulating object again for adjusted computing unit.

(1.2) thread-level task is divided.Thread-level task, which is divided, will determine on clustered node inter-process core and the node Mapping between computing unit.Tasks carrying uses pipeline synchronization scheduling mode in node, and thread-level task divides what is used It is using load balancing while minimizing allocation strategy of the synchronization overhead as target.It is load that cross-thread, which divides the factor mainly considered, Balanced and locality.Thread-level task partiting step is specially：First, using multilayer K roads figure partitioning algorithm to each cluster section Computing unit inside point carries out initial division；Secondly, line splitting is entered to loading big computing unit using duplication splitting algorithm, The granularity of computing unit is reduced, data flow program is described as shown in Figure 2 in multinuclear clustered node internal reproduction splitting algorithm Flow chart.Each step of the algorithm is as follows：The result of K roads figure partitioning algorithm asks each to divide successively as input in the above stage Computational load, be ranked up according to load, finding can be split off actor's (basic computational ele- ment) and load is maximum Grid numbering MaxPartition and workload maxWright, then look for the minimum grid numbering MinPartition of load With workload minWeight, further according to inequality maxWeight<minWeight*balanceFactor The result of (balanceFactor is balance factor) is judged that, if result is true, algorithm terminates, if result is false, after It is continuous to find the maximum fissionable actor of workload in MaxPartition, the division fraction repFactor of the actor is calculated, RepFactor=Max (repFactor, 2), then by the actor horizontal splits into repFactor parts, portion is put into In MinPartition, remaining repFactor-1 parts are placed in MaxPartition, remove and have divided from MaxPartition Actor, be then return to the initial place (asking each computational load divided and sequence) of program, circulation is performed, until algorithm is full Sufficient exit criteria and exit；Finally, multilayer K roads figure partitioning algorithm is reused to divide the figure after division, it is ensured that Load balancing and good locality in process cores.

The step determines the flowing water for the task that process level and thread-level are divided mainly for the task division result of step (1) Line implementation procedure so that it is small as far as possible that program performs delay.Including two steps：Asynchronous pipeline scheduling between clustered node With synchronizing software pipeline schedule internuclear inside clustered node.Pipeline synchronization ensures flowing water using a global synchronization clock Execution task on each stage of line is completed simultaneously, and each execution stage, which has, equal performs delay.Asynchronous software streamline Performed between each subtask by the way of data-driven, when the data that a sub- tasks carrying is produced be sent to its have according to On another subtask for the relation of relying, subtask, which receives data in the case where other conditions are met, can just start execution, different The execution of whole streamline does not need global synchronization in step streamline, calculates and is separated with communicating.In order to the EQUILIBRIUM CALCULATION FOR PROCESS time with Data transfer generally uses block transmission mechanism between data transmission period, asynchronous pipeline subtask, as long as the communication between task is slow Deposit to be filled and can just trigger message transmission, it is not necessary to which data can just be transmitted by waiting to the subtask current generation to have performed.Specifically Step is as follows

(2.1) asynchronous pipeline is dispatched between clustered node

Process level is divided in the dependence for being assigned to subtask and being also determined while on node between subtask.It is asynchronous Pipeline schedule does not have global synchronization clock, and the execution that subtask is performed between the characteristic for meeting data-driven, subtask meets life Production person's consumer's pattern.Execution of the corresponding data flow program on the cluster being made up of 3 machines is described as shown in Figure 3 to show It is intended to.3 multinuclear machines are had in figure and corresponds to compiler respectively and is divided by process level task data flow program is divided into three Subtask I, II and III.Actor execution is relevant with parallel architecture inside machine and scheduling mode in machine, in shared storage Intra-node is dispatched using pipeline synchronization on multi-core platform.Asynchronous pipeline is transmitting to amortize unit data quantity between node In expense, data flow program between node use block communication mode, message-passing machine is triggered when the producer fills up communication block System, consumer starts to perform after message is received.By taking I and II in Fig. 3 as an example, actor after actor C perform a period of time Communication buffer between C and actor F is filled C and sends data to F, and F receives F after the data that C is produced and starts to perform, while C It can continue executing with and generate new data.Execution of the data flow program on cluster is ensured by asynchronous pipeline executive mode.

(2.2) clustered node internal synchronization pipeline schedule

The scheduling of thread-level pipeline synchronization includes stage distribution and construction pipeline schedule two steps of table.Thread-level task After the completion of division, stage assignment is carried out, synchronizing software streamline is built.Comprise the following steps that

(2.2.1) stage distributes.First, topological row is carried out to the calculating node in the DFD inside clustered node Sequence, forms topological sequences；Secondly, 0 is initialized as by the stage No. of its node to each calculate node in topological sequences, so Afterwards, judge its with forerunner's node whether on same clustered node, if on same clustered node, judging itself and forerunner Whether node is in same process cores, if in same process cores, then it is identical with the stage of predecessor node, if Not in same process cores, then its stage No. is bigger by 1 than the stage No. that forerunner calculates node, if not in same cluster section On point, then its stage No. is unrelated with forerunner's calculate node, by traveling through the topological sequences of calculate node, and all nodes are entered Row order segment number assignment

(2.2.2) constructs pipeline schedule.The result that task is divided and the stage distributes is constructed into pipeline synchronization dispatch list. As shown in Fig. 4 (a), abscissa, which represents resource, includes process cores, and ordinate represents stage No..As in Fig. 4 (a), P, Q, S are drawn Assign on same core Core0, R, T are divided on same and core Core1, U, V are divided on same core Core2.P For start node, stage No. is that 0, Q and its father node P is in same core, so its stage No. is also 0；R stage No. is 1, S Stage No. be 2, T stage No. be 3, U, V stage No. is 4.In such as Fig. 4 (b), software pipeline implementation procedure, it experienced Software flow filling stage, full stage and empty stage.

Due to multithreading shared buffer memory data, cache is using cache line as storage cell, when the modification of multiple threads Mutually, there is pseudo- shared (False Sharing), shadow in independent variable and when these variables are shared on same cache line Ring the performance that program is performed.The step is shared mainly for the text that cache access is present to be optimized in terms of two：

(3.1) puppet that Cache line fillings elimination flow line stage is synchronously produced is shared.Caused by row filling mechanism The variable of different threads will not share same cache line, eliminate pseudo- shared.

(3.2) puppet produced due to data transfer between computing unit is eliminated using steady propagation technology to share.Such as Fig. 5 (a) if P, C are parallel when performing on different core in productive consumption person's chain shown in, the space accessed as P and C is same Also puppet can occur when on cache line shared, as shown in Fig. 5 (b).Can exist in complicated DFD a plurality of internuclear logical Believe side, if still filling mechanism using cache lines, big quantity space will necessarily be caused to be wasted, reduce space utilisation and generation Higher communication delay.Cache utilization rate is shared and improved as far as possible to eliminate the puppet of communication buffer, employs stable state expansion Give full play to one's skill art.The pseudo- service condition for sharing rear cache is eliminated as Fig. 5 (c) is shown.Steady propagation technique algorithm is thought using greed Think, calculating data flow program stable state performs once the pseudo- shared rear related computing unit of all output sides eliminations and should extended first Coefficient, then looked in all spreading coefficients can make all computing units extend after upon execution all without making L1 data The greatest coefficient that cache overflows is used as final spreading coefficient.In order that cache preferably plays effectiveness, spreading coefficient is being searched When be not necessarily to make all computing units to perform all without occurring data cache to overflow, allow to have according to " 90/10 principle " Occur to overflow L1 data cache when 10% computing unit is performed but not overflow L2/L3cache, so can also obtain compared with Good performance.

As it will be easily appreciated by one skilled in the art that the foregoing is merely illustrative of the preferred embodiments of the present invention, it is not used to The limitation present invention, any modifications, equivalent substitutions and improvements made within the spirit and principles of the invention etc., it all should include Within protection scope of the present invention.

Claims

1. a kind of data flow towards multinuclear cluster compiles optimization method, it is characterised in that comprise the following steps：

Determine that calculating task and the task of multinuclear PC cluster node and processing nuclear mapping are divided and scheduling steps；Described appoints Business is divided：First, process level task division is carried out to synchrodata flow graph, it is determined that each calculating task The corresponding clustered node being assigned to；Secondly, thread-level task is carried out to the task in the synchrodata flow graph in clustered node to draw Point, it is determined that the process cores in the corresponding clustered node that each distribution of computation tasks is arrived；

The level stream of internuclear pipeline schedule between scheduling result construction clustered node and in clustered node is divided according to task Waterline scheduling steps；Described level pipeline schedule step is specially：First, being adjusted clustered node using asynchronous pipeline Degree, secondly, is dispatched to clustered node inside using pipeline synchronization；

According to the signal intelligence and data flow program between the architectural characteristic of polycaryon processor, clustered node on polycaryon processor Implementation status does cache optimization step；The cache optimization step detailed process is：First, mechanism is filled using cache line The puppet synchronously caused between synchronizing software flowing water each stage in clustered node is eliminated to share；Secondly, eliminated using steady propagation technology The puppet that computing unit data transfer is caused is shared.

2. the data flow according to claim 1 towards multinuclear cluster compiles optimization method, it is characterised in that the task Divide by being translated into Graph partition problem, and being divided according to process level task the target difference point divided with thread-level task It is not solved using Group partition strategies and duplication division strategy and obtained.

3. the data flow according to claim 2 towards multinuclear cluster compiles optimization method, it is characterised in that the process Level task is divided is specially using Group partition strategy steps：

Multiple computing units are fused into a group by pretreatment stage, reduce in group single computing unit and other The number on the computing unit communication side in group；

In the coarseness stage, multiple adjacent group are fused into one；

In the initial division stage, group is mapped on PC cluster node, while determining between calculate node and clustered node Mapping；

The fine granularity adjusting stage, the boundary node in the complete each division of initial division is subjected to tuning, reduces communication overhead.

4. the data flow according to claim 1 towards multinuclear cluster compiles optimization method, it is characterised in that the thread Level task partiting step be specially：

First, initial division is carried out to the computing unit inside each clustered node using multilayer K roads figure partitioning algorithm；

Secondly, line splitting is entered to loading big computing unit using duplication splitting algorithm, reduces the granularity of computing unit；

Finally, reuse multilayer K roads figure partitioning algorithm to divide the figure after division, it is ensured that the load in process cores Balanced and good locality.

5. compiling optimization method according to any described data flow towards multinuclear cluster in Claims 1-4, its feature exists In the method for the asynchronous pipeline scheduling of, the progress be the result that is divided process level using producers and consumers' model with Machine is assigned on each node of cluster.

6. compiling optimization method according to any described data flow towards multinuclear cluster in Claims 1-4, its feature exists In the detailed process of the pipeline synchronization scheduling is as follows：

First, topological sorting is carried out to the calculating node in the DFD inside process, forms topological sequences；

Secondly, 0 is initialized as by the stage No. of its node to each calculate node in topological sequences, then, judge its with it is preceding Node is driven whether on same clustered node, if on same clustered node, judging it with forerunner's node whether same In one process cores, if in same process cores, then it is identical with the stage of predecessor node, if not at same place Manage on core, then its stage No. is bigger than the stage No. of forerunner node by 1, if not on same clustered node, then its stage It is number unrelated with the predecessor node, by traveling through the topological sequences of calculate node, all nodes are carried out with stage No. assignment, construction collection Pipeline synchronization dispatch list inside group node.

7. the data flow according to claim 5 towards multinuclear cluster compiles optimization method, it is characterised in that the synchronization The detailed process of pipeline schedule is as follows：