CN103970580A

CN103970580A - Data flow compilation optimization method oriented to multi-core cluster

Info

Publication number: CN103970580A
Application number: CN201410185945.5A
Authority: CN
Inventors: 于俊清; 张维维; 唐九飞; 何云峰; 管涛
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2014-05-05
Filing date: 2014-05-05
Publication date: 2014-08-06
Anticipated expiration: 2034-05-05
Also published as: CN103970580B

Abstract

The invention discloses a data flow compilation optimization method oriented to a multi-core cluster system. The data flow compilation optimization method comprises the following steps that task partitioning and scheduling of mapping from calculation tasks to processing cores are determined; according to task partitioning and scheduling results, hierarchical pipeline scheduling of pipeline scheduling tables among cluster nodes and among cluster node inner cores is constructed; according to structural characteristics of a multi-core processor, communication situations among the cluster nodes, and execution situations of a data flow program on the multi-core processor, cache optimization based on cache is conducted. According to the method, the data flow program and optimization techniques related to the structure of the system are combined, high-load equilibrium and high parallelism of synchronous and asynchronous mixed pipelining codes on a multi-core cluster are brought into full play, and according to cache and communication modes of the multi-core cluster, cache access and communication transmission of the program are optimized; furthermore, the execution performance of the program is improved, and execution time is shorter.

Description

A kind of data stream compile optimization method towards multinuclear cluster

Technical field

The invention belongs to computer compile technology field, more specifically, relate to a kind of data stream compile optimization method towards multinuclear cluster.

Background technology

Along with the development of semiconductor technology, polycaryon processor has been verified as a feasible platform of exploitation concurrency.Multinuclear cluster parallel system becomes a kind of important parallel computing platform design with powerful computation capability and good extendability.Multi-Core Cluster System provides powerful computing ability, also more burden has been given to compiler and programming personnel effectively to develop internuclear coarse grain parallelism simultaneously.Data stream programming provides a kind of feasible method to develop the concurrency of multicore architecture.In this model, each node has represented a calculation task, and every limit has represented that the data between calculation task flow.Each calculation task is an independently computing unit.It has independently instruction stream and address space, and the data between calculation task flow through the communication queue of first in first out and realize.Data stream programming model be take data flow model as basis, take data stream programming language as implementation.Data stream compiling is about to data stream programming language and is converted to the involved technique of compiling of bottom target executable program.Wherein, to data flow program, the runnability in target processing core has played decisive role to compile optimization.

Massachusetts Institute of Technology's compiling laboratory discloses a kind of stream programming language StreamIt.This language, based on Java, has carried out stream expansion to Java and has introduced Filter concept.Filter is the most basic computing unit, and it is the program block of a single-input single-output.In Filter, each processing procedure is described with Work, adopts Push, Pop and Peek operation to communicate in FIFO mode between each Work.Meanwhile, for high-performance computer of future generation (Raw), proposed a kind of stream optimization: first, compiler adopts data splitting and merges the method combining, to calculating node, divided and merge, to increase, calculated and communication overhead ratio; Then processing calculating node mapping later, to each, process on core, reach load balancing, each processes the executive mode that core adopts streamline, and the communication of processing internuclear employing demonstration realizes data transmission.

The stream optimization of StreamIt is that the scheduling problem of stream programming model on polycaryon processor proposed a solution.By distribution of computation tasks is processed on core to each, realized load balancing, guaranteed that calculation task is in the executed in parallel of processing on core.But, there is following defect: (1) is dispatched to each calculating of processing on core is separated with communication, in streamline, separately for it has distributed independently call duration time, therefore increased the expense of communication; (2) do not consider bottom storage allocation optimization problem and the communication optimization problem of processing core; (3) compile optimization method is not optimized for the architectural framework characteristic of multi-Core Cluster System bottom.In a word, for multi-Core Cluster System, it has also opened storage organization and the software communication mechanism of its level when powerful calculating ability is provided to programmer.Existing stream compile optimization method, does not consider the architectural framework of bottom, does not make full use of system hardware resources and improves the execution efficiency of program as storage resources.

Summary of the invention

The object of the present invention is to provide a kind of data stream compile optimization method towards multinuclear cluster, the framework for multi-Core Cluster System, is optimized processing to data flow program, has improved largely the execution performance of data flow program.

The optimization method that the present invention adopts is usingd intermediate representation-synchrodata flow graph that data stream compiler front-end produces as input, and it is carried out to task division and scheduling, level pipeline schedule, cache optimization tertiary treatment successively, finally generates executable code.Concrete steps are as follows:

(1) task division and scheduling step determining calculation task and multinuclear cluster computing node and process core mapping

Node in data flow diagram represents calculation task, and limit represents the communication between calculation task.First, according to the number of node in cluster, synchrodata flow graph is carried out to process level task division, this sub-step adopts Group partitioning strategy of multitask, target maximizes program execution performance for minimizing inter-node communication expense, during division, should consider that load balancing considers that communication overhead minimizes again, by each distribution of computation tasks to corresponding clustered node on.Secondly, according to the calculation task on each clustered node, for the processing core enterprising line journey level task division of each distribution of computation tasks to clustered node, this sub-step adopts and copies splitting-up method, the calculation task that load is large divides, and target is the load balancing realizing on clustered node inter-process core.

(2) according to the level pipeline schedule step of pipeline schedule internuclear in saving with cluster between task division and scheduling result structure clustered node

Pipeline synchronization utilizes a global synchronization clock to guarantee that streamline executing the task on each stage completes simultaneously, adopts the mode of data-driven to carry out between each subtask of asynchronous software streamline.First, synchrodata flow graph is carried out to asynchronous pipeline scheduling, determine the tasks carrying process between clustered node, this step to cluster computing node, completes the mapping of process and clustered node by the whole Random Maps of the calculation task in each process; Secondly, according to the dependence between calculation task in clustered node, for each calculation task (node) is to distribute its stage No. in streamline, complete pipeline synchronization structure; Finally, utilize above two kinds of information, structure level Flow-shop table.

(3) according to the signal intelligence between the architectural characteristic of described polycaryon processor, clustered node and data flow program, the implementation status on polycaryon processor is done cache optimization step

When calculation task (node) is when carrying out, can there is pseudo-sharing in the use that buffer memory is checked in the processing at calculation task place, and the performance that program is carried out produces larger impact.

General polycaryon processor to X86-based is analyzed, and adopts capable mechanism and the steady propagation technology of filling of cache line to combine that to carry out the puppet existing shared for elimination program, and the use of buffer memory is optimized.

The present invention is relevant the optimizing integration of structure to multi-Core Cluster System by data stream scheduling, realized three of data flow program grades of optimizing processs, specifically comprise task division and scheduling, level pipeline schedule, cache optimization, improved the execution performance of data flow program on target platform.Particularly, the present invention has the following advantages:

(1) improved the concurrency of program.By the formalized description to problem, it is a greedy problem that the present invention is dispatched to data flow diagram abstract on the processing core of multi-Core Cluster System, thereby for data flow program has been constructed the Flow-shop model of level, task is all mapped to each to be processed on core, realize low communication expense and load balancing, improved the concurrency of program.

(2) reduce expense.The level pipeline schedule model that the present invention proposes a synchronous versus asynchronous mixing makes full use of calculating and the communication resource of system, simultaneously, use for the buffer memory of clustered node inside is optimized, and improves locality and the Buffer Utilization of data access, strengthens the operational efficiency of program.

Accompanying drawing explanation

Fig. 1 is the structural framing figure of the inventive method in data stream compiling system;

Fig. 2 be in the embodiment of the present invention data flow program at clustered node internal reproduction splitting-up method process flow diagram;

Fig. 3 is that in the embodiment of the present invention, data flow program asynchronous pipeline on cluster is carried out exemplary plot;

Fig. 4 (a) is in the embodiment of the present invention in synchronizing software Flow-shop, the exemplary plot of task division, stage assignment;

Fig. 4 (b) is the corresponding software flow implementation of Fig. 4 (a) exemplary plot;

Fig. 5 (a) is that in the embodiment of the present invention, data flow program is carried out the pseudo-tasks carrying schematic diagram of sharing of steady propagation technology elimination;

Fig. 5 (b) is that pseudo-the sharing of the task in Fig. 5 (a) eliminated front schematic diagram;

Fig. 5 (c) is that pseudo-the sharing of the task in Fig. 5 (a) eliminated rear schematic diagram.

Embodiment

In order to make object of the present invention, technical scheme and advantage clearer, below in conjunction with drawings and Examples, the present invention is further elaborated.Should be appreciated that specific embodiment described herein, only in order to explain the present invention, is not intended to limit the present invention.In addition,, in each embodiment of described the present invention, involved technical characterictic just can not combine mutually as long as do not form each other conflict.

Be illustrated in figure 1 the structural framing figure of the present embodiment in stream compiling system, data flow program can generate an intermediate representation after resolving through data stream compiler front-end---synchrodata flow graph (Synchronous Data Flow, SDF), pass through successively subsequently task division and scheduling, level pipeline schedule, cache optimization and three grades of optimizing processs of communication optimization, finally generate through message passing interface (Message Passing Interface, MPI) object code of encapsulation, completes compiling.

This step comprises two sub-steps: process level task division and thread-level task division.In multi-Core Cluster System, because different nodes has the different network addresss, between node, need to communicate by network, its communication cost is large, and communication in node belongs to machine intimate communication, its communication cost is little, thus to data flow program task division need to distinguish between node and node in these internuclear differences.Task division to different levels under cluster is described below: process level task division minimizes inter-node communication expense under the prerequisite that guarantees load balancing between node, and does not occur loop between division result; Thread-level task division minimizes synchronization overhead under the prerequisite that guarantees load balancing between node, and guarantees data locality as far as possible.Concrete steps are as follows:

(1.1) process level task division.Process task is divided the mapping of determining between computing unit and clustered node, when carrying out in order to amortize the communication overhead of data flow program unit data quantity, between process, data communication adopts piece communication mechanism, only has when buffer zone is filled or just can triggers message transmission during flush buffers by force.Deadlock when preventing that program from carrying out, there is loop in the data dependence of dividing between will avoiding dividing in process level.Process level task division for the synchrodata flow graph of multinuclear cluster has proposed Group partitioning strategy of multitask, and it has adopted greedy algorithm to realize.Group task division is introduced group structure, and group represents the set that one or more computing units form in synchrodata flow graph.When initial, each computing unit of synchrodata flow graph is treated as a group, consistent between dependence and computing unit between group.Group task division mainly contains four-stage and forms:

(1.1.1) pretreatment stage.This stage falls into a trap and calculates unit multiple-input and multiple-output and design for synchrodata flow graph, and this stage is fused into a group by a plurality of computing units, has reduced communicate by letter with the computing unit in other group number on limit of single computing unit in group.

(1.1.2) the Group coarseness stage.This stage is carried out roughening treatment to pretreated group figure, and a plurality of adjacent group are fused into one, will avoid occurring loop in group figure when coarseness.A pair of group merges the income producing and is called alligatoring income, and computing formula is as follows:

gain = \frac{comm (SrcGroup)}{workload (srcGroup) + workload (snkGroup)}

Wherein, workload (srcGroup) and workload (snkGroup) represent srcGroup and snkGroup load separately, comm (srcGroup, snkGroup) represent the communication overhead between srcGroup and snkGroup, communication overhead comprises that data send and two aspects of data receiver.

Coarseness adopts greed to inspire thought, first the alligatoring income of calculating all adjacent group is kept at result in a Priority Queues, from Priority Queues, selecting a pair of group of Income Maximum to do merges, if being not more than to divide back loading theoretical mean and merge in rear group figure, the load of the new group forming after fusion there will not be loop, this fusion is effective so, the group falling through effective integration is deleted from group figure, merging the new group obtain is inserted into and in figure, upgrades dependence between group, according to new group, upgrade the income in Priority Queues, said process iterates.The end condition of algorithm is between any a pair of group, to merge the number can not produce group in positive income or group figure to be less than threshold value.

(1.1.3) the initial division stage.This stage is preliminary determines after alligatoring the mapping between group and clustered node in group figure.Initial division is make each divide load balancing and guarantee that as far as possible between division, communication is minimum.Initial division adopts the strategy of deadlock prevention, in division, starts just to avoid occurring loop in division result.After coarseness, group figure is a directed acyclic graph (Directed AcyclicGraph, DAG), for DAG figure, topological sorting can utilize internodal partial ordering relation in figure to obtain a topological sequences, during initial division, according to group topological sequences, investigate one by one group node in group figure, determine the grid numbering that each group is concrete.

(1.1.4) the fine granularity adjusting stage.This stage is by the feature modeling unit of dividing, and has the computing unit of communicating by letter with the computing unit on other clustered nodes, according to signal intelligence, does further tuning, reduces node communication expense.For a feature modeling unit, the division set at this computing unit place is called source and divides (srcPartition), there is the target that is divided at the computing unit place of dependence to divide (objPartition) with this computing unit, a computing unit only has a srcPartition and may have a plurality of objPartition, the traffic of other computing units in computing unit and srcPartition is internalData, the traffic of the computing unit in computing unit and i objPartition is externalData[i], when adjusting, fine granularity safeguards a Priority Queues, its weights are externalData[i] – internalData.Be in course of adjustment and select processing of weights maximum, can a computing unit be moved to an objPartition and will consider from following two factors: first, can in division, not introduce loop; Secondly, can not destroy the load balancing between whole division to a certain extent.A computing unit has been adjusted later and will have been upgraded Priority Queues according to adjusting result later, but can not be used as adjustment object for adjusted computing unit again.

(1.2) thread-level task division.Thread-level task division will be determined the mapping between computing unit on clustered node inter-process core and this node.Tasks carrying adopts pipeline synchronization scheduling mode in node, and thread-level task division adopts is to take the allocation strategy that load balancing simultaneous minimization synchronization overhead is target.It is load balancing and locality that cross-thread is divided the main factor of considering.Thread-level task division step is specially: first, adopt multilayer K road figure partitioning algorithm to carry out initial division to the computing unit of each clustered node inside; Secondly, employing copies splitting-up method divides the large computing unit of load, reduces the granularity of computing unit, has described as shown in Figure 2 the process flow diagram of data flow program at multinuclear clustered node internal reproduction splitting-up method.Each step of this algorithm is as follows: in the above stage, the result of K road figure partitioning algorithm is as input, ask successively the computational load of each division, according to load, sort, finding can be by division actor grid numbering MaxPartition and workload maxWright (a basic calculating unit) and load maximum, look for again grid numbering MinPartition and the workload minWeight of least-loaded, according to the result of inequality maxWeight<minWeight*balanceFactor (balanceFactor is balance factor), judge again, if result is true, algorithm finishes, if result is false, continue to find fissionable actor of workload maximum in MaxPartition, calculate the division mark repFactor of this actor, repFactor=Max (repFactor, 2), then this actor horizontal split is become to repFactor part, portion is put in MinPartition, remaining repFactor-1 part is placed in MaxPartition, from MaxPartition, remove the actor having divided, then get back to the initial place (asking computational load the sequence of each division) of program, circulation is carried out, until algorithm meets exit criteria and exits, finally, reuse multilayer K road figure partitioning algorithm the figure after division is divided, the load balancing on assurance processing core and good locality.

This step is determined the pipeline implementation of the task that process level and thread-level are divided mainly for the task division result of step (1), make calling program carry out delay as much as possible little.Comprise two steps: the asynchronous pipeline scheduling between clustered node and the synchronizing software pipeline schedule between clustered node inner core.Pipeline synchronization utilizes a global synchronization clock to guarantee that streamline executing the task on each stage completes simultaneously, and each execute phase has equal execution and postpones.Between each subtask of asynchronous software streamline, adopt the mode of data-driven to carry out, the data that produce when a sub-tasks carrying are sent on another subtask that has dependence with it, subtask receives that in the situation that other conditions are satisfied data just can start to carry out, in asynchronous pipeline, the execution of whole streamline does not need global synchronization, calculate with communicate by letter separated.For EQUILIBRIUM CALCULATION FOR PROCESS time and data transmission period, between asynchronous pipeline subtask, data transmission adopts block transmission mechanism conventionally, as long as the communication buffer between task is just filled, can trigger transmission of messages, not need to wait to execute just and can transmit data to the subtask current generation.Concrete steps are as follows

(2.1) asynchronous pipeline scheduling between clustered node

Process level is divided in has also determined the dependence between subtask when subtask is assigned on node.Asynchronous pipeline scheduling does not have global synchronization clock, and the characteristic that meets data-driven is carried out in subtask, and the execution between subtask meets producer consumer pattern.The execution schematic diagram of corresponding data flow program on the cluster being comprised of 3 machines described as shown in Figure 3.In figure, have 3 multinuclear machines respectively corresponding compiler through process level task division, data flow program is divided into three subtask I, II and III.The execution of the interior actor of machine is relevant with machine intimate parallel architecture and scheduling mode, is sharing intra-node employing pipeline synchronization scheduling on storage multi-core platform.Between node, asynchronous pipeline is in order to amortize the expense of unit data quantity in transmission, and data flow program adopts piece communication mode between node, and the producer triggers message passing mechanism when communication block is filled up, and consumer starts to carry out after receiving message.I and the II of take in Fig. 3 are example, when actor C carries out after a period of time the communication buffer between actor C and actor F, are filled C and send data to F, and after F receives the data that C produces, F starts to carry out, and C can continue to carry out and generates new data simultaneously.By asynchronous pipeline executive mode, guarantee the execution of data flow program on cluster.

(2.2) the inner pipeline synchronization scheduling of clustered node

The scheduling of thread-level pipeline synchronization comprises that the stage distributes and structure pipeline schedule table two step.After thread-level task division completes, carry out stage assignment, build synchronizing software streamline.Concrete steps are as follows

(2.2.1) stage distributes.First, the calculating node in the data flow diagram of clustered node inside is carried out to topological sorting, form topological sequences, secondly, each computing node in topological sequences is initialized as to 0 by the stage No. of its node, then, judge that itself and forerunner's node are whether on same clustered node, if on same clustered node, judge that itself and forerunner's node are whether on same processing core, if on same processing core, it is identical with the stage of forerunner's node so, if not on same processing core, its stage No. is calculated the stage No. large 1 of node than forerunner so, if not on same clustered node, its stage No. and this forerunner's computing node are irrelevant so, by the topological sequences of traversal computing node, all nodes are carried out to stage No. assignment

(2.2.2) structure pipeline schedule.By the result structure pipeline synchronization dispatch list of task division and stage distribution.As shown in Figure 4, horizontal ordinate represents that resource comprises processing core, and ordinate represents stage No..In Fig. 4 (a), it is upper that P, Q, S are divided into same core Core0, and R, T are divided into same and core Core1 upper, and U, V are divided on same core Core2.P is start node, and stage No. is that 0, Q and its father node P are in same core, so its stage No. is also 0; The stage No. of R is that stage No. that the stage No. of 1, S is 2, T is that the stage No. of 3, U, V is 4.As Fig. 4 (b), in software pipeline implementation, experienced software flow and filled stage, full stage and empty stage.

Due to multithreading shared buffer memory data, cache be take cache line as storage cell, when a plurality of threads, revise mutually independently variable and these variablees and shared same cache line when upper, have pseudo-share (False Sharing), affect the performance of program execution.The literary composition that this step exists mainly for cache access is shared and is optimized from two aspects:

(3.1) it is shared that Cache line fills the synchronous puppet producing of elimination flow line stage.By the row mechanism of filling, make the variable of different threads can not share same cache line, eliminate pseudo-sharing.

(3.2) adopt the puppet producing due to data transmission between steady propagation technology elimination computing unit to share.As Fig. 5 (a) if as shown in prosumer's chain P, C on different core during executed in parallel, when the space of P and C access is on same cache line, also can there is pseudo-sharing, as shown in Fig. 5 (b).In complicated data flow diagram, can there are many internuclear communication limits, if still adopt cache lines to fill mechanism, will inevitably cause large quantity space to be wasted, reduce space utilisation and produce higher communication delay.The utilization factor of sharing and improve as far as possible cache in order to eliminate the puppet of communication buffer, has adopted steady propagation technology.As being depicted as, Fig. 5 (c) eliminates the pseudo-service condition of sharing rear cache.Steady propagation technology algorithm adopts greedy thought, first computational data string routine stable state is carried out once all outputs limit and is eliminated the coefficient that relevant computing unit should be expanded after pseudo-sharing, and then in all spreading coefficients, looks for and can make when carrying out, can not make after all computing units expansions greatest coefficient that L1 data cache overflows as final spreading coefficient.For cache is better played effectiveness, when searching spreading coefficient, might not make all computing unit execution that data cache can not occur overflows, while allowing 10% computing unit execution according to " 90/10 principle ", overflow L1 data cache but can not make L2/L3cache overflow, so also can obtain good performance.

Those skilled in the art will readily understand; the foregoing is only preferred embodiment of the present invention; not in order to limit the present invention, all any modifications of doing within the spirit and principles in the present invention, be equal to and replace and improvement etc., within all should being included in protection scope of the present invention.

Claims

1. towards a data stream compile optimization method for multinuclear cluster, it is characterized in that, comprise the following steps:

The task division and scheduling step determining calculation task and multinuclear cluster computing node and process core mapping;

According to the level pipeline schedule step of pipeline schedule internuclear in saving with cluster between task division and scheduling result structure clustered node;

According to the signal intelligence between the architectural characteristic of described polycaryon processor, clustered node and data flow program, the implementation status on polycaryon processor is done cache optimization step.

2. the stream of the data stream towards multinuclear cluster compile optimization method according to claim 1, is characterized in that, described task division is specially with scheduling step:

First, synchrodata flow graph is carried out to process level task division, determine the corresponding clustered node that each distribution of computation tasks arrives;

Secondly, the task in the synchrodata flow graph in clustered node is carried out to thread-level task division, determine the processing core in the corresponding clustered node that each distribution of computation tasks arrives.

3. the data stream towards multinuclear cluster according to claim 2 flows compile optimization method, it is characterized in that, described task division is by being translated into Graph partition problem, and according to the difference of process level and thread-level task division target, utilizes respectively Group partition strategy and copy division strategy it is solved and is obtained.

4. the stream of the data stream towards multinuclear cluster compile optimization method according to claim 3, is characterized in that, described process level task division adopts Group partition strategy step to be specially:

Pretreatment stage, is fused into a group by a plurality of computing units, has reduced communicate by letter with the computing unit in other group number on limit of single computing unit in group;

In the coarseness stage, a plurality of adjacent group are fused into one;

In the initial division stage, be mapped to group on cluster computing node, determines the mapping between computing node and clustered node simultaneously;

The fine granularity adjusting stage, the complete boundary node that each is divided of initial division is carried out to tuning, reduce communication overhead.

5. according to the stream of the data stream towards the multinuclear cluster compile optimization method described in claim 2 to 4 any one, it is characterized in that, described thread-level task division step is specially:

First, adopt multilayer K road figure partitioning algorithm to carry out initial division to the computing unit of each clustered node inside;

Secondly, employing copies splitting-up method divides the large computing unit of load, reduces the granularity of computing unit;

Finally, reuse multilayer K road figure partitioning algorithm the figure after division is divided, the load balancing on assurance processing core and good locality.

6. according to the stream of the data stream towards the multinuclear cluster compile optimization method described in claim 1 to 5 any one, it is characterized in that, described level pipeline schedule step is specially:

First, to adopting asynchronous pipeline scheduling between clustered node.

Secondly, clustered node inside is adopted to pipeline synchronization scheduling.

7. the data stream towards multinuclear cluster according to claim 6 flows compile optimization method, it is characterized in that, described in the method for the asynchronous pipeline scheduling carried out be that the result that adopts producers and consumers's model that process level is divided is assigned randomly on each node of cluster.

8. according to the data stream towards the multinuclear cluster stream compile optimization method described in claim 6 or 7, it is characterized in that, described in the detailed process of the pipeline synchronization scheduling carried out as follows:

First, the calculating node in the data flow diagram of process inside is carried out to topological sorting, form topological sequences;

Secondly, each computing node in topological sequences is initialized as to 0 by the stage No. of its node, then, judge that itself and forerunner's node are whether on same clustered node, if on same clustered node, judge that itself and forerunner's node are whether on same processing core, if on same processing core, it is identical with the stage of forerunner's node so, if not on same processing core, its stage No. is than the stage No. of forerunner node large 1 so, if not on same clustered node, its stage No. and this forerunner's node are irrelevant so, by the topological sequences of traversal computing node, all nodes are carried out to stage No. assignment, the pipeline synchronization dispatch list of structure clustered node inside.

9. according to the data stream towards the multinuclear cluster stream compile optimization method described in claim 6 to 8 any one, it is characterized in that, described in carry out cache optimization detailed process and be:

First, the puppet that adopts cache line filling mechanism elimination clustered node inter-sync software flow synchronously to cause between each stage is shared;

Secondly, the puppet that adopts steady propagation technology elimination computing unit data transmission to cause is shared.