CN102270114B

CN102270114B - Method and device for inserting inter-cluster data transmission operation

Info

Publication number: CN102270114B
Application number: CN 201110120058
Authority: CN
Inventors: 苏振宇
Original assignee: BEIJING BEIYANG ELECTRONIC TECHNOLOGY Co Ltd; Sunplus Technology Co Ltd; Sunplus Core Technology Co Ltd
Current assignee: BEIJING SUNPLUS-EHUE TECHNOLOGY CO., LTD.
Priority date: 2011-05-06
Filing date: 2011-05-06
Publication date: 2013-08-14
Anticipated expiration: 2031-05-06
Also published as: CN102270114A

Abstract

The invention provides a method for inserting an inter-cluster data transmission operation. The method comprises the following steps of: inputting a source program to be distributed; according to a pre-established completely linear restriction equation, obtaining execution times of instructions in the source program in the event of adopting different distribution schemes in each cluster of a processor so as to find out the shortest execution time from the obtained plurality of execution times; and determining whether the inter-cluster data transmission operation is inserted between the instructions with a dependency relationship according to the dependency relationship between the instructions in the source program and the distribution scheme corresponding to the shortest execution time. According to the scheme disclosed by the invention, the plurality of clusters can simultaneously execute instructions as many as possible, but the inter-cluster data transmission operations can be inserted as few as possible.

Description

Data transfer operation insertion method and device between bunch

Technical field

The present invention relates to the processor technical field, data transfer operation insertion method and device between relating in particular to bunch.

Background technology

Adopt the processor of very long instruction word (VLIW, Very Long Instruction Word) architecture generally all to have plenty hardware resources, these hardware resources generally include several bunches (cluster) with identical function.For will distinguishing with ordinary processor, below sort processor is called the sub-clustering processor.Bunch be the identical a plurality of performance elements of function in the sub-clustering processor, these unit are separate, can execute instruction simultaneously.As TI TMS320C6x processor family, it has two bunches, is called a bunch A (cluster A), bunch B (cluster B), and each time point of processor can be launched 8 instructions.There is the hardware data path between bunch A, bunch B, data path (CrossBar) between being called bunch.Referring to Fig. 1, L1, S1, M1, D1, L2, S2, M2 and D2 represent the hardware capability unit.Data transmission between data path CrossBar carries out bunch between certain time point (Cycle) bunch A or a bunch B have used bunch, other emission groove of chip still can be launched other instruction at this time point.It is the implicit transmission mode that TI TMS320C6x processor this need not the programmer be manually write the mode that can realize data transfer operation.

The TMS320C6x series processors increased example, in hardware bunch between data path (CrossBar), simplified the difficulty of software programming and Compiler Optimization, but also increased the complexity of hardware design, especially associated line has also increased greatly between register file (register file) and the functional unit, chip power-consumption and area have also increased thereupon, and what bring is the also just increase thereupon of hardware cost of every chips thereupon.

In order to reduce hardware cost, occurred the cancellation example, in hardware bunch between data path, the substitute is, when needing to transmit data between different bunches, need data transmission bunch between insert one bunch between data transfer operation, data transfer operation generally transmits by explicit transmission manner between this bunch.So-called demonstration is transmitted and can be referred to that the programmer must operate Data transmission by many instructions of input explicitlies, namely send the instruction of data and the instruction of reception data.Some exists such as not belonging in the method that explicit transmission mode commonly used relates to

Common factor ∧, union ∨ nonlinear operation, if realize nonlinear operation, these nonlinear operations utilize pure C code to realize usually, need realize by some third-party softwares such as Matlab.Some method can only be obtained locally optimal solution in addition.As and greedy algorithm is only considered current some instructions can dispatching the dispatch command time, earlier from program, select some command assignment in certain bunch, from remaining program, selecting some other to be assigned in other bunch again, see those instructions in different bunches need bunch between the transmission operation, be placed in different bunches if can be put into originally with some instructions in the cluster, like this, just can adjust the sub-clustering of these instructions, be put in same bunch.So constantly repeat up to finding command assignment scheme between best bunch.Greedy algorithm is just found the solution in regional area, and what obtain is that suboptimum (sub-optimal) is separated but not globally optimal solution.

The sub-clustering processor by bunch between data transfer operation realize bunch between the transmission of data, the execution cost is higher, reason is that the execution of this instruction can interrupt the instruction of all bunches and carry out, carry out efficient for improving instruction, need make a plurality of bunches of execution commands simultaneously as much as possible, again data transfer operation between the least possible introducing bunch.Existing instruction distributing method or need realize carrying out nonlinear operation indirectly in the numerical evaluation storehouse by similar Matlab, perhaps just in regional area, find the solution and can't obtain command assignment scheme between best bunch, all can't bring into play the instruction of sub-clustering processor to greatest extent and carry out efficient.

Summary of the invention

The invention provides data transfer operation insertion method and device between a kind of bunch, can avoid nonlinear operation on the one hand, can obtain globally optimal solution on the one hand in addition, obtain command assignment scheme between best bunch, make a plurality of bunches of execution commands simultaneously as much as possible, data transfer operation between the least possible introducing bunch is brought into play the instruction of sub-clustering processor to greatest extent and is carried out efficient again.

Data transfer operation insertion method between a kind of bunch of embodiment of the invention proposition comprises the steps:

Import source program to be allocated;

According to the complete linear restriction system of equations of setting up in advance, the execution time of correspondence is found the shortest execution time when obtaining instruction in the described source program adopting different allocative decision in each bunch of processor from a plurality of execution time that obtain;

According to the allocative decision of the dependence between the instruction in the described source program and the shortest described execution time correspondence, determine to exist whether insert between the instruction of dependence bunch between data transfer operation.

The embodiment of the invention also proposes data transfer operation insertion device between a kind of bunch, comprising:

Load module is used for input source program to be allocated;

The execution time processing module, be used for the complete linear restriction system of equations that basis is set up in advance, the execution time of correspondence was found the shortest execution time when instruction in the acquisition source program was adopted different allocative decision in each bunch of processor from a plurality of execution time that obtain;

Instruction distribution module, the allocative decision that is used for the shortest execution time correspondence that finds according to the dependence between the described source program instruction and described execution time processing module, determine to exist whether insert between the instruction of dependence bunch between data transfer operation.

As can be seen from the above technical solutions, according to the complete linear restriction system of equations of setting up in advance, the shortest execution time of correspondence when obtaining instruction in the source program and in each bunch of processor, adopting different allocative decision, determine to exist according to the allocative decision of the shortest described execution time correspondence whether insert between the instruction of dependence bunch between data transfer operation.Compared with prior art, the present invention program relies on linear operation to realize command assignment fully, need not to introduce the nonlinear operation process, therefore need not to introduce third party software.In addition and since the present invention is based on source program all instructions bunch between the command assignment scheme, therefore be to obtain command assignment scheme between best bunch from overall angle, can bring into play the instruction of sub-clustering processor to greatest extent and carry out efficient.

Description of drawings

Fig. 1 is the structural representation of TI TMS320C6x processor;

Fig. 2 is the process flow diagram of data transfer operation insertion method embodiment one between the present invention bunch;

Fig. 3 is the synoptic diagram of the data dependency graph after raw data dependency graph and the simplification;

Fig. 4 is the process flow diagram of data transfer operation insertion method embodiment two between the present invention bunch;

Fig. 5 is the process flow diagram of data transfer operation insertion method embodiment three between the present invention bunch;

Fig. 6 is the synoptic diagram that concerns of the first auxiliary variable Ak and the second auxiliary variable Fk;

Fig. 7 inserts the block diagram of device for data transfer operation between the present invention bunch.

Embodiment

In order to represent conveniently to introduce several memonic symbols in the following discussion, if do not do specified otherwise, hereinafter all the implication with table 1 expression is identical for the identical memonic symbol of Chu Xianing.

Table 1

In order to make the purpose, technical solutions and advantages of the present invention clearer, describe the present invention below in conjunction with the drawings and specific embodiments.

Fig. 2 is the process flow diagram of data transfer operation insertion method embodiment one between the present invention bunch.As shown in Figure 2, present embodiment comprises the steps:

Step 201, input source program to be allocated;

At least can comprise a pair of instruction with dependence in the described source program.

Step 202, according to the instruction that dependence arranged processor bunch between distribute, set up complete linear restriction system of equations.

In the present embodiment, for fear of nonlinear operation of the prior art, can but per two instructions of being not limited to be combined with dependence issuable linear restriction relation when being assigned between same or different bunch, set up complete linear restriction system of equations, this complete linear restriction system of equations comprises the linear restriction equation that may exist between two instructions of several reactions.

Step 203, according to described complete linear restriction system of equations, the execution time of correspondence is found the shortest execution time when obtaining instruction in the source program adopting different allocative decision in each bunch of processor from a plurality of execution time that obtain.

In order to obtain the shortest execution time of source program correspondence fast, can obtain executing the initial estimation execution time of source program in advance, in this initial estimation execution time, find the solution corresponding instruction divides timing in each bunch complete linear restriction system of equations.Particularly, in the present embodiment, step 203 can comprise following substep:

The initial estimation execution time of substep 203-1, the described source program of acquisition;

The initial estimation execution time can be utilized bunch distribution and the dispatching algorithm (UAS that is gone up algorithm (Bottom Up Close), list scheduling (List Schedule) or unification the end of by, Unified Assign and Schedule) rough obtaining executes the needed time of source program, this time is the preliminary estimation execution time, can represent with MaxT, the execution time MaxT of the estimation that this is preliminary may not be optimum usually, need constantly converge to the shortest by finding the solution above-mentioned complete linear restriction system of equations.

Substep 203-2, according to the described estimation execution time, find the solution the complete linear restriction system of equations of setting up in advance, if described complete linear restriction system of equations has solution, then shorten the described estimation execution time according to predetermined step-length, find the solution described complete linear restriction system of equations again, do not have solution up to described complete linear restriction system of equations;

Because the above-mentioned preliminary estimation execution time may not be optimum, just the time that needs in proper order with the packing of orders of source program optimum is compared, may be longer.Therefore, by the above-mentioned complete linear restriction system of equations of continuous iterative and judge whether this complete linear restriction system of equations has solution to obtain the shortest execution time of source program.When the above-mentioned complete linear restriction system of equations of continuous iterative, when separating, show that the corresponding execution time is not optimum, can bring in constant renewal in and estimate execution time MaxT according to predetermined step-length, regain one and estimate execution time MaxT.

Substep 203-3, described complete linear restriction system of equations there is not the previous estimation execution time of the estimation execution time of correspondence when separating as the shortest execution time.

When above-mentioned complete linear restriction system of equations occurs not having when solution first, the instruction sequence that shows source program has reached optimum distribution, and the execution time of the previous estimation of estimation execution time of correspondence can be used as the shortest execution time when occurring not having solution so first.

In the present embodiment, when obtaining the shortest execution time, the execution time of the instruction that is performed the latest when the effect of not considering to encircle among the data dependency graph G can be used as objective function, namely the execution time of the instruction that is performed the latest by statistics is determined the initial estimation execution time.

Step 204, according to the allocative decision of the dependence between the instruction in the described source program and the shortest described execution time correspondence, determine to exist whether insert between the instruction of dependence bunch between data transfer operation.

Owing to the shortest execution time that can obtain source program by abovementioned steps, the just optimum solution of corresponding above-mentioned complete linear restriction system of equations.Since above-mentioned complete linear restriction system of equations be based on instruction with dependence bunch in the linear restriction relation that exists when being assigned with set up, so fully the optimum solution of linear restriction system of equations also just may corresponding instruction bunch between best allocative decision.In conjunction with the dependence between the instruction in this best allocative decision and the source program, thus data transfer operation between determining whether to need between two instructions to insert bunch.Be assigned to different bunches such as two instructions that dependence is arranged, data transfer operation between just needing between these two instructions so to insert bunch, otherwise, just do not need.

In the present embodiment, the form of expression of the dependence between the described instruction can be the data dependency graph.Before step 204, can also comprise: input is used for describing the data dependency graph of the dependence between the described source program instruction.When input data dependency graph, can also specifically comprise: the data dependency graph is simplified; Data dependency graph after input is simplified.Because each bar dependence edge of dependency graph all needs to set up above-mentioned equation of constraint of the present invention, so the quantity of dependence edge largely influences the efficient of algorithm in the data dependency graph: equation of constraint is more many, and efficient is more low.By eliminating the data dependency graph that redundant dependence edge obtains reduction instruction, can further improve treatment effeciency in the present invention.Redundant dependence edge refers to: in original data dependency graph, the dependence edge of figure is OPx-＞OPy; Wherein, a node in each presentation graphs of OPx, OPy, corresponding x bar instruction and the instruction of y bar separately.If exist other from the path of OPx＞OPy in the drawings, and this path passes through a non-x among the figure at least, other nodes of y, and then dependence edge OPx-＞OPy is redundant.As shown in Figure 3,5 have path 2-5 and path 2-3-5 from node 2 to node, so dependence edge 2-5 is redundant dependence edge.Corresponding dependence matrix was expressed intuitively before and after difference between the data dependency graph before and after simplifying can be simplified by comparison.Because the data dependency graph is simplified, the equation of constraint quantity of foundation also reduces thereupon.Such as, eliminated three redundant limits before and after simplifying among Fig. 3, equation of constraint has reduced 3 at least.

In the present embodiment, data transfer operation can be simple exchanges data between two instructions that are assigned in different bunches between bunch, also can be calling of routine library etc.Data transfer operation can also can transmit by buffer memory (cache) by the bus transfer of processor between bunch.

In the technique scheme of present embodiment, divide timing issuable linear restriction relation in the same cluster of processor or in different bunches according to the instruction that dependence is arranged, set up complete linear restriction system of equations, whether determined at data transfer operation between whether inserting between the instruction that has dependence bunch by constantly finding the solution this system of equations.Compared with prior art, need not to introduce the nonlinear operation process, also just need not to introduce third party software, can realize fully with the C code.In addition, in the present embodiment, but constantly restrain execution time of source program by finding the solution above-mentioned complete linear restriction system of equations, but find the shortest execution time of source program, thereby find globally optimal solution, also just find accordingly instruction bunch between optimum distributing scheme, brought into play the instruction of sub-clustering processor to greatest extent and carried out efficient.

Fig. 4 is the process flow diagram of data transfer operation insertion method embodiment two between the present invention bunch.As shown in Figure 4, present embodiment can comprise the steps:

Step 401, input source program to be allocated.

Step 402, according to the instruction that dependence arranged processor bunch between distribute, set up described complete linear restriction system of equations.

Characterize instruction i and whether exist in bunch k at moment j by an existence variable is set, and give different numerical value, this existence variable can be designated as X (i, j, k).Be assigned in different bunches as two instructions of dependence, so its corresponding existence variable X (i, j, adding and will satisfy certain linear restriction relation by adding and existing k), such as add with after can only form certain several specific integer.Present embodiment can be set up complete linear restriction equation with this between per two instructions with dependence, thereby sets up complete linear restriction system of equations.Therefore, above-mentioned steps 402 can specifically comprise:

Substep 402-1, the single instruction of definition in the estimation execution time of source program bunch between the existence variable that distributes;

Substep 402-2, acquisition have the existence variable of a pair of instruction correspondence in same or different bunch of dependence; And

Substep 402-3, according to the adding and concern of described existence variable, set up described complete linear restriction system of equations.

Step 403, according to described complete linear restriction system of equations, the execution time of correspondence is found the shortest execution time when obtaining instruction in the source program adopting different allocative decision in each bunch of processor from a plurality of execution time that obtain.

Step 404, according to the allocative decision of the dependence between the instruction in the described source program and the shortest described execution time correspondence, determine to exist whether insert between the instruction of dependence bunch between data transfer operation.

This step 403,404 concrete implementation can not repeat them here referring to the relevant portion of above-described embodiment one.

In the present embodiment, whether defined an existence variable describes instruction and is assigned with in being engraved in certain bunch at a time, with this according to two instructions that dependence arranged be assigned to in the cluster or in different bunches time existence variable of correspondence set up add and complete linear restriction relation, set up every pair of complete linear restriction equation that has between the dependence instruction, thereby set up complete system of linear equations.By constantly finding the solution this complete system of linear equations, the insertion of data transfer operation between realizing bunch, accordingly, found instruction bunch between optimum distributing scheme, brought into play the instruction of sub-clustering processor to greatest extent and carried out efficient.

Fig. 5 is the process flow diagram of data transfer operation insertion method embodiment three between the present invention bunch.As shown in Figure 5, present embodiment can comprise the steps:

Step 501, input source program to be allocated;

At least comprise a pair of instruction with dependence in the described source program.

Step 502, according to the instruction that dependence arranged processor bunch between distribute, set up complete linear restriction system of equations.

In the present embodiment, described complete linear restriction system of equations can comprise the first linear restriction system of equations, the second linear equation set of constraints and the 3rd equation of constraint group.Wherein, the first linear restriction system of equations can set up according to existence variable among the embodiment two, be used for describing two instructions with dependence bunch existence; The second linear restriction system of equations can be used for describe exist between two instructions with dependence bunch between the time point of data transfer operation; The 3rd linear restriction system of equations can be used for expression exist bunch between the time point of data transfer operation get rid of other instructions and carry out.

In the present embodiment, can pass through X (i, j, k)=1 or 0 come the i of presentation directives whether to be present among bunch k at moment j.(k)=1, the i of presentation directives is present among bunch k at moment j for i, j, and (k)=0, the i of presentation directives is not present among bunch k at moment j for i, j as if X as if X.Based on this, the above-mentioned first linear restriction system of equations can comprise following linear equation:

\begin{matrix} A_{k} = Σ_{j = Ei 1}^{Li 1} X_{(i 1, j, k)} + Σ_{j = Ei 2}^{Li 2} X_{(i 2, j, k)} & A_{k} &Element; {0,1,2}, k = (0,1, . . ., C - 1) \end{matrix} - - - (1)

\{\begin{matrix} {4 F}_{k} - A_{k} \leq 2 \\ F_{k} - A_{k} &GreaterEqual; - 1 \\ F_{k} &Element; {0,1} \\ k = (0,1, . . ., C - 1) \end{matrix} - - - (2)

S_{i} = Σ_{k = 0}^{C - 1} F_{k}, S_{i} &Element; {0,1} - - - (3)

In the formula (1), X (i1, j, k) and X (k) whether the i1 of presentation directives, i2 exist in bunch k at moment j respectively for i2, j; C is the number of sub-clustering processor bunch; E _I1The execution time the earliest of the i1 of presentation directives, L _I1The execution time the latest of the i1 of presentation directives, E _I2The execution time the earliest of the i2 of presentation directives, L _I2The execution time the latest of the i2 of presentation directives.The earliest the execution time and the latest the execution time such as can be on the basis of empirical value, processor performance parameter in conjunction with the movable net (AOV in summit, Activity On Vertex) topological order of node calculates, find the solution the time of complete linear restriction equation like this with minimizing, improve the efficient of algorithm.Based on this, the implication of formula (1) is: wherein, if instruction i1, i2 not in same bunch of k, these two instructions are at [the E of its correspondence so _I1, L _I1], [E _I2, L _I2] the existence variable sum that goes up in bunch k is 0.In the present embodiment existence variable sum of the correspondence in bunch k is called first auxiliary variable of integer character, is designated as A _kSo, if instruction i1, i2 be A not in bunch k the time _k=0; Otherwise A _k=1.

For instance, if certain sub-clustering processor has 3 bunches (are labeled as bunch 0 respectively, bunch 1 and bunches 2), two instruction i0 and i1 are arranged, having dependence between them is i0 → i1, in this example, the value of the memonic symbol shown in the table 1 is respectively: data dependency graph G, E=2, N=2, MaxT=4 (ms), C=3, existence variable X (i, j, distribution situation k) is as shown in table 2.

Table 2

Above-mentioned formula (1) can be specially:

A ₀＝(X(0，0，0)+X(0，1，0)+X(0，2，0)+X(0，3，0))+(X(1，0，0)+X(1，1，0)+X(1，2，0)+X(1，3，0))

A ₁＝(X(0，0，1)+X(0，1，1)+X(0，2，1)+X(0，3，1))+(X(1，0，1)+X(1，1，1)+X(1，2，1)+X(1，3，1))

A ₂＝(X(0，0，2)+X(0，1，2)+X(0，2，2)+X(0，3，2))+(X(1，0，2)+X(1，1，2)+X(1，2，2)+X(1，3，2))

In the formula (2), defined to second auxiliary variable of integer character, be designated as F _k, set up this second auxiliary variable F _kWith the first auxiliary variable A _kComplete linear restriction relation, to describe as the first auxiliary variable A _k=2 o'clock, F _k=1; Otherwise F _k=0.Because the first auxiliary variable A when instructing i1, i2 not in bunch k _kValue only equal 0 or 1, otherwise, have only as instruction i1, i2 in bunch k the time the first auxiliary variable A _kJust can equal 2, so can utilize the first auxiliary variable A _kThis attribute set up formula (2).

As shown in Figure 6, the process of setting up of formula (2) can be as follows:. be end points with (2,1), through (2,1), (0,0.5).These two points are a ray W1; Be end points with (2,1) again, through (2,1), (1,0) is a ray W2.W1：F _k-0.25×A _k＝0.5，W2：F _k＝A _k-1。The zone of the F axle between these two rays just can be described with above-mentioned formula (2).

Constraint can obtain F thus ₀, F ₁, F ₂Value.Work as i0, i1 in the time of same bunch, A ₀, A ₁, A ₂The value that wherein has and only have an amount is 2, the F that this moment is corresponding _kValue be 1, and remaining F _kValue all is 0.

In the present embodiment, the concrete parameter of formula (2) is not unique, can be an end points with (2,1), and on the F axle [0,1) choose a point wantonly as the another one end points, come structure formula (2) with this.

In the formula (3), describe whether need bunch between data transfer operation by means of homogeneity variable S _iEach bar dependence edge ei in the data dependency graph has the homogeneity variable S of a correspondence _iThe second auxiliary variable F of correspondence when two instructions (i1, i2) are in same bunch of k _k=1, S so _i=1, two instructions of dependence edge ei correspondence (do not need data transfer operation between the i1 → i2); Otherwise S _i=0 expression needs data transfer operation.

In the present embodiment, the described second linear restriction system of equations can comprise following linear equation:

\{\begin{matrix} Σ_{j = 0}^{MaxT - 1} j \times Y_{(j, ei)} > Σ_{j = Ei 1}^{Li 1} Σ_{k = 0}^{C - 1} j \times X_{(i 1, j, k)} \\ Σ_{j = 0}^{MaxT - 1} j \times Y_{(j, ei)} < Σ_{j = Ei 2}^{Li 2} Σ_{k = 0}^{C - 1} j \times X_{(i 2, j, k)} \end{matrix} - - - (4)

Σ_{j = 0}^{MaxT - 1} Y_{(j, ei)} = 1 - S_{i}, Y_{(j, ei)} &Element; {0,1} - - - (5)

In above-mentioned formula (4), (5), Y _{(j, ei)}Be the sequential variable, need data transfer operation between two instructions of dependence edge ei correspondence, then Y _{(j, ei)}Variable description the time point j at data transfer operation place.This sequential variable Y _{(j, ei)}With homogeneity variable S _iBetween have the linear restriction relation of formula (5), its value correspondence can be 1 or 0.

Formula (4) expression: for each bar dependence edge ei:i1 → i2, if between there is one bunch in time point j data transfer operation, the time point j of data transfer operation is necessarily greater than the time point of execution command i1, and less than the time point of execution command i2 between carrying out bunch.

Formula (5) expression: if dependence edge ei need insert data transfer operation between bunch, then arrive in this scope of MaxT-1 at time point 0, will have and only have a Y _{(j, ei)}=1, otherwise within time point 0 to MaxT-1 all Y _{(j, ei)}=0.

Number of clusters order C is that the sequential variable of 3 sub-clustering processor correspondence can be referring to table 3.

Table 3

Cycle	e0	e1	e2
				T0	Y(0，0)	Y(1，0)	Y(2，0)
T1	Y(0，1)	Y(1，1)	Y(2，1)
				T2	Y(0，2)	Y(1，2)	Y(2，2)
T3	Y(0，3)	Y(1，3)	Y(2，3)

Described the 3rd linear restriction system of equations can comprise following linear equation:

Σ_{i = 0}^{N} Σ_{k = 0}^{C - 1} X_{(i, j, k)} + C \times Σ_{ei = 0}^{E - 1} Y_{(j, ei)} \leq C, j = (0,1, . . ., MaxT - 1) - - - (6)

Since the sub-clustering processor bunch between data transmission be explicit carrying out, and every data transfer operation can interrupt the execution of instruction in all bunches when carrying out.So data transfer operation when carrying out, need determine not have other instructions to carry out at identical time point.In like manner, if when having other instruction to carry out at time point j place, then data transfer operation can not be carried out at time point j.Therefore, make up linear restriction formula (6).

Linear restriction formula (6) expression: at the random time point of time point 0 within the MaxT-1 scope, if between there is one bunch in time point j data transfer operation, then at this moment between point can not have other instruction to be performed.The existence of last every the instruction of timing statistics point j in each bunch, if statistical value is 1, showing at this time point j has other instructions to carry out, data transfer operation between then can not carrying out bunch; If be 0, showing at this time point j does not have other instructions to carry out, and can carry out data transfer operation.

Step 503, according to described complete linear restriction system of equations, the execution time of correspondence is found the shortest execution time when obtaining instruction in the source program adopting different allocative decision in each bunch of processor from a plurality of execution time that obtain.

The execution time of the instruction that is performed the latest when adding up different allocative decision is minimum determines the shortest execution time, can make up a restriction relation:

MIN (Σ_{j = 0}^{MaxT - 1} Σ_{k = 0}^{C - 1} X_{\max len, j, k}) - - - (7)

In the formula (7), maxlen represents the instruction that is performed the latest.

Step 504, obtain a homogeneity variable that is used for describing data transfer operation between needs whether bunch according to described existence variable;

Step 505, whether define between the instruction of dependence data transfer operation between needs bunch according to described homogeneity variable.

In above-described embodiment, when making up complete linear restriction system of equations, it is also conceivable that every instruction of source program all needs to launch and only launch once, and two priority times that instruction is carried out that dependence is arranged.Specifically can set up following (8), (9) formula retrains this situation.

Σ_{j = 0}^{MaxT} Σ_{k = 0}^{C} X_{(i, j, k)} = 1, i &Element; N - - - (8)

Σ_{j = Ei 1}^{Li 1} Σ_{k = 0}^{C - 1} j \times X_{(i 1, j, k)} + L_{i 1 &RightArrow; i 2} \leq Σ_{j = Ei 2}^{Li 2} Σ_{k = 0}^{C - 1} j \times X_{(i 2, j, k)} - - - (9)

L _i1→i2≥1

L in the formula _{I1 → i2}Expression number time delay.Read data instruction load for example, its form is r1=[memory], r1 represents register, memory represents memory address, if its number time delay is 3, represent that then this instruction need could read r1 to data through 3 time points, if next bar instruction just will be used the value of r1, can only use at the 4th time point.Time delay, number can be according to the difference of the present invention program's applicable cases, selected arbitrarily the number more than or equal to 1.

As the formula (8), in MaxT-1, utilize existence variable X (i, j, k) add up each bar instruction i in the existence of each time point j in each bunch k, such as second instruction at the 2nd time point 2nd bunch of execution, then it is 1 at the 2nd time point 2nd bunch existence, is 0 at other times point in the existence of other bunches, add and all time points in the existence of all bunches, as long as and be 1, show that this instruction launched, and only launched once.

As the formula (9), can be performed time interval [E every instruction _i, L _i] in, add up the existence of this instruction on each bunch.Certain instruction need be prior to carrying out with an other instruction that relies on its execution result, such as i1 and i2 dependence is arranged, the execution of i2 depends on the execution result of i1, i1 should carry out prior to i2, so the time point of execution i1 is with the product of its corresponding existence and count sum time delay, should be less than the product of the time point of carrying out i1 and its corresponding existence.

In above-described embodiment, by according to instruction bunch in existence variable when being assigned with set up complete linear restriction system of equations, obtain the shortest execution time of source program by finding the solution this complete system of linear equations, but data transfer operation between determining how to insert bunch based on this shortest execution time.From geometric angle, described a plurality of equatioies and inequality constrain can be formed a sealing or semi-enclosed polyhedron, and globally optimal solution is located on this polyhedral the top or nethermost certain summit, and this summit is also with regard to the corresponding the shortest execution time.

The embodiment of the invention also provides data transfer operation insertion device between a kind of bunch, and as shown in Figure 7, this device 700 comprises:

Load module 701 is used for input source program to be allocated;

Execution time processing module 702, be used for the complete linear restriction system of equations that basis is set up in advance, the execution time of correspondence was found the shortest execution time when instruction in the acquisition source program was adopted different allocative decision in each bunch of processor from a plurality of execution time that obtain;

Instruction distribution module 703, the allocative decision that is used for the shortest execution time correspondence that finds according to the dependence between the described source program instruction and described execution time processing module 702, determine to exist whether insert between the instruction of dependence bunch between data transfer operation.

Described execution time processing module 702 can comprise: the initial estimation unit, find the solution unit, iterative processing unit.Wherein, the initial estimation unit is used for obtaining the initial estimation execution time of described source program; Finding the solution the unit is used for finding the solution the complete linear restriction system of equations of setting up in advance according to the estimation execution time of importing; If solution arranged, the iterative processing unit shortens the described estimation execution time according to predetermined step-length, and the estimation execution time after will shortening exports to and finds the solution the unit; Do not separate if having, the previous estimation execution time of the estimation execution time of correspondence was not as the shortest execution time output when the iterative processing unit had solution with described complete linear restriction system of equations.

Described device may further include: the linear restriction module.The linear restriction module be used for according to the instruction that dependence is arranged processor bunch between distribute, set up complete linear restriction system of equations.

Described linear restriction module can specifically comprise: the first linear restriction unit, the second linear restriction unit, the 3rd linear restriction unit.Wherein, the first linear restriction unit is set up to be used for describing and is had two instructions of dependence in the first linear restriction system of equations of same bunch of existence; The second linear restriction unit set up to be used for describe exist between two instructions with dependence bunch between the second linear restriction system of equations of time point of data transfer operation; The 3rd linear restriction unit set up to be used for expression exist bunch between the time point of data transfer operation get rid of the 3rd linear restriction system of equations that other instructions are carried out.

The above only is preferred embodiment of the present invention, and is in order to limit the present invention, within the spirit and principles in the present invention not all, any modification of making, is equal to replacement, improvement etc., all should be included within the scope of protection of the invention.

Claims

1. data transfer operation insertion method between a kind bunch is characterized in that, comprises the steps:

Import source program to be allocated;

Obtain the initial estimation execution time of described source program;

According to the described estimation execution time, find the solution the complete linear restriction system of equations of setting up in advance, if described complete linear restriction system of equations has solution, then shorten the described estimation execution time according to predetermined step-length, find the solution described complete linear restriction system of equations again, do not have solution up to described complete linear restriction system of equations;

The previous estimation execution time of the estimation execution time of correspondence was not as the shortest execution time when described complete linear restriction system of equations was had solution;

2. method according to claim 1 is characterized in that, also comprises before the initial estimation execution time of the described source program of described acquisition:

According to the instruction that dependence is arranged processor bunch between distribute, set up described complete linear restriction system of equations.

3. method according to claim 2 is characterized in that, according to the instruction that dependence is arranged bunch between distribute, set up described complete linear restriction system of equations and comprise:

Define single instruction in the described estimation execution time bunch between the existence variable that is assigned with;

Acquisition have a pair of instruction of dependence be assigned to bunch in the time correspondence the existence variable;

According to the adding and concern of described existence variable, set up described complete linear restriction system of equations.

4. method according to claim 3, it is characterized in that, described allocative decision according to the dependence between the instruction in the described source program and the shortest described execution time correspondence, determine to exist whether insert between the instruction of dependence bunch between data transfer operation comprise:

Obtain a homogeneity variable that is used for describing data transfer operation between needs whether bunch according to described existence variable;

Whether define between the instruction of dependence data transfer operation between needs bunch according to described homogeneity variable.

5. method according to claim 4 is characterized in that, described existence variable-value is 0 or 1.

6. method according to claim 1 is characterized in that, whether also comprise before the data transfer operation between inserting between the described instruction of determining to exist dependence bunch: input is used for describing the data dependency graph of the dependence between the described source program instruction.

7. method according to claim 6 is characterized in that, the data dependency graph that described input is used for the dependence between the described source program instruction of description comprises:

The data dependency graph is simplified;

Data dependency graph after input is simplified.

8. method according to claim 2 is characterized in that, described complete linear restriction system of equations comprises:

Be used for describing and have two instructions of dependence in the first linear restriction system of equations of bunch existence;

Be used for to describe exist between two instructions with dependence bunch between the second linear restriction system of equations of time point of data transfer operation;

Be used for expression exist bunch between the time point of data transfer operation get rid of the 3rd linear restriction system of equations that other instructions are carried out.

9. method according to claim 8 is characterized in that, the described first linear restriction system of equations comprises:

A_{k} = Σ_{j = Ei 1}^{Li 1} X_{(i 1, j, k)} + Σ_{j = Ei 2}^{Li 2} X_{(i 2, j, k)}, A_{k} &Element; {0,1,2}, k = (0,1, . . ., C - 1);

\{\begin{matrix} 4 F_{k} - A_{k} \leq 2 \\ F_{k} - A_{k} &GreaterEqual; - 1 \\ F_{k} &Element; {0,1} \\ k = (0,1, . . ., C - 1) \end{matrix};

With

S_{i} = Σ_{k = 0}^{C - 1} F_{k}, S_{i} &Element; {0,1};

I1 and i2 represent to have two instructions of dependence; If (k)=1, the i of presentation directives is present among bunch k at the moment j X for i, j, if (i, j k)=0, show at moment j not have instruction i in bunch k X; C is the number of sub-clustering processor bunch; E _I1, E _I2The execution time the earliest of the difference i1 of presentation directives, i2, L _I1, L _I2The execution time the latest of the difference i1 of presentation directives, i2, A _kRepresent first auxiliary variable, F _kRepresent second auxiliary variable, S _iExpression homogeneity variable, k are represented k bunch.

10. method according to claim 9 is characterized in that, the described second linear restriction system of equations comprises:

\{\begin{matrix} Σ_{j = 0}^{MaxT - 1} j \times Y_{(j, ei)} > Σ_{j = Ei 1}^{Li 1} Σ_{k = 0}^{C - 1} j \times X_{(i 1, j, k)} \\ Σ_{j = 0}^{MaxT - 1} j \times Y_{(j, ei)} < Σ_{j = Ei 2}^{Li 2} Σ_{k = 0}^{C - 1} j \times X_{(i 2, j, k)} \end{matrix}

With

Σ_{j = 0}^{MaxT - 1} Y_{(j, ei)} = 1 - S_{i}, Y_{(j, ei)} &Element; {0,1}, Y_{(j, ei)}

Be the sequential variable, j represents j constantly, and ei represents i bar dependence edge, and MaxT is for estimating the execution time.

11. method according to claim 10 is characterized in that, described the 3rd linear restriction system of equations comprises:

Σ_{i = 0}^{N} Σ_{k = 0}^{C - 1} X_{(i, j, k)} + C \times Σ_{ei = 0}^{E - 1} Y_{(j, ei)} \leq C, j = (0,1, . . ., MaxT - 1),

MaxT is for estimating the execution time, and N represents the instruction number that described source program comprises, E represents dependence edge number in the data dependency graph, and ei represents i bar dependence edge.

12. data transfer operation inserts device between one kind bunch, it is characterized in that, comprising:

Load module is used for input source program to be allocated;

The execution time processing module, according to the complete linear restriction system of equations of setting up in advance, the execution time of correspondence is found the shortest execution time when obtaining instruction in the described source program adopting different allocative decision in each bunch of processor from a plurality of execution time that obtain;

Instruction distribution module, according to the allocative decision of the dependence between the instruction in the described source program and the shortest described execution time correspondence, determine to exist whether insert between the instruction of dependence bunch between data transfer operation;

Described execution time processing module comprises:

The initial estimation unit is for the initial estimation execution time that obtains described source program;

Find the solution the unit, be used for according to the described estimation execution time, find the solution the complete linear restriction system of equations of setting up in advance;

The iterative processing unit if described complete linear restriction system of equations has solution, then shortens the described estimation execution time according to predetermined step-length, finds the solution described complete linear restriction system of equations again, does not have solution up to described complete linear restriction system of equations; The previous estimation execution time of the estimation execution time of correspondence was not as the shortest execution time when described complete linear restriction system of equations was had solution.

13. device according to claim 12 is characterized in that, described device further comprises:

The linear restriction module, according to the instruction that dependence is arranged processor bunch between distribute, set up described complete linear restriction system of equations.

14. device according to claim 13 is characterized in that, described linear restriction module comprises:

The first linear restriction unit, foundation are used for description and have two instructions of dependence in the first linear restriction system of equations of bunch existence;

The second linear restriction unit, set up to be used for describe exist between two instructions with dependence bunch between the second linear restriction system of equations of time point of data transfer operation;

The 3rd linear restriction unit, set up to be used for expression exist bunch between the time point of data transfer operation get rid of the 3rd linear restriction system of equations that other instructions are carried out.