Embodiment
In order to represent conveniently to introduce several memonic symbols in the following discussion, if do not do specified otherwise, hereinafter all the implication with table 1 expression is identical for the identical memonic symbol of Chu Xianing.
Table 1
In order to make the purpose, technical solutions and advantages of the present invention clearer, describe the present invention below in conjunction with the drawings and specific embodiments.
Fig. 2 is the process flow diagram of data transfer operation insertion method embodiment one between the present invention bunch.As shown in Figure 2, present embodiment comprises the steps:
Step 201, input source program to be allocated;
At least can comprise a pair of instruction with dependence in the described source program.
Step 202, according to the instruction that dependence arranged processor bunch between distribute, set up complete linear restriction system of equations.
In the present embodiment, for fear of nonlinear operation of the prior art, can but per two instructions of being not limited to be combined with dependence issuable linear restriction relation when being assigned between same or different bunch, set up complete linear restriction system of equations, this complete linear restriction system of equations comprises the linear restriction equation that may exist between two instructions of several reactions.
Step 203, according to described complete linear restriction system of equations, the execution time of correspondence is found the shortest execution time when obtaining instruction in the source program adopting different allocative decision in each bunch of processor from a plurality of execution time that obtain.
In order to obtain the shortest execution time of source program correspondence fast, can obtain executing the initial estimation execution time of source program in advance, in this initial estimation execution time, find the solution corresponding instruction divides timing in each bunch complete linear restriction system of equations.Particularly, in the present embodiment, step 203 can comprise following substep:
The initial estimation execution time of substep 203-1, the described source program of acquisition;
The initial estimation execution time can be utilized bunch distribution and the dispatching algorithm (UAS that is gone up algorithm (Bottom Up Close), list scheduling (List Schedule) or unification the end of by, Unified Assign and Schedule) rough obtaining executes the needed time of source program, this time is the preliminary estimation execution time, can represent with MaxT, the execution time MaxT of the estimation that this is preliminary may not be optimum usually, need constantly converge to the shortest by finding the solution above-mentioned complete linear restriction system of equations.
Substep 203-2, according to the described estimation execution time, find the solution the complete linear restriction system of equations of setting up in advance, if described complete linear restriction system of equations has solution, then shorten the described estimation execution time according to predetermined step-length, find the solution described complete linear restriction system of equations again, do not have solution up to described complete linear restriction system of equations;
Because the above-mentioned preliminary estimation execution time may not be optimum, just the time that needs in proper order with the packing of orders of source program optimum is compared, may be longer.Therefore, by the above-mentioned complete linear restriction system of equations of continuous iterative and judge whether this complete linear restriction system of equations has solution to obtain the shortest execution time of source program.When the above-mentioned complete linear restriction system of equations of continuous iterative, when separating, show that the corresponding execution time is not optimum, can bring in constant renewal in and estimate execution time MaxT according to predetermined step-length, regain one and estimate execution time MaxT.
Substep 203-3, described complete linear restriction system of equations there is not the previous estimation execution time of the estimation execution time of correspondence when separating as the shortest execution time.
When above-mentioned complete linear restriction system of equations occurs not having when solution first, the instruction sequence that shows source program has reached optimum distribution, and the execution time of the previous estimation of estimation execution time of correspondence can be used as the shortest execution time when occurring not having solution so first.
In the present embodiment, when obtaining the shortest execution time, the execution time of the instruction that is performed the latest when the effect of not considering to encircle among the data dependency graph G can be used as objective function, namely the execution time of the instruction that is performed the latest by statistics is determined the initial estimation execution time.
Step 204, according to the allocative decision of the dependence between the instruction in the described source program and the shortest described execution time correspondence, determine to exist whether insert between the instruction of dependence bunch between data transfer operation.
Owing to the shortest execution time that can obtain source program by abovementioned steps, the just optimum solution of corresponding above-mentioned complete linear restriction system of equations.Since above-mentioned complete linear restriction system of equations be based on instruction with dependence bunch in the linear restriction relation that exists when being assigned with set up, so fully the optimum solution of linear restriction system of equations also just may corresponding instruction bunch between best allocative decision.In conjunction with the dependence between the instruction in this best allocative decision and the source program, thus data transfer operation between determining whether to need between two instructions to insert bunch.Be assigned to different bunches such as two instructions that dependence is arranged, data transfer operation between just needing between these two instructions so to insert bunch, otherwise, just do not need.
In the present embodiment, the form of expression of the dependence between the described instruction can be the data dependency graph.Before step 204, can also comprise: input is used for describing the data dependency graph of the dependence between the described source program instruction.When input data dependency graph, can also specifically comprise: the data dependency graph is simplified; Data dependency graph after input is simplified.Because each bar dependence edge of dependency graph all needs to set up above-mentioned equation of constraint of the present invention, so the quantity of dependence edge largely influences the efficient of algorithm in the data dependency graph: equation of constraint is more many, and efficient is more low.By eliminating the data dependency graph that redundant dependence edge obtains reduction instruction, can further improve treatment effeciency in the present invention.Redundant dependence edge refers to: in original data dependency graph, the dependence edge of figure is OPx->OPy; Wherein, a node in each presentation graphs of OPx, OPy, corresponding x bar instruction and the instruction of y bar separately.If exist other from the path of OPx>OPy in the drawings, and this path passes through a non-x among the figure at least, other nodes of y, and then dependence edge OPx->OPy is redundant.As shown in Figure 3,5 have path 2-5 and path 2-3-5 from node 2 to node, so dependence edge 2-5 is redundant dependence edge.Corresponding dependence matrix was expressed intuitively before and after difference between the data dependency graph before and after simplifying can be simplified by comparison.Because the data dependency graph is simplified, the equation of constraint quantity of foundation also reduces thereupon.Such as, eliminated three redundant limits before and after simplifying among Fig. 3, equation of constraint has reduced 3 at least.
In the present embodiment, data transfer operation can be simple exchanges data between two instructions that are assigned in different bunches between bunch, also can be calling of routine library etc.Data transfer operation can also can transmit by buffer memory (cache) by the bus transfer of processor between bunch.
In the technique scheme of present embodiment, divide timing issuable linear restriction relation in the same cluster of processor or in different bunches according to the instruction that dependence is arranged, set up complete linear restriction system of equations, whether determined at data transfer operation between whether inserting between the instruction that has dependence bunch by constantly finding the solution this system of equations.Compared with prior art, need not to introduce the nonlinear operation process, also just need not to introduce third party software, can realize fully with the C code.In addition, in the present embodiment, but constantly restrain execution time of source program by finding the solution above-mentioned complete linear restriction system of equations, but find the shortest execution time of source program, thereby find globally optimal solution, also just find accordingly instruction bunch between optimum distributing scheme, brought into play the instruction of sub-clustering processor to greatest extent and carried out efficient.
Fig. 4 is the process flow diagram of data transfer operation insertion method embodiment two between the present invention bunch.As shown in Figure 4, present embodiment can comprise the steps:
Step 401, input source program to be allocated.
Step 402, according to the instruction that dependence arranged processor bunch between distribute, set up described complete linear restriction system of equations.
Characterize instruction i and whether exist in bunch k at moment j by an existence variable is set, and give different numerical value, this existence variable can be designated as X (i, j, k).Be assigned in different bunches as two instructions of dependence, so its corresponding existence variable X (i, j, adding and will satisfy certain linear restriction relation by adding and existing k), such as add with after can only form certain several specific integer.Present embodiment can be set up complete linear restriction equation with this between per two instructions with dependence, thereby sets up complete linear restriction system of equations.Therefore, above-mentioned steps 402 can specifically comprise:
Substep 402-1, the single instruction of definition in the estimation execution time of source program bunch between the existence variable that distributes;
Substep 402-2, acquisition have the existence variable of a pair of instruction correspondence in same or different bunch of dependence; And
Substep 402-3, according to the adding and concern of described existence variable, set up described complete linear restriction system of equations.
Step 403, according to described complete linear restriction system of equations, the execution time of correspondence is found the shortest execution time when obtaining instruction in the source program adopting different allocative decision in each bunch of processor from a plurality of execution time that obtain.
Step 404, according to the allocative decision of the dependence between the instruction in the described source program and the shortest described execution time correspondence, determine to exist whether insert between the instruction of dependence bunch between data transfer operation.
This step 403,404 concrete implementation can not repeat them here referring to the relevant portion of above-described embodiment one.
In the present embodiment, whether defined an existence variable describes instruction and is assigned with in being engraved in certain bunch at a time, with this according to two instructions that dependence arranged be assigned to in the cluster or in different bunches time existence variable of correspondence set up add and complete linear restriction relation, set up every pair of complete linear restriction equation that has between the dependence instruction, thereby set up complete system of linear equations.By constantly finding the solution this complete system of linear equations, the insertion of data transfer operation between realizing bunch, accordingly, found instruction bunch between optimum distributing scheme, brought into play the instruction of sub-clustering processor to greatest extent and carried out efficient.
Fig. 5 is the process flow diagram of data transfer operation insertion method embodiment three between the present invention bunch.As shown in Figure 5, present embodiment can comprise the steps:
Step 501, input source program to be allocated;
At least comprise a pair of instruction with dependence in the described source program.
Step 502, according to the instruction that dependence arranged processor bunch between distribute, set up complete linear restriction system of equations.
In the present embodiment, described complete linear restriction system of equations can comprise the first linear restriction system of equations, the second linear equation set of constraints and the 3rd equation of constraint group.Wherein, the first linear restriction system of equations can set up according to existence variable among the embodiment two, be used for describing two instructions with dependence bunch existence; The second linear restriction system of equations can be used for describe exist between two instructions with dependence bunch between the time point of data transfer operation; The 3rd linear restriction system of equations can be used for expression exist bunch between the time point of data transfer operation get rid of other instructions and carry out.
In the present embodiment, can pass through X (i, j, k)=1 or 0 come the i of presentation directives whether to be present among bunch k at moment j.(k)=1, the i of presentation directives is present among bunch k at moment j for i, j, and (k)=0, the i of presentation directives is not present among bunch k at moment j for i, j as if X as if X.Based on this, the above-mentioned first linear restriction system of equations can comprise following linear equation:
In the formula (1), X (i1, j, k) and X (k) whether the i1 of presentation directives, i2 exist in bunch k at moment j respectively for i2, j; C is the number of sub-clustering processor bunch; E
I1The execution time the earliest of the i1 of presentation directives, L
I1The execution time the latest of the i1 of presentation directives, E
I2The execution time the earliest of the i2 of presentation directives, L
I2The execution time the latest of the i2 of presentation directives.The earliest the execution time and the latest the execution time such as can be on the basis of empirical value, processor performance parameter in conjunction with the movable net (AOV in summit, Activity On Vertex) topological order of node calculates, find the solution the time of complete linear restriction equation like this with minimizing, improve the efficient of algorithm.Based on this, the implication of formula (1) is: wherein, if instruction i1, i2 not in same bunch of k, these two instructions are at [the E of its correspondence so
I1, L
I1], [E
I2, L
I2] the existence variable sum that goes up in bunch k is 0.In the present embodiment existence variable sum of the correspondence in bunch k is called first auxiliary variable of integer character, is designated as A
kSo, if instruction i1, i2 be A not in bunch k the time
k=0; Otherwise A
k=1.
For instance, if certain sub-clustering processor has 3 bunches (are labeled as bunch 0 respectively, bunch 1 and bunches 2), two instruction i0 and i1 are arranged, having dependence between them is i0 → i1, in this example, the value of the memonic symbol shown in the table 1 is respectively: data dependency graph G, E=2, N=2, MaxT=4 (ms), C=3, existence variable X (i, j, distribution situation k) is as shown in table 2.
Table 2
Above-mentioned formula (1) can be specially:
A
0=(X(0,0,0)+X(0,1,0)+X(0,2,0)+X(0,3,0))+(X(1,0,0)+X(1,1,0)+X(1,2,0)+X(1,3,0))
A
1=(X(0,0,1)+X(0,1,1)+X(0,2,1)+X(0,3,1))+(X(1,0,1)+X(1,1,1)+X(1,2,1)+X(1,3,1))
A
2=(X(0,0,2)+X(0,1,2)+X(0,2,2)+X(0,3,2))+(X(1,0,2)+X(1,1,2)+X(1,2,2)+X(1,3,2))
In the formula (2), defined to second auxiliary variable of integer character, be designated as F
k, set up this second auxiliary variable F
kWith the first auxiliary variable A
kComplete linear restriction relation, to describe as the first auxiliary variable A
k=2 o'clock, F
k=1; Otherwise F
k=0.Because the first auxiliary variable A when instructing i1, i2 not in bunch k
kValue only equal 0 or 1, otherwise, have only as instruction i1, i2 in bunch k the time the first auxiliary variable A
kJust can equal 2, so can utilize the first auxiliary variable A
kThis attribute set up formula (2).
As shown in Figure 6, the process of setting up of formula (2) can be as follows:. be end points with (2,1), through (2,1), (0,0.5).These two points are a ray W1; Be end points with (2,1) again, through (2,1), (1,0) is a ray W2.W1:F
k-0.25×A
k=0.5,W2:F
k=A
k-1。The zone of the F axle between these two rays just can be described with above-mentioned formula (2).
Constraint can obtain F thus
0, F
1, F
2Value.Work as i0, i1 in the time of same bunch, A
0, A
1, A
2The value that wherein has and only have an amount is 2, the F that this moment is corresponding
kValue be 1, and remaining F
kValue all is 0.
In the present embodiment, the concrete parameter of formula (2) is not unique, can be an end points with (2,1), and on the F axle [0,1) choose a point wantonly as the another one end points, come structure formula (2) with this.
In the formula (3), describe whether need bunch between data transfer operation by means of homogeneity variable S
iEach bar dependence edge ei in the data dependency graph has the homogeneity variable S of a correspondence
iThe second auxiliary variable F of correspondence when two instructions (i1, i2) are in same bunch of k
k=1, S so
i=1, two instructions of dependence edge ei correspondence (do not need data transfer operation between the i1 → i2); Otherwise S
i=0 expression needs data transfer operation.
In the present embodiment, the described second linear restriction system of equations can comprise following linear equation:
In above-mentioned formula (4), (5), Y
(j, ei)Be the sequential variable, need data transfer operation between two instructions of dependence edge ei correspondence, then Y
(j, ei)Variable description the time point j at data transfer operation place.This sequential variable Y
(j, ei)With homogeneity variable S
iBetween have the linear restriction relation of formula (5), its value correspondence can be 1 or 0.
Formula (4) expression: for each bar dependence edge ei:i1 → i2, if between there is one bunch in time point j data transfer operation, the time point j of data transfer operation is necessarily greater than the time point of execution command i1, and less than the time point of execution command i2 between carrying out bunch.
Formula (5) expression: if dependence edge ei need insert data transfer operation between bunch, then arrive in this scope of MaxT-1 at time point 0, will have and only have a Y
(j, ei)=1, otherwise within time point 0 to MaxT-1 all Y
(j, ei)=0.
Number of clusters order C is that the sequential variable of 3 sub-clustering processor correspondence can be referring to table 3.
Table 3
Cycle |
e0 |
e1 |
e2 |
T0 |
Y(0,0) |
Y(1,0) |
Y(2,0) |
T1 |
Y(0,1) |
Y(1,1) |
Y(2,1) |
T2 |
Y(0,2) |
Y(1,2) |
Y(2,2) |
T3 |
Y(0,3) |
Y(1,3) |
Y(2,3) |
Described the 3rd linear restriction system of equations can comprise following linear equation:
Since the sub-clustering processor bunch between data transmission be explicit carrying out, and every data transfer operation can interrupt the execution of instruction in all bunches when carrying out.So data transfer operation when carrying out, need determine not have other instructions to carry out at identical time point.In like manner, if when having other instruction to carry out at time point j place, then data transfer operation can not be carried out at time point j.Therefore, make up linear restriction formula (6).
Linear restriction formula (6) expression: at the random time point of time point 0 within the MaxT-1 scope, if between there is one bunch in time point j data transfer operation, then at this moment between point can not have other instruction to be performed.The existence of last every the instruction of timing statistics point j in each bunch, if statistical value is 1, showing at this time point j has other instructions to carry out, data transfer operation between then can not carrying out bunch; If be 0, showing at this time point j does not have other instructions to carry out, and can carry out data transfer operation.
Step 503, according to described complete linear restriction system of equations, the execution time of correspondence is found the shortest execution time when obtaining instruction in the source program adopting different allocative decision in each bunch of processor from a plurality of execution time that obtain.
The execution time of the instruction that is performed the latest when adding up different allocative decision is minimum determines the shortest execution time, can make up a restriction relation:
In the formula (7), maxlen represents the instruction that is performed the latest.
Step 504, obtain a homogeneity variable that is used for describing data transfer operation between needs whether bunch according to described existence variable;
Step 505, whether define between the instruction of dependence data transfer operation between needs bunch according to described homogeneity variable.
In above-described embodiment, when making up complete linear restriction system of equations, it is also conceivable that every instruction of source program all needs to launch and only launch once, and two priority times that instruction is carried out that dependence is arranged.Specifically can set up following (8), (9) formula retrains this situation.
L
i1→i2≥1
L in the formula
I1 → i2Expression number time delay.Read data instruction load for example, its form is r1=[memory], r1 represents register, memory represents memory address, if its number time delay is 3, represent that then this instruction need could read r1 to data through 3 time points, if next bar instruction just will be used the value of r1, can only use at the 4th time point.Time delay, number can be according to the difference of the present invention program's applicable cases, selected arbitrarily the number more than or equal to 1.
As the formula (8), in MaxT-1, utilize existence variable X (i, j, k) add up each bar instruction i in the existence of each time point j in each bunch k, such as second instruction at the 2nd time point 2nd bunch of execution, then it is 1 at the 2nd time point 2nd bunch existence, is 0 at other times point in the existence of other bunches, add and all time points in the existence of all bunches, as long as and be 1, show that this instruction launched, and only launched once.
As the formula (9), can be performed time interval [E every instruction
i, L
i] in, add up the existence of this instruction on each bunch.Certain instruction need be prior to carrying out with an other instruction that relies on its execution result, such as i1 and i2 dependence is arranged, the execution of i2 depends on the execution result of i1, i1 should carry out prior to i2, so the time point of execution i1 is with the product of its corresponding existence and count sum time delay, should be less than the product of the time point of carrying out i1 and its corresponding existence.
In above-described embodiment, by according to instruction bunch in existence variable when being assigned with set up complete linear restriction system of equations, obtain the shortest execution time of source program by finding the solution this complete system of linear equations, but data transfer operation between determining how to insert bunch based on this shortest execution time.From geometric angle, described a plurality of equatioies and inequality constrain can be formed a sealing or semi-enclosed polyhedron, and globally optimal solution is located on this polyhedral the top or nethermost certain summit, and this summit is also with regard to the corresponding the shortest execution time.
The embodiment of the invention also provides data transfer operation insertion device between a kind of bunch, and as shown in Figure 7, this device 700 comprises:
Load module 701 is used for input source program to be allocated;
Execution time processing module 702, be used for the complete linear restriction system of equations that basis is set up in advance, the execution time of correspondence was found the shortest execution time when instruction in the acquisition source program was adopted different allocative decision in each bunch of processor from a plurality of execution time that obtain;
Instruction distribution module 703, the allocative decision that is used for the shortest execution time correspondence that finds according to the dependence between the described source program instruction and described execution time processing module 702, determine to exist whether insert between the instruction of dependence bunch between data transfer operation.
Described execution time processing module 702 can comprise: the initial estimation unit, find the solution unit, iterative processing unit.Wherein, the initial estimation unit is used for obtaining the initial estimation execution time of described source program; Finding the solution the unit is used for finding the solution the complete linear restriction system of equations of setting up in advance according to the estimation execution time of importing; If solution arranged, the iterative processing unit shortens the described estimation execution time according to predetermined step-length, and the estimation execution time after will shortening exports to and finds the solution the unit; Do not separate if having, the previous estimation execution time of the estimation execution time of correspondence was not as the shortest execution time output when the iterative processing unit had solution with described complete linear restriction system of equations.
Described device may further include: the linear restriction module.The linear restriction module be used for according to the instruction that dependence is arranged processor bunch between distribute, set up complete linear restriction system of equations.
Described linear restriction module can specifically comprise: the first linear restriction unit, the second linear restriction unit, the 3rd linear restriction unit.Wherein, the first linear restriction unit is set up to be used for describing and is had two instructions of dependence in the first linear restriction system of equations of same bunch of existence; The second linear restriction unit set up to be used for describe exist between two instructions with dependence bunch between the second linear restriction system of equations of time point of data transfer operation; The 3rd linear restriction unit set up to be used for expression exist bunch between the time point of data transfer operation get rid of the 3rd linear restriction system of equations that other instructions are carried out.
The above only is preferred embodiment of the present invention, and is in order to limit the present invention, within the spirit and principles in the present invention not all, any modification of making, is equal to replacement, improvement etc., all should be included within the scope of protection of the invention.