A kind of micro dispatching method of supporting directed cyclic graph
Technical field
The present invention relates to instruction-level parallelism compiling, fine setting degree technical field, particularly relate to the micro dispatching method of " directed cyclic graph " of a kind of tenaculum " back edge ", and corresponding compile optimization technology.
Background technology
Instruction scheduling is the relevant important stage of optimizing of machine.It is relevant that successful instruction scheduling need satisfy data, and control relevant and structurally associated and other constraint condition improve resource utilization and instruction degree of parallelism by the rearrangement instructions order.And structurally associated wherein is meant if executing instructions contingent resource contention.Solve relevant need the visit of this class and take resource status, the visit machine state, obtain between instruction and postpone or the like, so need frequent visit machine mould [Muchnick.Advance Compiler Design and Implementation.Morgan KaufmannPublishers, 1997].
In addition, the machine that has also has specific (special) requirements to the instruction sending order, as the first generation, second generation processor-Anthem (Itanium) and the Mai Jinli (McKinley be itanium2) of Intel Company based on the IA-64 architecture.They all need machine with special template, MII for example, firing order.Wherein MII represents article one emission Memory instruction, and then two must be the Integer instruction.Instruction scheduling need be considered complicated factors more for this reason, and this transplants code also is very disadvantageous.
Dual requirements in view of performance and transplantability, we consider that with instruction scheduling the part of structurally associated and other restrictive condition takes out separately, make it to become a new module, be called fine setting degree [Dong-Yuan Chen, Lixia Liu, Chen Fu, Shuxin Yang, Chengyong Wu, Roy Ju.Efficient Resource Management during InstructionScheduling for the EPIC Architecture. Parallel Architectures andCompilation Techniques.2003,9:36 ∽ 45].It is responsible for encapsulating the detail of target machine microarchitecture.The interface of visit machine parameter and machine state is provided for the relevant optimization of other machine on the one hand, makes compiler irrelevant with machine to a certain extent, the modification or the replacement that can fast adaptation target machine hardware.On the other hand, also improved the dirigibility of instruction scheduling with portable.
At first, the work coverage of fine setting degree is little, and for the machine state conversion, the state variation that we only need to be concerned about current period gets final product, and adds particular restriction, and we need consider the cycle status that the front is several at most.Scheduling scope in such several cycles just.Secondly, we must be thought of as the command assignment functional part, consider the rearrangement of instruction, and this just requires to do the same work of a kind of similar scheduling in the cycle.
In satisfying the process of structurally associated, the fine setting degree must learn also that partly current machine takies state of resources, and after having selected suitable transmitting instructions, state will change thereupon accordingly.Process and finite-state automata state exchange that this machine state changes are similar.
Usually, instruction scheduling solves structurally associated two kinds of methods, detects backward and detects forward.Method is the instruction that the record current period has been launched backward, for each armed instruction, judges whether the instruction generation resource contention of having launched with current period or runs counter to other restriction.If existence conflicts or runs counter to, then do not allow this instruction in the current period emission, otherwise allow emission.And the forward detection method is all to have write down the machine state after the emission for every instruction of emission, for new candidate instruction, judges whether to arrive next legal state from current machine state.If could not would not do not allow transmitting instructions, otherwise otherwise.Two kinds of methods are a kind of to be seen backward, the instruction of the emission of seeing over, whether legal [T.Muller.Employingfinite automata for resource scheduling. In the 26th AnnualInternational Symposium on Microarchitecture of next state is judged in another kind of eyes front, Austin, Dec 1993:12 ∽ 20].Need there be complicated comparison and judgement in the back for every instruction to method, and speed is slow and algorithm is complicated, but it can handle special circumstances, and space consuming is little.And the forward direction method is passed through state transitions, and speed is fast, and algorithm is simple, but wants to simulate all states, and space consuming is big, and is difficult to handle special circumstances.The fine setting degree combines these two kinds of methods advantage separately, has realized the mixed method that a kind of processor structure is relevant.
At present the thought of fine setting degree has obtained successful application [Open ResearchCompiler (ORC) .http: //ipf-orc.sourceforge.net.2002.] in ORC.But it has just realized the situation of " directed cyclic graph " is not also supported in the scheduling of " directed acyclic graph ", so can't be applied in the software flow, realizes the scheduling to band " back edge " instruction.On the other hand, original part of layout instruction bundles (bundling) in software flow among the ORC (SWP:Software Pipeline) module, its algorithm is simple, degree of fine setting [Richard A.Huff.Lifetime-sensitive modulo scheduling. ACM SIGPLAN Notices not, 1993,28 (6): 258 ∽ 267], except underaction, also lose the chance that performance is optimized in many potential raisings.
As a kind of round-robin scheduling method, the instruction-level parallelism that software flow (SWP) is effectively hidden in the development sequence (Instruction Level Parallelism is called for short ILP).It comes raising speed by the execution of the different loop bodies that overlap.Before a loop body is not finished as yet, can start next loop body, the lead time between the two is called startup spacing (Initiationinterval is called for short II).Software flow scheduling at present has two big major techniques: Move-then-scheduling technology (also being the code mobile technology) and Schedule-then-move technology.The former crosses over circulation back edge (back-edge) move one by one.The latter directly starts from scratch to form final scheduling.Schedule-then-move family has two major members again: Unrollbased scheduling and Modulo Scheduling.The former not only does loop unrolling (unroll) but also does instruction scheduling, and the latter only dispatches an iteration in the circulation, makes each identical time interval of process repeat same scheduling, does not have resource contention and relevant conflict.Modulo scheduling is to realize the most popular method of software flow [Josep M.Codina at present, JosepLlosa, Antonio Gonz á lez.A Comparative Study of ModuloScheduling Techniques.Proceedings of the 16th internationalconference on Supercomputing.2002,7:97 ∽ 106].The Slack Modulo Scheduling[Richard A.Huff.Lifetime-sensitive modulo scheduling.ACM SIGPLAN Notices that has adopted Richard A.Huff to propose among the ORC, 1993,28 (6): 258 ∽ 267].That be right after after the modulo scheduling is exactly layout instruction bundles (bundling).Can launch two instruction bundles (bundle) in the framework of Itanium in the regulation one-period, each instruction bundles has three instructions, and their transmission must be according to certain template (template).Specified among the Itanium 16 kinds of instruction templates (MII, MI_I, MLX, MMI, M_MI, MFI, MMF, MIB, MBB, BBB, MMB, MFB, RESERVED_A, RESERVED_D, RESERVED_3, RESERVED_F).We three locations of instruction of template be respectively three dead slots (slot0, slot1, slot2).The bundling of SWP part stems from Pro64 among the ORC.Its basic thought is as follows:
SWP_Pack_A_Bundle (ops_list) //ops_list be modulo scheduling think the instruction set that can in a cycle, put down while (! Ops_list.empty, ()) //ops_list is not empty for, (each slot in current bundle) { // to each slot for of present instruction bundle, (every slot_kind on each slot) { // instruction type f for that may place each slot, (each op in ops_list) // to each the op if among the ops_list, (! SWP_Violates_Dependencies (ops_list.first_op (), op)) //those op of not placing before op and it does not have dependence SWP_Bundle_Next_In_Group (op, slot, slot_kind); // place this slot_kind type //op is to slot ops_list.erase (op); The op that // deletion has been placed } } //end for (every slot_kind ...) //end for (each slot ...) //whil e}
Ops_list is that modulo scheduling is thought the instruction set that can put down in a clock period (cycle) in the algorithm.When wherein each instruction slots (slot) being looked for the instruction type that may place, only be prioritization according to a kind of linearity.Such as, when the instruction type that the slot1 selection that is a bundle can be placed, be to come designated order groove type (slot_kind) according to the reference sequence of F>I>M>B.And it is,, also like this when placing even the instruction of back can not find suitable template in case after the slot of front had put instruction well, the instruction of back can not ejected it according to algorithm.And in fact, eject the chance that the instruction of having placed exactly can be excavated potential optimization under a lot of situations.For example: adds, shr_i.u, add, lfetch, xor, six instructions of ld8_i (being respectively M/I, I, M/I, M, M/I, M type) can be used MII, MMI (ld8_i, shr_i.u, add originally; Lfetch, xor adds) puts down in a cycle, but the linear priority algorithm among the bundling of SWP has been lined up MII to it, MMF+MFB (adds, shr_i.u, add; Lfetch, xor, nop.f+ld8_I, nop.f nop.b) has also promptly used two cycle just to put down.Also be that modulo scheduling is thought and can be put down in a cycle, and in fact do not put down.We claim that this situation is partition problem (Splitissue) [Intel Corp.Itanium Processor Microarchitecture Reference.
Http:// developer.intel.com/design/itanium/arch spec.htm, 2000].
On the other hand, this algorithm has not only increased many dummy instructions (NOP), has increased the possibility that (I-Cache miss) do not hit in the command cache visit; And fill out NOP when instruction, that preferentially select for use is nop.f rather than nop.i, the possibility [Intel Corp.Intel Itanium2 Processor Reference Manual.2002,7:53 ∽ 54] of occur when this has also increased execution pausing (stall).The frequency that is performed when these instruction set place fundamental blocks (BB:Basic Block) is very high, and these instruction set have been because SWP has done loop unrolling (Loop unrolling) when repeatedly reappearing in this fundamental block, performance that will the appreciable impact compile optimization.Also this example has just directly caused the bzip2 among the SPEC2000 (ISET=ref) after compiling under the ORC peak option, has descended about 2% working time.
In sum, in the modulo scheduling of prior art software flow, the partition problem can occur, also increase the possibility that the command cache visit is not hit, reduce compile efficiency, thereby reduced the performance of compile optimization.
Summary of the invention
The technical problem to be solved in the present invention provides a kind of micro dispatching method of supporting directed cyclic graph, avoid the partition problem (Split issue) that occurs in the software flow modulo scheduling, reduced to occur the possibility that (I-Cache miss) do not hit in the command cache visit, improve parallel compilation efficient, thereby further improved the compile optimization performance.
In order to solve the problems of the technologies described above, the invention provides a kind of micro dispatching method of supporting directed cyclic graph, may further comprise the steps:
A kind of micro dispatching method of supporting directed cyclic graph is characterized in that may further comprise the steps:
A) computations is concentrated the value of series of every instruction;
Is b) the decision instruction collection empty? if, execution in step 1); If not, carry out next step;
C) do you judge whether has expired in machine current state space or all selected mistake of all instructions? if, carry out next step, if not, execution in step e);
D) template of finishing the machine state space in last cycle is assigned, and upgrades the absolute slot value that instructs in the machine state space in last cycle, and state space to the last cycle is composed in the machine state space of current period, and the state space of current machine is put sky;
E) from instruction set, select an instruction, be its distribution function parts, the state space of current machine is composed the test space to current period;
F) according to the data dependency graph, check the correlativity of choosing instruction in the state space of current machine in each bar instruction and step e), judged whether that promptly one of following four kinds of situations judge correlativity, if any, execution in step h then), as not, then carry out next step; Four kinds of situations are: do not exist data relevant between in the machine state space of current period arbitrary instruction and selected instruction; In the data dependency graph on the arc of arbitrary instruction and selected instruction length of delay be 0; It is not 0 that loop iteration in the data dependency graph on arbitrary arc that instructs selected instruction is counted difference; The progression at arbitrary instruction place is not the progression at selected instruction place in the modulo scheduling;
G) putting feasible is false, then execution in step k), wherein, feasible is the logical variable that judges whether to exist the dependence in the cycle;
H) will instruct op to add in the test space of current period, putting feasible is TRUE, and template is sought in the instruction of calling the template matches function and be in the test space state of current period, and realizes the state transitions of finite-state automata;
I) do you judge the whether success of state transitions of finite-state automata? if, carry out next step, if not, execution in step 1);
J) test successfully, the value of the test space of current period is composed machine state space to current period, upgrade the absolute slot value of selected instruction, deletion selected instruction, execution in step b then from instruction set);
K) in the machine state space of current period, do and selected mark, then execution in step c) for selected instruction;
L) the fine setting degree finishes.
In technique scheme, described instruction set is that modulo scheduling is thought and is placed on the instruction set that sends among the same cycle.
In technique scheme, the value of series of every instruction described in the step a) is that the position of described instruction in flat scheduling is divided by starting spacing.
In technique scheme, the described method of described absolute slot value representation finishes the absolute slot value of back instruction.
As from the foregoing, a kind of micro dispatching method of supporting directed cyclic graph of the present invention, resetting (Reorder) technology and coordination (Negotiate) technology layout modulo scheduling in utilization thinks in the instruction set that can launch in same cycle, except considering the dependence between instruction, also to consider to instruct the value of series on the arc and the progression at instruction place simultaneously, realize support " back edge "; Avoid the partition problem (Split issue) that occurs in the software flow modulo scheduling, reduced the possibility that (I-Cache miss) do not hit in the command cache visit to occur, improved parallel compilation efficient, thereby further improved the compile optimization performance.
Description of drawings
Fig. 1 a is the false code synoptic diagram of loop body in the embodiment of the invention;
Fig. 1 b is first three circulation synoptic diagram behind the loop body loop unrolling in the embodiment of the invention;
Fig. 1 c is data dependency graph that iteration is interior in the embodiment of the invention;
Fig. 1 d is a round-robin operation dispatching situation in the embodiment of the invention;
Fig. 2 a is the synoptic diagram of an interior flat scheduling (II=2) of iteration in the embodiment of the invention;
Fig. 2 b is that streamline is carried out synoptic diagram in the embodiment of the invention;
Fig. 3 a is non-0 o'clock its registers front and back of a loop iteration difference synoptic diagram in the embodiment of the invention;
Fig. 3 b is its registers front and back synoptic diagram when instruction progression does not wait in the embodiment of the invention;
Fig. 4 is a micro dispatching method process flow diagram of supporting directed cyclic graph in the embodiment of the invention.
The drawing explanation:
Among Fig. 1 d, latency=1, omega=1, II=1; I
1To I
7Table is represented 7 cycle of the 1st cycle to the respectively.
Among Fig. 3, TN represents temporary variable (Temporary Name); TN represents overall temporary variable (Global Temporary Name).
Among Fig. 4, alphabetical implication: ops_list is that modulo scheduling is thought and can be placed on the instruction set that sends among the same cycle in the process flow diagram;
Stage is the progression (op at instruction place in the modulo scheduling
StagePosition (the location)/startup spacing (II) of=op in flat scheduling);
NULL represents empty set;
Prev_state, cur_state represent the machine state space of last cycle and current period respectively, and temp_state represents the test space of the machine state of current period, if test successfully, then the value of temp_state are composed to cur_state.
Tried is a Boolean variable, its value for the selected mistake of this op among the true respresentation cur_state once;
Absolute slot value representation MSMDDG algorithm finishes the absolute slot value of back instruction;
Op represents an instruction selecting from ops_list;
Inst is any instruction among the cur_state;
Arc_latency (inst, op) expression DDG in the instruction inst to the instruction op arc on length of delay;
Omega (inst, op) the loop iteration number that instruction inst crosses over to the dependence of instruction op among the expression DDG.
The progression (being location/II) at instruction inst place in Stage (inst) the expression modulo scheduling, this value finishes in the calculating at the very start of program;
Feasible is the logical variable that judges whether to exist the dependence in the cycle;
Bundle_Helper is existing template matches function in the fine setting degree, its function is in current period, attempt state transitions, for suitable template is sought in the instruction in the temp_state state space by finite-state automata (FSA:Finite State Automata).
Embodiment
Above mention, modulo scheduling belongs to the Move-then-scheduling technology, be to cross over circulation back edge (back-edge) move one by one, " data dependency graph " (DDG:DataDependence Graph) that it makes up is different from traditional " directed acyclic graph " (DAG:DirectedAcyclic Graph), but a kind of " directed cyclic graph " (DCG:Directed CyclicGraph).
It not only comprises the dependence between same iteration (iteration) instruction in the circulation, but also dependence (loop-carried dependence) is carried in the circulation that comprises between different iteration.The latter just, by the circulation back edge (if every road from the start node of DDG to node a through all passing through node b, (b dom is a) then to say node b dominate node a.If b dom is a, then limit a → b is called " back edge ".Point to the back edge of self as node among Fig. 1 c 1), traditional DAG has been become " directed cyclic graph " [Vicki H.Allan, Reese B.Jones, Randall M.Lee, Stephen J.Allan.Software pipelining.ACM Computing Surveys (CSUR) .1995,9 (27): 376 ∽ 432].We this " directed cyclic graph " be designated as DDG (N, A), wherein, N is the set of node of all instructions of expression, A is the set of the arc of all dependences of expression.Every arc is all used preface idol (omega, latency) mark.The loop iteration number that iteration distance (omega) expression dependence is crossed over, postpone result that the instruction of (latency) expression article one produces can by uses of second instruction institute must consumption time.
In order to illustrate how the fine setting degree makes support to DDG among the ORC, as shown in Figure 1, the loop body shown in Fig. 1 a has become Fig. 1 b through three loop unrollings, and we see in the not only same iteration dependence, and dependence is also arranged between different iteration.These dependences can fully be represented with the DDG of Fig. 1 c.Node 1 points to the dependence of the interior latency=1 of the same iteration of preface idol (0,1) expression on node 2 arcs; Preface idol on the back edge of node 1 sensing self is (1,1), means that article one instruction of current iteration (iteration) and article one instruction of next iteration also have the latency=1 dependence.Because modulo scheduling can guarantee II*omega>=latency when calculating startup spacing (II),, comprise in the iteration and the dependence between iteration so after the selected II scheduling of modulo scheduling finishes, just considered all dependences.
In other words, on the one hand, modulo scheduling is when carrying out " flat scheduling (Flat schedule: promptly to the scheduling of an iteration) ", consider in the iteration dependence (instruction 2 shown in Fig. 2 a between some instructions in succession, 3 are transferred to same cycle by flat scheduling, and proving does not have dependence between them).On the other hand, because the data dependency graph has ring, modulo scheduling also must be considered the dependence between the instruction of all " mould II remainder is identical " in this instruction and the flat scheduling, because they carry out simultaneously when instruction of scheduling.As I among Fig. 2 b
1, I
3, I
5In instruction be the (I that carries out simultaneously
i: i mod 2=1; I is the position location of instruction in flat scheduling), I
2, I
4, I
6In instruction be the (I that carries out simultaneously
i: i mod2=0).Instruction in the square frame is streamline instruction in circulation when stablizing, and is called a progression (stage), and its length equals II.Article one, the instruction place stage=location/II.Cycle I
5In instruction 6 and instruction 1 come from stage2 and stage0 respectively.
Next modulo scheduling is responsible for bundling by the fine setting degree after finishing.In order to support directed cyclic graph, the fine setting degree must be considered Omega and two factors of Stage.
Shown in Fig. 2 b, Cycle I
5In the position mould II remainder of instruction 6,2,3,1 in flat scheduling identical, they send in same cycle in the EPIC structure together.Presumptive instruction 6 and instruct write-after-read is arranged between 2 dependence (example) as follows of (RAW:Read After Write), according to common dispatching method, we can not be put into them within the same cycle, because a back adds will use the result of previous adds, must wait for a cycle (latency=1).But can learn that from directed cyclic graph omega=1 in the weights on the arc between them also promptly instructs 2 from the next iteration.=0 situation, the its registers (Register Allocation) of SWP after bundling can be eliminated the correlativity of this RAW easily by register renaming, referring to [B.R.Rau, M.Lee, P.P.Tirumalai, M.S.Schlansker.Register Allocation for Software PipelinedLoops.Proc.of the SIGPLAN ' 92 Conf.on Programming Language Design andImplementation.1992,6:283 ∽ 299].In fact the original software flow part of ORC is also so done.So when omega>=1, we are actually and can be put into same cycle to these two instructions, Fig. 3 a shows instruction 6 and the situation of instruction 2 before and after software flow is RegisterAllocation.
Cycle I
5Middle instruction 2 and instruction 3 just have been placed to same cycle in Flat Scheduling, proving does not have correlativity therebetween.Instruction 6 and instruction 1 also may be following this situations.Equally, instruction 6 also is the dependence of write-after-read (RAW) to instruction 1, and shown in Fig. 3 b, the source operand of st1 will use the result operand of adds.By common dispatching method, we can not be put into them among the same cycle.But can learn that from the information that modulo scheduling provides instruction 6 and instruction 1 are not to be positioned at same stage.The stage2 that the former comes, the latter is from stage0.To this situation, SWP eliminates correlativity between them by register renaming in also can the Register Allocation after bundling, referring to [B.R.Rau, M.Lee, P.P.Tirumalai, M.S.Schlansker.Register Allocation for Software Pipelined Loops.Proc.of theSIGPLAN ' 92 Conf.on Programming Language Design and Implementation.1992,6:283 ∽ 299].So, when their stage is inequality, in fact also can put among the same cycle.Fig. 3 b shows instruction 6 and the situation of instruction 1 before and after software flow is RegisterAllocation.
In sum, in order to realize support to " back edge ", the fine setting degree is reset (Reorder) technology and coordination (Negotiate) technology layout modulo scheduling in utilization and is thought in the instruction set that can launch in same cycle, except considering the dependence between instruction, also to consider simultaneously to instruct on the arc omega value and instruct the stage at place.Register Allocation cooperates after ability and the bundling like this, obtains and optimizes performance preferably.
Describe the micro dispatching method of supporting directed cyclic graph in the embodiment of the invention in detail referring to Fig. 4, as shown in Figure 4, support the micro dispatching method of directed cyclic graph may further comprise the steps:
Step 101, the value of series (Stage) of every instruction in the computations collection (ops_list); Wherein, stage=location/II; Ops_list is that modulo scheduling is thought and can be placed on the instruction set that sends among the same cycle, and location is the position in flat scheduling, and II is for starting spacing;
Are step 103, decision instruction collection (ops_list) empty? if, execution in step 123; If not, carry out next step;
Step 105 judges that whether full machine current state space (cur_state) or all instructions (op) all have the tried sign, i.e. whether instruction selected mistake? if, carry out next step, if not, execution in step 109;
Step 107, finishing the template of the machine state space prev_state in last cycle assigns, upgrade the absolute slot value of instructing among the prev_state, the machine state spatial value of current period is composed state space to the last cycle, the state space of current machine is put sky, be prev_state=cur_state, cur_state=NULL;
Step 109 is selected an instruction OP from instruction set ops_list, be its distribution function parts, and the state space cur_state of current machine is composed test space temp_state to current period;
Step 111 according to the data dependency graph, is checked the correlativity of each bar instruction inst and instruction op of the state space of current machine.Promptly by having judged whether that one of following four kinds of situations have judged whether correlativity, if any, then execution in step 115, as not, then carry out next step;
In step 111, four kinds of situations are: do not exist data relevant between in the machine state space of current period arbitrary instruction inst and this instruction op; In the data dependency graph on the arc of instruction inst and instruction op length of delay be 0, promptly Arc_latency (inst, op)==0; In the data dependency graph instruction inst to the instruction op arc on Omega be not 0, promptly Omega (inst, op)!=0; The progression at instruction inst place is not the progression at instruction op place, i.e. Stage (inst) in the modulo scheduling!=Stage (op);
Step 113, putting feasible is false, and execution in step 121 then, and wherein, feasible is the logical variable that judges whether to exist the dependence in the cycle;
Step 115, instruction op is added in the test space temp_state state space of current period, putting feasible is TRUE, template is sought in the instruction of calling Bundle_Helper and be among the temp_state, and the state transitions of realization FSA, wherein, Bundle_Helper is existing template matches function in the fine setting degree, its function is in current period, attempt state transitions, for suitable template is sought in the instruction in the temp_state state space by finite-state automata (FSA:Finite State Automata);
Does step 117 judge whether the FSA state transitions successful? if, carry out next step, if not, execution in step 121;
Step 119 is tested successfully, the value of temp_state composed to cur_state, i.e. and cur_state=temp_state, the absolute slot value of update instruction op, this instruction of deletion op from instruction set ops_list, execution in step 103 then;
Step 121, for instruction op does the tried mark, execution in step 105 then in cur_state;
Step 123, the fine setting degree finishes.
In sum, the present invention is in order to realize the support to " back edge ", the fine setting degree is reset (Reorder) technology and coordination (Negotiate) technology layout modulo scheduling in utilization and is thought in the instruction set that can launch in same cycle, except considering the dependence between instruction, also to consider simultaneously to instruct on the arc omega value and instruct the stage at place.Register Allocation cooperates after ability and the bundling like this, obtain and optimize performance preferably, avoid the partition problem (Split issue) that occurs in the software flow modulo scheduling, reduced to occur the possibility that (I-Cache miss) do not hit in the command cache visit, improve parallel compilation efficient, thereby further improved the compile optimization performance.