CN1670699A - A micro-dispatching method supporting directed cyclic graph - Google Patents

A micro-dispatching method supporting directed cyclic graph Download PDF

Info

Publication number
CN1670699A
CN1670699A CN 200410029453 CN200410029453A CN1670699A CN 1670699 A CN1670699 A CN 1670699A CN 200410029453 CN200410029453 CN 200410029453 CN 200410029453 A CN200410029453 A CN 200410029453A CN 1670699 A CN1670699 A CN 1670699A
Authority
CN
China
Prior art keywords
instruction
state
machine
state space
execution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN 200410029453
Other languages
Chinese (zh)
Other versions
CN1306401C (en
Inventor
文严治
连瑞琦
刘章林
吴承勇
张兆庆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Hezhong Data Technology Co ltd
Original Assignee
Institute of Computing Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Computing Technology of CAS filed Critical Institute of Computing Technology of CAS
Priority to CNB2004100294533A priority Critical patent/CN1306401C/en
Publication of CN1670699A publication Critical patent/CN1670699A/en
Application granted granted Critical
Publication of CN1306401C publication Critical patent/CN1306401C/en
Anticipated expiration legal-status Critical
Expired - Lifetime legal-status Critical Current

Links

Images

Landscapes

  • Devices For Executing Special Programs (AREA)

Abstract

This invention relates to micro adjusting method with supportive ring graph, which uses recorder technique and neachievediate technique mode and considers not only the relationship between orders and also delay value of the orders arc and the order degree when in the same one order aggregation to realize the loop support.

Description

A kind of micro dispatching method of supporting directed cyclic graph
Technical field
The present invention relates to instruction-level parallelism compiling, fine setting degree technical field, particularly relate to the micro dispatching method of " directed cyclic graph " of a kind of tenaculum " back edge ", and corresponding compile optimization technology.
Background technology
Instruction scheduling is the relevant important stage of optimizing of machine.It is relevant that successful instruction scheduling need satisfy data, and control relevant and structurally associated and other constraint condition improve resource utilization and instruction degree of parallelism by the rearrangement instructions order.And structurally associated wherein is meant if executing instructions contingent resource contention.Solve relevant need the visit of this class and take resource status, the visit machine state, obtain between instruction and postpone or the like, so need frequent visit machine mould [Muchnick.Advance Compiler Design and Implementation.Morgan KaufmannPublishers, 1997].
In addition, the machine that has also has specific (special) requirements to the instruction sending order, as the first generation, second generation processor-Anthem (Itanium) and the Mai Jinli (McKinley be itanium2) of Intel Company based on the IA-64 architecture.They all need machine with special template, MII for example, firing order.Wherein MII represents article one emission Memory instruction, and then two must be the Integer instruction.Instruction scheduling need be considered complicated factors more for this reason, and this transplants code also is very disadvantageous.
Dual requirements in view of performance and transplantability, we consider that with instruction scheduling the part of structurally associated and other restrictive condition takes out separately, make it to become a new module, be called fine setting degree [Dong-Yuan Chen, Lixia Liu, Chen Fu, Shuxin Yang, Chengyong Wu, Roy Ju.Efficient Resource Management during InstructionScheduling for the EPIC Architecture. Parallel Architectures andCompilation Techniques.2003,9:36 ∽ 45].It is responsible for encapsulating the detail of target machine microarchitecture.The interface of visit machine parameter and machine state is provided for the relevant optimization of other machine on the one hand, makes compiler irrelevant with machine to a certain extent, the modification or the replacement that can fast adaptation target machine hardware.On the other hand, also improved the dirigibility of instruction scheduling with portable.
At first, the work coverage of fine setting degree is little, and for the machine state conversion, the state variation that we only need to be concerned about current period gets final product, and adds particular restriction, and we need consider the cycle status that the front is several at most.Scheduling scope in such several cycles just.Secondly, we must be thought of as the command assignment functional part, consider the rearrangement of instruction, and this just requires to do the same work of a kind of similar scheduling in the cycle.
In satisfying the process of structurally associated, the fine setting degree must learn also that partly current machine takies state of resources, and after having selected suitable transmitting instructions, state will change thereupon accordingly.Process and finite-state automata state exchange that this machine state changes are similar.
Usually, instruction scheduling solves structurally associated two kinds of methods, detects backward and detects forward.Method is the instruction that the record current period has been launched backward, for each armed instruction, judges whether the instruction generation resource contention of having launched with current period or runs counter to other restriction.If existence conflicts or runs counter to, then do not allow this instruction in the current period emission, otherwise allow emission.And the forward detection method is all to have write down the machine state after the emission for every instruction of emission, for new candidate instruction, judges whether to arrive next legal state from current machine state.If could not would not do not allow transmitting instructions, otherwise otherwise.Two kinds of methods are a kind of to be seen backward, the instruction of the emission of seeing over, whether legal [T.Muller.Employingfinite automata for resource scheduling. In the 26th AnnualInternational Symposium on Microarchitecture of next state is judged in another kind of eyes front, Austin, Dec 1993:12 ∽ 20].Need there be complicated comparison and judgement in the back for every instruction to method, and speed is slow and algorithm is complicated, but it can handle special circumstances, and space consuming is little.And the forward direction method is passed through state transitions, and speed is fast, and algorithm is simple, but wants to simulate all states, and space consuming is big, and is difficult to handle special circumstances.The fine setting degree combines these two kinds of methods advantage separately, has realized the mixed method that a kind of processor structure is relevant.
At present the thought of fine setting degree has obtained successful application [Open ResearchCompiler (ORC) .http: //ipf-orc.sourceforge.net.2002.] in ORC.But it has just realized the situation of " directed cyclic graph " is not also supported in the scheduling of " directed acyclic graph ", so can't be applied in the software flow, realizes the scheduling to band " back edge " instruction.On the other hand, original part of layout instruction bundles (bundling) in software flow among the ORC (SWP:Software Pipeline) module, its algorithm is simple, degree of fine setting [Richard A.Huff.Lifetime-sensitive modulo scheduling. ACM SIGPLAN Notices not, 1993,28 (6): 258 ∽ 267], except underaction, also lose the chance that performance is optimized in many potential raisings.
As a kind of round-robin scheduling method, the instruction-level parallelism that software flow (SWP) is effectively hidden in the development sequence (Instruction Level Parallelism is called for short ILP).It comes raising speed by the execution of the different loop bodies that overlap.Before a loop body is not finished as yet, can start next loop body, the lead time between the two is called startup spacing (Initiationinterval is called for short II).Software flow scheduling at present has two big major techniques: Move-then-scheduling technology (also being the code mobile technology) and Schedule-then-move technology.The former crosses over circulation back edge (back-edge) move one by one.The latter directly starts from scratch to form final scheduling.Schedule-then-move family has two major members again: Unrollbased scheduling and Modulo Scheduling.The former not only does loop unrolling (unroll) but also does instruction scheduling, and the latter only dispatches an iteration in the circulation, makes each identical time interval of process repeat same scheduling, does not have resource contention and relevant conflict.Modulo scheduling is to realize the most popular method of software flow [Josep M.Codina at present, JosepLlosa, Antonio Gonz á lez.A Comparative Study of ModuloScheduling Techniques.Proceedings of the 16th internationalconference on Supercomputing.2002,7:97 ∽ 106].The Slack Modulo Scheduling[Richard A.Huff.Lifetime-sensitive modulo scheduling.ACM SIGPLAN Notices that has adopted Richard A.Huff to propose among the ORC, 1993,28 (6): 258 ∽ 267].That be right after after the modulo scheduling is exactly layout instruction bundles (bundling).Can launch two instruction bundles (bundle) in the framework of Itanium in the regulation one-period, each instruction bundles has three instructions, and their transmission must be according to certain template (template).Specified among the Itanium 16 kinds of instruction templates (MII, MI_I, MLX, MMI, M_MI, MFI, MMF, MIB, MBB, BBB, MMB, MFB, RESERVED_A, RESERVED_D, RESERVED_3, RESERVED_F).We three locations of instruction of template be respectively three dead slots (slot0, slot1, slot2).The bundling of SWP part stems from Pro64 among the ORC.Its basic thought is as follows:
SWP_Pack_A_Bundle (ops_list) //ops_list be modulo scheduling think the instruction set that can in a cycle, put down while (! Ops_list.empty, ()) //ops_list is not empty for, (each slot in current bundle) { // to each slot for of present instruction bundle, (every slot_kind on each slot) { // instruction type f for that may place each slot, (each op in ops_list) // to each the op if among the ops_list, (! SWP_Violates_Dependencies (ops_list.first_op (), op)) //those op of not placing before op and it does not have dependence SWP_Bundle_Next_In_Group (op, slot, slot_kind); // place this slot_kind type //op is to slot ops_list.erase (op); The op that // deletion has been placed } } //end for (every slot_kind ...) //end for (each slot ...) //whil e}
Ops_list is that modulo scheduling is thought the instruction set that can put down in a clock period (cycle) in the algorithm.When wherein each instruction slots (slot) being looked for the instruction type that may place, only be prioritization according to a kind of linearity.Such as, when the instruction type that the slot1 selection that is a bundle can be placed, be to come designated order groove type (slot_kind) according to the reference sequence of F>I>M>B.And it is,, also like this when placing even the instruction of back can not find suitable template in case after the slot of front had put instruction well, the instruction of back can not ejected it according to algorithm.And in fact, eject the chance that the instruction of having placed exactly can be excavated potential optimization under a lot of situations.For example: adds, shr_i.u, add, lfetch, xor, six instructions of ld8_i (being respectively M/I, I, M/I, M, M/I, M type) can be used MII, MMI (ld8_i, shr_i.u, add originally; Lfetch, xor adds) puts down in a cycle, but the linear priority algorithm among the bundling of SWP has been lined up MII to it, MMF+MFB (adds, shr_i.u, add; Lfetch, xor, nop.f+ld8_I, nop.f nop.b) has also promptly used two cycle just to put down.Also be that modulo scheduling is thought and can be put down in a cycle, and in fact do not put down.We claim that this situation is partition problem (Splitissue) [Intel Corp.Itanium Processor Microarchitecture Reference. Http:// developer.intel.com/design/itanium/arch spec.htm, 2000].
On the other hand, this algorithm has not only increased many dummy instructions (NOP), has increased the possibility that (I-Cache miss) do not hit in the command cache visit; And fill out NOP when instruction, that preferentially select for use is nop.f rather than nop.i, the possibility [Intel Corp.Intel Itanium2 Processor Reference Manual.2002,7:53 ∽ 54] of occur when this has also increased execution pausing (stall).The frequency that is performed when these instruction set place fundamental blocks (BB:Basic Block) is very high, and these instruction set have been because SWP has done loop unrolling (Loop unrolling) when repeatedly reappearing in this fundamental block, performance that will the appreciable impact compile optimization.Also this example has just directly caused the bzip2 among the SPEC2000 (ISET=ref) after compiling under the ORC peak option, has descended about 2% working time.
In sum, in the modulo scheduling of prior art software flow, the partition problem can occur, also increase the possibility that the command cache visit is not hit, reduce compile efficiency, thereby reduced the performance of compile optimization.
Summary of the invention
The technical problem to be solved in the present invention provides a kind of micro dispatching method of supporting directed cyclic graph, avoid the partition problem (Split issue) that occurs in the software flow modulo scheduling, reduced to occur the possibility that (I-Cache miss) do not hit in the command cache visit, improve parallel compilation efficient, thereby further improved the compile optimization performance.
In order to solve the problems of the technologies described above, the invention provides a kind of micro dispatching method of supporting directed cyclic graph, may further comprise the steps:
A kind of micro dispatching method of supporting directed cyclic graph is characterized in that may further comprise the steps:
A) computations is concentrated the value of series of every instruction;
Is b) the decision instruction collection empty? if, execution in step 1); If not, carry out next step;
C) do you judge whether has expired in machine current state space or all selected mistake of all instructions? if, carry out next step, if not, execution in step e);
D) template of finishing the machine state space in last cycle is assigned, and upgrades the absolute slot value that instructs in the machine state space in last cycle, and state space to the last cycle is composed in the machine state space of current period, and the state space of current machine is put sky;
E) from instruction set, select an instruction, be its distribution function parts, the state space of current machine is composed the test space to current period;
F) according to the data dependency graph, check the correlativity of choosing instruction in the state space of current machine in each bar instruction and step e), judged whether that promptly one of following four kinds of situations judge correlativity, if any, execution in step h then), as not, then carry out next step; Four kinds of situations are: do not exist data relevant between in the machine state space of current period arbitrary instruction and selected instruction; In the data dependency graph on the arc of arbitrary instruction and selected instruction length of delay be 0; It is not 0 that loop iteration in the data dependency graph on arbitrary arc that instructs selected instruction is counted difference; The progression at arbitrary instruction place is not the progression at selected instruction place in the modulo scheduling;
G) putting feasible is false, then execution in step k), wherein, feasible is the logical variable that judges whether to exist the dependence in the cycle;
H) will instruct op to add in the test space of current period, putting feasible is TRUE, and template is sought in the instruction of calling the template matches function and be in the test space state of current period, and realizes the state transitions of finite-state automata;
I) do you judge the whether success of state transitions of finite-state automata? if, carry out next step, if not, execution in step 1);
J) test successfully, the value of the test space of current period is composed machine state space to current period, upgrade the absolute slot value of selected instruction, deletion selected instruction, execution in step b then from instruction set);
K) in the machine state space of current period, do and selected mark, then execution in step c) for selected instruction;
L) the fine setting degree finishes.
In technique scheme, described instruction set is that modulo scheduling is thought and is placed on the instruction set that sends among the same cycle.
In technique scheme, the value of series of every instruction described in the step a) is that the position of described instruction in flat scheduling is divided by starting spacing.
In technique scheme, the described method of described absolute slot value representation finishes the absolute slot value of back instruction.
As from the foregoing, a kind of micro dispatching method of supporting directed cyclic graph of the present invention, resetting (Reorder) technology and coordination (Negotiate) technology layout modulo scheduling in utilization thinks in the instruction set that can launch in same cycle, except considering the dependence between instruction, also to consider to instruct the value of series on the arc and the progression at instruction place simultaneously, realize support " back edge "; Avoid the partition problem (Split issue) that occurs in the software flow modulo scheduling, reduced the possibility that (I-Cache miss) do not hit in the command cache visit to occur, improved parallel compilation efficient, thereby further improved the compile optimization performance.
Description of drawings
Fig. 1 a is the false code synoptic diagram of loop body in the embodiment of the invention;
Fig. 1 b is first three circulation synoptic diagram behind the loop body loop unrolling in the embodiment of the invention;
Fig. 1 c is data dependency graph that iteration is interior in the embodiment of the invention;
Fig. 1 d is a round-robin operation dispatching situation in the embodiment of the invention;
Fig. 2 a is the synoptic diagram of an interior flat scheduling (II=2) of iteration in the embodiment of the invention;
Fig. 2 b is that streamline is carried out synoptic diagram in the embodiment of the invention;
Fig. 3 a is non-0 o'clock its registers front and back of a loop iteration difference synoptic diagram in the embodiment of the invention;
Fig. 3 b is its registers front and back synoptic diagram when instruction progression does not wait in the embodiment of the invention;
Fig. 4 is a micro dispatching method process flow diagram of supporting directed cyclic graph in the embodiment of the invention.
The drawing explanation:
Among Fig. 1 d, latency=1, omega=1, II=1; I 1To I 7Table is represented 7 cycle of the 1st cycle to the respectively.
Among Fig. 3, TN represents temporary variable (Temporary Name); TN represents overall temporary variable (Global Temporary Name).
Among Fig. 4, alphabetical implication: ops_list is that modulo scheduling is thought and can be placed on the instruction set that sends among the same cycle in the process flow diagram;
Stage is the progression (op at instruction place in the modulo scheduling StagePosition (the location)/startup spacing (II) of=op in flat scheduling);
NULL represents empty set;
Prev_state, cur_state represent the machine state space of last cycle and current period respectively, and temp_state represents the test space of the machine state of current period, if test successfully, then the value of temp_state are composed to cur_state.
Tried is a Boolean variable, its value for the selected mistake of this op among the true respresentation cur_state once;
Absolute slot value representation MSMDDG algorithm finishes the absolute slot value of back instruction;
Op represents an instruction selecting from ops_list;
Inst is any instruction among the cur_state;
Arc_latency (inst, op) expression DDG in the instruction inst to the instruction op arc on length of delay;
Omega (inst, op) the loop iteration number that instruction inst crosses over to the dependence of instruction op among the expression DDG.
The progression (being location/II) at instruction inst place in Stage (inst) the expression modulo scheduling, this value finishes in the calculating at the very start of program;
Feasible is the logical variable that judges whether to exist the dependence in the cycle;
Bundle_Helper is existing template matches function in the fine setting degree, its function is in current period, attempt state transitions, for suitable template is sought in the instruction in the temp_state state space by finite-state automata (FSA:Finite State Automata).
Embodiment
Above mention, modulo scheduling belongs to the Move-then-scheduling technology, be to cross over circulation back edge (back-edge) move one by one, " data dependency graph " (DDG:DataDependence Graph) that it makes up is different from traditional " directed acyclic graph " (DAG:DirectedAcyclic Graph), but a kind of " directed cyclic graph " (DCG:Directed CyclicGraph).
It not only comprises the dependence between same iteration (iteration) instruction in the circulation, but also dependence (loop-carried dependence) is carried in the circulation that comprises between different iteration.The latter just, by the circulation back edge (if every road from the start node of DDG to node a through all passing through node b, (b dom is a) then to say node b dominate node a.If b dom is a, then limit a → b is called " back edge ".Point to the back edge of self as node among Fig. 1 c 1), traditional DAG has been become " directed cyclic graph " [Vicki H.Allan, Reese B.Jones, Randall M.Lee, Stephen J.Allan.Software pipelining.ACM Computing Surveys (CSUR) .1995,9 (27): 376 ∽ 432].We this " directed cyclic graph " be designated as DDG (N, A), wherein, N is the set of node of all instructions of expression, A is the set of the arc of all dependences of expression.Every arc is all used preface idol (omega, latency) mark.The loop iteration number that iteration distance (omega) expression dependence is crossed over, postpone result that the instruction of (latency) expression article one produces can by uses of second instruction institute must consumption time.
In order to illustrate how the fine setting degree makes support to DDG among the ORC, as shown in Figure 1, the loop body shown in Fig. 1 a has become Fig. 1 b through three loop unrollings, and we see in the not only same iteration dependence, and dependence is also arranged between different iteration.These dependences can fully be represented with the DDG of Fig. 1 c.Node 1 points to the dependence of the interior latency=1 of the same iteration of preface idol (0,1) expression on node 2 arcs; Preface idol on the back edge of node 1 sensing self is (1,1), means that article one instruction of current iteration (iteration) and article one instruction of next iteration also have the latency=1 dependence.Because modulo scheduling can guarantee II*omega>=latency when calculating startup spacing (II),, comprise in the iteration and the dependence between iteration so after the selected II scheduling of modulo scheduling finishes, just considered all dependences.
In other words, on the one hand, modulo scheduling is when carrying out " flat scheduling (Flat schedule: promptly to the scheduling of an iteration) ", consider in the iteration dependence (instruction 2 shown in Fig. 2 a between some instructions in succession, 3 are transferred to same cycle by flat scheduling, and proving does not have dependence between them).On the other hand, because the data dependency graph has ring, modulo scheduling also must be considered the dependence between the instruction of all " mould II remainder is identical " in this instruction and the flat scheduling, because they carry out simultaneously when instruction of scheduling.As I among Fig. 2 b 1, I 3, I 5In instruction be the (I that carries out simultaneously i: i mod 2=1; I is the position location of instruction in flat scheduling), I 2, I 4, I 6In instruction be the (I that carries out simultaneously i: i mod2=0).Instruction in the square frame is streamline instruction in circulation when stablizing, and is called a progression (stage), and its length equals II.Article one, the instruction place stage=location/II.Cycle I 5In instruction 6 and instruction 1 come from stage2 and stage0 respectively.
Next modulo scheduling is responsible for bundling by the fine setting degree after finishing.In order to support directed cyclic graph, the fine setting degree must be considered Omega and two factors of Stage.
Shown in Fig. 2 b, Cycle I 5In the position mould II remainder of instruction 6,2,3,1 in flat scheduling identical, they send in same cycle in the EPIC structure together.Presumptive instruction 6 and instruct write-after-read is arranged between 2 dependence (example) as follows of (RAW:Read After Write), according to common dispatching method, we can not be put into them within the same cycle, because a back adds will use the result of previous adds, must wait for a cycle (latency=1).But can learn that from directed cyclic graph omega=1 in the weights on the arc between them also promptly instructs 2 from the next iteration.=0 situation, the its registers (Register Allocation) of SWP after bundling can be eliminated the correlativity of this RAW easily by register renaming, referring to [B.R.Rau, M.Lee, P.P.Tirumalai, M.S.Schlansker.Register Allocation for Software PipelinedLoops.Proc.of the SIGPLAN ' 92 Conf.on Programming Language Design andImplementation.1992,6:283 ∽ 299].In fact the original software flow part of ORC is also so done.So when omega>=1, we are actually and can be put into same cycle to these two instructions, Fig. 3 a shows instruction 6 and the situation of instruction 2 before and after software flow is RegisterAllocation.
Cycle I 5Middle instruction 2 and instruction 3 just have been placed to same cycle in Flat Scheduling, proving does not have correlativity therebetween.Instruction 6 and instruction 1 also may be following this situations.Equally, instruction 6 also is the dependence of write-after-read (RAW) to instruction 1, and shown in Fig. 3 b, the source operand of st1 will use the result operand of adds.By common dispatching method, we can not be put into them among the same cycle.But can learn that from the information that modulo scheduling provides instruction 6 and instruction 1 are not to be positioned at same stage.The stage2 that the former comes, the latter is from stage0.To this situation, SWP eliminates correlativity between them by register renaming in also can the Register Allocation after bundling, referring to [B.R.Rau, M.Lee, P.P.Tirumalai, M.S.Schlansker.Register Allocation for Software Pipelined Loops.Proc.of theSIGPLAN ' 92 Conf.on Programming Language Design and Implementation.1992,6:283 ∽ 299].So, when their stage is inequality, in fact also can put among the same cycle.Fig. 3 b shows instruction 6 and the situation of instruction 1 before and after software flow is RegisterAllocation.
In sum, in order to realize support to " back edge ", the fine setting degree is reset (Reorder) technology and coordination (Negotiate) technology layout modulo scheduling in utilization and is thought in the instruction set that can launch in same cycle, except considering the dependence between instruction, also to consider simultaneously to instruct on the arc omega value and instruct the stage at place.Register Allocation cooperates after ability and the bundling like this, obtains and optimizes performance preferably.
Describe the micro dispatching method of supporting directed cyclic graph in the embodiment of the invention in detail referring to Fig. 4, as shown in Figure 4, support the micro dispatching method of directed cyclic graph may further comprise the steps:
Step 101, the value of series (Stage) of every instruction in the computations collection (ops_list); Wherein, stage=location/II; Ops_list is that modulo scheduling is thought and can be placed on the instruction set that sends among the same cycle, and location is the position in flat scheduling, and II is for starting spacing;
Are step 103, decision instruction collection (ops_list) empty? if, execution in step 123; If not, carry out next step;
Step 105 judges that whether full machine current state space (cur_state) or all instructions (op) all have the tried sign, i.e. whether instruction selected mistake? if, carry out next step, if not, execution in step 109;
Step 107, finishing the template of the machine state space prev_state in last cycle assigns, upgrade the absolute slot value of instructing among the prev_state, the machine state spatial value of current period is composed state space to the last cycle, the state space of current machine is put sky, be prev_state=cur_state, cur_state=NULL;
Step 109 is selected an instruction OP from instruction set ops_list, be its distribution function parts, and the state space cur_state of current machine is composed test space temp_state to current period;
Step 111 according to the data dependency graph, is checked the correlativity of each bar instruction inst and instruction op of the state space of current machine.Promptly by having judged whether that one of following four kinds of situations have judged whether correlativity, if any, then execution in step 115, as not, then carry out next step;
In step 111, four kinds of situations are: do not exist data relevant between in the machine state space of current period arbitrary instruction inst and this instruction op; In the data dependency graph on the arc of instruction inst and instruction op length of delay be 0, promptly Arc_latency (inst, op)==0; In the data dependency graph instruction inst to the instruction op arc on Omega be not 0, promptly Omega (inst, op)!=0; The progression at instruction inst place is not the progression at instruction op place, i.e. Stage (inst) in the modulo scheduling!=Stage (op);
Step 113, putting feasible is false, and execution in step 121 then, and wherein, feasible is the logical variable that judges whether to exist the dependence in the cycle;
Step 115, instruction op is added in the test space temp_state state space of current period, putting feasible is TRUE, template is sought in the instruction of calling Bundle_Helper and be among the temp_state, and the state transitions of realization FSA, wherein, Bundle_Helper is existing template matches function in the fine setting degree, its function is in current period, attempt state transitions, for suitable template is sought in the instruction in the temp_state state space by finite-state automata (FSA:Finite State Automata);
Does step 117 judge whether the FSA state transitions successful? if, carry out next step, if not, execution in step 121;
Step 119 is tested successfully, the value of temp_state composed to cur_state, i.e. and cur_state=temp_state, the absolute slot value of update instruction op, this instruction of deletion op from instruction set ops_list, execution in step 103 then;
Step 121, for instruction op does the tried mark, execution in step 105 then in cur_state;
Step 123, the fine setting degree finishes.
In sum, the present invention is in order to realize the support to " back edge ", the fine setting degree is reset (Reorder) technology and coordination (Negotiate) technology layout modulo scheduling in utilization and is thought in the instruction set that can launch in same cycle, except considering the dependence between instruction, also to consider simultaneously to instruct on the arc omega value and instruct the stage at place.Register Allocation cooperates after ability and the bundling like this, obtain and optimize performance preferably, avoid the partition problem (Split issue) that occurs in the software flow modulo scheduling, reduced to occur the possibility that (I-Cache miss) do not hit in the command cache visit, improve parallel compilation efficient, thereby further improved the compile optimization performance.

Claims (4)

1, a kind of micro dispatching method of supporting directed cyclic graph is characterized in that may further comprise the steps:
A) computations is concentrated the value of series of every instruction;
Is b) the decision instruction collection empty? if, execution in step l); If not, carry out next step;
C) do you judge whether has expired in machine current state space or all selected mistake of all instructions? if, carry out next step, if not, execution in step e);
D) template of finishing the machine state space in last cycle is assigned, and upgrades the absolute slot value that instructs in the machine state space in last cycle, and state space to the last cycle is composed in the machine state space of current period, and the state space of current machine is put sky;
E) from instruction set, select an instruction, be its distribution function parts, the state space value of current machine is composed the test space to current period;
F), check each bar instruction and step of the state space of current machine according to the data dependency graph
E) choose the correlativity of instruction in, judged whether that promptly one of following four kinds of situations judge correlativity, if any, execution in step h then), as not, then carry out next step; Four kinds of situations are: do not exist data relevant between in the machine state space of current period arbitrary instruction and selected instruction; In the data dependency graph on the arc of arbitrary instruction and selected instruction length of delay be 0; It is not 0 that loop iteration in the data dependency graph on arbitrary arc that instructs selected instruction is counted difference; The progression at arbitrary instruction place is not the progression at selected instruction place in the modulo scheduling;
G) putting feasible is false, then execution in step k), wherein, feasible is the logical variable that judges whether to exist the dependence in the cycle;
H) will instruct op to add in the test space of current period, putting feasible is TRUE, and template is sought in the instruction of calling the template matches function and be in the test space state of current period, and realizes the state transitions of finite-state automata;
I) do you judge the whether success of state transitions of finite-state automata? if, carry out next step, if not, execution in step l);
J) test successfully, the value of the test space of current period is composed machine state space to current period, upgrade the absolute slot value of selected instruction, deletion selected instruction, execution in step b then from instruction set);
K) in the machine state space of current period, do and selected mark, then execution in step c) for selected instruction;
L) the fine setting degree finishes.
2, a kind of micro dispatching method of supporting directed cyclic graph as claimed in claim 1 is characterized in that, described instruction set is that modulo scheduling is thought and is placed on the instruction set that sends among the same cycle.
3, a kind of micro dispatching method of supporting directed cyclic graph as claimed in claim 1, the value of series that it is characterized in that every instruction described in the step a) are that the position of described instruction in flat scheduling is divided by starting spacing.
4, a kind of micro dispatching method of supporting directed cyclic graph as claimed in claim 1 is characterized in that, the described method of described absolute slot value representation finishes the absolute slot value of back instruction.
CNB2004100294533A 2004-03-19 2004-03-19 A micro-dispatching method supporting directed cyclic graph Expired - Lifetime CN1306401C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2004100294533A CN1306401C (en) 2004-03-19 2004-03-19 A micro-dispatching method supporting directed cyclic graph

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2004100294533A CN1306401C (en) 2004-03-19 2004-03-19 A micro-dispatching method supporting directed cyclic graph

Publications (2)

Publication Number Publication Date
CN1670699A true CN1670699A (en) 2005-09-21
CN1306401C CN1306401C (en) 2007-03-21

Family

ID=35041973

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2004100294533A Expired - Lifetime CN1306401C (en) 2004-03-19 2004-03-19 A micro-dispatching method supporting directed cyclic graph

Country Status (1)

Country Link
CN (1) CN1306401C (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007085121A1 (en) * 2006-01-26 2007-08-02 Intel Corporation Scheduling multithreaded programming instructions based on dependency graph
CN102063288A (en) * 2011-01-07 2011-05-18 四川九洲电器集团有限责任公司 DSP (Digital Signal Processing) chip-oriented instruction scheduling method
CN102200924A (en) * 2011-05-17 2011-09-28 北京北大众志微系统科技有限责任公司 Modulus-scheduling-based compiling method and device for realizing circular instruction scheduling
US8037466B2 (en) 2006-12-29 2011-10-11 Intel Corporation Method and apparatus for merging critical sections
CN104361183A (en) * 2014-11-21 2015-02-18 中国人民解放军国防科学技术大学 Microprocessor micro system structure parameter optimizing method based on simulator
CN105843660A (en) * 2016-03-21 2016-08-10 同济大学 Code optimization scheduling method for encoder
CN116302114A (en) * 2023-02-24 2023-06-23 进迭时空(珠海)科技有限公司 Compiler instruction scheduling optimization method for supporting instruction macro fusion CPU
WO2024131071A1 (en) * 2022-12-19 2024-06-27 苏州元脑智能科技有限公司 Instruction processing method and system, device, and non-volatile readable storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5918035A (en) * 1995-05-15 1999-06-29 Imec Vzw Method for processor modeling in code generation and instruction set simulation
US6321377B1 (en) * 1998-12-03 2001-11-20 International Business Machines Corporation Method and apparatus automatic service of JIT compiler generated errors
US6631460B1 (en) * 2000-04-27 2003-10-07 Institute For The Development Of Emerging Architectures, L.L.C. Advanced load address table entry invalidation based on register address wraparound

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8612957B2 (en) 2006-01-26 2013-12-17 Intel Corporation Scheduling multithreaded programming instructions based on dependency graph
WO2007085121A1 (en) * 2006-01-26 2007-08-02 Intel Corporation Scheduling multithreaded programming instructions based on dependency graph
US8037466B2 (en) 2006-12-29 2011-10-11 Intel Corporation Method and apparatus for merging critical sections
CN102063288A (en) * 2011-01-07 2011-05-18 四川九洲电器集团有限责任公司 DSP (Digital Signal Processing) chip-oriented instruction scheduling method
CN102200924B (en) * 2011-05-17 2014-07-16 北京北大众志微系统科技有限责任公司 Modulus-scheduling-based compiling method and device for realizing circular instruction scheduling
WO2012155442A1 (en) * 2011-05-17 2012-11-22 北京北大众志微系统科技有限责任公司 Compiling method and device for realizing loop instruction scheduling based on modulo scheduling
CN102200924A (en) * 2011-05-17 2011-09-28 北京北大众志微系统科技有限责任公司 Modulus-scheduling-based compiling method and device for realizing circular instruction scheduling
CN104361183A (en) * 2014-11-21 2015-02-18 中国人民解放军国防科学技术大学 Microprocessor micro system structure parameter optimizing method based on simulator
CN104361183B (en) * 2014-11-21 2017-09-01 中国人民解放军国防科学技术大学 Microprocessor microarchitecture parameter optimization method based on simulator
CN105843660A (en) * 2016-03-21 2016-08-10 同济大学 Code optimization scheduling method for encoder
CN105843660B (en) * 2016-03-21 2019-04-02 同济大学 A kind of code optimization dispatching method of compiler
WO2024131071A1 (en) * 2022-12-19 2024-06-27 苏州元脑智能科技有限公司 Instruction processing method and system, device, and non-volatile readable storage medium
CN116302114A (en) * 2023-02-24 2023-06-23 进迭时空(珠海)科技有限公司 Compiler instruction scheduling optimization method for supporting instruction macro fusion CPU
CN116302114B (en) * 2023-02-24 2024-01-23 进迭时空(珠海)科技有限公司 Compiler instruction scheduling optimization method for supporting instruction macro fusion CPU

Also Published As

Publication number Publication date
CN1306401C (en) 2007-03-21

Similar Documents

Publication Publication Date Title
EP0843257B1 (en) Improved code optimiser for pipelined computers
US6289505B1 (en) Method, apparatus and computer programmed product for binary re-optimization using a high level language compiler
Rong et al. Single-dimension software pipelining for multidimensional loops
US7082602B2 (en) Function unit based finite state automata data structure, transitions and methods for making the same
US7631305B2 (en) Methods and products for processing loop nests
US6754893B2 (en) Method for collapsing the prolog and epilog of software pipelined loops
CN1922574A (en) Method and system for performing link-time code optimization without additional code analysis
US20060123401A1 (en) Method and system for exploiting parallelism on a heterogeneous multiprocessor computer system
EP0774714A2 (en) Method and apparatus for instruction scheduling in an optimizing compiler for minimizing overhead instructions
Gao et al. SEED: A statically greedy and dynamically adaptive approach for speculative loop execution
US7058937B2 (en) Methods and systems for integrated scheduling and resource management for a compiler
US20020083423A1 (en) List scheduling algorithm for a cycle-driven instruction scheduler
Lavery et al. Unrolling-based optimizations for modulo scheduling
CN1306401C (en) A micro-dispatching method supporting directed cyclic graph
Caamaño et al. APOLLO: Automatic speculative polyhedral loop optimizer
Shobaki et al. Optimizing occupancy and ILP on the GPU using a combinatorial approach
Larsen et al. Exploiting vector parallelism in software pipelined loops
Kessler Compiling for VLIW DSPs
She et al. OpenCL code generation for low energy wide SIMD architectures with explicit datapath
Yang et al. Vesyla-ii: An algorithm library development tool for synchoros vlsi design style
CN113721899B (en) GPDSP-oriented lightweight high-efficiency assembly code programming method and system
Bilmes et al. The PHiPAC v1. 0 matrix-multiply distribution
CN101076780A (en) Compiling method, apparatus and computer system for loop in program
O′ Boyle et al. Expert programmer versus parallelizing compiler: a comparative study of two approaches for distributed shared memory
Stripf et al. A compiler back-end for reconfigurable, mixed-ISA processors with clustered register files

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
ASS Succession or assignment of patent right

Owner name: HANGZHOU UNIMAS INFORMATION TECHNOLOGY CO., LTD.

Free format text: FORMER OWNER: INSTITUTE OF COMPUTING TECHNOLOGY HINESE ACADEMY OF SCIENCES

Effective date: 20130104

C41 Transfer of patent application or patent right or utility model
COR Change of bibliographic data

Free format text: CORRECT: ADDRESS; FROM: 100080 HAIDIAN, BEIJING TO: 310052 HANGZHOU, ZHEJIANG PROVINCE

TR01 Transfer of patent right

Effective date of registration: 20130104

Address after: Hangzhou City, Zhejiang province Binjiang District 310052 shore road 1180 building 3 layer 1-3

Patentee after: HANGZHOU UNIMAS INFORMATION ENGINEERING Co.,Ltd.

Address before: 100080 Haidian District, Zhongguancun Academy of Sciences, South Road, No. 6, No.

Patentee before: Institute of Computing Technology, Chinese Academy of Sciences

C56 Change in the name or address of the patentee

Owner name: HANGZHOU UNIMASSYSTEM DATA TECHNOLOGY CO., LTD.

Free format text: FORMER NAME: HANGZHOU UNIMAS INFORMATION TECHNOLOGY CO., LTD.

CP01 Change in the name or title of a patent holder

Address after: Hangzhou City, Zhejiang province Binjiang District 310052 shore road 1180 building 3 layer 1-3

Patentee after: HANGZHOU HEZHONG DATA TECHNOLOGY Co.,Ltd.

Address before: Hangzhou City, Zhejiang province Binjiang District 310052 shore road 1180 building 3 layer 1-3

Patentee before: HANGZHOU UNIMAS INFORMATION ENGINEERING Co.,Ltd.

CP02 Change in the address of a patent holder

Address after: 310052 floors 5-8, building 3, No. 399, Danfeng Road, Xixing street, Binjiang District, Hangzhou City, Zhejiang Province (self declaration)

Patentee after: HANGZHOU HEZHONG DATA TECHNOLOGY Co.,Ltd.

Address before: 310052 1-3 / F, building 3, 1180 Bin'an Road, Binjiang District, Hangzhou City, Zhejiang Province

Patentee before: HANGZHOU HEZHONG DATA TECHNOLOGY Co.,Ltd.

CP02 Change in the address of a patent holder
CX01 Expiry of patent term
CX01 Expiry of patent term

Granted publication date: 20070321