CN102880449B - Method and system for scheduling delay slot in very-long instruction word structure - Google Patents

Method and system for scheduling delay slot in very-long instruction word structure Download PDF

Info

Publication number
CN102880449B
CN102880449B CN201210347706.6A CN201210347706A CN102880449B CN 102880449 B CN102880449 B CN 102880449B CN 201210347706 A CN201210347706 A CN 201210347706A CN 102880449 B CN102880449 B CN 102880449B
Authority
CN
China
Prior art keywords
instruction
scheduling
alternative
local
overall
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201210347706.6A
Other languages
Chinese (zh)
Other versions
CN102880449A (en
Inventor
朱浩
彭楚
王东辉
洪缨
侯朝焕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Acoustics CAS
Original Assignee
Institute of Acoustics CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Acoustics CAS filed Critical Institute of Acoustics CAS
Priority to CN201210347706.6A priority Critical patent/CN102880449B/en
Publication of CN102880449A publication Critical patent/CN102880449A/en
Application granted granted Critical
Publication of CN102880449B publication Critical patent/CN102880449B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention discloses a method and a system for scheduling a delay slot in a very-long instruction word structure. The method comprises the steps of locally scheduling instructions in a current basic block; after the local scheduling is finished, judging whether a residual instruction delay slot exists, if not, ending the scheduling, otherwise, putting an instruction which can be filled into the instruction delay slot and is high in spending into a local standby instruction cache; globally scheduling instructions in a basic block of a branch target, selecting an instruction which can be filled into the instruction delay slot and placing the instruction in a global standby instruction cache; and selecting an instruction from the local standby instruction cache and/or the global standby instruction cache and filling the instruction into the residual instruction delay slot. The system comprises a local scheduling unit, a global scheduling unit and a balanced scheduling unit. According to the method and the system for scheduling the delay slot in the very-long instruction word structure disclosed by the invention, through balance between scheduling of the delay slot and program parallelism, as well as balance between local scheduling and global scheduling, high execution efficiency of programs can be implemented.

Description

Under a kind of very long instruction word structure, postpone groove dispatching method and system thereof
Technical field
The present invention relates to a kind of instruction scheduling technology, relate in particular under a kind of very long instruction word structure and postpone groove dispatching method and system thereof.
Background technology
(Very Long Instruction Word is called for short: VLIW) be a kind of very long packing of orders, it connects together many instructions, has increased the speed of computing very long instruction word.VLIW technology is one of main performance improving processor instruction level concurrency, and it adopts the concurrency of software-hardware synergism development process device.The assembling of long instruction is completed by compiler, rather than adopts superscalar processor based on hardware dynamic scheduling strategy, thereby has significantly reduced hardware complexity and chip power-consumption.
At digital signal processor (Digital Signal Processing, abbreviation DSP) in architecture Design, normally at Reduced Instruction Set Computer, (Reduced Instruction Set Computer is called for short: RISC) under architecture, in conjunction with VILW technology, study.Adopt in the flow water treater of this architecture, branch instruction is to put forward a high performance significant obstacle, this be because the instruction prefetch address after branch instruction need to be after branch instruction be carried out relatively pipelining-stage time could produce, middle delay can cause the branch outcome such as instruction pipelining pause, thereby directly affect the execution of subsequent instructions sequence, reduced parallel instructions degree.Postpone groove structure just in order to reduce the relevant performance cost of bringing of this type of instruction flow line line traffic control stream, its principle is, if inserted after branch instruction and the incoherent instruction of branch instruction, the instruction pipelining of processor will be still in running order so, the delay that branch instruction is brought will rationally be applied by these instructions, and postponing groove is exactly for depositing the structure of these instructions.Postpone groove structure and want to obtain good result in practical programs is carried out, need the support of software dispatching algorithm.If be filled with the instruction of use in postponing groove, the handling property of processor just may improve so, and if can not find suitable instruction and make to postpone groove and have to fill non-operation instruction, control so the performance loss that associative operation brings and still exist.Therefore, how allowing instruction delay groove obtain as far as possible and use fully on software, is probing into of value.
Postpone groove dispatching algorithm and mainly comprise local scheduling algorithm and overall scheduling algorithm two schemes.Conventional compiler only adopts the scheduling of comparatively simple local function fragment, it is local scheduling, the instruction that postpones to fill in groove is chosen from fundamental block (Basic Block), fundamental block is a statement sequence that order is carried out in program, when carrying out, can only enter from entry statement, from exit statement, exit (fundamental block only has an exit statement and an entry statement), fundamental block with branch instruction as END instruction.If there is no suitable instructions in fundamental block can select, so just fill non-operation instruction, overall scheduling is in finding fundamental block, not have suitable instructions to fill to postpone after groove, also can based on certain constraint rule, from other fundamental blocks, select suitable instruction to fill, that is to say that overall scheduling allows to cross over the code moving on fundamental block border, if this process is failure also, select so again non-operation instruction to fill, but the cost that realizes overall scheduling on compiler is larger, and directly affect the compilation speed of compiler.
In traditional dsp processor architecture Design, conventionally only use local scheduling algorithm or be limited to the compromise of the consideration of compilation speed being selected to overall scheduling and compiler efficiency, and local scheduling algorithm is confined to the optional target instruction target word number of local function fragment seldom conventionally, when compiler is fully optimized, it is on the low side that the logicality enhancing in local function fragment between instruction causes postponing groove utilization rate.In traditional dsp processor architecture Design scheme architecture to realizing based on VLIW technology useless, cannot assess the impact on parallel instructions degree, and may bring the destruction to parallel instructions degree to postponing the use of groove, affect on the contrary the actual efficiency of program.
Summary of the invention
The method that the object of this invention is to provide the delay groove scheduling under a kind of very long instruction word structure, to reach the balance postponing between groove scheduling and program parallelization degree, the balance between local scheduling and overall scheduling, thus make program obtain higher execution efficiency.
For achieving the above object, on the one hand, the invention provides the delay groove dispatching method under a kind of very long instruction word structure, the method comprises the following steps:
Instruction in current fundamental block is carried out to local scheduling, after described local scheduling completes, judged whether that remaining command postpones groove, if do not have, finishing scheduling; Otherwise can be packed into instruction delay groove but local alternative instruction buffer is put in the larger instruction of expense;
Instruction in branch target fundamental block is carried out to overall scheduling, choose the instruction that can be packed into instruction delay groove and put into overall alternative instruction buffer;
From the alternative instruction buffer in described part and/or the alternative instruction buffer of the described overall situation, choose instruction and be packed into described remaining command delay groove.On the other hand, the present invention also provides the dispatching system of the delay groove under a kind of very long instruction word structure, and this system comprises:
Local scheduling unit, for instruction in current fundamental block is carried out to local scheduling, has judged whether after described local scheduling completes that remaining command postpones groove, if do not have, and finishing scheduling; Otherwise can be packed into instruction delay groove but local alternative instruction buffer is put in the larger instruction of expense;
Overall scheduling unit, puts into overall alternative instruction buffer for instruction in branch target fundamental block being carried out to overall scheduling, choose being packed into the instruction of instruction delay groove;
Balance scheduling unit, is packed into described remaining command and postpones groove for choose instruction from the alternative instruction buffer in described part and/or the alternative instruction buffer of the described overall situation.
The method of the delay groove scheduling under a kind of very long instruction word structure that the embodiment of the present invention provides can set about postponing from assembly level the instruction filling work of groove, the method combines and postpones local scheduling strategy and overall scheduling strategy in groove dispatching method, to reach the balance postponing between groove scheduling and program parallelization degree, balance between local scheduling and overall scheduling, thus make program obtain higher execution efficiency.
Accompanying drawing explanation
After embodiments of the present invention being described in detail with way of example below in conjunction with accompanying drawing, other features of the present invention, feature and advantage will be more obvious.
Fig. 1 postpones groove dispatching system structural drawing under a kind of very long instruction word structure of the embodiment of the present invention;
Fig. 2 postpones groove dispatching method process flow diagram under a kind of very long instruction word structure of the embodiment of the present invention;
Fig. 3 is the oriented instruction dependency graph of local scheduling front construction;
Fig. 4 is the oriented instruction dependency graph of overall scheduling front construction.
Embodiment
Below by drawings and Examples, the application's technical scheme is described in further detail.
Delay groove dispatching method under a kind of VLIW structure that the embodiment of the present invention proposes, for having set about postponing the instruction filling work of groove from assembly level.The method combines local scheduling strategy, overall scheduling strategy in the own delay groove dispatching algorithm proposing, and the requirement to parallel instructions for VLIW structure, design balance scheduling strategy to reach the balance between balance, local scheduling and the overall scheduling between the scheduling of instruction delay groove and program parallelization degree, taken this to obtain high as far as possible instruction pipelining performance.
Below in conjunction with Fig. 1 and Fig. 2, introduce in detail delay groove dispatching method and the system thereof under embodiment of the present invention VLIW structure.As shown in Figure 1, this system comprises processing unit 11, local scheduling unit 21, overall scheduling unit 22, balance scheduling unit 31, statistic unit 32, and code generation unit 41.
Processing unit 11 is for resolving successively the assembling file of input, and from assembling file, obtains the specifying information that every assembly instruction comprises.Above-mentioned specifying information comprises the division of the hardware capability unit that register information, target address information, instruction title, instruction are involved, and cycle (cycle) number carried out of instruction.Processing unit 11 be take function fragment as unit by the assembling file receiving, and hash index table is stored and set up to function label, to facilitate index.
Processing unit 11 adopts doubly linked list structure that assembly instruction Yi Baowei unit is organized.For example, assembling file is as follows:
test:
pr0add d0,d1,d2|1
pr0sub d3,d4,d5|2
pr0add d6,d7,d8|3
pr0sub d9,d10,d11.4
pr0 j test1.5
test1:
pr0 addia a10,1.6
pr0 addia a8,1|7
pr0 j test 1.8
pr0 addia a5,1|9
pr0 sub d9,d10,d11.10
pr0 j test .11
Above assembling file is divided into two function fragments, be function fragment test and function fragment test1, instruction 1~4 in function fragment tes t forms instruction bag pak1, end symbol " | " in instruction 1~instruction 3 represents that this assembly instruction is for can and wrap instruction, end symbol ". " in instruction 4 represents that this instruction is that the last item can and wrap instruction, child list in pak1 is organized by the logical order of instruction 1~4, and instruction 5 forms instruction bag pak2.Therefore, function fragment tes t consists of pak1 and pak2.The object set that carries out local scheduling for function fragment test is { pak1, pak2 }, and according to branch instruction, the object set that is the inference register pr0 analysis overall scheduling of instruction 5 is the whole instruction bags to next branch instruction or function fragment or between interior two branch instructions of subfunction fragment from the instruction 6 of test1, as the instruction 9~instruction 11 in above assembling file.
Statistic unit 32 is for adding up the static instruction cycles of objective function fragment, and the variation of the instruction cycles bringing after local scheduling and overall scheduling.The function fragment periodicity obtaining by statistic unit 32 statistics becomes the foundation of assessment instruction pipelining performance quality.
Particularly, postpone in groove dispatching system under very long instruction word (VLIW) structure, each instruction delay groove can represent saves an instruction cycle.Before carrying out local scheduling and overall scheduling, statistic unit 32 need to travel through the set of instruction bag { pak1, the pak2 of concrete function fragment, ..., pakn } in the sub-instructions chained list of each element, and calculate maximum instruction cycles, form maximum instruction cycles set { c1, c2, ... cn }, the value of the whole elements in cumulative this set, thus obtain the instruction cycles of this function fragment.
The embodiment of the present invention postpones groove dispatching system when starting local scheduling, on the one hand, and the instruction cycles that needs considering delay groove to save; Another aspect, need to consider very long instruction word (VLIW) structure the give an order fractionation of bag and the impact of the instruction cycles that brings of restructuring.
Local scheduling unit 21 is for carrying out local scheduling to instruction in current fundamental block, judged whether that remaining command postpones groove after local scheduling completes, if do not have, and finishing scheduling; Otherwise can be packed into instruction delay groove but local alternative instruction buffer is put in the larger instruction of expense.
Particularly, the local scheduling of instruction in function fragment is realized in local scheduling unit 21, from the sub-fragment of objective function, choose with branch instruction, below the irrelevant instruction of instruction is packed into delay groove, what embody is a kind of process from bottom to top, need to analyze the dependence between instruction in current fundamental block carrying out local scheduling early stage, construct local oriented instruction dependency graph, the Ingress node select target branch instruction of local instruction dependency graph.According to local instruction dependency graph, can from fundamental block, find out the whole instructions that have relevant conflict to branch instruction, form local correlation instruction set, and from fundamental block, find out the whole instructions that can insert in instruction delay groove, form local alternative instruction set.Function fragment as shown in Fig. 3 left hand view, suppose that it is address register a11 that call in branch instruction 9 gives tacit consent to the register using under this architecture, in Fig. 2 right part of flg, solid-line curve partly represents that with call instruction exists the instruction of data dependence relation, show to exist the dependence of writeafterread, dependent instruction set { 1,2,6,7,8 } cannot be received in instruction delay groove; And dashed curve partly represents that instruction 7 in instruction 5 and dependent instruction set exists the antidependence relation of read-after-write in Fig. 2 right part of flg, i.e. instruction 5 cannot be added in instruction delay groove equally, and therefore final associated instruction set is combined into { 1,2,5,6,7,8,9 }.The alternative instruction set in part that therefore, may be packed into instruction delay groove in the function fragment shown in Fig. 3 left hand view is { 3,4 }.
Can be according to the number that postpones groove in target architecture by alternative instruction set { 3 in traditional local scheduling algorithm, 4 } be added into as much as possible in instruction delay groove, yet, adopting on the target architecture of very long instruction word (VLIW) structure technology, the effect of filling according to traditional local scheduling mode also not obviously even has no effect.End symbol by each assembly instruction in Fig. 3 left hand view can be found out instruction set { 3,4,5 } can form a parallel instructions bag, instruction { 1 }, instruction { 2 }, instruction set { 6,7 } and instruction set { 8,9 } form equally independently parallel instructions bag, { 5} forms dependent instruction subclass.
For local scheduling unit 21 balance that balance scheduling unit 31 carries out when carrying out local scheduling, scheduling describes below.According to target architecture, give an order and postpone the comparison of number of instructions in groove number and local alternative instruction set, and in local correlation instruction set, be packed into preparation the local correlation subset of instructions that instruction in the alternative instruction set in part in instruction delay groove is present in same parallel instructions bag and close, whether can with dependent instruction set in other elements reconstitute the situations such as parallel instructions bag the balance Scheduling Algorithms in local scheduling discussed:
The first situation: instruction delay groove number is greater than or equal to local alternative instruction set { 3,4 } element number, and the local correlation subset of instructions in function fragment is closed, it is instruction 5, with the instruction the instruction in dependent instruction subclass 5 in dependent instruction set, be that instruction { 1 }, instruction { 2 }, instruction set { 6,7 } or instruction set { 8,9 } cannot form new parallel instructions bag.
In the first situation, for the instruction scheduling of the alternative instruction set in part { 3,4 }, will cause two instruction delay grooves occupied, and but form independently instruction bag with the instruction { 5 } that the alternative instruction set in part { 3,4 } can form parallel instructions bag.So on the one hand, because the instruction in function fragment { 3 }, instruction { 4 } and instruction { 5 } itself can be merged into an instruction bag, the scheduling for the alternative instruction set in part { 3,4 } will can not change instruction set { 3 so, 4,5 } taking instruction pipelining; On the other hand, the scheduling of alternative instruction set { 3,4 } will cause the waste of instruction delay groove.Therefore, for the first situation, will can not carry out the scheduling of local alternative instruction set { 3,4 }, and the overall scheduling that enters next step.
The second situation: instruction delay groove number is greater than or equal to alternative instruction set element number, and local correlation instruction set and close and can form new parallel instructions bag with described local correlation subset of instructions.
For example, instruction delay groove number is more than or equal to local alternative instruction set { 3,4 } element number, and instruction { 5 } can and instruction { 1 }, instruction { 2 }, instruction set { 6,7 } or instruction set { 8,9 } form new parallel instructions bag.
In the second situation, for the alternative instruction set { 3 in part, 4 } instruction scheduling will cause two instruction delay grooves occupied equally, and instruction { 5 } also needs to discuss two kinds of situation: situation A while merging to other instruction bags, when the introducing of instruction { 5 } causes two instruction parlors to add extra nop (sky) instruction, local alternative instruction set { 3,4 } can not be scheduled; Situation B, the introducing of instruction { 5 } can not cause the introducing of extra nop instruction, and for situation B, the result while also needing according to overall scheduling is weighed.
The third situation: instruction delay groove number is less than local alternative instruction set element number, and with described local correlation subset of instructions close with the alternative instruction set in described part in one or more instruction of appointing can form new parallel instructions bag with described local correlation instruction set.
For example, instruction delay groove number is less than local alternative instruction set { 3,4 } element number, and instruction { 5 } can with the alternative instruction set { 3 in part, 4 } instruction in { 3 } or instruction { 4 } and instruction { 1 }, instruction { 2 }, instruction set { 6,7 } or instruction set { 8,9 } form new parallel instructions bag.
In the third situation, by local scheduling, from the alternative instruction set in part { 3,4 }, choose an instruction and insert instruction delay groove, save 1 instruction cycle.But result when its result also needs according to overall scheduling is weighed.
The 4th kind of situation: instruction delay groove number is less than local alternative instruction set element number, and with described local correlation subset of instructions close with the alternative instruction set in described part in one or more instruction of appointing all can not form new parallel instructions bag with described local correlation instruction set.
For example, instruction delay groove number is less than local alternative instruction set { 3,4 } element number, and instruction { 5 } not all can with the alternative instruction set { 3 in part, 4 } instruction in { 3 } or instruction { 4 } and instruction { 1 }, instruction { 2 }, instruction set { 6,7 } or instruction set { 8,9 } form new parallel instructions bag.
In the 4th kind of situation, by local scheduling, an instruction in the alternative instruction in part { 3,4 } is inserted to instruction delay groove and will can not promote the performance of instruction stream water-based, therefore can only carry out overall scheduling.
The 5th kind of situation, instruction delay groove number is less than local alternative instruction set element number, and with described local correlation subset of instructions close with the alternative instruction set in described part in one or more instruction of appointing all cannot form new parallel instructions bag with described local correlation instruction set.
For example, instruction delay groove number is less than local alternative instruction set { 3,4 } element number, and instruction { 5 } and instruction { 3 } or instruction { 4 } all cannot and instruction { 1 }, instruction { 2 }, instruction set { 6,7 } or instruction set { 8,9 } form new parallel instructions bag.
In the 5th kind of situation, if instruction { 3 } and instruction { 4 } are present in same parallel instructions bag, so by local scheduling by the alternative instruction set { 3 in part, 4 } instruction delay groove is inserted in an instruction in can not promote the performance of instruction pipelining equally, therefore can only carry out overall scheduling.If instruction { 3 } and instruction { 4 } are present in same parallel bag, by local scheduling, utilize local alternative instruction { 3,4 } to fill up delay groove so, just can save and postpone groove number, therefore only need carry out local scheduling.
It should be noted that, local scheduling is to judge whether to fill delay groove according to the present instruction delay number of groove and the expense of local alternative instruction set, in local scheduling, need to provide an alternative buffer in part, for storing that local alternative instruction set can be scheduled but the larger alternative instruction set in part of expense, after the overall scheduling at next step completes, the more alternative instruction selection instruction of storing from local scheduling buffer and overall scheduling buffer is packed in remaining command delay groove.
Preferably, in the alternative instruction set by part, can be packed into instruction delay groove but the larger instruction of expense while putting into local alternative buffer, first delete the instruction that can not make performance improve in local instruction set.
Overall scheduling need to be moved the partial code in branch target function fragment and postpone in groove, but call normally many-one but not man-to-man relation between program function, therefore, the natural conjugate branch of overall scheduling algorithm is predicted to realize, the common like this increase that causes size of code.Yet under very long instruction word structure, widely used inference register but allows overall scheduling algorithm obtain better realization, the architecture of a processor of take is example, and retaining the guided instruction of inference register pr0 will always be performed.Therefore, can confirm whether branch instruction is necessarily carried out according to the call number of pr register.
What overall scheduling embodied is a kind of top-down heuristic process, than local scheduling, it need to make certain adjustment to the parallel instructions of object code fragment, is the process of a kind of " First come first served " (first obtain the processing of scheduling power, rear acquisition is not processed).In the early stage of using in overall scheduling, need to construct overall oriented instruction dependency graph equally, and the same select target branch instruction of the Ingress node of this dependency graph.According to instruction dependency graph, can from fundamental block, find out the whole instructions that have relevant conflict to branch instruction, form overall dependent instruction set, and from fundamental block, find out the whole instructions that can insert in instruction delay groove, the alternative instruction set of the overall situation while forming overall scheduling.And according to overall dependent instruction set search with the alternative instruction set of the overall situation in order element be present in the order element of the parallel bag of same instructions, form overall dependent instruction subclass.
Function fragment as shown in Figure 4, suppose that { it is address register a11 that the register using is given tacit consent in the call instruction in 9} under this architecture to branch instruction, in Fig. 4 the solid-line curve shown in right part of flg part there is the instruction of data dependence relation in representative and call instruction, it is instruction set { 14,15 }, instruction set { 14,15 } cannot put into instruction delay groove, and dashed curve representative is the data dependence relation between objective function fragment native instructions, be that instruction 11 can not be carried out before instruction 10, instruction 15 can not be carried out before instruction 11.The alternative instruction set of the overall situation that therefore, finally may be packed into instruction delay groove is { 10,11,12,13 }.
Under traditional pattern, according to postponing the remaining quantity of groove, be packed into as much as possible instruction, and conventionally easily ignore the concurrency that usability of program fragments self has.
From the function fragment of Fig. 4 left hand view, can find out overall alternative instruction set { 10,11,12,13 } subclass in { 11,12,13 } is in same parallel instruction bag, the use instruction delay groove of maximal efficiency how, can divide following several situation to discuss equally:
For overall scheduling unit 22 balance that balance scheduling unit 31 carries out when carrying out overall scheduling, scheduling describes below.Below in conjunction with Fig. 3, and according to the comparable situation that remains instruction delay groove number and alternative instruction set element number under target architecture, the balance Scheduling Algorithms in overall scheduling is discussed:
The first situation, remaining command postpones groove number and is more than or equal to overall alternative instruction set { 10,11,12,13 } element number.
In the first situation, overall alternative instruction set { 10,11,12,13 } can all be inserted in instruction delay groove.Due to the alternative instruction set { 10 of the overall situation, 11,12,13 } subclass { 11,12 in, 13 } and instruction 14 be in same parallel instruction bag, when instruction { 14 } can not be added into the parallel instruction bag at instruction { 15 } place, overall alternative instruction set { 10,11,12,13 } all insert the periodicity that the effect of bringing in instruction delay groove has only reduced the parallel instruction bag consisting of instruction { 10 }; When instruction { 14 } can be added into the parallel instruction bag at instruction { 15 } place, overall alternative instruction set { 10,11,12,13 } all inserting the effect of bringing in instruction delay groove has only reduced by instruction { 10 } and instruction set { 11,12,13 } form respectively the periodicity of parallel instruction bag.
Therefore, when the first situation when local scheduling is set up, or the situation A of the second situation when local scheduling is when set up, and the result of overall scheduling is the final scheduling result of delay groove.When the situation B of the second situation when local scheduling sets up, and residual delay groove number now equals overall alternative instruction set { 10,11,12,13 } during element number, final scheduling result is 1: when the quantity of residual delay groove is more than or equal to local alternative instruction set { 3,4} and overall alternative instruction set { 10,11,12,13 }, during order element number sum, scheduling result is that they are all added in instruction delay groove so.2: if the quantity of residual delay groove is less than local alternative instruction set { 3,4} and overall alternative instruction set { 10,11,12,13 } order element number sum, scheduling result is that the element of the alternative instruction set in part or overall alternative instruction set is all added in instruction delay groove so.
The second situation, remaining command postpones groove number and is less than overall alternative instruction set { 10,11,12,13 } element number, and is more than or equal to the alternative instruction set element number of local scheduling that local scheduling obtains.
In the second situation, overall alternative instruction set { 10,11,12,13 } can not all insert in instruction delay groove, according to traditional overall scheduling algorithm by the alternative instruction set { 10 of the overall situation, 11,12,13 } subclass { 10,11 } put into instruction delay groove, when instruction set { 12,13,14,15 } in the time of can not independently becoming to wrap, will take two instruction delay grooves, the effect of bringing is to have reduced the periodicity of the parallel instruction bag consisting of instruction { 10 }; When instruction set { 12,13,14,15 } can independently become to wrap, take two instruction delay grooves, the effect of bringing is the periodicity that has reduced by two parallel instruction bags.Or according to traditional overall scheduling scheme by alternative instruction set { 10,11,12,13 } subclass { 10,11,12 } put into instruction delay groove, when instruction set { 13,14,15 } can independently become to wrap, take three instruction delay grooves, the effect of bringing is the periodicity that has reduced by two parallel instruction bags; When instruction set { 13,14,15 } cannot independently become to wrap, take three instruction delay grooves, the effect of bringing is to have reduced the periodicity of the parallel instruction bag consisting of instruction 10.
The first situation when the effect that the second situation during overall scheduling is brought is all better than local scheduling and the second situation, therefore, overall scheduling result is the final scheduling result of instruction delay groove.
The third situation, remaining command postpones groove number and is less than 2.
In the third situation, can only from the alternative instruction set of the overall situation { 10,11,12,13 }, choose an instruction and insert in instruction delay groove, because the instruction { 10 } in the alternative instruction set of the overall situation { 10,11,12,13 } has formed single instrction bag.Therefore, instruction 10 is inserted in instruction delay groove and will be reduced by an instruction cycle, and the service efficiency of delay groove is very high.
When any situation in the five kinds of situations of the first situation to the during local scheduling is set up, the result of the third situation of overall scheduling is the final scheduling result of instruction delay groove.
It should be noted that equally, in overall scheduling scheme, still need to provide an alternative Instruction Register of the overall situation, the larger alternative instruction set of the overall situation of expense when storing overall scheduling.In balance scheduling scheme, need to judge whether local alternative Instruction Register is empty, if, from the alternative Instruction Register of the overall situation, choose instruction and be packed into instruction delay groove, otherwise the instruction of choosing best performance from the alternative Instruction Register in part and overall alternative Instruction Register is packed into remaining command, postpone groove.
Preferably, in the alternative instruction set by the overall situation, can be packed into instruction delay groove but the larger instruction of expense while putting into local alternative buffer, first delete the instruction that can not make performance improve in local instruction set.
In addition, when the number of instructions sum in overall scheduling buffer memory and local scheduling buffer memory is greater than instruction delay groove number, the instruction correlation analysis in instruction delay groove need to be considered, when adding instruction in instruction delay groove, the correlativity between instruction need to be considered equally.
In an example, in instruction delay groove, inserted three instructions (as follows),
Delayslot
Pr0 mul d0,d1,d2.1
Pr0 nop.2
Pr0 nop sub d3,d4,d0.3
Wherein, there is the dependence about data register d0 in instruction { 1 } and instruction { 3 }, and the performance period of mul instruction in instruction { 1 } be 2 cycles, traditional way is in instruction delay groove 2, to insert nop instruction, and inserting of nop instruction will be wasted instruction delay groove.Therefore, need to select other independent instructions to be packed into the position of instruction delay groove 2, thereby solve the waste of nop instruction to instruction delay groove.
Delay groove algorithm under the very long instruction word (VLIW) structure that the embodiment of the present invention provides combines traditional local scheduling algorithm and overall scheduling algorithm, and has proposed balance Scheduling Algorithms for very long instruction word structure.By postponing the balance between groove scheduling and program parallelization degree, the balance between local scheduling and overall scheduling, thus make program obtain higher execution efficiency.In addition, the overall scheduling algorithm using in computation complexity will be lower than compiler, and more flexible, can excavate the execution efficiency of target program more fully.
Obviously, do not departing under the prerequisite of true spirit of the present invention and scope, the present invention described here can have many variations.Therefore, all changes that it will be apparent to those skilled in the art that, within all should being included in the scope that these claims contain.The present invention's scope required for protection is only limited by described claims.

Claims (6)

1. under very long instruction word structure, postpone a groove dispatching method, it is characterized in that comprising the following steps:
Instruction in current fundamental block is carried out to local scheduling, after described local scheduling completes, judged whether that remaining command postpones groove, if do not have, finishing scheduling; Otherwise can be packed into instruction delay groove but local alternative instruction buffer is put in the larger instruction of expense;
Instruction in branch target fundamental block is carried out to overall scheduling, choose the instruction that can be packed into instruction delay groove and put into overall alternative instruction buffer;
From the alternative instruction buffer in described part and/or the alternative instruction buffer of the described overall situation, choose instruction and be packed into described remaining command delay groove;
Described local scheduling comprises:
According to the dependence between instruction in current fundamental block, obtain local alternative instruction set and local associated instruction set and close;
According to described local correlation instruction set search with the alternative instruction set in described part in each order element be present in the order element of same parallel instructions bag, form local correlation subset of instructions and close;
According to target architecture, give an order postpone the number of groove, number of instructions and the described local correlation subset of instructions in the alternative instruction set of the alternative instruction set in described part and described part closed, from the alternative instruction set in described part, choose instruction and be packed into instruction delay groove.
2. dispatching method according to claim 1, is characterized in that, can be packed into instruction delay groove but before the larger instruction of expense puts into local alternative instruction buffer, also comprise:
Delete the instruction that can not make performance improve in the alternative instruction set in described part.
3. according to the dispatching method described in claim power 1, it is characterized in that, described instruction in branch target fundamental block carried out to overall scheduling, choose the instruction that can be packed into instruction delay groove and put into overall alternative instruction buffer and further comprise:
According to the dependence between instruction in branch target fundamental block, obtain overall alternative instruction set and overall dependent instruction set;
According to described overall dependent instruction set search with the alternative instruction set of the described overall situation in order element be present in the order element of the parallel bag of same instructions, form overall dependent instruction subclass;
According to current residual, postpone number of instructions and described overall dependent instruction subclass in number, the alternative instruction set of the described overall situation and the alternative instruction set of the described overall situation of groove, from the alternative instruction set of the described overall situation, choose the instruction that can be packed into instruction delay groove and put into overall alternative instruction buffer.
4. dispatching method according to claim 3, is characterized in that, chooses before the instruction that can be packed into instruction delay groove puts into overall alternative instruction buffer from the alternative instruction set of the described overall situation, also comprises:
Delete the instruction that can not make performance improve in the alternative instruction set of the described overall situation.
5. dispatching method according to claim 1, is characterized in that, chooses instruction and be packed into described remaining command and postpone groove and further comprise from the alternative instruction buffer in described part and/or the alternative instruction buffer of the described overall situation:
Judge whether the alternative instruction buffer in described part is empty, if, from the alternative instruction buffer of the described overall situation, choose instruction and be packed into instruction delay groove, otherwise the instruction of choosing best performance from the alternative instruction buffer in described part and the alternative instruction buffer of the described overall situation is packed into remaining command, postpone groove.
6. according to the dispatching method described in arbitrary claim in claim 1 to 5, it is characterized in that, described instruction is assembly instruction.
CN201210347706.6A 2012-09-18 2012-09-18 Method and system for scheduling delay slot in very-long instruction word structure Expired - Fee Related CN102880449B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210347706.6A CN102880449B (en) 2012-09-18 2012-09-18 Method and system for scheduling delay slot in very-long instruction word structure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210347706.6A CN102880449B (en) 2012-09-18 2012-09-18 Method and system for scheduling delay slot in very-long instruction word structure

Publications (2)

Publication Number Publication Date
CN102880449A CN102880449A (en) 2013-01-16
CN102880449B true CN102880449B (en) 2014-11-05

Family

ID=47481790

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210347706.6A Expired - Fee Related CN102880449B (en) 2012-09-18 2012-09-18 Method and system for scheduling delay slot in very-long instruction word structure

Country Status (1)

Country Link
CN (1) CN102880449B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104424026B (en) * 2013-08-21 2017-11-17 华为技术有限公司 One kind instruction dispatching method and device
CN103577242B (en) * 2013-11-14 2016-11-02 中国科学院声学研究所 Controlling stream graph reconstructing method for scheduled assembly code
GB2544994A (en) * 2015-12-02 2017-06-07 Swarm64 As Data processing
CN110659119B (en) 2019-09-12 2022-08-02 浪潮电子信息产业股份有限公司 Picture processing method, device and system
CN114327643B (en) * 2022-03-11 2022-06-21 上海聪链信息科技有限公司 Machine instruction preprocessing method, electronic device and computer-readable storage medium
CN114816532B (en) * 2022-04-20 2023-04-04 湖南卡姆派乐信息科技有限公司 Register allocation method for improving RISC-V binary code density

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101604255A (en) * 2009-07-23 2009-12-16 上海交通大学 The method that the binary translation by delayed skip instruction of intermediate language is realized

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004062427A (en) * 2002-07-26 2004-02-26 Renesas Technology Corp Microprocessor
EP2106584A1 (en) * 2006-12-11 2009-10-07 Nxp B.V. Pipelined processor and compiler/scheduler for variable number branch delay slots

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101604255A (en) * 2009-07-23 2009-12-16 上海交通大学 The method that the binary translation by delayed skip instruction of intermediate language is realized

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
一种面向RISC体系结构的全局延迟槽调度策略;蒋奕等;《计算机科学》;20041225;第31卷(第12期);第192页右栏第3段,第194页左栏第3段 *
时磊等.面向VLIW处理器的分支调度优化算法.《计算机工程与应用》.2012,第48卷(第21期),第41页右栏第2段-第43页左栏倒数第2段. *
蒋奕等.一种面向RISC体系结构的全局延迟槽调度策略.《计算机科学》.2004,第31卷(第12期),第192页右栏第3段,第194页左栏第3段. *
面向VLIW处理器的分支调度优化算法;时磊等;《计算机工程与应用》;20120721;第48卷(第21期);第41页右栏第2段-第43页左栏倒数第2段 *

Also Published As

Publication number Publication date
CN102880449A (en) 2013-01-16

Similar Documents

Publication Publication Date Title
Schoeberl et al. Towards a time-predictable dual-issue microprocessor: The Patmos approach
CN102880449B (en) Method and system for scheduling delay slot in very-long instruction word structure
US9639371B2 (en) Solution to divergent branches in a SIMD core using hardware pointers
WO2012003997A1 (en) Data processing device and method
US9830164B2 (en) Hardware and software solutions to divergent branches in a parallel pipeline
CN103329132A (en) Architecture optimizer
CN101620526B (en) Method for reducing resource consumption of instruction memory on stream processor chip
CN103116485A (en) Assembler designing method based on specific instruction set processor for very long instruction words
Reshadi et al. A cycle-accurate compilation algorithm for custom pipelined datapaths
CN114416045A (en) Method and device for automatically generating operator
Ramirez et al. Trace cache redundancy: Red and blue traces
US20180246847A1 (en) Highly efficient scheduler for a fine grained graph processor
Lakshminarayana et al. Incorporating speculative execution into scheduling of control-flow-intensive designs
US20050257200A1 (en) Generating code for a configurable microprocessor
Prokesch et al. A generator for time-predictable code
US7415689B2 (en) Automatic configuration of a microprocessor influenced by an input program
CN115878188B (en) High-performance realization method of pooling layer function based on SVE instruction set
CN101727513A (en) Method for designing and optimizing very-long instruction word processor
WO2016028410A1 (en) Execution and scheduling of software pipelined loops
Whitham et al. Time-predictable out-of-order execution for hard real-time systems
Huang et al. WCET-aware re-scheduling register allocation for real-time embedded systems with clustered VLIW architecture
Gomez et al. Optimizing the memory bandwidth with loop morphing
Dos Santos et al. A code-motion pruning technique for global scheduling
Fernandes A clustered VLIW architecture based on queue register files
US6430682B1 (en) Reliable branch predictions for real-time applications

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20141105

Termination date: 20190918

CF01 Termination of patent right due to non-payment of annual fee