CN1266592C

CN1266592C - Dynamic VLIW command dispatching method according to determination delay

Info

Publication number: CN1266592C
Application number: CN 200310110566
Authority: CN
Inventors: 王志英; 沈立; 戴葵; 张春元; 鲁建壮; 李云照; 陆洪毅; 王蕾; 王进
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2003-11-26
Filing date: 2003-11-26
Publication date: 2006-07-26
Anticipated expiration: 2023-11-26
Also published as: CN1545026A

Abstract

The present invention discloses dynamic VLIW instruction dispatching method according to determined delay. The goal of the present invention is to solve the problems that a VLIW microprocessor can not eliminate the compatibility of dynamic delay and binary codes. The present invention has the technical scheme that a pipeline core is divided into a front end and a back end, wherein the front end comprises an instruction fetch module, an instruction decode module and an instruction dispatching module, and the back end comprises FU and ROB. Determined execution delay is designated for various operations in a VLIW instruction according to statistical information, and the earliest available time of each register is stored. In running, the execution time of each operation is determined according to the earliest available time of each register, and the end time of execution is determined according to operational delay. Meanwhile, the optimized result of a VLIW compiler is used for ensuring that no correlation exist among a plurality of operations in the same VLIW instruction. The present invention enables an instruction dispatching mechanism with simple use to dynamically determine the execution time of operations by the parallelism development result of the VLIW compiler, the performance of microprocessors is improved, and the problem of binary code compatibility is solved.

Description

According to the dynamic VLIW instruction scheduling method of determining to postpone

Technical field: the present invention relates to instruct in the microprocessor Design dynamic instruction dispatching method in dispatching method, especially VLIW (very long instruction word, the Very Long Instruction Word) microprocessor Design.

Background technology: instruction scheduling both can be finished by compiler static state when compiling in the microprocessor Design, also can dynamically be finished by hardware mechanisms when operation, and these two kinds of methods cut both ways.In general, all despatching works of static method are all finished by compiler, seldom even do not need only to need extra hardware, shortcoming is can't handle effectively that when operation produce dynamic deferred (delay that causes as branch transition or reference-to storage), performance is subjected to certain restriction, its typical case's representative is a vliw microprocessor, for example the MultiFlow microprocessor of MultiFlow company; Message scheduling instruction when dynamic approach then can make full use of operation is carried out, reduce preferably because the delay that branch and accessing operation cause, shortcoming is the support that needs complex hardware mechanism, the hardware cost height, superscalar architecture is its representative, for example X86 series microprocessor of Intel Company.

Studies show that the performance of superscalar microprocessor has been tending towards the limit at present, and vliw microprocessor shows good performance advantage and development prospect, becomes the focus of current architectural study and microprocessor Design.The VLIW structure forms a very long instruction with the operation of many instructions packing, and each operation during same VLIW instructs can executed in parallel, and VLIW gains the name thus.The VLIW structure adopts a plurality of independent functional units, and with each operation in the parallel execution of instructions, the task of many instructions that selection can be flowed out is simultaneously finished by compiler.But the information elimination was dynamic deferred when present vliw microprocessor still can't make full use of operation, also had serious binary code compatibling problem simultaneously.Prior art is as separating the mechanism of flowing out (Split-Issue) and DISVLIW (Dynamically Instruction ScheduledVLIW, dynamic dispatching VLIW) all hardware instruction scheduling mechanism and VLIW structure can not be merged well, though solved the binary code compatibling problem to a certain extent, but hardware mechanisms is to a great extent in the work that repeats compiler, and design complexities is very high, and performance but can not get effective raising.How to realize that with simple hardware mechanisms VLIW instruction dynamic dispatching is a key issue that needs to be resolved hurrily.

Summary of the invention: the objective of the invention is to solve vliw microprocessor and can't effectively utilize the dynamic deferred and binary code compatibling problem of information elimination when moving, overcome that hardware mechanisms repeats compiler work in a large number in the prior art, design complexities is high but can't effectively improve the shortcoming of performance, the hardware instruction scheduling mechanism is combined among the VLIW structure, improve the dynamic deferred tolerance of VLIW structure, and utilize the concurrency of VLIW compiler development to reduce hardware complexity, improve the performance of vliw microprocessor, eliminate dynamic deferred and solution binary code compatibling problem.

Technical scheme of the present invention is: according to statistical information is that various types of operations specify the execution of determining to postpone in the VLIW instruction, and the pot life the earliest of each register is preserved; Determine the execution time of each operation during operation according to the pot life the earliest of each register, determine its execution concluding time according to operating delay; Utilize the optimization result of VLIW compiler to guarantee not exist between a plurality of operations in the same VLIW instruction any relevant simultaneously, can detect the execution time of a plurality of operations simultaneously and whether relevantly needn't detect between them, reduce hardware complexity.Still there is not the report that adopts this method dynamic dispatching VLIW instruction at present both at home and abroad.

Concrete scheme is: whole vliw microprocessor system comprises pipeline kernel, Cache system, memory controller and four parts of storer: pipeline kernel is responsible for execution command, and execution result write back storer, it is connected with the Cache system with address bus by the instruction/data bus; The Cache system holds instruction and the high-speed cache of data, is made up of instruction Cache and Data Cache two parts; Memory controller provides the interface of storer and Cache system, and when the required instruction of pipeline kernel or data were not in Cache, memory controller was responsible for Cache is read in instruction or data from storer; Storer is held instruction and data.Be connected with address bus by instruction bus, data bus between these three modules of Cache system, memory controller and storer.

The vliw microprocessor system utilizes pipeline kernel to carry out the dynamic instruction scheduling, and the method for designing of pipeline kernel is: pipeline kernel is divided into front-end and back-end.Front end comprises instruction fetch module, instruction decode module and three modules of instruction dispatch module, is responsible for memory fetch, with the instruction decode of fetching, and according to each execution time of operating in the definite instruction of decode results; The rear end comprises that (Function Unit, FU) and again the sequencing buffering (be responsible for executable operations and the execution result of operation is confirmed by Reorder Buffer, ROB) two modules for the functional unit of executable operations.Instruction fetch module is connected with address bus and instruction Cache by instruction bus, is responsible for instruction fetch from instruction Cache; The VLIW code translator is arranged in the instruction decode module, be connected by instruction bus, instruction is deciphered, determine the delay of operation, but determine the execution time the earliest of operation according to the information that writes down among the RAT according to the action type that decoding obtains with instruction fetch module; The instruction dispatch module comprises register pot life table (Register Available Table, RAT) and functional unit selector switch (Function Unit Selector, FUS), the pot life the earliest of each register of RAT table record, FUS is assigned to different functional units according to decode results with each operation and carries out, determine the execution time the earliest of operation, and revise the pot life the earliest of corresponding register (if any) among the RAT according to operating delay.The functional unit of rear end comprises ALU unit, memory access unit, floating point unit, branch units, concrete quantity is different because of the design of different microprocessors, can carry out a plurality of operations simultaneously, in order not hinder the smooth decoding of other instruction, each functional unit has an instruction queue, preserves and waits for all operations of carrying out in this unit; ROB preserves the execution result of operation and according to the order of instruction fetch the execution result of operation is confirmed, the result who obtains confirming is write back register or storer, abandon remaining result, in each instruction queue of cancelling simultaneously all etc. this result's to be used operation.

The concrete steps of utilizing pipeline kernel to carry out the dynamic instruction scheduling are:

1. instruction fetch module is got the VLIW instruction the Cache from instruction, and for every instruction distributing the execution result of holding instruction among the sequencing buffering ROB again, write down this sequence number I, identify every instruction uniquely by I.

2. instruction decode module is to the VLIW instruction decode, determine its delay according to each operation types in the instruction, determine the execution time the earliest of each operation according to the register pot life that writes down among the RAT, and be each operated allocated two numberings I and p, the instruction at I sign operation place, p represents that this operates in the position of instruction among the I, like this by＜I, p〉represent an operation uniquely; The method of determining operating delay is:

(a) operate for Load, microprocessor Design all adopts Memory Hierarchy to constitute complicated storage system at present, and visiting wherein any single-level memory inefficacy all can influence its delay, and the delay of Load operation is not a fixed value when therefore moving; The simulation test statistics shows, to be appointed as the delayed aging fruit that visit data Cache hits best when the delay of Load operation, therefore the present invention also adopts this hypothesis, and the delay that visit data Cache hits is appointed as in the delay of Load operation, is a clock period.

(b) for Store operation, postpone also uncertainly during its operation, but the Store operation does not have destination register, therefore can not influence the execution of other type instruction.Microprocessor all adopts and writes buffering at present, the delay of Store operation can be shortened to a clock period, so the present invention supposes that its delay is a clock period.

(c) for other all types of operations, postponing all is determined value, promptly definite after the microprocessor pipeline design is determined.

3. for each operation in the VLIW instruction, the action type that the instruction dispatch module of front end obtains according to decoding module determines that all can carry out the functional unit of this operation, determine that this operates in the execution time the earliest on each functional unit, this operation is assigned on the functional unit that can carry out this operation the earliest carries out, revise the pot life the earliest of this operation destination register (if any) among the RAT simultaneously.The method of determining to operate in the execution time the earliest on each functional unit is: for each functional unit, this that obtains according to decoding operates that the execution time is determined its position in this functional unit instruction queue the earliest, if this position is taken by other operation, then seek first room backward and put into this operation; For fear of finishing the pipeline stall that causes because of waiting for that the instruction of some long delays is carried out, the length setting of instruction queue is the long delay number of all operations.

4. the functional unit of each clock period streamline rear end detects the operation be positioned at its instruction queue head whether carry out required source operand all ready, is then to carry out this operation, otherwise waits for;

5. the sequencing buffering is confirmed each execution result of operating in every instruction successively according to the instruction fetch order again, and the result that will be identified writes back register or storer; If the result of certain operation is cancelled, then all wait for the operation of this operating result in the flush instructions formation.

By above implementation as can be known, the present invention can information dynamically adjust execution time of each operation in the VLIW instruction when decoding finishes the back according to operation, and determines that it carries out concluding time, realizes the dynamic instruction scheduling.

Adopt the present invention can reach following technique effect:

(1) the hardware implementation complexity is low, low in energy consumption.Compare with traditional dynamic instruction dispatching technique, the present invention utilizes RAT to write down the pot life the earliest of each register, for each operation, instruction dispatch unit is determined the execution time the earliest of this operation according to the information among the RAT, and put it in the functional unit waiting list that to carry out it the earliest, can avoid using complicated coherent detection mechanism, for example instruction window like this.Simultaneously the VLIW compiler guarantees not exist between each operation in the same VLIW instruction any relevant, can handle a plurality of operations simultaneously and needn't detect these the operation between whether exist relevant, avoided using complicated instruction coherent detection hardware, the execution time of operation is determined by its type fully, greatly reduces hard-wired difficulty and complexity.

(2) increased substantially the performance of vliw microprocessor.Adopt dynamic instruction dispatching method of the present invention, vliw microprocessor can redefine the execution sequence of instruction according to the pipelining delay that when operation dynamically produces, effectively eliminated the adverse effect of these delays, improved the performance of vliw microprocessor track performance.

(3) instruction dispatch unit can when operation according to microprocessor in functional unit actual quantity and postpone to redefine execution time of operation, and it be assigned to each functional unit carry out, solved the binary code compatibling problem that the VLIW structure faces.

(4) each functional unit all has an instruction queue, is used to preserve the instruction that wait is carried out on this functional unit, can avoid hindering the normal decoding of other instruction like this.The length setting of instruction queue is the long delay number of all operations, has avoided finishing the pipeline stall that causes because of waiting for that some long delay instruction is carried out.

The present invention can make full use of the concurrency development result of VLIW compiler, uses simple instruction scheduling mechanism dynamically to determine the execution time of operation, improves the vliw microprocessor performance, and has solved the binary code compatibling problem that the VLIW structure faces effectively.

The SPECint95 benchmark program group and the Unix core benchmark program group that adopt system performance evaluation and test association (System Performance Evaluation CooperativeConsortium) to provide are tested, when every instruction contains that simulation is of the present invention in the VLIW simulator of 4 operations, can obtain average 2.217 IPC (on average phase execution command number, Instructions per Cycle) weekly.

Description of drawings:

Fig. 1 is a dynamic VLIW microprocessor logic block diagram of the present invention.

Fig. 2 is a dynamic VLIW microprocessor pipeline nuclear logic diagram of the present invention.

Fig. 3 is the logic diagram of dynamic instruction scheduling mechanism of the present invention.

Fig. 4 is dynamic instruction scheduling mechanism The performance test results figure of the present invention.

Embodiment:

Fig. 1 is a dynamic VLIW microprocessor system logic diagram of the present invention.Whole vliw microprocessor system comprises pipeline kernel, Cache system, memory controller and four parts of storer: pipeline kernel is responsible for execution command, and execution result write back storer, it is connected with the Cache system with address bus by the instruction/data bus; The Cache system holds instruction and the high-speed cache of data, is made up of instruction Cache and Data Cache two parts; Memory controller provides the interface of storer and Cache system, and when the required instruction of pipeline kernel or data were not in Cache, memory controller was responsible for Cache is read in instruction or data from storer; Storer is held instruction and data.Be connected with address bus by instruction bus, data bus between these three modules of Cache system, memory controller and storer.

Fig. 2 is a dynamic VLIW microprocessor pipeline nuclear logic diagram of the present invention.Pipeline kernel is divided into front-end and back-end (with dashed lines is irised out), and front end comprises instruction fetch module, instruction decode module, instruction dispatch module, connects by internal bus between these three modules.Instruction fetch module is responsible for instruction fetch and by internal bus instruction decode module is sent in instruction, the code translator of instruction decode module is sent to the instruction dispatch module to instruction decode and with decode results by internal bus, decode results comprises the source-register and the destination register of action type, operating delay, operation, and the operation during the instruction dispatch module will be instructed according to decode results is assigned to each functional unit FU of rear end ₁～FU _nWaiting list IQ in; A plurality of independent functional units are used in the rear end, can carry out a plurality of operations simultaneously, and execution result is temporarily stored among the sequencing buffering ROB again, confirm that by sequencing buffering again the result who obtains confirming will be write back storer or register, and all the other results will be cancelled.

Fig. 3 is a dynamic instruction scheduling logic synoptic diagram in the vliw microprocessor pipeline kernel of the present invention.The dynamic instruction scheduling logic is made up of three modules such as instruction decode module, instruction dispatch module, instruction execution modules:

Instruction decode module comprises the VLIW code translator, be connected by internal bus with instruction fetch module, be responsible for instruction decode, and determine the delay of each operation in the VLIW instruction, determine the execution time the earliest of each operation according to the register pot life that writes down among the RAT according to the action type that decoding obtains.For the execution result of order affirmation operation, it also is each operated allocated two numberings I and p, and I is a command identification, and p is the position that operates in the instruction, and I and p can represent each operation uniquely.

The instruction dispatch module comprises register pot life table RAT and functional unit selector switch FUS, be responsible for determining the execution time the earliest of each operation in the VLIW instruction according to the instruction decode result, with and the pot life the earliest of destination register, and operation is put into the waiting list of corresponding function unit.RAT adopts the structure identical with register file, the pot life the earliest of a physical register of record.FUS is connected with RAT by internal bus, and the pot life the earliest of register is write RAT.

The instruction execution module comprise functional unit and again sequencing cushion ROB, interconnect by internal data bus, the execution result of being responsible for after executable operations also will be confirmed writes back register or storer, and wherein functional unit is responsible for executable operations, and the sequencing buffering is responsible for results verification again.ROB adopts the structure identical with register file, and the information of record comprises operation mark I and p, operation execution result.

Fig. 4 is dynamic instruction scheduling mechanism The performance test results figure of the present invention, the performance that has compared the present invention and other two kinds of dynamic instruction scheduling mechanism DL1 and DL6, DL1 and DL6 represent respectively to suppose that the Load operating delay is respectively 1 and 6 o'clock dynamic instruction scheduling mechanism.The longitudinal axis is IPC (on average phase execution command number, Instructions per Cycle) weekly, and transverse axis is the used instruction window size of other dynamic instruction scheduling mechanism.The SPECint95 benchmark program group and the Unix core benchmark program group that adopt system performance evaluation and test association (System Performance Evaluation CooperativeConsortium) to provide are tested, the benchmark program of selecting for use comprises li, compress and the m88ksim among the SPECint95, and the lex.c among the Unix, wc.c and grep.c.When every instruction contains that simulation is of the present invention in the VLIW simulator of 4 operations, can obtain average 2.217 IPC (on average phase execution command number, Instructions perCycle) weekly.

Claims

1 one kinds of foundations are determined the dynamic VLIW instruction scheduling method of delay, its VLIW is that the very long instruction word microprocessor system comprises pipeline kernel, Cache system, memory controller and storer, pipeline kernel is responsible for execution command, and execution result write back storer, it is connected with the Cache system with address bus by the instruction/data bus; The Cache system holds instruction and the high-speed cache of data, is made up of instruction Cache and Data Cache two parts; Memory controller provides the interface of storer and Cache system, and when the required instruction of pipeline kernel or data were not in Cache, memory controller was responsible for Cache is read in instruction or data from storer; Storer is held instruction and data; Be connected with address bus by instruction bus, data bus between these three modules of Cache system, memory controller and storer; It is characterized in that utilizing pipeline kernel to carry out the dynamic instruction scheduling, the method for designing of pipeline kernel is: pipeline kernel is divided into the rear and front end, front end comprises instruction fetch module, instruction decode module, instruction dispatch module, be responsible for memory fetch, with the instruction decode of fetching, and according to each execution time of operating in the definite instruction of decode results; The rear end comprises the functional unit FU of executable operations and two modules of sequencing buffering ROB again, is responsible for executable operations and the execution result of operation is confirmed; Instruction fetch module is connected with address bus and instruction Cache by instruction bus, is responsible for instruction fetch from instruction Cache; The VLIW code translator is arranged in the instruction decode module, be connected by instruction bus with instruction fetch module, instruction is deciphered, determined the delay of operation, but determine the execution time the earliest of operation according to the information that writes down among the register pot life table RAT according to the action type that decoding obtains; The instruction dispatch module comprises register pot life table RAT---Register Available Table and functional unit selector switch FUS, the pot life the earliest of each register of RAT table record, FUS is assigned to different functional units according to decode results with each operation and carries out, determine the execution time the earliest of operation, and revise the pot life the earliest of corresponding register among the RAT according to operating delay; The functional unit of rear end comprises ALU unit, memory access unit, floating point unit, branch units, and each functional unit has an instruction queue, preserves and waits for all operations of carrying out in this unit; ROB preserves the execution result of operation and according to the order of instruction fetch the execution result of operation is confirmed, the result who obtains confirming is write back register or storer, abandon remaining result, in each instruction queue of cancelling simultaneously all etc. this result's to be used operation; The concrete grammar that utilizes pipeline kernel to carry out the dynamic instruction scheduling is:

1.1 instruction fetch module is got the VLIW instruction the Cache from instruction, and for every instruction distributing the execution result of holding instruction among the sequencing buffering ROB again, write down this sequence number I, identify every instruction uniquely by I;

1.2 instruction decode module is to the VLIW instruction decode, determine its delay according to each operation types in the instruction, determine the execution time the earliest of each operation according to the register pot life that writes down among the RAT, and be each operated allocated two numberings I and p, the instruction at I sign operation place, p represents that this operates in the position of instruction among the I, like this by＜I, p〉represent an operation uniquely;

1.3 for each operation in the VLIW instruction, the action type that the instruction dispatch module obtains according to pool sign indicating number module determines that all can carry out the functional unit of this operation, determine that this operates in the execution time the earliest on each functional unit, this operation is assigned on the functional unit that can carry out this operation the earliest carries out, revise the pot life the earliest of this operation destination register among the RAT simultaneously;

1.4 whether the detection of the functional unit of each clock period streamline rear end is positioned at the required source operand of the operation execution of its instruction queue head all ready, is then to carry out this operation, otherwise waits for;

1.5 sequencing buffering ROB confirms each execution result of operating in every instruction successively according to the instruction fetch order again, and the result that will be identified writes back register or storer; If the result of certain operation is cancelled, then all wait for the operation of this operating result in the flush instructions formation.

The dynamic VLIW instruction scheduling method that 2 foundations according to claim 1 determine to postpone is characterized in that describedly determining that according to each operation types in the instruction method of operating delay is:

2.1 the delay that visit data Cache hits is appointed as in the delay of Load operation, it is a clock period;

2.2 the delay of Store operation is assumed to a clock period;

2.3 for other all types of operations, postponing all is determined value, promptly definite after the microprocessor pipeline design is determined.

3 foundations according to claim 1 and 2 are determined the dynamic VLIW instruction scheduling method of delay, it is characterized in that the described method of determining to operate in the execution time the earliest on each functional unit is: for each functional unit, this that obtains according to decoding operates that the execution time is determined its position in this functional unit instruction queue the earliest, if this position is taken by other operation, then seek first room backward and put into this operation; And the length setting of instruction queue is that the long delay number of all operations is to avoid finishing the pipeline stall that causes because of waiting for that some long delay instruction is carried out.