CN1142485C

CN1142485C - Correlation delay eliminating method for streamline control

Info

Publication number: CN1142485C
Application number: CNB011315695A
Authority: CN
Inventors: 葵戴; 戴葵; 王志英; 沈立; 王蓉晖; 王蕾; 张春元; 王明仕; 王进
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2001-11-28
Filing date: 2001-11-28
Publication date: 2004-03-17
Anticipated expiration: 2021-11-28
Also published as: CN1349160A

Abstract

The present invention discloses a method for eliminating control correlation delays in pipelines. The goal of the present invention is to effectively eliminate control correlation delays in pipelines and improve the performance of a microprocessor on the premise that the simplicity and the low power consumption of hardware are realized. The present invention has the technical scheme that a compiler determines all probable branch target addresses of a branch instruction and inserts a prefetched instruction; all next instructions of the current branch instruction are read by two fetch instruction components in advanced, and a selector selects an instruction provided by one fetch instruction component to decode and execute according the decoding result of the current branch instruction. The prefetched instruction has a fetch address 1, a fetch address 2, a fetch address and a fetch stack which correspond to different branch instructions and are executed by the fetch instruction components. The prefetched instruction uses a basic block as a unit. The hardware of the present invention realizes low complexity, low power consumption, high control correlation delay elimination rate and small invalid prefetched instruction number. Microprocessors designed by the present invention have a high performance price ratio.

Description

Correlation delay eliminating method for streamline control

Technical field: the present invention relates to the removing method of streamline control correlation delay in the microprocessor Design, especially require removing method low in energy consumption, that hardware is realized streamline control correlation delay in the simple embedded microprocessor Design.

Background technology: at present, the method for eliminating streamline control correlation delay in the microprocessor Design roughly can be divided into two classes, i.e. branch prediction and delayed branch method.Principle of locality when branch prediction method mainly utilizes program run predicts according to the statistical information of branch instruction execution result whether next branch transition is successful.The effect of branch prediction not only depends on its accuracy, and the expense during with branch prediction is closely related.Streamline control correlation delay depends on and recovers the strategy taked after structure, forecast method and the prediction error of streamline.The weak point of this method is, prediction needs a large amount of hardware supported, and for example recovery parts after caluclate table and the prediction error etc. realize that expense is big, power consumption is high.The main thought of delayed branch method is to carry out and the irrelevant instruction of branch instruction in the relevant pause period of control, thereby covers this pause period.The weak point of this method is, can't fill all Tapped Delay grooves, and the instruction that can't guarantee simultaneously to be scheduled always must be carried out, and if not necessary, performance can not get real raising so.

Embedded microprocessor is mainly used in fields such as household electrical appliances, mobile phone, microcontroller, require low in energy consumption, above method or can't satisfy the requirement that it is low in energy consumption, complexity is low, or control correlation delay elimination factor is low, can't fully improve its performance.Even in the general purpose microprocessor design, these methods also can't be eliminated the control correlation delay in the streamline expeditiously.

Summary of the invention: technical matters to be solved by this invention is to satisfy under the embedded microprocessor hardware realization prerequisite simple, low in energy consumption, efficiently eliminates streamline control correlation delay, improves microprocessor performance.

Technical scheme of the present invention is: determined all possible branch target address of branch instruction and inserted prefetched instruction by the compiler in the collector; Two instruction fetch parts of design read in all possible successor instruction of current branch instruction in advance in instruction fetch module; Selector switch of design in instruction decode and execution module, give the instruction decode parts according to the instruction of selecting instruction fetch parts to provide to the decode results of current branch instruction and instruction execution unit is deciphered and carried out by selector switch, thereby eliminate streamline control correlation delay.Still do not have both at home and abroad at present and adopt this method to carry out the report that streamline control correlation delay is eliminated.

The present invention relates to eight nouns: streamline, branch instruction, streamline are correlated with, control is relevant, control correlation delay, instruction prefetch, prefetched instruction, fundamental block, and their definition is:

(1) streamline: the execution process instruction of microprocessor is resolved into the experimental process process, every

Individual subprocess can be held at other subprocess of its special function Duan Shangyu effectively simultaneously

OK, the flowing water technology that instruction that Here it is is carried out.Streamline is the concrete reality of flowing water technology

Existing.In different microprocessor Design, what stages streamline specifically is divided into

Also inequality.General streamline all comprises 5 stages: get finger, decoding, execution,

Memory access, write back.

(2) branch instruction: all instructions of reprogramming Counter Value are referred to as branch instruction, it

Comprise four classes: conditional branch instruction, direct unconditional branch instruction, unconditional indirectly

Transfer instruction and function, process link order (for example retum statement).

(3) streamline is relevant: because the pipeline stall that the relation of interdependence between instruction causes claims

For streamline is correlated with.

(4) control is relevant: make control relevant because the streamline that branch instruction causes is relevant.

(5) control correlation delay: because the relevant pipeline stalling clock periodicity that causes of control just

It is the control correlation delay.

(6) instruction prefetch: before instruction is performed, will instruct the behaviour who from storer, takes out in advance

Work is called instruction prefetch.

(7) prefetched instruction: be responsible for finishing the instruction of looking ahead and be called prefetched instruction.

(8) fundamental block: the basic composition unit of program, it has only inlet (is fundamental block

First statement) and an outlet (being the last item statement of fundamental block), base

This piece contains two instructions at least, and the outlet of fundamental block is a branch instruction, a program

Always can be divided into several fundamental blocks.

Implementation of the present invention is:

1. determine all possible branch target address of branch instruction and insert prefetched instruction by the compiler in the collector;

2. when program brought into operation, in the instruction fetch module, instruction fetch parts read in first fundamental block, another instruction fetch parts free time.

For each fundamental block in the program:

(a) the instruction fetch parts of being responsible for reading in this fundamental block read in each bar in the fundamental block successively

Instruction, and judge whether it is prefetched instruction.If the current instruction that is read into is for pre-

Instruction fetch then sends to prefetched instruction another instruction fetch parts, and being carried out by it should

Prefetched instruction; Otherwise instruction is sent to decoding part in instruction decode and the execution module

Part is deciphered.

(b) when the instruction of the last item of fundamental block be after branch instruction decoding finishes, instruction decode and

Selector switch in the execution module is according to the base in instruction fetch parts of decode results selection

This piece is as the follow-up fundamental block of current fundamental block:

1) supposes that two instruction fetch parts are respectively IF ₀And IF ₁, the current IF that carrying out ₀In

Instruction, IF ₁Then carry out prefetched instruction, the succeeding target of the current fundamental block of looking ahead

Instruction.When instruction sequences is carried out, the instruction that promptly finishes decoding be not branch instruction or

When jump condition is the branch instruction of False, select IF ₀In instruction translate

Sign indicating number; When carrying out branch instruction, the instruction that promptly finishes decoding is that jump condition is

During the branch instruction of True, select IF ₁In instruction decipher.

2) if current streamline is being carried out IF ₁In instruction, selection strategy is just in time opposite,

If that is: the instruction amenable to process is carried out in proper order, the instruction that promptly finishes decoding is not a branch

When Zhi Zhiling or jump condition are the branch instruction of False, select IF ₁In finger

Order is deciphered; When carrying out branch instruction, the instruction that promptly finishes decoding is to shift

When condition is the branch instruction of True, select IF ₀In instruction decipher.

Therefore, in program operation process, instruction fetch parts are responsible for instruction execution unit provides instruction, another instruction fetch parts are responsible for finishing instruction prefetch, the concurrent working of two instruction fetch parts, make that when branch instruction decoding finishes all possible branch target instruction has been kept at respectively in two instruction fetch parts.

Compare with general compiler, compiler of the present invention has increased by two specific functions: determine all possible transfer address of branch instruction and insert prefetched instruction according to dissimilar branch instructions in fundamental block.The flow process that compiler inserts prefetched instruction is: program compiler is each bar instruction in the read routine code successively, when running into branch instruction, expression arrives the end of current fundamental block, inserts corresponding prefetched instruction according to the type of branch instruction after article one instruction of this branch instruction place fundamental block.

Branch instruction is divided into four classes, and the present invention has designed three branch instructions that prefetched instruction is corresponding different according to their different situations.Article three, prefetched instruction is fetch addr1, addr2, fetchaddr, fetch stack.

(1) conditional branch instructions: shift and success also may fail, two follow-up bases are arranged

This piece, two follow-up fundamental blocks of the current fundamental block of should looking ahead simultaneously.Two possibilities are arranged

Transfer address, one is kept in the instruction, another is that finger after this instruction

The address of order.Insert this moment after article one instruction of this branch instruction place fundamental block

Go into prefetched instruction fetch addr1, addr2 looks ahead from address addr1 and addr2

Two fundamental blocks.Addr1 is obtained by this branch instruction decoding, and addr2 is this instruction

Instruction address afterwards, (unit is a word to its value for branch instruction address adds instruction length

Joint).

(2) directly unconditional branch instruction: shift success always, have only one follow-up basic

Piece can obtain branch target address when compiling, the prefetch target fundamental block gets final product.

Have only a possible transfer address, be kept in the instruction.This moment is in this branch instruction

Insert prefetched instruction fetch addr after article one instruction of place fundamental block, look ahead

The fundamental block that begins from address addr.Addr is obtained by this branch instruction decoding.

(3) unconditional branch instruction indirectly: shift success always, but owing to divert the aim ground

The location is kept in the register, can't obtain when compiling usually, refers to for this class branch

Order is not handled.

(4) process return statement: shift success always, have only a follow-up fundamental block, this

When quasi-sentence appears at the invocation of procedure usually and returns, because embedding may appear in the invocation of procedure

Cover, the present invention (promptly looks ahead ground with the return address that a stack is preserved the invocation of procedure

The location), during each invocation of procedure the return address is kept at stack top location, when looking ahead from

Obtain prefetch address in the stack top location.This moment is at this process link order place fundamental block

Article one instruction after insert prefetched instruction fetch stack, look ahead from stack top location

The fundamental block that the address begins.

Different with other instructions, prefetched instruction is carried out by the instruction fetch parts.The coding of the prefetched instruction of different RISC (reduced instruction set computer calculating) instruction set correspondence may be different, but as long as realize identical functions of the present invention, all belong to protection domain of the present invention.Looking ahead with the fundamental block is that unit carries out, except when conditional branch instructions shifts outside a part of successor instruction of looking ahead when failing, the instruction of being looked ahead all will be performed.

If all possible target instruction target word is read into before branch instruction decoding finishes, just can select correct successor instruction to decipher and carry out according to the branch instruction decode results.The SPECint95 benchmark program group that adopts system performance evaluation and test association (System Performance Evaluation CooperativeConsortium) to provide is tested, when realization is of the present invention in FastDLX simulator (the CPU simulator of standard), if do not consider indirect unconditional branch instruction (this class instruction shared ratio in program is lower), finish the probability that back branch target instruction has been read in branch instruction decoding and reach 99.3%; If consider indirect unconditional branch instruction, finish the back branch target in branch instruction decoding and instruct the probability that has been read into to reach 93%.

The present invention has the following advantages:

(1) the hardware implementation complexity is low, low in energy consumption.Compare with branch prediction techniques, this

Bright recovery hardware when having saved complicated branch prediction hardware and prediction error, only

Use simple control logic unit, greatly reduced hard-wired difficulty and multiple

Assorted degree.

(2) control correlation delay elimination factor height.The present invention is directly according to the decoding of branch instruction

The result selects the branch target instruction, and instruction prefetch has guaranteed to finish when branch instruction decoding

The most instructions in back all are read into, and most controls are relevant prolongs thereby eliminated

Late.

(3) looking ahead with the fundamental block is that unit carries out, and looks ahead when carrying out current fundamental block

The follow-up fundamental block of it all makes prefetch operation send the time early, has guaranteed to fill

The foot time finishes and looks ahead; Look ahead when simultaneously, shifting failure except conditional branch instructions

Outside the part successor instruction, the instruction of being looked ahead all will be performed, effectively

Reduced invalid prefectching.

The present invention is satisfying under the embedded microprocessor hardware realization prerequisite simple, low in energy consumption, has realized efficient elimination streamline control correlation delay, improves the purpose of microprocessor performance.The present invention also can be applicable in the general purpose microprocessor design.

Description of drawings:

Fig. 1 is the process flow diagram that compiler of the present invention inserts prefetched instruction;

Fig. 2 is an overall logic structural drawing of the present invention;

Fig. 3 is the spacetime diagram of the non-branch instruction of general microprocessor in 5 level production lines;

Fig. 4 is the spacetime diagram of general microprocessor branch instruction in 5 level production lines;

Fig. 5 is the spacetime diagram of branch instruction in 5 level production lines behind employing the present invention;

Fig. 6 adopts the test result that the present invention is directed to the SPECint95 benchmark program;

Fig. 7 adopts the performance of the present invention and other control correlation delay eliminating method to compare.

Embodiment:

Fig. 1 inserts the process flow diagram of prefetched instruction for compiler of the present invention.Program compiler is each bar instruction in the read routine code successively, when running into branch instruction, expression arrives the end of current fundamental block, type according to branch instruction is inserted corresponding prefetched instruction after article one instruction of this branch instruction place fundamental block, program is process return statement always at last, can run into branch instruction when therefore compiling.

Fig. 2 is an overall logic structural drawing of the present invention.It is made up of collector, instruction fetch module, instruction decode and execution module:

Collector mainly is responsible for determining all possible transfer address of branch instruction and is inserted prefetched instruction according to the type of branch instruction.Compiler reads each the bar instruction in the source program successively, is branch instruction if this instructs, and then inserts corresponding prefetched instruction according to the type of branch instruction after article one instruction of its place fundamental block.Concrete grammar is: for conditional branch instructions, two follow-up fundamental blocks are arranged, corresponding prefetch address has two, insert prefetched instruction fetchaddr1, addr2, wherein addr1 is obtained by this branch instruction decoding, and addr2 is the instruction address after this instruction, and its value adds instruction length (unit is a byte) for branch instruction address; For direct unconditional branch instruction, have only a follow-up fundamental block, corresponding prefetch address has one, inserts prefetched instruction fetch addr, and wherein addr is obtained by instruction decode; The process return statement has a follow-up fundamental block, and corresponding prefetch address has one, is kept in the stack top location, inserts prefetched instruction fetch stack.Program code after the compiling is kept in the storer.

Instruction fetch module mainly is responsible for instruction decode and execution module provides instruction, and carry out instruction prefetch, and its function is finished by two instruction fetch parts.Instruction fetch parts IF ₀From instruction Cache, read instruction instruction fetch parts IF by port 0 ₁Then from instruction Cache, read instruction by port one.When program brings into operation, select IF ₀The instruction that provides is deciphered and is carried out, IF ₁Free time, work as IF ₀After reading in prefetched instruction, it is transmitted to IF ₁, by IF ₁Finish and look ahead; In program operation process, which instruction fetch parts carries out instruction fetch, and which instruction fetch parts carries out instruction prefetch and should determine according to the decode results of branch instruction: if IF ₀Be responsible for instruction fetch, IF ₁Be responsible for instruction prefetch, the instruction that finishes decoding is not branch instruction or the branch instruction that shifts failure, and the two operation is constant so, otherwise IF ₁Be responsible for instruction fetch, IF ₀Be responsible for instruction prefetch; If IF ₁Be responsible for instruction fetch, IF ₀Be responsible for instruction prefetch, the instruction that finishes decoding is not branch instruction or the branch instruction that shifts failure, and the two operation is constant so, otherwise IF ₀Be responsible for instruction fetch, IF ₁Be responsible for instruction prefetch.

Instruction decode and execution module mainly are responsible for the decoding and the execution of instruction, instruction decoded and that carry out is provided by instruction fetch parts in the instruction fetch module, wherein the decode results of branch instruction is sent to selector switch and two instruction fetch parts, and the decode results of non-branch instruction is sent to instruction execution unit and carries out.Selector switch is responsible for deciphering and carrying out according to the instruction that the decode results of branch instruction selects instruction fetch parts to provide, and when program brought into operation, it selected IF ₀In instruction decode and execution, in program operation process, it is selected according to the decode results of branch instruction: if using IF ₀In instruction, the instruction that finishes decoding is not branch instruction or the branch instruction that shifts failure, IF in continuing so to use ₀Instruction, otherwise use IF ₁In instruction; If using IF ₁In instruction, the instruction that finishes decoding is not branch instruction or the branch instruction that shifts failure, continues to use IF so ₁In instruction, otherwise use IF ₀In instruction.In execution process instruction,, then the process return address is kept in the stack top location so that look ahead if generating process calls.

As Fig. 3, suppose that streamline is divided into 5 stages: get finger (IF), decoding (ID), carry out (EX), memory access (MEM) and write back (WB), instruction p is article one instruction of carrying out after instruction i, and instruction p+1 is the second instruction of carrying out after instruction i, and the like.Because i is not branch instruction, so in to its decoding, the instruction fetch parts are reading command p, therefore instruct IF stage of ID stage and instruction p of i to overlap, and the ID stage of the EX stage and instruction p of instruction i overlaps, and does not control correlation delay at this moment.

As Fig. 4, instruction i is a branch instruction, in the streamline identical with Fig. 3, only after instruction i decoding finishes, could determine the address of instruction p, and the IF stage of the EX stage and instruction p of instruction i is overlapping, and this moment, streamline had the control correlation delay of a clock period.

As Fig. 5, presumptive instruction I is a branch instruction, instruction i ₁The instruction of carrying out when being the branch transition failure, instruction i ₂The instruction of carrying out when being the branch transition success, the second instruction of instruction q for after branch instruction, carrying out, three instruction of instruction q+1 for after branch instruction, carrying out, the rest may be inferred.After adopting the present invention, in instruction i decoding, two instruction fetch parts will be distinguished reading command i ₁And i ₂, can be after instruction i decoding finishes according to decode results selection instruction i ₁Or i ₂Decipher the EX stage and instruction i of instruction i ₁Or i ₂The ID stage overlap, eliminated original control correlation delay.

The present invention successfully realizes in the streamline of milky way TS-1 embedded microprocessor IP kernel, can effectively eliminate streamline control correlation delay, the result of Fig. 6 for adopting the test of SPECint95 benchmark program group to obtain.Among the figure, the longitudinal axis is represented each benchmark program in the SPECint95 benchmark program group, and transverse axis represents to control the correlation delay elimination factor, and gcc is 88.80%, ijpeg is 83.32%, compress is 88.55%, and perl is 88.69%, and m88ksim is 92.99%, li is 85.09%, vertex is 87.47%, and go is 86.16%, and the average elimination factor of streamline control correlation delay reaches 87.8%.

Fig. 7 has gone out to adopt the different resulting performances of control correlation delay eliminating method, wherein an average clock periodicity that instruction is required is carried out in CPI (cycle per instruction) expression, and conditional branching postpones, unconditional branch postpones and the unit of average mark Zhi Yanchi all is clock period (cycle).Wherein, the situation of any Tapped Delay technology for eliminating is not adopted in the pipeline stalling representative, and two instruction fetching components represent to adopt situation of the present invention.After adopting the present invention, conditional branching postpones to be reduced to 0.05, and unconditional branch postpones to be reduced to 0.09, and average Tapped Delay is reduced to 0.06, and effectively the CPI value is reduced to 1.01, all well below the result who adopts additive method to obtain.

Claims

1. correlation delay eliminating method for streamline control, its overall logic structure comprises collector, instruction fetch module, instruction decode and execution module, it is characterized in that being responsible for determining all possible transfer address of branch instruction and being inserted prefetched instruction according to the type of branch instruction by the compiler in the collector; Instruction fetch module also will carry out instruction prefetch except being responsible for instruction decode and execution module provides instruction, and its function is finished by two instruction fetch parts IF0 and IF1; A selector switch is arranged in instruction decode and the execution module, and the instruction decode parts are given in the responsible instruction of selecting instruction fetch parts to provide according to the decode results of current branch instruction and instruction execution unit is deciphered and carried out; Its implementation is: 1) determined all possible branch target address of branch instruction and inserted prefetched instruction by compiler in the collector; 2) when program brings into operation, instruction fetch parts read in first fundamental block in the instruction fetch module, another instruction fetch parts free time; For each fundamental block in the program: (a) the instruction fetch parts of being responsible for reading in this fundamental block read in each bar in the fundamental block successively

Instruction, and judge whether it is prefetched instruction; If the current instruction that is read into is for pre-

Prefetched instruction; Otherwise instruction is sent to decoding unit to be deciphered; (b) when the instruction of the last item of fundamental block be after branch instruction decoding finishes, instruction decode and

Selector switch in the execution module is selected an instruction fetch portion according to the branch instruction decode results

Fundamental block in the part is as the follow-up fundamental block of current fundamental block:

If the I. current IF that carrying out ₀In instruction, IF ₁Then carry out prefetched instruction,

The look ahead succeeding target instruction of current fundamental block; When instruction sequences is carried out, promptly finish to translate

The instruction of sign indicating number is not branch instruction or jump condition when being the branch instruction of False, choosing

Select IF ₀In instruction decipher; When carrying out branch instruction, promptly finish to decipher

Instruction is a jump condition when being the branch instruction of True, selects IF ₁In instruction advance

Row decoding;

If II. current streamline is being carried out IF ₁In instruction, selection strategy is phase just in time

Instead, if that is: the instruction amenable to process is carried out in proper order, the instruction that promptly finishes decoding is not

When branch instruction or jump condition are the branch instruction of False, select IF ₁In finger

Order is deciphered; When carrying out branch instruction, the instruction that promptly finishes decoding is to shift bar

When part is the branch instruction of True, select IF ₀In instruction decipher.

2. correlation delay eliminating method for streamline control according to claim 1, it is characterized in that the flow process that described compiler inserts prefetched instruction is: program compiler is each bar instruction in the read routine code successively, when running into branch instruction, expression arrives the end of current fundamental block, inserts corresponding prefetched instruction according to the type of branch instruction after article one instruction of this branch instruction place fundamental block.

3. correlation delay eliminating method for streamline control according to claim 1, it is characterized in that described prefetched instruction is that corresponding different branch instructions designs, they comprise fetchaddr1, three kinds of addr2, fetch addr, fetch stack, different branch instructions will be used different prefetched instructions:

This piece, two follow-up fundamental blocks of the current fundamental block of should looking ahead simultaneously; Have two can

Can transfer address, one is kept in the instruction, another be this instruction afterwards that

The address of bar instruction; This moment is in article one instruction of this branch instruction place fundamental block

Insert prefetched instruction fetch addr1 afterwards, addr2 looks ahead from address addr1

Two fundamental blocks that begin with addr2; Addr1 is obtained by this branch instruction decoding,

Addr2 is the instruction address after this instruction, and its value adds instruction for branch instruction address

Length;

Piece can obtain branch target address when compiling, the prefetch target fundamental block gets final product; Only

A possible transfer address is arranged, be kept in the instruction; This moment is at this branch instruction place

Insert prefetched instruction fetch addr after article one instruction of fundamental block, look ahead from ground

The fundamental block that location addr begins; Addr is obtained by this branch instruction decoding;

The location is kept in the register, when compiling, can't obtain usually, and for this class branch instruction,

Do not handle;

Cover, the present invention preserves the return address of the invocation of procedure-be prefetch address with a stack,

During each invocation of procedure the return address is kept at stack top location, when looking ahead from stack top location

The middle prefetch address that obtains; This moment, the article one at this process link order place fundamental block referred to

Insert prefetched instruction fetch stack after the order, looking ahead begins from the stack top location address

Fundamental block.

4. the removing method of streamline control correlation delay according to claim 1, it is characterized in that described instruction prefetch is that unit carries out with the fundamental block, a part of successor instruction of looking ahead when shifting failure except conditional branch instructions, the instruction of being looked ahead all will be performed.