CN1558325A

CN1558325A - Device and method for invalidating redundant items in branch target address cache

Info

Publication number: CN1558325A
Application number: CNA2004100032719A
Authority: CN
Inventors: 汤玛斯・麦当劳; 汤玛斯·麦当劳
Original assignee: INTELLIGENCE FIRST CO
Current assignee: INTELLIGENCE FIRST CO
Priority date: 2004-02-03
Filing date: 2004-02-03
Publication date: 2004-12-29
Anticipated expiration: 2024-02-03
Also published as: CN100339825C

Abstract

The present invention discloses the device and method for nullifying the excessive item stored in the same branch command in N path correlated branch target address cache (BTAC). The index part of the address extracted from the cache is fed to BTAC to select its N-path set. The control logic detects whether there is one or more path to contain one effective mark coinciding with the mark part of the extracted address in the N-path set. One mark will be set to indicate the said case has existed while the extracted address is stored in a register. Then, the control logic maintains one of paths with the effective mark while nullifying the others. Thus, the excessive target address with the same branch command may be eliminated for the BTAC to obtain fast more target address of the branch command with raised BTAC efficiency.

Description

Device and method with unnecessary branch target address caching item as invalidization

Technical field

The present invention is about branch prediction (branch prediction) field of microprocessor, particularly about using the branch prediction of imaginary branch target address high speed buffer storage, promptly in order to device and method with unnecessary branch target address caching item as invalidization.

Background technology

Modern microprocessor is the pipeline microprocessor.That is, in the different blocks or pipeline stage of microprocessor, can carry out several instructions simultaneously.Hennessy and Patterson carry out pipeline and are defined as a plurality of instructions reality of carrying out that can overlap and make technology.Referring to the calculator structure: quantization method (second edition), the Morgan Kaufmann publishing company by San Francisco, California printed and distribute in 1996, and John L.Hennessy and David A.Patterson show.Then, they carry out the explanation of having done following excellence to pipeline:

Pipeline and assembly line are similar.In automobile assembly line, many steps are arranged, each step all has some contribution for the construction of automobile.Though each step is to carry out on different automobiles, can operate concurrently with other step.In the computing machine pipeline, each step can be finished the some of instruction.As assembly line, different step can be finished the partly different of different instruction concurrently.Each of these steps is called pipeline stage or line section.These stages can be connected to the next stage with a stage, and pipeline--the instruction meeting enters from an end and form, and pass through these stages, and leave from the other end, just as the automobile in the assembly line.

Synchronous microprocessor is to operate according to the clock pulse cycle.Usually, every through a clock pulse cycle, instruction just reaches another stage from stage of microprocessor pipeline.In automobile assembly line, if the workman in certain stage is not because have automobile to assemble and be in idle state, then the production capacity of this assembly line or usefulness just can reduce.Similarly, if the microprocessor stage will not carry out because of instruction is arranged, and be in idle state (being commonly referred to pipeline foam (pipeline bubble)) in a clock pulse cycle, then the usefulness of processor can reduce.

A possible factor of pipeline foam is a branch instruction.When processor was run into a branch instruction, it must determine the destination address of this branch instruction, and from destination address but not the next sequential address after this branch instruction begins to extract instruction.Have, if this branch instruction is conditional branch instructions (whether be whether branch is adopted, is to exist and decide according to specified conditions), then except determining destination address, processor must judge whether to adopt this branch instruction again.Because last pipeline stage of resolving destination address and/or branch outcome (being whether branch is adopted) all is positioned at far away after the stage of extracting instruction usually, foam just may produce.

For addressing this problem, modern microprocessor generally all uses branch prediction mechanism, with in pipeline early prediction destination address and branch outcome.An example of branch prediction mechanism is branch target address caching (BTAC), can be in the instruction cache extraction instruction of microprocessor, and predicted branches result and destination address.When microprocessor is carried out a branch instruction, and determination be will adopt this branch and destination address thereof the time, and the address of this branch instruction and destination address thereof can be written into BTAC.Next time when instruction cache extracts this branch instruction, branch instruction address can be hit BTAC, and BTAC just can provide the destination address of this branch instruction in early days at pipeline.

An effective BTAC can be by the quantity of elimination or minimizing foam, the usefulness of promoting processor, and failing processor will be wasted many times in the parsing of waiting for branch instruction because of these foams.Yet, when BTAC has done wrong prediction, have the stage of being instructed in the pipeline all can be cleared by error extraction, in order to extracting correct instruction, and when the action that empties and extract, just foam can be introduced pipeline.Along with the pipeline of microprocessor is more done more deeply, the efficient of BTAC just becomes very crucial to usefulness.

The efficient of BTAC is mainly decided on the hit rate of BTAC.A factor that influences the BTAC hit rate is the destination address quantity of the different branch instructions that BTAC is stored.The destination address of stored branch instruction the more, BTAC is more efficient.Yet the grain size of microprocessor is always limited, so its inside function square frame such as BTAC, just must do heal better little.Influence a factor of BTAC actual size, for storing the memory location size of destination address and relevant information in the BTAC.Particularly, the memory location of single port is generally little than multiport memory location.In cycle, the BTAC that is made up of the single port memory location only can be read or write at a specific clock pulse, but both can not carry out simultaneously, and the BTAC that the multiport memory location is formed then can be read and write simultaneously.Yet in fact multiport BTAC is greater than single port BTAC.This means that with identical size, the destination address number that multiport BTAC can store can lack than single port BTAC, thereby reduce the efficient of BTAC.Therefore, thus, single port BTAC is comparatively desirable.

Yet single port BTAC only can read or write in the cycle at a clock pulse, and can not read while write, and may reduce the efficient of BTAC because of error (miss).If a single port BTAC was written in its cycle that need be read, similarly be to upgrade BTAC, or with a destination address ineffective treatment, then an error has just produced with a new destination address.In this case, BTAC must produce an error of reading, because it can't provide destination address.But, this destination address still might be present among the BTAC, because BTAC is in being written at this moment.

Therefore, needed is a kind of method and device that reduces the error among the single port BTAC.

Another one can reduce the phenomenon of BTAC efficient, repeats to store the destination address of same branch instruction for BTAC.This phenomenon may occur in the BTAC of multichannel set associative (multi-way set-associative).Be repeated to store shared BTAC project, originally can be used to store the destination address of other branch instruction, not all right now, because the space of BTAC is limited, so will reduce the efficient of BTAC.Pipeline is longer, promptly number of stages the more, BTAC just more might store the destination address of repetition.

Same branch instruction is repeated to be taken at soon the phenomenon of BTAC, the most normal very little situation of program code circulation that occurs in.For instance, a branch instruction is performed for the first time, owing to the 2 tunnel of BTAC is the road of least recently used (least recently used), so the destination address of branch instruction is written into the 2 the tunnel.Yet, before destination address is written to BTAC, meet this branch instruction again again, that is because destination address is not written into BTAC as yet, BTAC can be because of getting the extraction address of error query statement high-speed cache soon.Therefore, destination address can be written into BTAC for the second time.Insert the action that another different branch instructions read BTAC if in the middle of this, have, make the 2 the tunnel no longer to be least-recently-used road, then can choose different Lu Rudi 1 tunnel, for the second time destination address is write BTAC.The destination address of present same branch instruction just occurs in BTAC twice.So can waste the space of BTAC, and reduce the efficient of BTAC, because very the action that might write for the second time replaces to fall the actual target address of another branch instruction.

Therefore, the method and the device in needed a kind of valuable BTAC space that avoids waste, wherein this kind waste is owing to repeated to get soon the destination address of same branches instruction.

Moreover the particular combinations of the phenomenon relevant with BTAC imagination speciality can cause microprocessor to produce the situation of fast knot (deadlock).The imaginary branch prediction of BTAC, branch instruction produce mistake across the processor bus operation meeting of fast line taking of different instruction and extraction hypothetical instruction, and the combination of these phenomenons can cause fast knot in some cases.

Summary of the invention

The purpose of this invention is to provide a kind of microprocessor of the imaginary BTAC of use that prevents and produce the method and the device of fast knot.

The invention provides a kind of can with among the BTAC to the method and the device of the stored unnecessary project ineffective treatment of same branch instruction, with the space of the BTAC that avoids waste.For achieving the above object, the invention provides a kind of can with in the set associative branch target address caching (BTAC) to the device of the stored unnecessary project ineffective treatment of same branch instruction.This device comprises a positioning indicator, and it shows whether an instruction cache extracts in the set of the selected BTAC in address, have at least two-way to comprise effective branch target address of same branch instruction.This device also comprises a steering logic, is coupled to this positioning indicator, in order to show in the selected set at positioning indicator, when having at least two-way to comprise the effective branch target address of same branch instruction, with this wherein one road ineffective treatment of two-way at least.

On the other hand, the present invention also provide a kind of can with in the branch target address caching (BTAC) to the device of the stored unnecessary project ineffective treatment of same branch instruction.Whether this device comprises a detecting logic, can detect in a plurality of roads that one of BTAC chooses set, have more than one effective road store the destination address of same branch instruction.This device also comprises an ineffective treatment logic, is coupled to this detecting logic, chooses in the set this effective road more than one for this, keeps one of them, and with all other road ineffective treatment.

On the other hand, the invention provides a kind of pipeline microprocessor.This microprocessor comprises an instruction cache, and it has an address input end, to receive an address, to choose the fast line taking that comprises a branch instruction.This microprocessor also comprises a branch target address caching (BTAC), is coupled to this instruction cache, can respond this address and produces a plurality of indicator signals.Whether a corresponding road stores an actual target address of this branch instruction in the set of the BTAC that each this address of these indication signal is selected.This microprocessor also comprises a logic, is coupled to BTAC, and configuration is chosen wherein one or more roads, these roads ineffective treatment of set on these roads of these indication signal wherein two or more a plurality of road when storing the actual target address of this branch instruction with this.

On the other hand, the invention provides a kind of can with in the set associative branch target address caching (BTAC) to the method for the stored unnecessary project ineffective treatment of same branch instruction.During the method comprises that the selected BTAC of index part that judges instruction cache extraction address gathers, whether there is identical this instruction cache of mark on more than one road to extract the mark part of address, and whether effective.The method also comprises if this is chosen has more than one road effectively and coincide in the set, then keep this and choose one of them road of set, and with all other road ineffective treatment.

On the other hand, the invention provides a kind of can identity set with N road set associative branch target address caching (BTAC) in, to the method for the stored unnecessary project ineffective treatment of same branch instruction.The method comprises chooses the N road set of BTAC according to an instruction fetch than lower part.The method also comprises makes comparisons N the address mark on N corresponding road of this N road set and the higher part of this instruction fetch.The method also comprises judging whether have two or coincide this higher part and for effective of more a plurality of mark in this N address mark.If the method also comprises in this N address mark, have two or coincide this higher part and of more a plurality of mark for effective, in then should an effective N address mark, this higher part of coincideing this two or the one or more roads ineffective treatment on pairing this N road of more a plurality of mark.

On the other hand, the invention provides the computer data signal that is contained in a kind of in the transmission medium.This computer data signal comprises computer readable program code, so that a kind of pipeline microprocessor to be provided.This program code comprises first program code, and so that an instruction cache to be provided, it has an address input end, to receive an address, to choose the fast line taking that comprises a branch instruction.This program code also comprises second program code, so that a branch target address caching (BTAC) to be provided, is coupled to this instruction cache, can respond this address and produces a plurality of indicator signals.Whether a corresponding road stores an actual target address of this branch instruction in the set of the BTAC that each this address of these indication signal is selected.This program code also comprises the 3rd program code, so that a logic to be provided, be coupled to BTAC, configuration is chosen wherein one or more roads, these roads ineffective treatment of set on these roads of these indication signal wherein two or more a plurality of road when storing the actual target address of this branch instruction with this.

An advantage of the present invention is, can make BTAC can get the more destination address of multiple-limb instruction soon by the unnecessary destination address of eliminating the same branches instruction, and the efficient of promoting BTAC.

Further feature of the present invention and advantage after cooperating following explanation and accompanying drawing, will more can highlight.

Description of drawings

Fig. 1 is the block scheme of microprocessor of the present invention.

Fig. 2 is the block scheme of the part microprocessor of more shows in detail Fig. 1 of the present invention.

Fig. 3 is according to the present invention's block scheme of the BTAC of shows in detail Fig. 1 more.

Fig. 4 is the block scheme of the destination address array item purpose content of Fig. 3 of the present invention.

Fig. 5 is the block scheme of content of the mark array project of Fig. 3 of the present invention.

Fig. 6 is the block scheme of the counter array item purpose content of Fig. 3 of the present invention.

Fig. 7 is the block scheme that the BTAC of Fig. 1 of the present invention writes the content of requirement.

Fig. 8 is the block scheme that the BTAC of Fig. 1 of the present invention writes formation.

Fig. 9 is the operation workflow figure that the BTAC of Fig. 1 of the present invention writes formation.

Figure 10 is in the microprocessor of the present invention, in order to the block scheme with the logic of the unnecessary destination address ineffective treatment among the BTAC of Fig. 1.

Figure 11 is the operation workflow figure of the unnecessary destination address device of Figure 10 of the present invention.

Figure 12 is the block scheme that the fast knot in Fig. 1 microprocessor of the present invention is avoided logic.

Figure 13 is the operation workflow figure that the fast knot of Figure 12 of the present invention is avoided logic.

Wherein, description of reference numerals is as follows:

100: microprocessor 102: the instruction fetch device

104: instruction cache 106: instruction buffer

108: order format device 112: the format instruction queue

114: instruction transfer interpreter 116: translate instruction queue

118: the register stage 122: address phase

124: the data stage 126: execute phase

128: storage stage 132: write back stage

134: totalizer 136: multiplexer

138: instruction

142: branch target address caching (BTAC)

144:BTAC writes formation (BWQ) 146: queue depth

148: multiplexer 152: the branch prediction mistake

154: prediction replaces signal 156: instruction buffer is full of signal

158: instruction cache idle signal 162: existing extraction address

164:BTAC predicted target address 166: the next address of extracting in proper order

168: current ordcurrent order pointer 172: correct address

174: replace predicted target address 176:BTAC and write requirement

178:BTAC writes formation address 182: address signal

202: moderator 206: multiplexer

212:BTAC reading requirement signal 214: unnecessary TA requires signal

216: fast knot requires the non-spacing wave of signal 218:BWQ

222:BWQ is full of signal 234: unnecessary TA address

236: the fast knot address

244: unnecessary TA data-signal 246: fast knot data-signal

248:BWQ data-signal 252: control signal

256: data-signal 258: control signal

262: control signal

302: destination address array 304: the mark array

306: counter array 312: destination address array project

314: mark array project 316: counter array project

402: destination address (TA) 404: start field

406: across the position

502: mark 504:A significance bit

506:B significance bit 508:lru field

602: predicted state A counter 604: predicted state B counter

606:A/B lru position

702: branch instruction address field 706: destination address

708: start field 712: across the position

714: write activation-A field 716: write activation-B field

718: ineffective treatment-A field 722: ineffective treatment-B field

724: the road field

802:BTAC writes the project 804 that requires: significance bit

806: steering logic

902～922:BTAC writes the operation workflow of formation

1002A：tag0 1002B：tag1

1002C：tag2 1002D：tag3

1004：valid[7:0] 1006A：match0

1006B：match1 1006C：match2

1006D:match3 1012: comparer

1014: steering logic

1022: unnecessary TA ineffective treatment data register

1024: unnecessary TA flag register

1026: unnecessary TA address register

1102～1112: the operation workflow of unnecessary destination address device

1024: unnecessary TA flag register

1026: unnecessary TA address register

1102～1112: the operation workflow of unnecessary destination address device

1202:F_wrap signal 1204: control signal

1206: error signal 1208: imaginary signal

1212:T/NT signal 1214:F_wrap signal

1222: fast knot ineffective treatment data register 1224: fast knot flag register

1226: the fast knot address register

1302～1322: fast knot is avoided the operation workflow of logic

Embodiment

Now please refer to Fig. 1, is the block scheme of microprocessor 100 of the present invention.Microprocessor 100 comprises a pipelined processor.

Microprocessor 100 comprises an instruction fetch device 102.Instruction fetch device 102 as Installed System Memory, extracts instruction 138 from being coupled to the internal memory of microprocessor 100.In one embodiment, instruction fetch device 102 is the instruction 138 of the multiple of fast line taking size from internal memory extraction length overall.In one embodiment, instruction 138 is the instructions with variable-length.That is in the instruction set of microprocessor 100, the length of all instructions is not identical.In one embodiment, microprocessor 100 comprises a microprocessor, and its instruction set meets x86 framework instruction set in fact, and the latter's instruction length is variable.

Microprocessor 100 also comprises an instruction cache 104, is coupled to instruction fetch device 102.Instruction cache 104 receives the fast line taking that comprises command byte from instruction fetch device 102, and the fast fast line taking of instruction fetch, for microprocessor 100 follow-up uses.In one embodiment, instruction cache 104 comprises four road set associatives, the first rank high-speed caches (level-1 cache) of a 64KB.When certain instruction can not find in instruction cache 104, instruction cache 104 can be informed instruction fetch device 102, and the latter promptly extracts the fast line taking that comprises this instruction from internal memory.One existing extraction address 162 can be admitted to instruction cache 104, to choose a fast line taking wherein.In one embodiment, the fast line taking of instruction cache 104 comprises 32 bytes.Instruction cache 104 also produces an instruction cache idle signal 158.When instruction cache 104 was in idle state, instruction cache 104 can produce the instruction cache idle signal 158 of true value.When instruction cache 104 is not read, promptly be in the idle impact damper 106 that makes and comprise the fast line taking of command byte, and cushion these fast line takings from instruction cache 104 receptions, can be formatted as individual instructions up to it, till carrying out by microprocessor 100.In one embodiment, instruction buffer 106 comprises four projects, can store four fast line takings at most.Instruction buffer 106 also produces an instruction buffer and is full of signal 156.When instruction buffer 106 was expired, the instruction buffer that instruction buffer 106 can produce true value was full of signal 156.In one embodiment, if instruction buffer 106 is full, BTAC 142 promptly is not read.

In one embodiment, microprocessor 100 also comprises an order format device (instructionformatter) 108, and it is coupled to instruction buffer 106.Order format device 108 can receive command byte from instruction buffer 106, and therefrom produces the format instruction.That is order format device 108 can be checked a string command byte in the instruction buffer 106, judges which byte comprises the length of next instruction and next instruction, and exports next instruction and length thereof.In one embodiment, the format instruction comprises the instruction that meets x86 framework instruction set in fact.

Order format device 108 also comprises the logic in order to the generation branch target address, and this branch target address is called replacement predicted target address (override predicted target address) 174.In one embodiment, this branch target address produces logic and comprises a totalizer, is added to a branch instruction address in order to the side-play amount with correlated branch instruction, replaces predicted target address 174 to produce.In another embodiment, this logic comprises a calling/return stack, to produce the destination address of calling out with link order.Order format device 108 also produces a prediction and replaces signal 154.The prediction that order format device 108 produces true value replaces signal 154, to replace branch target address caching (BTAC) 142 branch predictions of being done in the microprocessor 100, hereinafter will be described in detail.That is, if the destination address that this logic produced in the order format device 108 misfits the destination address that BTAC 142 is produced, then order format device 108 produces the prediction replacement signal 154 of true value, so that the instruction of extracting because of the forecasting institute of BTAC 142 is cleared, and microprocessor 100 is branched to replace predicted target address 174.In one embodiment, be cleared and during microprocessor 100 branched to a part that replaces predicted target address 174, BTAC 142 was not read in instruction.

Microprocessor 100 also comprises a format instruction queue 112, is coupled to order format device 108.Format instruction queue 112 receives the format instruction from order format device 108, and is cushioned, till it is translated into micro-order.In one embodiment, format instruction queue 112 comprises a plurality of projects, can store 12 format instructions at most, though Figure 12 has only shown four projects.

Microprocessor 100 also comprises an instruction transfer interpreter 114, is coupled to format instruction queue 112.Instruction transfer interpreter 114 will format instruction queue 112 stored format macrodactylia orders and be translated into micro-order.In one embodiment, microprocessor 100 comprises a Reduced Instruction Set Computer (RISC) core, can carry out the micro-order of primary (native) or reduced instruction set computer.

Microprocessor 100 also comprises translates instruction queue 116, is coupled to instruction transfer interpreter 114.Translate instruction queue 116 and translate micro-order, and cushioned, till it can be carried out by the remainder of microprocessor pipeline from 114 receptions of instruction transfer interpreter.

Microprocessor 100 also comprises a register stage 118, is coupled to translate instruction queue 116.The register stage 118 comprises a plurality of registers, with save command operand and result.The register stage 118 comprises the visible register file of a user, to store user's visible state of microprocessor 100.

Microprocessor 100 also comprises an address phase 122, is coupled to the register stage 118.Address phase 122 comprises the address and produces logic, can be the instruction that needs access memory and produces memory address, as loading or save command and branch instruction.

Microprocessor 100 also comprises data phase 124, is coupled to address phase 122.Data phase 124 comprises in order to the logic from the internal memory loading data, and in order to get soon from the one or more high-speed caches of internal memory institute loading data.

Microprocessor 100 also comprises the execute phase 126, is coupled to data phase 124.Execute phase 126 comprises the performance element in order to execution command, as carrying out the arithmetic and the logical block of arithmetic and logical order.In one embodiment, the execute phase 126 comprises an Integer Execution Units, a performance element of floating point, a MMX performance element and a SSE performance element.Execute phase 126 also comprises in order to resolve the logic of branch instruction.Particularly, execute phases 126 can judge whether a branch instruction is adopted, and BTAC142 before whether once this branch instruction of error prediction can be adopted.In addition, execute phases 126 can be judged the BTAC 142 previous branch target address of predicting, whether predicted mistakenly by BTAC 142, that is mistake whether.If execute phases 126 judgement-previous branch prediction is a mistake, then the execute phase 126 can produce the branch prediction rub-out signal 152 of true value, so that be cleared because of the instruction that error prediction extracted of BTAC 142, and make microprocessor 100 branch to correct address 172.In one embodiment, be cleared and during microprocessor 100 branched to the part of correct address 172, BTAC 142 was not read in instruction.

Microprocessor 100 also comprises a storage stage 128, is coupled to the execute phase 126.Storage stage 128 comprises can respond the storage micro-order with the logic of storage data to internal memory.Storage stage 128 produces a correct address 172.Correct address 172 is in order to the branch prediction of corrigendum by the pointed previous mistake of branch prediction rub-out signal 152.Correct address 172 comprises the correct branch target address of a branch instruction.That is correct address 172 is non-imagination (non-speculative) destination addresses of a branch instruction.When a branch instruction was performed and resolves, correct address 172 also was written among the BTAC 142, as hereinafter describing in detail.Storage stage 128 also produces a BTAC and writes and require 176, to upgrade BTAC 142.BTAC writes and requires 176 will do about Fig. 7 part hereinafter and illustrate in greater detail.

Microprocessor 100 also comprises a write back stage 132, is coupled to storage stage 128.Write back stage 132 comprises in order to instruction results is write the logic in register stage 118.

Microprocessor 100 also comprises BTAC 142.BTAC 142 comprises a high-speed cache, to get destination address and other branch prediction information soon.BTAC 142 can respond the address 182 that is received from a multiplexer 148, produces a predicted target address 164.In one embodiment, BTAC 142 comprises a single port high-speed cache, and reading with write activity of BTAC 142 must be shared its access right, so may produce the vacation error of BTAC 142.BTAC 142 will do hereinafter with multiplexer 148 and be described in more detail.

Microprocessor 100 also comprises that one is coupled to the multiplexer 136 of BTAC 142.Multiplexer 136 is chosen some in its six inputs, exports as existing extraction address 162.One of them input is the next address 166 of extracting in proper order that is produced by a totalizer 134, and totalizer 134 is that existing extraction address 162 is increased progressively the size of a fast line taking, and produces the next address 166 of extracting in proper order.After the one fast line taking of instruction cache 104 normal extraction, multiplexer 136 is chosen the next address 166 of extracting in proper order, to export as existing extraction address 162.Another input is existing extraction address 162.Another input is a BTAC predicted target address 164, if BTAC 142 points out a branch instruction and is present in by existing extraction address 162 from the selected fast line taking of instruction cache 104, and BTAC 142 these branch instructions of prediction will be adopted, and then multiplexer 136 can be chosen this BTAC predicted target address 164.Another input is the correct address 172 that is received from storage stage 128, and multiplexer 136 is chosen this input to correct the branch prediction of a mistake.Another input is the replacement predicted target address 174 that receives from order format device 108, and multiplexer 136 is chosen this input to replace BTAC predicted target address 164.Another input is a current ordcurrent order pointer 168, and it is pointed out now just by the address of order format device 108 formative instructions.Multiplexer 136 can be chosen current ordcurrent order pointer 168, and is to avoid the situation of fast knot, as described below.

Microprocessor 100 also comprises that a BTAC who is coupled to BTAC 142 writes formation (BWQ) 144.BTAC writes formation 144 and comprises a plurality of storage assemblies, writes with buffering BTAC and requires 176, can be written into BTAC 142 up to it.BTAC writes formation 144 and can receive branch prediction rub-out signal 152, predict that replacement signal 154, instruction buffer are full of signal 156 and instruction cache idle signal 158.Preferably, BTAC write formation 144 and can postpone write with BTAC and require 176 actions of upgrading BTAC 142, up to one suitably constantly till, that is when BTAC 142 is not read, this is as shown in the input signal 152 to 158.By this, can increase the efficient of BTAC 142, hereinafter will more be described in detail.

BTAC writes formation 144 generations one BTAC and writes formation address 178, with an input as multiplexer 148.BTAC writes formation 144 and also comprises a register that stores present queue depth 146.Queue depth 146 points out to be stored at present BTAC and writes effective BTAC in the formation 144 and write and require 176 quantity.The initial value of queue depth is zero.Require 176 quilt income BTAC to write formation 144 whenever there being a BTAC to write, queue depth 146 promptly increases progressively thereupon.And whenever having a BTAC to write to require 176 to write formation 144 from BTAC and remove, queue depth 146 promptly successively decreases thereupon.BTAC writes formation 144 can do more detailed description hereinafter.

Fig. 2 is according to the present invention's block scheme of the part microprocessor 100 of shows in detail Fig. 1 more.Fig. 2 is coupled to BTAC and writes the moderator 202 of 142 of formation 144 and BTAC and the multiplexer 206 of tool three input ends except demonstrating, and also the BTAC of displayed map 1 writes formation 144, BTAC 142 and multiplexer 148.Though multiplexer 148 has only shown two input ends in Fig. 1, it is actually the multiplexer with four input ends, as shown in Figure 2.Among Fig. 2, BTAC 142 comprises a read/write input end, an address input end and a data input pin.

As shown in Figure 1, multiplexer 148 can receive existing extraction address 162 and BWQ address 178.In addition, multiplexer 148 can receive a unnecessary TA address 234 and a fast knot address 236, and these are done with the part of Figure 12-13 about Figure 10-11 hereinafter respectively and illustrate in greater detail.Multiplexer 148 is chosen four and is imported one of them, and the address signal 182 of output map 1, it is the control signal 258 that is produced according to moderator 202, and delivers to the address input end of BTAC 142.

Multiplexer 206 can receive a unnecessary TA data-signal 244 and a fast knot data-signal 246 conduct inputs, and these are done with the part of Figure 12-13 about Figure 10-11 hereinafter respectively and illustrate in greater detail.Multiplexer 206 also writes formation 144 from BTAC and receives a BWQ data-signal 248 as input, and this signal 248 is the data that present BTAC writes the requirement of formation 144, in order to upgrade BTAC 142.Multiplexer 206 is chosen three and is imported one of them, and exports a data-signal 256, and it is the control signal 262 that is produced according to moderator 202, and delivers to the data input pin of BTAC 142.

Moderator 202 is done arbitration between a plurality of resources that require access BTAC 142.Moderator 202 produces a signal 252, and it is read or writes fashionable in BTAC 142, be sent to the read/write input end of BTAC 142.Moderator 202 receives a BTAC reading requirement signal 212, and it shows a requirement of using existing extraction address 162 to read BTAC 142, this requirement be with one also use the requirement of existing extraction address 162 reading command high-speed caches 104 parallel simultaneously.Moderator 202 also receives a unnecessary destination address (TA) and requires signal 214, and the requirement of its demonstration is will be in BTAC 142, with the unnecessary project ineffective treatment of the instruction of the same branches in the unnecessary destination address 234 selected set.Moderator 202 also receives a fast knot requirement signal 216, its show one will be with the requirement of the item as invalidization of BTAC 142, this project is one of to predict in the 236 selected set of fast knot address branch instruction mistakenly not across a fast line taking border, and is as described below.Moderator 202 also writes formation 144 from BTAC and receives the non-spacing wave 216 of a BWQ, and its demonstration has a requirement at least, is waiting for and will upgrade among the BTAC 142, and is by a project in the 178 selected set of BWQ address, as described below.Moderator 202 also writes formation 144 receptions one BWQ from BTAC and is full of signal 222, and it shows that it is to be full of requirement that BTAC writes formation 144, and wait will be upgraded among the BTAC 142, and is by a project in the 178 selected set of BWQ address, as described below.

In one embodiment, moderator 202 comes dispatching priority shown in following table 1, and wherein 1 is the highest precedence, and 5 is minimum precedence.

The 1-fast knot requires 216

2-BWQ is full of 222

3-BTAC reading requirement 212

The unnecessary TA of 4-requires 214

5-BWQ non-NULL 218

Table 1

Fig. 3 is according to the present invention's block scheme of the BTAC 142 of shows in detail Fig. 1 more.As shown in Figure 3, BTAC 142 comprises a destination address array 302, a mark array 304 and a counter array 306.Each array all receives the address 182 of Fig. 1.The embodiment of Fig. 3 is BTAC 142 high-speed caches that show one four road set associative.In another embodiment, BTAC 142 comprises a two-way set associative cache.In one embodiment, destination address array 302 is the single port array with mark array 304; Yet counter array 306 is the dual-port array, have a read port and and write inbound port, this because of counter array 306 compared with destination address array 302 and mark array 304, more need frequent renewal.

Destination address array 302 comprises a storage assembly array, and to store destination address array project 312, these projects are to be used for getting soon branch target address and correlated branch information of forecasting.The content of destination address array project 312 Fig. 4 is hereinafter partly done explanation.Mark array 304 comprises a storage assembly array, and with storing marking array project 314, these projects are to be used for getting soon address mark and correlated branch information of forecasting.The content of mark array project 314 Fig. 5 is hereinafter partly done explanation.Counter array 306 comprises a storage assembly array, and with storage counter array project 316, these projects are to be used for storing the branch outcome information of forecasting.The content of counter array project 316 Fig. 6 is hereinafter partly done explanation.

Aforementioned each array all is configured to four the tunnel, is denoted as road 0, road 1, road 2 and road 3.Preferably, each road of destination address array 302 all stores two projects or part, is denoted as A and B, in order to get a branch target address and imaginary branch information soon, so if there are two branch instructions to be present in the fast line taking, then BTAC 142 can make prediction to suitable branch instruction.

Aforementioned each array is that the address 182 by Fig. 1 indexes.Address 182 in each array, choose a fast line taking than the low order meeting.In one embodiment, each array comprises 128 set.Because 128 set respectively have four the tunnel, each road is respectively deposited two, so BTAC 142 can get 1024 destination addresses at most soon.Preferably, each array is that the position [11:5] with address 182 indexes, to choose 1 among the BTAC142 four tunnel set.

Fig. 4 is the block scheme of content that illustrates the destination address array project 312 of Fig. 3 according to the present invention.

Destination address array project 312 comprises a branch target address (TA) 402.In one embodiment, destination address 402 comprises one 32 address, its be last time carried out after the branch instruction get soon.BTAC 142 provides destination address 402 in prediction TA output 164.

Destination address array project 312 also comprises an initial field 404.Start field 404 is pointed out, first byte of this branch instruction, the byte offsets in the fast line taking that the existing extraction of instruction cache 104 responses address 162 is exported.In one embodiment, a fast line taking comprises 32 bytes; Therefore, start field 404 comprises 5 positions.

Destination address array project 312 also comprises one across position 406.If the branch instruction of being predicted is true across position 406 then across two fast line takings of instruction cache 104.BTAC 142 provides across position 406 in a B_wrap signal 1214, and this will partly illustrate in Figure 12 hereinafter.

Fig. 5 is the block scheme of content that illustrates the mark array project 314 of Fig. 3 according to the present invention.

Mark array project 314 comprises a mark 502.In one embodiment, mark 502 comprises 20 higher positions of address of this branch instruction, and the corresponding project of this branch instruction in the destination address array 302 then stores a predicted target address 402.BTAC 142 can make comparisons 182 20 higher positions, address of mark 502 and Fig. 1, and judging this correspondence project address 182 that whether coincide, that is whether address 182 hit BTAC 142, if this project be effectively to talk about.

Mark array project 314 also comprises an A significance bit 504, if the destination address 402 that the part of the respective items purpose A in the destination address array 302 is deposited is for effective, then this A significance bit 504 is true.Mark array project 314 also comprises a B significance bit 506, if the destination address 402 that the part of the respective items purpose B in the destination address array 302 is deposited is for effective, then this B significance bit 506 is true.

Mark array project 314 also comprises one three lru field 508, with point out selected set four the tunnel in, which is nearest minimum being used.In one embodiment, when 142 of BTAC are performed in a BTAC branch, just upgrade lru field 508.That is 142 of BTAC will be adopted in BTAC 142 predictions one branch instruction, and microprocessor 100 is according to this prediction, when branching to the predicted target address 164 that BTAC 142 provided, just upgrade lru field 508.BTAC 142 upgrades lru field 508 when this BTAC branch Zhizheng is performed, BTAC 142 is not read during this period, and also failed call uses BTAC to write formation 144.

Fig. 6 is the block scheme of content that illustrates the counter array project 316 of Fig. 3 according to the present invention.

Counter array project 316 comprises a predicted state A counter 602.In one embodiment, predicted state A counter 602 is one two a saturated counters, and judging the correlated branch instruction when microprocessor 100 will be adopted at every turn, promptly up counts, and is not adopted when judging the correlated branch instruction at every turn, promptly down counts.Predicted state A counter 602 can be saturated in the binary value of b ' 11 when up counting, and is then saturated in the binary value of b ' 00 when down counting.In one embodiment, if the value of predicted state A counter 602 is b ' 11 or b ' 10, then the BTAC 142 correlated branch instruction of can forecasting institutes choosing the A part of destination address array project 312 will be adopted; Otherwise BTAC 142 can not adopted in this correlated branch instruction of prediction.Counter array project 316 also comprises a predicted state B counter 604, and function mode is similar to predicted state A counter 602, but its to be the B that is relevant to selected destination address array project 312 partly operate.

Counter array project 316 also comprises an A/B lru position 606.Binary value is the A/B lru position 606 of b ' 1, and the A that represents selected destination address array project 312 partly is recently minimum being used; Otherwise the B of selected destination address array project 312 partly is recently minimum being used.In one embodiment, when branch instruction arrived at the storage stage 128 of judging branch outcome (that is whether this branch is adopted), A/B lru position 606 can be along with predicted state A and

B counter

602 and 604 are updated.In one embodiment, refresh counter array project 316 need not use BTAC to write formation 144, writes inbound port because counter array 306 comprises a read port and, as described in preamble Fig. 3 part.

Fig. 7 writes the block scheme that requires 176 content according to the BTAC that the present invention illustrates Fig. 1.Fig. 7 has shown the information that is used for upgrading a BTAC 142 projects that storage stage 128 is produced, and this packets of information is contained in BTAC and writes and require signal 176 and deliver to BTAC to write formation 144, and also is the content that a BTAC writes the project of formation 144, as shown in Figure 8.

BTAC writes and requires 176 to comprise a branch instruction address field 702, and it is the address of a previous branch instruction of carrying out, and 142 of BTAC will be updated because of this branch instruction.When writing when requiring 176 to upgrade BTAC 142 subsequently, branch instruction address 702 20 higher positions are promptly deposited in the tag field 502 of the mark array project 314 of Fig. 5.Then use as the index of BTAC 142 branch instruction address 702 lower 7 positions [11:5].In one embodiment, branch instruction address 702 is one 32 a field.

BTAC writes and requires 176 also to comprise a destination address 706, in order to the DAF destination address field 402 that deposits Fig. 4 in.

BTAC writes and requires 176 also to comprise an initial field 708, in order to the start field 404 that deposits Fig. 4 in.BTAC writes and requires 176 also to comprise one across position 712, in order to deposit in Fig. 4 across position 406.

BTAC writes and requires 176 also to comprise that one writes activation-A field 714, and whether it specifies will write with BTAC and require 176 specified information, upgrades the A part of selected destination address array project 312.BTAC writes and requires 176 also to comprise that one writes activation-B field 716, and whether it specifies will write with BTAC and require 176 specified information, upgrades the B part of selected destination address array project 312.

BTAC writes and requires 176 also to comprise one ineffective treatment-A field 718, and whether its appointment will be with the A partial invalidityization of selected destination address array project 312.Action with the A partial invalidityization of selected destination address array project 312 comprises the A significance bit 504 of removing Fig. 5.BTAC writes and requires 176 also to comprise one ineffective treatment-B field 722, and whether its appointment will be with the B partial invalidityization of selected destination address array project 312.Action with the B partial invalidityization of selected destination address array project 312 comprises the B significance bit 506 of removing Fig. 5.

BTAC writes and requires 176 also to comprise one 4 road field 724, and which of selected set its appointment will upgrade.Road field 724 is deciphered fully.In one embodiment, when microprocessor 100 reads BTAC 142 when obtaining a branch prediction, microprocessor 100 promptly determines to deposit in the value of road field 724, and this value is sent to storage stage 128 along pipeline stage, writes and requires among 176 to be included in BTAC.If it is the credit balance order that microprocessor 100 is upgrading one of BTAC 142, for example, if BTAC 142 is hit in existing extraction address 162, then microprocessor 100 is the road at credit balance order place with this, deposits road field 724 in.If microprocessor 100 is writing the new projects of BTAC 142, for example, owing to a new branch instruction writes, then microprocessor 100 deposits the nearest minimum road that is used of selected BTAC142 set in road field 724.In one embodiment, when microprocessor 100 reads BTAC 142 when obtaining this branch prediction, microprocessor 100 is judged the minimum road that is used recently promptly from the lru field 508 of Fig. 5.

Fig. 8 is the block scheme that writes formation 144 according to the BTAC that the present invention illustrates Fig. 1.

BTAC writes formation 144 and comprises a plurality of storage assemblies 802, writes with the BTAC that stores Fig. 7 and requires 176.In one embodiment, BTAC writes formation 144 and comprises six storage assemblies 802, writes and requires 176 to store six BTAC.

Write for each BTAC and to require project 802, BTAC to write the significance bit 804 that formation 144 also comprises a correspondence, its value corresponding project when being effective for true, be vacation when invalid.

BTAC writes formation 144 and also comprises a steering logic 806, is coupled to storage assembly 802 and significance bit 804.Steering logic 806 also is coupled to queue depth's register 146.When a BTAC writes when requiring 176 to be loaded BTAC and to write formation 144, steering logic 806 increases progressively queue depth 146, and writes when requiring 176 to be moved out of BTAC and to write formation 144 as a BTAC, then queue depth 146 is successively decreased.Steering logic 806 receives BTAC from the storage stage 128 of Fig. 1 and writes and require signal 176, and the requirement that will receive deposits project 802 in.Branch prediction rub-out signal 152, the prediction that steering logic 806 also receives Fig. 1 replaces signal 154, instruction buffer is full of signal 156 and instruction cache idle signal 158.Whenever queue depth 146 greater than zero the time, steering logic 806 promptly produces the non-spacing wave 218 of BWQ of Fig. 2 of true value.Whenever the value of queue depth 146 equals the sum (being six among the embodiment at Fig. 8) of project 802, the BWQ that steering logic 806 promptly produces Fig. 2 of true value is full of signal 222.When steering logic 806 produced the non-spacing wave 218 of the BWQ of true value, steering logic 806 provided the branch instruction address 702 of Fig. 7 of the oldest (or bottom) project 802 that BTAC writes formation 144 also in the BWQ of Fig. 1 address signal 178.In addition, when steering logic 806 produced the non-spacing wave 218 of the BWQ of true value, steering logic 806 provided the field 706 to 724 of Fig. 7 of the oldest (or bottom) project 802 that BTAC writes formation 144 also in BWQ data-signal 248.

Fig. 9 is the operation workflow figure that writes formation 144 according to the BTAC that the present invention illustrates Fig. 1.Flow process starts from decisional block 902.

In decisional block 902, BTAC writes formation 144 and whether equals the project sum that BTAC writes formation 144 by the queue depth 146 of judging Fig. 1, judges whether it is full.If then flow process proceeds to square frame 918, to upgrade BTAC 142; Otherwise flow process proceeds to decisional block 904.

In decisional block 904, BTAC writes formation 144 by checking instruction cache idle signal 158, judges whether the instruction cache 104 of Fig. 1 is idle.If then flow process proceeds to decisional block 922, to upgrade BTAC 142 where necessary, because BTAC is not read for 142 this moments probably; Otherwise flow process proceeds to decisional block 906.

In decisional block 906, BTAC writes formation 144 by checking that instruction buffer is full of signal 156, judges whether the instruction buffer 106 of Fig. 1 is full.If then flow process proceeds to decisional block 922, to upgrade BTAC 142 where necessary, because BTAC is not read for 142 this moments probably; Otherwise flow process proceeds to decisional block 908.

In decisional block 908, BTAC writes formation 144 by checking that prediction replaces signal 154, judges whether a BTAC 142 branch predictions are substituted.If then flow process proceeds to decisional block 922, to upgrade BTAC 142 where necessary, because BTAC is not read for 142 this moments probably; Otherwise flow process proceeds to decisional block 912.

In decisional block 912, BTAC writes formation 144 by checking branch prediction rub-out signal 152, judges whether a BTAC 142 branch predictions are corrected.If then flow process proceeds to decisional block 922, to upgrade BTAC 142 where necessary, because BTAC is not read for 142 this moments probably; Otherwise flow process proceeds to decisional block 914.

In decisional block 914, BTAC writes formation 144 and judges whether to produce a BTAC and write and require 176.If not, then flow process is back to decisional block 902; Otherwise flow process proceeds to square frame 916.

In square frame 916, BTAC writes formation 144 BTAC and loads this BTAC and write and require 176, and queue depth 146 is increased progressively.This BTAC writes and requires 176 to be loaded the invalid project that BTAC writes top in the formation 144, and then this project promptly is denoted as effectively.Flow process then is back to decisional block 902.

In square frame 918, BTAC writes formation 144 and upgrades BTAC142 with its oldest (or bottom) project, and queue depth 146 is successively decreased.BTAC writes formation 144 and then moves down a project.BTAC writes formation 144 by in BWQ address signal 178, this value of the branch instruction address field 702 of Fig. 7 of old project is provided, and, provide this oldest BTAC to write the remainder that requires 176 projects, and upgrade BTAC142 with this oldest project by in BWQ data-signal 248.In addition, BTAC writes formation 144 the non-spacing wave 218 of BWQ is made as true value, delivers to the moderator 202 of Fig. 2.If square frame 918 is to arrive at from decisional block 902, then BTAC writes formation 144 and also BWQ is full of signal 222 and is made as true value, delivers to the moderator 202 of Fig. 2.Flow process proceeds to decisional block 914 from square frame 918.

Be noted that, at a pending clock pulse such as BTAC reading requirement signal 212 in the cycle, if BTAC writes formation 144 and BWQ is full of signal 222 is made as very, and moderator 202 permits BTAC and writes formation 144 access BTAC 142, and then BTAC 142 will send signal list and be shown with an error generation.If in fact in BTAC 142, the branch instruction for BTAC 142 predictions in the specified fast line taking in existing extraction address 162 will be adopted has the effective destination address of existence, and then aforementioned error promptly may be false the error.Yet advantageously, because under most of situation, when the write activity of BTAC 142 can be deferred to BTAC 142 and is not read, thereby BTAC writes formation 144 and can reduce BTAC 142 and produce the chance of false error.

In decisional block 922, BTAC writes formation 144 by judging whether queue depth equals zero, and judges that BTAC writes whether formation 144 is empty.If then flow process proceeds to decisional block 914; Otherwise flow process proceeds to decisional block 918, to upgrade BTAC 142 where necessary, because BTAC is not read for 142 this moments probably.

Figure 10 illustrates in the microprocessor 100 according to the present invention, in order to the block scheme with the logic of the unnecessary destination address ineffective treatment among the BTAC 142 of Fig. 1.

Figure 10 has shown that the BTAC 142 mark arrays 304 of Fig. 3 receive the address 182 of Fig. 1, and response produces four marks: tag0 1002A, tag1 1002B, tag2 1002C and tag3 1002D are generically and collectively referred to as mark 1002.Mark 1002 comprises the mark 502 of Fig. 5, and it is from each road of four tunnel of mark array 304.In addition, the mark array responsively produces and is denoted as eight valid[7:0 of 1004] position, its A significance bit 504 and B significance bit 506 for coming from each road of four tunnel of mark array 304.

But microprocessor 100 also comprises the comparer 1012 of receiver address 182, and it is coupled to mark array 304.In the embodiment of Figure 10, comparer 1012 comprises four 20 comparer, each comparer is in order to compare address 182 higher 20 positions marks 1002 corresponding with, to produce the identical signal of four correspondences: match0 1006A, match1 1006B, match2 1006C and match31006D are generically and collectively referred to as 1006.As if one of correspondence in the 182 identical marks 1002 of address, then corresponding comparer 1012 produces the identical signal 1006 of correspondence of true value.

Microprocessor 100 also comprises the steering logic 1014 that can receive coincide signal 1006 and useful signal 1004, and it is coupled to comparer 1012.If in the set of selected mark array 304, there is more than one road to have the identical signal 1006 of a true value, and the significance bit 1004 of at least one true value, then steering logic 1014 can store a true value in a unnecessary TA flag register 1024, has the actual target address more than to be stored among the BTAC 142 to show same branch instruction.In addition, steering logic 1014 can make address 182 be loaded a unnecessary TA register 1026.At last, steering logic 1014 is with unnecessary TA ineffective treatment data load one unnecessary TA ineffective treatment data register 1022.In one embodiment, be stored in the data in the unnecessary TA ineffective treatment data register 1022, a BTAC who is similar to Fig. 7 writes and requires 176, and except branch instruction address 702 does not store, this address because of branch instruction is stored in the unnecessary TA address register 1026; Destination address 706, start bit 708 and also store across position 712, this is because of in invalid BTAC 142 projects, and these fields all needn't be taken notice of; Therefore, when the action of a unnecessary TA ineffective treatment was carried out, destination address array 302 was not written into, and only underlined array 304 is updated, with unnecessary BTAC 142 item as invalidization.The output of unnecessary TA ineffective treatment data register 1022 comprises the unnecessary TA data-signal 244 of Fig. 2.The unnecessary TA that the output of unnecessary TA flag register 1024 comprises Fig. 2 requires 214.The output of unnecessary TA address register 1026 then comprises the unnecessary TA address 234 of Fig. 2.In one embodiment, be stored in the equation of unnecessary TA ineffective treatment data register 1022 and the road value 724 of unnecessary TA flag register 1024 in order to generation, as shown in table 2 below.In table 2, valid[3] comprise A valid[3] 504 with B valid[3] 506 logical "or" computing; Valid[2] comprise A valid[2] 504 with B valid[2] 506 logical "or" computing; Valid[1] comprise A valid[1] 504 with B valid[1] 506 logical "or" computing; Valid[0] then comprise Avalid[0] 504 with B valid[0] 506 logical "or" computing.

redundantInvalWay[3]＝(valid[3]&match[3])&((valid[0]&match[0])|(valid[1]&match[1])|(valid[2]&match[2]))；

redundantInvalWay[2]＝(valid[2]&match[2])&((valid[0]&match[0])|(valid[1]&match[1]))；

redundantInvalWay[1]＝(valid[1]&match[1])&(valid[0]&match[0])；

RedundantInvalWay[0]=0; / * road 0 can being disabled */

redundantTAFlag＝((valid[3]&match[3])&(valid[2]&match[2]))|

((valid[3]&match[3])&(valid[1]&match[1]))|

((valid[3]&match[3])&(valid[0]&match[0]))|

((valid[2]&match[2])&(valid[1]&match[1]))|

((valid[2]&match[2])&(valid[0]&match[0]))|

((valid[1]&match[1])&(valid[0]&match[0]))；

Table 2

For the correct running of understanding the unnecessary destination address ineffective treatment logic of Figure 10, following texts and pictures 11 parts are described, to be example below, describe the situation of the unnecessary destination address project that may in BTAC 142, produce the same branches instruction with a kind of order of execution command.

The first existing extraction address 162 of Fig. 1 is admitted to instruction cache 104 and BTAC 142.In the selected fast line taking in the first existing extraction address 162, comprise a branch instruction, be called branch-A.The set of BTAC 142 has been chosen in the first existing extraction address 162, is called set N.In all roads of set N, there is no any mark 1002 identical first existing extraction addresses 162; Therefore, BTAC 142 has produced an error.In this example, the lru value 508 shown nearest minimum roads that are used are 2.Therefore, resolve the information that is used for upgrading BTAC 142 of branch-A gained, just, all descend along pipeline along with branch-A points out the information that road 2 need be updated.

Then, the second existing extraction address 162 is admitted to instruction cache 104 and BTAC 142.In the selected fast line taking in the second existing extraction address 162, comprise a branch instruction, be called branch-B.Set N is also chosen in the second existing extraction address 162, and the road 3 of hitting set N; Therefore, BTAC 142 has produced one and has hit.In addition, the lru value 508 of BTAC 142 renewal set N is road 1.

Then, because branch-A is in small routine code circulation, the first existing extraction address 162 is admitted to instruction cache 104 and BTAC 142 once more, and chooses set N once more.Because the first time of branch-A, the storage stage 128 of Fig. 1 was not also arrived in execution, so do not upgrade BTAC 142 with the destination address of branch-A.Therefore, BTAC 142 produces error once more.Yet the current lru value 508 shown nearest minimum roads that are used are 1, and this is because lru 508 is updated because of hitting of branch-B.Therefore, resolve the information that is used for upgrading BTAC 142 of carrying out branch-A gained for the second time, just, descend along pipeline along with second branch-A points out the information that road 1 need be updated.

Then, first branch-A arrives at storage stage 128, and produces a BTAC and write and require 176, upgrades the road 2 of set N with the destination address of using branch-A, and branch-A promptly is performed subsequently.

Then, second branch-A arrives at storage stage 128, and produces a BTAC and write and require 176, upgrades the road 1 of set N with the destination address of using branch-A, and branch-A promptly is performed subsequently.The result is that for identical branch instruction, branch-A has two effective items among the BTAC 142.One of them project is unnecessary, can make BTAC 142 lack efficient in the use, because unnecessary project can be used for another branch instruction originally, and/maybe may replace to fall the actual target address of other branch instruction.

Figure 11 is the operation workflow figure according to the unnecessary destination address device of Figure 10 of the present invention.Flow process starts from square frame 1102.

In square frame 1102, moderator 202 is permitted the BTAC reading requirement 212 access BTAC142 of Fig. 2, makes multiplexer 148 choose existing extraction address 162, in the address signal 182 that is provided in Fig. 1, and the control signal 252 of generation Fig. 2, to carry out reading of BTAC 142 to show.Therefore, existing extraction address 162 is as the index use, to choose the set of BTAC 142 via address 182 than low order.Flow process proceeds to square frame 1104.

In square frame 1104, comparer 1012 is made comparisons the mark 1002 of all Figure 10 of four tunnel of selected BTAC 142 set with the higher significance bit of the existing extraction address 162 that is provided in the address signal 182, to produce the identical signal 1006 of Figure 10.Steering logic 1014 can receive the identical signal 1006 and significance bit 1004 of Figure 10.Flow process proceeds to decisional block 1106.

In decisional block 1106, steering logic 1014 judges whether that the situation that more than one significant notation coincide produces.That is steering logic 1014 is according to significance bit 1004 and identical signal 1006, judges in selected BTAC 142 set in existing extraction address 162, whether have two or more multichannel have the mark 1002 that effectively coincide.If then flow process proceeds to square frame 1108; Otherwise flow process finishes.

In square frame 1108, steering logic 1014 is stored in a true value in the unnecessary TA flag register 1024, and address 182 is stored in unnecessary TA address register 1026, and the ineffective treatment data are stored in unnecessary TA ineffective treatment data register 1022.Especially, steering logic 1014 deposits we-A 714, we-B 716, inv-A 718 and the inv-B 722 of true value in unnecessary TA ineffective treatment data register 1022.In addition, steering logic 1014 deposits the road field 724 of unnecessary TA ineffective treatment data register 1022 in also the value according to described table 2 gained of preamble Figure 10 part.Flow process proceeds to square frame 1112.

In square frame 1112, the unnecessary TA that moderator 202 is permitted Fig. 2 requires 214 access BTAC 142, makes multiplexer 148 choose unnecessary TA address 234, to be provided in the address signal 182, and the control signal 252 of generation Fig. 2, to carry out writing of BTAC 142 to show.Therefore, the using as index via address 182 than low order of unnecessary TA address 234 is to choose the set of BTAC 142.The unnecessary TA data-signal 244 that BTAC 142 is provided from unnecessary TA data register 1022 receives data, and with in the selected set by the road ineffective treatment of road field 724 appointments.Flow process ends at square frame 1112.

Figure 12 illustrates the block scheme that Fig. 1 microprocessor 100 interior fast knots are avoided logic according to the present invention.

Figure 12 has shown BTAC 142, instruction cache 104, instruction buffer 106, order format device 108, the format instruction queue 112 and multiplexer 136 of Fig. 1, and the steering logic 1014 of Figure 10.

As shown in figure 12, microprocessor 100 also comprises a fast knot ineffective treatment data register 1222, a fast knot flag register 1224 and a fast knot address register 1226.

Order format device 108 can decoding be stored in the instruction of instruction buffer 106, and during across the branch instruction of two fast line takings, produces the F_wrap signal 1202 of true value one of decoding.Particularly, if branch instruction of order format device 108 decodings across two fast line takings, order format device 108 decoding this across first partly time of branch instruction, produce the F_wrap signal 1202 of true value, wherein this first part is stored in the first fast line taking of instruction buffer 106, and this moment is also no matter whether order format device 108 has deciphered this all the other parts across branch instruction, and this remainder is positioned at the second fast line taking, and it may also not be present in the instruction buffer 106.F_wrap signal 1202 can be sent to steering logic 1014.

When existing extraction address 162 produced error in instruction cache 104, instruction cache 104 promptly produced the error signal 1206 of true value.Error signal 1206 can be sent to steering logic 1014.

When the existing extraction address 162 of delivering to instruction cache 104 is imagination, that is existing extraction address 162 is predicted address, when choosing BTAC predicted target address 164 as existing extraction address 162 as multiplexer 136, steering logic 1014 promptly produces the imaginary signal 1208 of true value.Imaginary signal 1208 can be sent to instruction cache 104.In one embodiment, instruction cache 104 transfers to imaginary signal 1208 the instruction fetch device 102 of Fig. 1, so that instruction fetch device 102 can utilize an imaginary memory address, be extracted in the fast line taking that can not find the instruction cache 104 from internal memory in advance, its reason Figure 13 part hereinafter can be illustrated.

BTAC 142 can produce one and adopt/do not adopt (T/NT) signal 1212, delivers to steering logic 1014.T/NT signal 1212 expressions of true value: BTAC 142 is hit in address 182; BTAC 142 predicts in the fast line taking that has a branch instruction to be contained in the existing extraction of instruction cache 104 responses address 162 and provide, and this branch instruction will be adopted; And BTAC 142 can provide a destination address of this branch instruction in BTAC predicted target address signal 164.BTAC 142 produces T/NT signal 1212 according to the value of predicted state A 602 or the predicted state B 604 of Fig. 6, when according to what person then looking BTAC 142 and give a forecast, uses part A or B and decides.

BTAC 142 also produces a B_wrap signal 1214, delivers to steering logic 1014.In B_wrap signal 1214, provide Fig. 4 of selected BTAC destination address array project 312 value across position 406.Therefore, B_wpap signal 1214 expressions of falsity, BTAC 142 these branch instructions of prediction are not across two fast line takings.In one embodiment, steering logic 1014 can be temporary with B_wrap signal 1214, to keep from BTAC 142 accesses last time and the value of B_wrap 1214.

Steering logic 1014 also produces the current ordcurrent order pointer 168 of Fig. 1.Steering logic 1014 also produces a control signal 1204, and it chooses signal for the input of multiplexer 136.

If steering logic 1014 detects a fast knot situation and (hereinafter will describe in detail, that is the temporary B_wrap signal 1214 of falsity, F_wrap signal 1202, error signal 1206 and imaginary signal 1208 with true value), then steering logic 1014 stores a true value in fast knot flag register 1224, there is a fast knot situation to exist with expression, so causes the project of this fast knot situation will being disabled among the BTAC 142.In addition, steering logic 1014 can make address 182 be loaded fast knot address register 1226.At last, steering logic 1014 can be with fast knot ineffective treatment data load fast knot ineffective treatment data register 1222.In one embodiment, be stored in the data in the fast knot ineffective treatment data register 1222, a BTAC who is similar to Fig. 7 writes and requires 176, and except branch instruction address 702 does not store, this address because of branch instruction is stored in the fast knot address register 1226; Destination address 706, start bit 708 and also store across position 712, this is because of in invalid BTAC 142 projects, and these fields all needn't be taken notice of; Therefore, when the action of a fast knot ineffective treatment was carried out, destination address array 302 was not written into, and only underlined array 304 is updated, with BTAC 142 item as invalidization with prediction error.The output of fast knot ineffective treatment data register 1222 comprises the fast knot data-signal 246 of Fig. 2.The fast knot that the output of fast knot flag register 1224 comprises Fig. 2 requires 216.The output of fast knot address register 1226 then comprises the fast knot address 236 of Fig. 2.Be stored in the road value 724 of fast knot ineffective treatment data register 1222, then the road with the BTAC142 project place that causes the fast knot situation writes.

If steering logic 1014 detects the fast knot situation, then after item as invalidization with prediction error, steering logic 1014 also produces one and is worth in control signal 1204, so that multiplexer 136 is chosen current ordcurrent order pointer 168,100 branches go over microprocessor, so comprise the fast line taking of the branch instruction of prediction error, just can be extracted again.

Figure 13 is the operation workflow figure that avoids logic according to the fast knot that the present invention illustrates Figure 12.Flow process starts from square frame 1302.

In square frame 1302, existing extraction address 162 is admitted to instruction cache 104 and BTAC 142 via address signal 182.In Figure 13, existing extraction address 162 is called as extracts address A.Flow process proceeds to square frame 1304.

In square frame 1304, provide by extracting a specified fast line taking of address A to instruction buffer 106, be called fast line taking A, it comprises first part of a branch instruction, but non-this branch instruction is whole.Flow process proceeds to square frame 1306.

In square frame 1306, BTAC 142 response extraction address A, and whether this branch instruction of the fast line taking A of prediction will be adopted in T/NT signal 1212, and produces the B_wrap signal 1214 of falsity, and an imaginary destination address is provided in BTAC predicted target address 164.Flow process proceeds to square frame 1308.

In square frame 1308, steering logic 1014 control multiplexers 136 are chosen BTAC predicted target address 164, with as next existing extraction address 162, are called and extract address B.Steering logic also produces the imaginary signal 1208 of true value, because BTAC predicted target address 164 is imaginary.Flow process proceeds to square frame 1312.

In square frame 1312, instruction cache 104 produces the error signal 1206 of true value, extracts address B with expression and produce error in instruction cache 104.Under the normal condition, instruction fetch device 102 can extract the fast line taking of losing from internal memory; Yet because imaginary signal 1208 is true, order format device 108 can't extract the fast line taking of losing from internal memory, the reasons are as follows described.Flow process proceeds to square frame 1314.

In square frame 1314, the fast line taking A in the order format device 108 translation instruction impact dampers 106, and the F_wrap signal 1202 of generation true value are because this branch instruction is across two fast line takings.Order format device 108 can wait for that the fast line taking of bar is deposited in instruction buffer 106 down, and so it can finish the action of this branch instruction of format, to deliver to format instruction queue 112.Flow process proceeds to decisional block 1316.

In decisional block 1316, steering logic 1014 judges that whether B_wrap signal 1214 is whether false, F_wrap signal 1202 is whether true, error signal 1206 is for reaching really whether imaginary signal 1208 is true.These constitute a fast knot situation, hereinafter will be discussed.If then flow process proceeds to square frame 1318; Otherwise flow process finishes.

In square frame 1318, steering logic 1014 will cause BTAC 142 item as invalidization of fast knot situation, as described in preamble Figure 12 part.Therefore, when extraction next time address A is admitted to BTAC 142, BTAC 142 will produce error, because cause the project of fast knot situation invalid this moment.Flow process proceeds to square frame 1322.

In square frame 1322, steering logic 1014 control multiplexers 136 branch to current ordcurrent order pointer 168, as described in preamble Figure 12 part.In addition, when steering logic is chosen current ordcurrent order pointer 168 at control multiplexer 136, can produce the imaginary signal 1208 of falsity, because current ordcurrent order pointer 168 is not imaginary memory address.Current ordcurrent order pointer 168 very might hit instruction cache 104; Yet if do not hit, instruction fetch device 102 still can extract the specified fast line taking of current ordcurrent order pointer 168 from internal memory, is not imaginary because imaginary signal 1208 is pointed out current ordcurrent order pointer 168.Flow process ends at square frame 1322.

In decisional block 1316 is true time, and the reason that the fast knot situation promptly exists is to cause the required condition of fast knot to be set up.First causes the condition of fast knot to be, a multibyte branch instruction is arranged across two different fast line takings.Just, first of this branch instruction byte partly is positioned at the afterbody of one first fast line taking, and second of this branch instruction byte partly then is positioned at the following bar head of fast line taking in proper order.Since may occur across branch instruction, BTAC 142 must store prediction one branch instructions whether across the information of fast line taking, so steering logic 1014 could be before extracting fast line taking from destination address 164, know whether and to extract down the fast in proper order line taking of bar, to obtain the second portion of this branch instruction byte.If BTAC 142 stores incorrect information of forecasting, then BTAC 142 may predict mistakenly this branch instruction not across, and in fact be far from it.In this case, order format device 108 can decoding have first partly the fast line taking of this branch instruction, and whether detecting exist a branch instruction, but is not that all bytes of this branch instruction have obtained all to decipher.So order format device 108 can be waited for the fast line taking of bar down.The running of pipeline will pause always, and will be formatted to wait for more instruction, so that carry out.

Second causes the condition of fast knot to be and since BTAC 142 predicted branches instructions not across, branch control logic 1014 promptly extracts fast line taking (but not extracting fast in proper order line taking of bar down) from the destination address 164 that BTAC 142 provided.Yet destination address 164 produces error in instruction cache 104.Therefore, following fast line taking being waited for of order format device 108 must be extracted from internal memory.

The 3rd causes the condition of fast knot to be, has the microprocessor chip group of not wishing to extract from some memory address range instruction, if this microprocessor extracts instruction from these address realms, promptly may make system work as machine, or cause other unfavorable system state.One presumptive address as the destination address 164 that BTAC 142 is provided, promptly may cause from then on planting memory address range and extract instruction.Thereby microprocessor 100 can't extract a fast line taking of losing from internal memory according to this imaginary BTAC predicted target address 164.

Therefore, order format device 108 all suspends running with the remainder of pipeline, to wait for another fast line taking.Simultaneously, the instruction fetch device also suspends running, informs that to wait for pipeline it carries out non-imaginary extraction action.Under the situation of non-fast knot, similarly be that destination address 164 is when hitting instruction cache 104, order format device 108 will format branch instruction (though being to carry out with the byte of mistake), and formative branch instruction is delivered to execute phase of pipeline, it will detect prediction error, and the error prediction of corrigendum BTAC 142, use making imaginary signal 1208 become vacation.Yet, under the situation of fast knot, because order format device 108 fast line taking of bar under waiting for, so can't provide branch instruction to the execute phase, so the execute phase just can't detect prediction error.Therefore, the situation of fast knot has just produced.Yet the fast knot of Figure 12 avoids logic advantageously to avoid the generation of fast knot, as described in Figure 12 and 13, microprocessor 100 can suitably be operated.

Though the present invention and purpose thereof, feature and advantage are described in detail, other embodiment also can be within the scope of the present invention.For example, though write the situation narration of formation according to single port BTAC, in some microprocessor configuration, false error still may occur in the situation of multiport BTAC, although occurrence frequency is lower.Therefore, also can use and write formation, to reduce the false fault rate of a multiport BTAC.In addition, except this place narration, the situation that other BTAC is not read also may be present in some microprocessor, wherein comes the requirement that writes in the formation and may be written into BTAC.

Have, though the present invention and purpose thereof, feature and advantage are described in detail, other embodiment also can be within the scope of the present invention again.The present invention also may be implemented in computing machine and can use in (as can read) computer readable code (as computer readable program code, data etc.) that media included except utilizing hardware implements.It is feasible that computer code becomes disclosed function or framework (or both).For example, this can be by using general procedure language (as C, C++, JAVA and similar program language); The GDSII database; The hardware description language (HDL) that comprises Verilog HDL, VHDL, Altera HDL (AHDL) etc.; Or other available sequencing and/or circuit equipments of recording reach in this technical field.Computer code can place any known computing machine can use (as can read) media, comprise semiconductor memory, disk, CD (as CD-ROM, DVD-ROM and analog), and can be used as computer data signal, in be contained in computing machine and can use in (as can read) transmission medium (, comprising the media of numeral, optics or utilization simulating signal) as carrier wave or any other media.Thus, computer code can be gone up transmission at communication network (comprising the Internet and internal network).What recognize is, invention can be implemented in computer code (as the some of intellectual property power (IP) core (as microcontroller core), or as the design (as system single chip (SOC)) of systemic hierarchial) in, and be convertible into hardware, the some of making as integrated circuit.Have, the present invention also can be used as the combination of hardware and computer code again.

In a word, the above only is preferred embodiment of the present invention, can not with the scope implemented of qualification the present invention.All equalizations of doing according to claim of the present invention change and modify, and all should belong in the scope that patent of the present invention contains.

Claims

1, a kind of can with in the set associative branch target address caching to the device of the stored unnecessary project ineffective treatment of same branch instruction, it is characterized in that comprising:

Whether one positioning indicator extracts in the set of selected this branch target address caching in address in order to point out an instruction cache, have two-way at least to comprise an effective branch target address of same branch instruction; And

One steering logic is coupled to this positioning indicator, chooses in the set in order to point out this at this positioning indicator, and when having two-way at least to comprise an effective branch target address of same branch instruction, this that this is chosen set be wherein one road ineffective treatment of two-way at least.

2, device as claimed in claim 1 is characterized in that also comprising:

One register is coupled to this steering logic, extracts the address in order to store this instruction cache, for this steering logic use, with this this one road ineffective treatment wherein of two-way at least of this being chosen set.

3, device as claimed in claim 1 is characterized in that this chooses set is selected by an index part of this instruction cache extraction address.

4, device as claimed in claim 1, it is characterized in that this steering logic this is chosen set this at least behind wherein one road ineffective treatment of two-way, remove this positioning indicator.

5, device as claimed in claim 1 is characterized in that also comprising:

One register is coupled to this steering logic, but storage data, this that choose set with this of specifying the ineffective treatment of being wanted at least two-way this wherein one the tunnel, this that is used for this is chosen set be this one road ineffective treatment wherein of two-way at least.

6, device as claimed in claim 1 is characterized in that also comprising:

At least two useful signals are coupled to this steering logic, and each useful signal is in order to pointing out, this choose set this whether comprise an effective branch target address in corresponding road of two-way at least.

7, device as claimed in claim 1 is characterized in that also comprising:

At least two identical signals are coupled to this steering logic, and each identical signal is in order to pointing out, this stored mark in a corresponding road of two-way at least whether mark part that this instruction cache extracts the address coincide that this chooses set.

8, device as claimed in claim 7 is characterized in that also comprising:

At least two comparers, be coupled to this steering logic, each comparer choose set in order to this mark part of this instruction cache being extracted the address and this this at least this mark of this corresponding Lu Suocun of two-way make comparisons, and respond this comparison, produce the respective signal in these at least two the identical signals.

9, a kind of can with in the branch target address caching to the device of the stored unnecessary project ineffective treatment of same branch instruction, it is characterized in that comprising:

Whether one detecting logic is chosen in a plurality of roads of set in order to detect one of this branch target address caching, have more than one effective road store a destination address of same branch instruction; And

One ineffective treatment logic is coupled to this detecting logic, chooses in the set this effective road more than one for this, keeps one of them, and with other road ineffective treatment.

10, device as claimed in claim 9 is characterized in that also comprising:

One register is coupled to this ineffective treatment logic, extracts the address in order to store an instruction cache; Wherein this to choose set selected by the index part that this instruction cache extracts the address, wherein this ineffective treatment logic uses this instruction cache that is stored in this register to extract the address, with this is chosen in the set this more than one effectively the road this one of them kept, and with other road ineffective treatment.

11, device as claimed in claim 10, it is characterized in that this detecting Logical Configuration is for receiving the identical signal of a correspondence to this each road of choosing these roads of set, the signal that should coincide points out whether identical this instruction cache extracts an address mark part of address to an address mark on this correspondence road.

12, device as claimed in claim 11 is characterized in that more configuration is for receiving the effective index signal of a correspondence to this each road of choosing these roads of set for this detecting logic, and this effective index signal points out whether this correspondence road is effective.

13, device as claimed in claim 12 is characterized in that comprising:

One sign, be coupled to this detecting logic, choose in a plurality of roads of set in order to point out one of this branch target address caching, whether there is more than one effective road store a destination address of same branch instruction, wherein if should coincide signal and this effective index signal that more than one road is arranged in these roads for very, then this detecting logic is made as a true value with this sign.

14, device as claimed in claim 13 is characterized in that if this is masked as very then this ineffective treatment logic is chosen this ineffective treatment on this effective road more than one of set with this.

15, a kind of pipeline microprocessor is characterized in that comprising:

One instruction cache has an address input end, to receive an address, to choose a fast line taking that comprises a branch instruction;

One branch target address caching, be coupled to this instruction cache, can respond this address and produce a plurality of indicator signals, each these indicator signal points out whether a corresponding road in the set of this branch target address caching that this address is selected stores an actual target address of this branch instruction; And

One logic, be coupled to this branch target address caching, configuration is chosen wherein one or more roads, these roads ineffective treatment of set for point out these roads wherein two or more a plurality of road when storing an actual target address of this branch instruction in these indicator signals with this.

16, microprocessor as claimed in claim 15 is characterized in that also comprising:

One first pipeline stage, wherein this branch target address caching is pointed out an error of this address in this first pipeline stage, and wherein this branch target address caching specify these roads one of them, in order to store this destination address; And

One second pipeline stage, be positioned at after this first pipeline stage, in order to requiring this branch target address caching, one of this branch instruction is resolved destination address write this branch target address caching in one of them of these specified roads of this first pipeline stage.

17, microprocessor as claimed in claim 16 is characterized in that this first and second pipeline stage at least three pipeline stage of being separated by.

18, microprocessor as claimed in claim 16, it is characterized in that a follow-up action of extracting this branch instruction from this instruction cache, it may be before this subordinate phase be arrived in a previous action of extracting this branch instruction, promptly arrive at this phase one, so that this of this branch target address caching chosen in two or the more a plurality of road that is integrated into these roads, store an actual target address of this branch instruction.

19, a kind of can with in the set associative branch target address caching to the method for the stored unnecessary project ineffective treatment of same branch instruction, it is characterized in that comprising:

Judge that an instruction cache extracts in the set of this selected branch target address caching of an index part of address, whether have the mark on more than one road this instruction cache that coincide to extract a mark part of address, and whether effective; And

There is more than one road effectively and coincide in the set if this is chosen, then keeps this and choose one of them road of set, and with other road ineffective treatment.

20, method as claimed in claim 19 is characterized in that also comprising:

Respond this judgement action, store an indication, in the set with this selected branch target address caching of an index part of pointing out instruction cache extraction address, there is identical this instruction cache of mark on more than one road to extract a mark part of address, and is effective.

21, method as claimed in claim 19 is characterized in that also comprising:

Respond this judgement action, store this instruction cache and extract the address.

22, method as claimed in claim 19 wherein more comprises:

Respond this judgement action, store an indication, choose in the set except wherein one the tunnel to point out this, other road will being disabled.

23, a kind of can identity set with a N road set associative branch target address caching in, the method to the stored unnecessary project ineffective treatment of same branch instruction is characterized in that comprising:

Choose the N road set of this branch target address caching according to one of an instruction fetch than lower part;

N the address mark on N corresponding road of this N road set and a higher part of this instruction fetch are made comparisons;

Whether judge has two or coincide this higher part and for effective of more a plurality of mark in this N address mark; And

If in this N address mark, have two or coincide this higher part and of more a plurality of mark for effective, then should an effective N address mark in, identical this higher part this two or the one or more roads ineffective treatment on pairing this N road of more a plurality of mark.

24, be contained in the computer data signal in the transmission medium in a kind of, it is characterized in that comprising:

Computer readable program code, in order to a pipeline microprocessor to be provided, this program code comprises:

First program code, in order to an instruction cache to be provided, it has an address input end, to receive an address, to choose a fast line taking that comprises a branch instruction;

Second program code, in order to a branch target address caching to be provided, be coupled to this instruction cache, can respond this address and produce a plurality of indicator signals, each these indicator signal points out whether a corresponding road in the set of this branch target address caching that this address is selected stores an actual target address of this branch instruction; And

The 3rd program code, in order to a logic to be provided, be coupled to this branch target address caching, configuration is chosen wherein one or more roads, these roads ineffective treatment of set for point out these roads wherein two or more a plurality of road when storing an actual target address of this branch instruction in these indicator signals with this.