CN101211257A

CN101211257A - Method and processor for solving access dependence based on local associative lookup

Info

Publication number: CN101211257A
Application number: CNA2006101715219A
Authority: CN
Inventors: 龙国平; 范东睿; 袁楠; 张�浩
Original assignee: Institute of Computing Technology of CAS
Current assignee: G Cloud Technology Co Ltd
Priority date: 2006-12-30
Filing date: 2006-12-30
Publication date: 2008-07-02
Anticipated expiration: 2026-12-30
Also published as: CN100545806C

Abstract

The invention relates to a new method for solving access dependence based on partial associative search. The method includes: a partial associative search mechanism, wherein, when a data taking instruction enters into an access group, a subset of the group which is positioned in front only needs to be accessed to adjust whether a latest value can be obtained from inquired data memorizing instruction, in a similar way, and when a memorizing instruction enters into the access group, the subset of the group which is positioned rearwards is only required to adjust whether the data taking instruction which is executed in advance and is written back is existed; an access dependence predictor, wherein, when the data taking instruction is renamed, the access dependence predictor is required to index an access distance, and is the access distance is effective, a transmission module must ensure that the data memorizing instruction thereof in front which is corresponding to the access distance is executed before sending the data taking instruction, and then the data taking instruction is sent.

Description

Based on the solution memory access of local associative lookup relevant method and processor

Technical field

The present invention relates to the structure of processor, more specifically, the present invention relates to solve relevant method and the processor thereof in the address between the accessing operation in the processor.

Background technology

All solve the relevant method in memory access address all based on certain processor microarchitecture.Fig. 1 is the structured flowchart of a modern superscalar microprocessor.

Present most of superscalar processor all adopts the basic structure that is similar to Fig. 1, and moreover, the most existing microprocessor realizes in the memory access parts that all an access queue guarantees the procedure order between the accessing operation of out of order execution.Be correlated with for the address that solves between the accessing operation, need that all access queue is carried out complete association and search.

Complete association in the access queue is searched and is embodied in: when a load instruction (load) is launched out when entering access queue, the necessary all number storage orders (store) of associative lookup, if having before the load instruction between number storage order and they exists the address relevant, mean that so the needed value of this load instruction is all or part of in the number storage order of formation front, need after associative lookup, pass to the load instruction that newly enters formation to the value of the number storage order of matching addresses in this case.Because present processor supports that all with 8,16,32 even 64 be the poke visit of unit, therefore a load instruction may need to transmit (Forward) data there from its a plurality of number storage order before simultaneously, and this has further increased the complicacy of access queue control.Similar with the situation of load instruction, when a number storage order enters access queue, all load instructions in the necessary associative lookup access queue, exist the relevant and relevant load instruction in address to write back in advance in case find follow-up load instruction and this number storage order, relevant load instruction must be put exception sign and brush so and fall follow-up all instructions that begin from this load instruction.

Disclose the method that the instruction of in computer processor scheduler program is carried out in U.S. Pat 6108770, this method comprises to be extracted from instruction poke device and holds instruction, and carries out the instruction of being extracted not according to the order of program.When detecting load instruction/number storage order and conflict in proper order, delete the operating result of this load instruction and be relevant to this result's instruction, re-execute these instructions.This load instruction produces related with other number storage orders about the data that this load instruction relied on.The set of all these number storage orders is called as a number storage order collection.In the emission subsequently of this load instruction, its execution is postponed, till all number storage orders of concentrating at the number storage order of this load instruction are launched.Two load instructions can a shared number storage order collection, when finding that load instruction that a number storage order is concentrated is relevant to number storage order of another number storage order collection, with two number storage orders set also.This US6108770 discloses a preferred embodiment, it comprises two tables, a table is the ID of a poke Patent Office table (SSIT), a part of index or the hash of its PC (programmable counter) by an instruction, and the item among the SSIT provides the number storage order collection that is used for second table of index ID.For each number storage order collection, in second table, comprised a pointer that points to the unenforced number storage order of last extraction.

U.S. Pat 5999727 discloses a kind of relevant method of memory access that solves, and promptly is recorded among the Icache together by historical information and load instruction/poke (Store) instruction that memory access is relied on, so that reference when carrying out instruction scheduling.

Load instruction number storage order number storage order number storage order load instruction number storage order number storage order number storage order load instruction load instruction this shows, in the disposal route of being correlated with in the existing address that solves between the accessing operation, need that not only each accessing operation is carried out complete association and search, and this complete association delay meeting of searching is because formation elongated and rapid deterioration.The complete association of access queue searched also will bring very high dynamic power consumption.

Summary of the invention

The purpose of this invention is to provide and a kind ofly can solve address between the accessing operation relevant method and processor, the power consumption that can save processor under the situation of not losing processor performance allows access queue have certain extensibility simultaneously.

To achieve these goals, the invention provides a kind of processor, comprising:

Get finger and decoding single part, be used for obtaining instruction stream, deliver to the register renaming parts after instruction stream is deciphered from internal memory;

The register renaming parts, the write after write (WAW) that is used to solve between instruction or the microcode is relevant with two kinds of writeafterreads (WAR), and all instructions or microcode are delivered to emission element after passing through register renaming; Described this register renaming unit also comprises a memory access correlation predictive table (MDP), all will inquire about this memory access correlation predictive table for a peek (LOAD) instruction, searches the item that wherein whether has coupling;

Emission element is used to safeguard all instructions or the operand of microcode, in case certain bar instruction or the essential operand of microcode are ready to, just it are transmitted into the rear end execution unit and carry out;

The rear end execution unit, it comprises some fixed point arithmetic logical blocks (ALU), floating-point arithmetic logical block (ALU), and some memory access parts, each memory access parts is provided with the procedure order that an access queue is safeguarded all accessing operations;

Instruction reorder queue (ROQ) is safeguarded the instruction of all processor pipelines or the procedure order of microcode, in case instruction or microcode are finished, just removes the reorder queue from instructing.Another aspect of the present invention provides a kind of relevant method of memory access that solves in processor, described processor comprises: get finger and decoding unit; The register renaming parts, described this register renaming unit also comprises a memory access correlation predictive table (MDP); Emission element; The rear end execution unit, it comprises some fixed point arithmetic logical blocks (ALU), floating-point arithmetic logical block (ALU), and some memory access parts; And instruction reorder queue (ROQ); Described method comprises the following steps:

1) refers to and decoding unit reception access instruction from getting;

2) judge that described access instruction is load instruction or number storage order;

3), judge whether to exist the item that mates with the program counter address of this load instruction if load instruction is then inquired about this memory access correlation predictive table;

4) if in described memory access correlation predictive device, there is the item that mates with the program counter address of this load instruction, then emission element stops this that this load instruction is transmitted into the memory access parts, is finished and value has been write in the data cache (DCACHE) up to the poke relevant with its address operation;

5) if do not find the item of coupling in memory access correlation predictive table, emission element should be transmitted into this load instruction the memory access parts;

6) inquire about accessing operation in the access queue forward, determine whether to obtain from the number storage order of inquiring about the value that to peek there.

Description of drawings

Below in conjunction with the detailed description of preferred embodiment of figure to being adopted, above-mentioned purpose of the present invention, advantage and feature will become apparent by reference, wherein:

Fig. 1 shows the basic block diagram of microprocessor of prior art;

Fig. 2 shows in the prior art access queue is carried out the process that complete association is searched;

Fig. 3 shows and realizes the basic block diagram of microprocessor of the present invention;

Fig. 4 shows the basic structure that realizes memory access correlation predictive device of the present invention;

Fig. 5 shows the basic controlling flow process that realizes emission control logic of the present invention;

Fig. 6 shows the process of being correlated with based on part associative lookup solution memory access of the present invention that realizes;

Fig. 7 shows the structured flowchart of the memory access parts in the preferred embodiments of the present invention.

Embodiment

The preferred embodiments of the present invention are described below with reference to the accompanying drawings.

Fig. 1 is the structured flowchart of a modern superscalar microprocessor.For the processor of sophisticated vocabulary, the base unit that each functional part is handled is a microcode; And for the processor of reduced instruction set computer, the elementary cell that each functional part is handled is instruction.The present invention is suitable for for the processor and the compacting instruction set processor of sophisticated vocabulary, so for fear of unnecessary repetition, except specified otherwise, represent the instruction or the microcode of the basic processing unit of each functional part in the processor in this instructions and claims with instruction.Each several part among the figure is briefly described below:

Get and refer to and decoding (101): from internal memory, obtain instruction stream, deliver to the register renaming module after instruction stream is deciphered.In order to improve handling property, also place instruction cache (CACHE) and instruction TLB parts usually here, because these parts and the present invention concern not quite, so do not illustrate in the drawings.

Register renaming (102): the write after write (WAW) that mainly solves between instruction or the microcode by register renaming is relevant with two kinds of writeafterreads (WAR), all instructions or microcode are through delivering to transmitter module on the one hand behind the register renaming, deliver on the other hand among the instruction reorder queue ROQ, and safeguard the procedure order of all executory instructions or microcode at ROQ.

Instruction issue unit (103): safeguard all instructions or the operand of microcode,, just it is transmitted into the rear end execution unit and carries out in case certain bar instruction or the essential operand of microcode are ready to.Emission element learns that by the intercepted result bus value of related register is ready, carries out thereby the instruction that will be detained because of waiting for this register or microcode are transmitted into functional part.

The rear end execution unit comprises some fixed point ALU, floating-point ALU, especially also comprises some memory access parts, and all at the memory access parts access queue is set safeguards the procedure order of all accessing operations (Program Order) to existing processor usually.In case one instruction or microcode are finished at functional part and just the result are write back to result bus.

Instruction reorder queue (Reorder Queue, ROQ), safeguard the instruction of all processor pipelines or the procedure order of microcode, in case one instruction or microcode correct execution finish, just can notify the register renaming table to revise rename table, instruction that will be finished from ROQ or microcode remove simultaneously.

Fig. 2 is relevant and access queue is carried out the basic diagram that complete association is searched in order to solve memory access.Here the formation of depositing number storage order with load instruction complete association inquiry is an example to find out the relevant nearest number storage order in address.All deposit in the drawings the address contents addressable memory (201) (Address CAM (201)) address of all number storage orders, in order to find out a nearest relevant number storage order that has the address, earlier the address with each number storage order in the address of load instruction and the address contents addressable memory (201) compares, and obtains one and whether have the relevant bit vector in address.Then the bit vector that obtains is sent into and selected a nearest number storage order that exists the address to be correlated with in the priority encoder (202) (Priority Encoder (202)).When the item number of formation was a lot, carrying out bit vector that the address relatively obtains by address contents addressable memory (201) will be corresponding very long, and priority encoder (202) (just has very big delay like this.What is worse, along with increasing of formation item number, the delay meeting of priority encoder (202) is elongated thereupon, thereby has limited the expansion of processor.The formation that load instruction is deposited in number storage order inquiry also must be experienced a process that complete association shown in Figure 2 is searched to find out the relevant and nearest load instruction that carry out in advance in address.

Fig. 3 shows the fundamental block diagram that realizes microprocessor of the present invention.Compare with Fig. 1, both key distinctions are on register renaming, instruction/microcode emission element and memory access parts.Register renaming module (302) among the figure has realized the memory access address correlation predictive device MDP among the present invention; Instruction/microcode transmitter module (303) has been realized emission control logic of the present invention; Memory access parts (305) have been realized of the present invention based on the relevant thought of local associative lookup solution memory access.

Fig. 4 shows memory access correlation predictive device (Memory Dependence Predictor, basic structure MDP).This is a table that complete association is searched, and each comprises 2 basic territory: load instruction PC and internal memory distance (Memory Distance) in the table.Have only load instruction/microcode just might occupy one in table, wherein load instruction PC represents the value in the programmable counter corresponding with load instruction/microcode; In this patent, internal memory distance refers to a pair of address is relevant in the processor the load instruction and the number of the dynamic accessing operation between the number storage order.Whenever a load instruction/microcode during through the register renaming module, just with corresponding PC address complete association index MDP table.If find the item of coupling, illustrating so before this load instruction has the relevant number storage order in an address.Emission must stop formation at this moment the emission of this load instruction, till the corresponding number storage order of memory access distance that its front and index come out has been finished (promptly the value of wanting number storage order being written in the data cache (DCACHE)).

In present patent application, because memory access correlation predictive table MDP is that complete association is searched, so this table can not be very big.Experiment shows, is 16 access queue for length, adopts about 8 MDP to meet the demands.Certainly, the item number of MDP need be decided as circumstances require in actual the realization, but it is noted that no matter adopt great MDP, all should be included in the spiritual scope of the present invention.

Fig. 5 shows the basic controlling flow process of emission control logic.Check at first whether armed instruction/microcode is access instruction/microcode (502), after access instruction/microcode is then waited for necessary operations number ready (509), be transmitted into ALU parts (510) execution.Notice that the ALU parts here comprise Fixed-Point Arithmetic Unit and floating-point calculation component.If armed instruction/microcode is access instruction/microcode, need so to handle respectively according to load instruction and number storage order respectively.If load instruction (503), see so at first whether before inquire about memory access correlation predictive table MDP when register renaming indexes corresponding entry, judge promptly whether effectively the memory access that obtains apart from (504), if do not index corresponding entry, the memory access that promptly obtains distance is invalid, is transmitted into (513) in the memory access parts after waiting address ready (508) so.If the memory access that inquiry obtains distance is effective, illustrate that so there is the relevant number storage order in an address front, this moment must etc. the address number storage order of being correlated be finished and value write in the data cache (DCACHE) and after (507) this load instruction is transmitted in the memory access parts (513).

Emission control for number storage order among Fig. 5 is different.See at first whether the address is ready to (505), in the emission formation, wait for if the address is not ready for.Further whether the value of checking is ready to (506) if the address is ready to, if value is not ready for, then carries out phase one emission (511) in advance, and it is relevant promptly number storage order to be transmitted into access queue solution memory access; If the value of number storage order is ready to, carry out the emission (512) of subordinate phase so, by the emission of subordinate phase the value of number storage order is delivered in the access queue.It is noted that if number storage order in the address ready the time value be ready to, do not need so to launch in two stages, directly once launch an address and value and all deliver in the access queue.In addition, though the part number storage order need divide 2 phase transmission, all be considered as same number storage order in emission formation and access queue.

Fig. 6 shows based on the part associative lookup and solves the relevant synoptic diagram of memory access, and this is that improvement of the present invention realizes.Inquiring about all number storage orders during equally with the execution load instruction is example, different with the whole formation of inquiry in the tradition realization shown in Fig. 2, here only need inquire about several (L) forward, and then the number storage order that selection satisfies condition from several, even drawing by analysis, the front only inquires about several forward, can guarantee that load instruction more than 99% can take valid data and (not need to transmit the load instruction of (Forward) for those, then directly read, think also that here inquiry is correct) from high-speed cache (Cache).Because L is generally little a lot of than N, and L do not increase along with N and grows simultaneously, and therefore the visit time-delay does not increase with the growth of N.

Needing backward when should number storage order entering access queue, whether inquiry has the load instruction that writes back in advance.Whether existing realization also is the formation (Fig. 2) of the whole load instruction of inquiry, and the front has analyzed to do fully like this and there is no need, write back in advance and get final product (Fig. 6) but only need inquire about adjacent several load instructions.

Experiment shows, is 16 access queue for length, can guarantee that for most programs the load instruction more than 99% can obtain correct data when L is 8; Can guarantee simultaneously that 100% number storage order knows whether the load instruction of carrying out in advance or writing back by＜8 accessing operations of inquiry.Certainly, the occurrence of L need be decided as circumstances require in actual the realization, but it is noted that no matter the concrete value of L is much, all should be included in the spiritual scope of the present invention.

Although the overwhelming majority is transmitted (Forwarding) and all only occurred between memory access the distance very little load instruction and number storage order, still there is the minority load instruction need be from memory access apart from far number storage order Data transmission.Should be noted that: the transmission of mentioning among the present invention (Forwarding or Forward), refer to when load instruction/microcode is carried out, the number storage order from access queue before this load instruction obtains the process of latest data.We need transmit those but the load instruction of memory access distance far (＞8) calls load instruction (mis-forwardloads) in the transmission not.Anatomize the middle load instruction of these transmission and find that they have good predictability.Therefore the present invention is by being provided with the historical information that a memory access correlation predictive table MDP writes down load instruction in the transmission not.

When number storage order of ROQ notice memory access parts is finished, then use the address lookup access queue of this number storage order, if find the relevant load instruction in an address, memory access distance between this Store and the load instruction is greater than L, and this load instruction do not obtain up-to-date value there from the number storage order of memory access distance＜=L, illustrates that then this load instruction has read wrong value from data cache.Because this moment, this load instruction may write back, and follow-up instruction/microcode has used the error result of load instruction, therefore for fear of execution error, refresh process device streamline at this moment, in correlation predictive table MDP, distribute simultaneously a new list item, the value of the corresponding PC of load instruction that will and make mistakes and cause wrong number storage order and the load instruction of makeing mistakes between the memory access distance be recorded in the newly assigned MDP list item.

Fig. 7 shows the structured flowchart of the memory access parts in the preferred embodiments of the present invention.When an accessing operation emits from emission formation (701), at first deliver to MemAddr (702) and calculate the memory access address.Each accessing operation enters into LD/ST formation (703) after having calculated the memory access address, access queue is the control center of memory access parts, safeguards the order between all accessing operations.In the preferred embodiment shown in Fig. 7, visit Dcache (704) and TAG relatively (706) divide two to clap and finishes, and inquire about DTLB visit Dcache the time and carry out actual situation address translation (705).If find that at TAGCMP Cache hits, so directly write back among the ROQ (707) by the mmres bus; If instead do not hit, then send read request to second level cache (Cache) (not illustrating among the figure) by memread bus notice cache interface (Cache Interface) (708).When needing the X86 instruction submission (Commit) of memory access for one, ROQ removes corresponding microcode or instruction by Cmtbus notice LD/ST formation from formation.

Although below show the present invention in conjunction with the preferred embodiments of the present invention, one skilled in the art will appreciate that under the situation that does not break away from the spirit and scope of the present invention, can carry out various modifications, replacement and change to the present invention.Therefore, the present invention should not limited by the foregoing description, and should be limited by claims and equivalent thereof.

Claims

1. processor comprises:

Instruction reorder queue (ROQ) is safeguarded the instruction of all processor pipelines or the procedure order of microcode, in case instruction or microcode are finished, just removes the reorder queue from instructing.

2. according to the processor of claim 1, wherein said memory access correlation predictive table comprises two at least: the value of the programmable counter of load instruction correspondence (PC) and internal memory distance, described internal memory distance are a pair of address is relevant in the processor the load instruction and the number of the dynamic accessing operation between the number storage order.

3. according to the processor of claim 2, if wherein in memory access correlation predictive table, find the item of coupling, then emission element stops this load instruction is transmitted into the memory access parts, is finished and value is write in the data cache (DCACHE) up to the number storage order relevant with its address.

4. according to the processor of claim 2, if wherein do not find the item of coupling in memory access correlation predictive table, emission element just is transmitted into the memory access parts with this load instruction.

5. according to the processor of claim 1, if wherein described access instruction is a number storage order, then the register renaming parts are not retrieved described memory access correlation predictive table, directly number storage order are delivered to emission element.

6. according to the processor of claim 5, if wherein the address of this number storage order is ready to value, then emission element directly is transmitted into number storage order in the memory access parts.

7. according to the processor of claim 5, if wherein the address of this number storage order is unripe, then emission element is waited for described number storage order in the emission formation.

8. according to the processor of claim 5, if wherein the address of this number storage order is ready to and is worth unripely, then emission element carries out the phase one emission, and number storage order is transmitted in the memory access parts.

9. processor according to Claim 8, if wherein the value of this number storage order is ready to, then emission element carries out the subordinate phase emission, and the value of number storage order is transmitted in the memory access parts.

10. according to the processor of one of claim 1-9, wherein said memory access parts are inquired about the length N of the item number L of access queue less than described access queue forward when carrying out load instruction.

11. according to the processor of one of claim 1-9, wherein said memory access parts are inquired about the length N of the item number L of access queue less than described access queue backward when carrying out number storage order.

12. according to the processor of one of claim 10-11, when a number storage order was finished, the instruction reorder queue was notified the address lookup access queue of memory access parts with this number storage order.

13. processor according to claim 12, if the memory access parts find the load instruction relevant with its address in its access queue, and the distance of the memory access between this number storage order and the load instruction is greater than L, and this load instruction is not obtained up-to-date value there from the number storage order of memory access distance＜=L, then processor refreshes the streamline of described processor, in memory access correlation predictive table, distribute simultaneously a new list item, the value and the distance of the memory access between described number storage order and the described load instruction of the programmable counter of described load instruction is recorded in the newly assigned memory access correlation predictive list item.

14. one kind solves the relevant method of memory access in processor, described processor comprises: get finger and decoding unit; The register renaming parts, described this register renaming unit also comprises a memory access correlation predictive table (MDP); Emission element; The rear end execution unit, it comprises some fixed point arithmetic logical blocks (ALU), floating-point arithmetic logical block (ALU), and some memory access parts; And instruction reorder queue (ROQ); Described method comprises the following steps:

1) refers to and decoding unit reception access instruction from getting;

3), judge whether to exist item with the value coupling of the programmable counter of this load instruction if load instruction is then inquired about this memory access correlation predictive table;

4) if in described memory access correlation predictive device, there is item with the value coupling of the programmable counter of this load instruction, then emission element stops this that this load instruction is transmitted into the memory access parts, is finished and value has been write in the data cache (DCACHE) up to the poke relevant with its address operation;

15. according to the method for claim 14, if wherein in step 2) in judge that described access instruction is a number storage order, described method also comprises step

7) the register renaming parts are not retrieved described memory access correlation predictive table, directly number storage order are delivered to emission element.

16., also comprise step according to the method for claim 15:

8) if the address of this number storage order and value are ready to, then emission element directly is transmitted into number storage order in the memory access parts.

17., also comprise step according to the method for claim 15:

9) if the address of this number storage order is unripe, then emission element is waited for described number storage order in the emission formation.

18., also comprise step according to the method for claim 15:

10) be worth unripely if the address of this number storage order is ready to, then emission element carries out phase one emission, and number storage order is transmitted in the memory access parts.

19., also comprise step according to the method for claim 15:

11) if the value of this number storage order is ready to, then emission element carries out subordinate phase emission, and the value of number storage order is transmitted in the memory access parts.

20. according to the method for one of claim 14-19, wherein said memory access parts are inquired about the length N of the item number L of access queue less than described access queue forward when carrying out load instruction.

21. according to the method for one of claim 14-19, wherein said memory access parts are inquired about the length N of the item number L of access queue less than described access queue backward when carrying out number storage order.

22., also comprise step according to the method for one of claim 16-21:

12) when a number storage order is finished, the instruction reorder queue is notified the address lookup access queue of memory access parts with this number storage order.

23., also comprise step according to the method for claim 22:

13) if the memory access parts find the load instruction relevant with its address in its access queue, and the distance of the memory access between this number storage order and the load instruction is greater than L, and this load instruction is not obtained up-to-date value there from the number storage order of memory access distance＜=L, then processor refreshes the streamline of described processor, in memory access correlation predictive table, distribute simultaneously a new list item, with the value of the programmable counter of described load instruction and the memory access between described number storage order and the described load instruction apart from being recorded in the newly assigned memory access correlation predictive list item.

24. computer installation that adopts the processor of one of claim 1-15.