CN104615548B - Data prefetching method and microprocessor - Google Patents

Data prefetching method and microprocessor Download PDF

Info

Publication number
CN104615548B
CN104615548B CN201510101351.6A CN201510101351A CN104615548B CN 104615548 B CN104615548 B CN 104615548B CN 201510101351 A CN201510101351 A CN 201510101351A CN 104615548 B CN104615548 B CN 104615548B
Authority
CN
China
Prior art keywords
mentioned
cache
address
pattern
memory
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510101351.6A
Other languages
Chinese (zh)
Other versions
CN104615548A (en
Inventor
罗德尼.E.虎克
约翰.M.吉尔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Via Technologies Inc
Original Assignee
Via Technologies Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US13/033,765 external-priority patent/US8762649B2/en
Priority claimed from US13/033,809 external-priority patent/US8645631B2/en
Priority claimed from US13/033,848 external-priority patent/US8719510B2/en
Application filed by Via Technologies Inc filed Critical Via Technologies Inc
Publication of CN104615548A publication Critical patent/CN104615548A/en
Application granted granted Critical
Publication of CN104615548B publication Critical patent/CN104615548B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Memory System Of A Hierarchy Structure (AREA)

Abstract

A kind of data prefetching method and microprocessor.The microprocessor includes:One first order memory cache;One second level memory cache;And a pre-fetch unit, to:Detection appears in a direction and the pattern of the nearest access requirement in above-mentioned second level memory cache, and according to above-mentioned direction and pattern, multiple cache lines are prefetched into above-mentioned second level memory cache;From above-mentioned first order memory cache, an address of the access requirement that above-mentioned first order memory cache is received is received, wherein address above mentioned is related to a cache line;Determine in above-mentioned direction after relevant cache line by one or more cache lines pointed by above-mentioned pattern;And said one or multiple cache lines is caused to be prefetched into above-mentioned first order memory cache.

Description

Data prefetching method and microprocessor
The application be on March 29th, 2011 applying date, " prefetched application No. is 201110077108.7 entitled The divisional application of the application case of unit, data prefetching method and microprocessor ".
Technical field
The present invention relates to the memory cache of general microprocessor, more particularly to by the fast of data pre-fetching to microprocessor Access to memory.
Background technology
For nearest computer system, when cache fails (cache miss), the storage of microprocessor access system It time needed for device, can one or two orders of magnitude much upper than microprocessor access memory cache (cache).Therefore, in order to carry High cache hit rate (cache hit rate), microprocessor incorporates prefetching technique, for testing nearest data access pattern (examine recent data access patterns), and attempt to predict which data is the next access of program Object, and the benefit prefetched has been well known scope.
However, applicant have observed that the access pattern of certain programs can not examined for the pre-fetch unit of existing microprocessor It surveys.For example, Fig. 1 is shown when the program of execution includes carrying out the storage operation of a sequence via memory, the second level is fast The access pattern of access to memory (L2Cache), and the discribed storage address for each time in figure.As shown in Figure 1, although General trend is increases storage address with the time, i.e., by direction up, however in many cases, specified access Storage address also can the more previous time down rather than general trend up, be different from the practical institute of existing pre-fetch unit The result of prediction.
Although for the relatively large sample of quantity, general trend is to advance in one direction, and existing pre-fetch unit is in face But there are two the reason of chaotic situation are likely to occur when facing small sample.It is to follow its framework to depositing that first reason, which is program, Reservoir, whether caused by algorithm characteristic or bad programming (poor programming).Second reason is non- When the pipeline of (out-of-order execution) microcontroller core is executed with queue under normal function in proper order, usually can It is accessed into line storage with different from its generated program sequence.
It can be effectively that program carries out data pre-fetching therefore, it is necessary to a data pre-fetch unit (device), must take into consideration Apparent trend (no is not will present when window (time windows) carries out memory access instruction (operation) when smaller Clear trend), but then will appear apparent trend when being examined with larger samples number.
Invention content
The present invention discloses a kind of pre-fetch unit, is set in the microprocessor with a memory cache, wherein prefetching For unit to receive multiple access requirements to multiple addresses of a memory block, each access requirement corresponds to memory block Address in one, and the address of access requirement with function of time nonmonotonicity (non-monotonically) increase It adds deduct few.Pre-fetch unit includes a storage device and a control logic.Control logic is coupled to storage device, wherein when When receiving access requirement, control logic is then minimum to the maximum address and one that maintain the access requirement in storage device The count value of the variation of address and maximum address and lowest address maintains the cache being accessed recently in memory block One historical record of line, the cache line being accessed recently are related to the address of access requirement, according to count value, determine an access Direction, according to historical record, an access pattern is determined, and according to access pattern and along access direction, by memory cache Inside not yet prefetched into memory block by the cache line that historical record is designated as having accessed.
The present invention discloses a kind of data prefetching method, to prefetch data to a memory cache of a microprocessor, number According to forecasting method, including multiple access requirements to multiple addresses of a memory block are received, each access requirement correspondence is deposited One in the address of memory block, and the address of access requirement with function of time nonmonotonicity (non- Monotonically it) increases or decreases;When receiving access requirement, a memory areas maximum in the block and one is maintained most Small address, and calculate the count value of maximum and lowest address variation;When receiving access requirement, memory areas is maintained One historical record of the cache line being accessed recently in block, the cache line being accessed recently are related to the address of access requirement; An access direction is determined according to count value;An access pattern is determined according to historical record;And according to access pattern and along depositing Direction is taken, will not yet be prefetched into memory block by the cache line that historical record is designated as having accessed in memory cache.
The present invention discloses a kind of microprocessor, including multiple cores, a memory cache and a pre-fetch unit.Cache is deposited Reservoir is shared by core, to receive multiple access requirements to multiple addresses of a memory block, each access requirement One in the address of corresponding memory block, the address of access requirement with function of time nonmonotonicity (non- Monotonically it) increases or decreases.Pre-fetch unit to monitor access requirement, and maintains memory areas in the block one maximum The count value of the variation of address and a lowest address and maximum address and lowest address, according to count value, determine that one deposits It takes direction and along access direction, the cache line of miss in memory block is prefetched into memory cache.
The present invention discloses a kind of microprocessor, including a first order memory cache, a second level memory cache and One pre-fetch unit.Pre-fetch unit to detect appear in the nearest access requirement in the memory cache of the second level a direction and Pattern, and according to direction and pattern, multiple cache lines are prefetched into second level memory cache, are deposited from first order cache Reservoir receives an address of the access requirement that first order memory cache is received, and wherein address is related to a cache line, determines Be scheduled in direction rear one or more cache lines by pointed by pattern of relevant cache line and lead to one or more Cache line is prefetched into first order memory cache.
The present invention discloses a kind of data prefetching method, micro- to one with a second level memory cache to prefetch data One first order memory cache of processor, data prefetching method include detection appear in it is nearest in the memory cache of the second level One direction of access requirement and pattern, and according to direction and pattern, multiple cache lines are prefetched to second level cache and are deposited In reservoir;From first order memory cache, an address of the access requirement that first order memory cache is received is received, wherein Address is related to a cache line;Determine in direction relevant cache line rear one or more caches by pointed by pattern Line;And one or more cache lines is caused to be prefetched into first order memory cache.
The present invention discloses a kind of microprocessor, including a memory cache and a pre-fetch unit.Pre-fetch unit is examining Measuring tool has a state as multiple memory access requirements of a first memory block, and according to pattern from first memory area Block prefetches in multiple cache lines to memory cache, a new memory access requirement of one second memory block of monitoring, determines Determine whether first memory block is virtually adjacent to second memory block, and works as from first memory block and extend to second When memory block, then determine whether pattern predicts the new memory access requirement institute relevant one of second memory block Cache line is prefetched from second memory block by the cache line set each other off to fast in second memory block and according to pattern In access to memory.
The present invention discloses a kind of data prefetching method, to prefetch data to a memory cache of a microprocessor, number Include state as multiple memory access requirements of the detection with a first memory block according to forecasting method, and according to sample State prefetches cache line in memory cache from first memory block;Monitor that the one of a second memory block new deposits Access to store requirement;It determines whether first memory block is virtually adjacent to second memory block, and is stored when from first When device block extends to second memory block, determine whether pattern predicts the new memory access of second memory block It is required that the relevant cache line of institute is in second memory block;It, will be multiple fast from second memory block and according to pattern Line taking is prefetched into memory cache, to respond deciding step.
Description of the drawings
Fig. 1 is shown when it includes the program of sequence storage operation to execute via memory, a kind of second level cache The pattern access performance of memory.
Fig. 2 is a kind of block diagram of microprocessor of the present invention.
Fig. 3 is the more detailed block diagram of pre-fetch unit of Fig. 2 of the present invention.
Fig. 4 is the microprocessor of Fig. 2 of the present invention and the operational flowchart of the pre-fetch unit of especially Fig. 3.
Fig. 5 is the operational flowchart for the step of pre-fetch unit of Fig. 3 of the present invention is to Fig. 4.
Fig. 6 is the operational flowchart for the step of pre-fetch unit of Fig. 3 of the present invention is to Fig. 4.
Fig. 7 prefetches the operational flowchart for requiring queue for Fig. 3's of the present invention.
Fig. 8 A and Fig. 8 B are two pattern access points of a memory block of the invention, to indicate the demarcation frame of the present invention Pre-fetch unit.
Fig. 9 is the block diagram of the example operation of present invention microprocessor shown in Fig. 2.
Figure 10 is the block diagram of the example operation of the microprocessor shown in Fig. 2 for the example that the present invention continues Fig. 9.
Figure 11 A and Figure 11 B are the operation model of the microprocessor shown in Fig. 2 for the example that the present invention continues Fig. 9 and 10 figures The block diagram of example.
Figure 12 is a kind of block diagram of microprocessor of another embodiment of the present invention.
Figure 13 is the operational flowchart of pre-fetch unit shown in Figure 12 of the present invention.
Figure 14 is the operational flowchart of pre-fetch unit shown in the Figure 12 of the present invention according to Figure 13 steps.
Figure 15 is the block diagram that another embodiment of the present invention has an a kind of microprocessor for delimiting frame pre-fetch unit.
Figure 16 is the block diagram of the virtual hash table of Figure 15 of the present invention.
Figure 17 is the operational flowchart of the microprocessor of Figure 15 of the present invention.
Figure 18 is virtual hash of the present invention according to Figure 16 after the operation of pre-fetch unit via the narration of Figure 17 examples The content of table.
Figure 19A and Figure 19B is the operational flowchart of the pre-fetch unit of Figure 15 of the present invention.
Figure 20 is that a hash physical address of the pre-fetch unit used in Figure 15 of another embodiment of the present invention is virtual to hash The block diagram of address base.
The block diagram of the multi-core microprocessor of Figure 21 present invention.
Accompanying drawings symbol description
100~microprocessor
102~instruction cache memory
104~instruction decoder
106~register alias table
108~reservation station
112~execution unit
132~other execution units
134~loading/storage element
124~pre-fetch unit
114~retirement unit
116~first order data cache
118~second level memory cache
122~Bus Interface Unit
162~virtual hash table
198~queue
172~first order data search pointer
178~the first level data pattern addresses
196~first order data memory addresses
194~pattern predicts cache line address
192~cache line configuration requirement
188~cache line data
The virtual hash address field of 354~memory block
356~status bar
302~block position shade buffer
303~block number buffer
304~minimum index buffer
306~Maximum Index buffer
308~minimum change counter
312~maximum change counter
314~total counter
316~intermediary outcomes buffer
318~cycle match counter
342~direction buffer
344~pattern buffer
346~pattern sequence buffer
348~pattern general register
352~search index buffer
332~hardware cell
322~control logic
328~prefetch and require queue
324~extraction pointer
326~promote pointer
2002~hash virtual address library
2102A~core A
2102B~core B
The pre-fetch unit of 2104~highly reactive formula
The pre-fetch unit of 2106~shared highly reactive formula
Specific implementation mode
The manufacture of various embodiments of the invention discussed more fully below and application method.It is important to note, however, that this hair Bright provided many feasible concept of the invention may be implemented in various particular ranges.These specific embodiments are only used for illustrating The manufacture of the bright present invention and application method, but it is non-for limiting the scope of the present invention.
Generally, it can subsequently describe and be explained about the solution of the above problem.When owning for a memory Access (instruction, operation require) all indicates that when on a figure, one set of all accesses (instruction, operation require) can quilt One demarcation frame encloses.When additional access requirement is also shown on same figure, these access requirements can be also resized Demarcation frame afterwards, which encloses, to be come.Above-mentioned first figure is illustrated in figure 8 in the access twice (instruction or operation) of a memory block.Figure 8 X-axis indicates that the time of the access of instruction, Y-axis indicate the index of 64 byte cache lines of the access with 4KB blocks.First, Describe two accesses of first time:First access is to cache line 5 into line access, and second access requirement is to cache line 6 Into line access.A demarcation frame as shown in the figure encloses at 2 points that represent access requirement to come.
Furthermore third access requirement betides cache line 7, delimits frame and become larger so that representing the new of third access requirement Including point can be enclosed by demarcation frame.As new access constantly occurs, delimiting frame must expand with X-axis, and delimit the upper of frame Edge is also as Y-axis expands (this is upward example).The historical record of the movement of above-mentioned demarcation frame upper limb and lower edge will be to Determine that the trend of access pattern is upward, downward or is not.
Other than the trend of the upper limb of tracking demarcation frame and lower edge is to determine a trend direction, tracks an other access and want It asks and is also necessary, because the event that access requirement skips one or two cache lines occurs often.Therefore, in order to avoid skipping The event for prefetching cache line occurs, once detecting a trend upward or downward, pre-fetch unit is then determined using additional criterion The fixed cache line to be prefetched.Since access requirement trend can be rearranged, pre-fetch unit can be by the row again of these transient state Row access historical record is deleted.This operation is complete in a shade (bit mask) by marker bit (marking bit) At, each corresponds to the cache line with a memory block, and when corresponding position is set in the shade of position, indicates special Fixed block can be accessed.Once having reached a quantity sufficient to the access requirement of memory block, pre-fetch unit can use position Shade (the wherein instruction of sequential of the position shade without access), and based on larger access viewpoint (broad sense as described below Large view) it goes to access entire block, and be not based on smaller access viewpoint (narrow sense small view) and existing prefetch The block of access is removed as unit according only to the time of access.
Fig. 2 show the block diagram of the microprocessor 100 of the present invention.Microprocessor 100, which includes one, has multiple stratum Transmission path, and in transmission path also include various functions unit.Transmission path includes an instruction cache memory 102, Instruction cache memory 102 is coupled to an instruction decoder 104;Instruction decoder 104 is coupled to a register alias table 106 (register alias table, RAT);Register alias table 106 is coupled to 108 (reservation of a reservation station station);Reservation station 108 is coupled to an execution unit 112 (execution unit);Finally, execution unit 112 is coupled to One retirement unit 114 (retire unit).Instruction decoder 104 may include an instruction translator (instruction Translator), simplified Group instruction (such as Group instruction of x86 frameworks) is translated to the similar of microprocessor 100 The Group instruction of instruction set (reduce instruction set computer RISC).Reservation station 108 is generated and is transmitted Instruction is to execution unit 112, to make execution unit 112 be executed according to program sequence (program order).Retirement unit 114 include a resequence buffer (reorder buffer), to the resignation executed instruction according to program sequence (Retirement).Execution unit 112 includes 132 (other of loading/storage element 134 and other execution units Execution unit), for example, integer unit (integer unit), floating-point unit (floating point unit), point Zhi Danyuan (branch unit) or single instrction multiple data crossfire (Single Instruction Multiple Data, SIMD) unit.Loading/storage element 134 is reading first order data cache 116 (level 1data cache) Data, and write data to first order data cache 116.One second level memory cache 118 is supporting (back) first order data cache 116 and instruction cache memory 102.Second level memory cache 118 is passing through It is read by a Bus Interface Unit 122 and writing system memory, Bus Interface Unit 122 is that microprocessor 100 and one are total An interface between line (such as a field bus (local bus) or memory bus (memory bus)).Microprocessor 100 Also include a pre-fetch unit 124, to fetch data into second level memory cache 118 and/or the first order from system storage Data cache 116.
It is illustrated in figure 3 the 124 more detailed block diagram of pre-fetch unit of Fig. 2.Pre-fetch unit 124 includes a block position shade Buffer 302.Each in block position shade buffer 302 corresponds to the cache line with a memory block, wherein depositing The block number of memory block is stored in a block number buffer 303.In other words, block number buffer 303 stores The upper layer address bits (upper address bits) of memory block.When one number in block position shade buffer 302 It is to point out that corresponding cache line has been accessed when value is true (true value).Initialize block position shade buffer 302 will so that all place values are false (false).In one embodiment, the size of memory block is 4KB, and cache line Size be 64 bytes.Therefore, block position shade buffer 302 has 64 capacity.In certain embodiments, memory areas The size of block also can be identical as the size of physical memory paging (physical memory page).However, cache line is big Small can be other a variety of different sizes in other embodiments.Furthermore the storage maintained on block position shade buffer 302 The size in device region is changeable, needs not correspond to the size of physical memory paging.More precisely, block position hides The size of the memory area (or block) maintained on cover buffer 302 can be any size (two multiple is best), as long as It possesses enough cache lines to carry out the detection for being conducive to prefetch direction and pattern.
Pre-fetch unit 124 also may include a minimum index buffer 304 (min pointer register) and one most Big index buffer 306 (max pointer register).Minimum index buffer 304 and Maximum Index buffer 306 Respectively after the access to start to track a memory block in pre-fetch unit 124, constantly it is directed toward in this memory block The index (index) for the minimum and highest cache line being accessed.Pre-fetch unit 124 further includes a minimum change counter 308 and a maximum change counter 312.Minimum change counter 308 and maximum change counter 312 are respectively to pre- After taking unit 124 to start to track the access of this memory block, minimum index buffer 304 and Maximum Index buffer are calculated 306 numbers changed.Pre-fetch unit 124 also includes a total counter 314, to start to track this storage in pre-fetch unit 124 After the access of device block, the sum for the cache line being accessed is calculated.Pre-fetch unit 124 also includes an intermediary outcomes buffer 316, after pre-fetch unit 124 starts to track the access of this memory block, to be directed toward the intermediate pre-fetch of this memory block (such as the count value of the count value and maximum change counter 312 of minimum index buffer 304 is flat for the index of memory lines ).Pre-fetch unit 124 is also including a direction buffer 342 (direction register), a pattern buffer 344, equally State period buffer 346, a pattern general register 348 and one search index buffer 352, and each function is as described below.
Pre-fetch unit 124 also includes multiple cycle match counters 318 (period match counter).Each period Match counter 318 maintains a count value of a different cycles.In one embodiment, period 3,4 and 5.During period refers to Between 316 left/right of index buffer digit.The count value of cycle match counter 318 each memory of block access into It is updated after row.When block position shade buffer 302 instruction in the cycle to the access on 316 left side of intermediary outcomes buffer with it is right When the access on 316 the right of intermediary outcomes buffer matches, pre-fetch unit 124 then then increases and the relevant period in the period Count value with counter 318.It, will be especially following about application and the operation in more detail of cycle match counter 318 Fig. 4, Fig. 5 are told about.
Pre-fetch unit 124, which is also prefetched including one, requires queue 328, one to extract pointer 324 (pop pointer) and one Promote pointer 326 (push pointer).Prefetch project (entry) queue for requiring queue 328 to include a cycle, above-mentioned item Purpose each to store prefetch requirement caused by the operation (especially with regard to the figure of the 4th, 6 and 7) of pre-fetch unit 124. Propulsion pointer 326, which is pointed out to be assigned to, prefetches the next project (entry) for requiring queue 328.Extraction pointer 324 is pointed out It will be from prefetching the next project for requiring queue 328 to remove.It in one embodiment, may be to lose non-sequential because prefetching requirement Mode (out of order) terminate, so prefetch require queue 328 be can by it is non-follow it is out-of-sequence in a manner of extracted (popping) (completed) project completed.In one embodiment, it is due in circuit flow, owning to prefetch and require the size of queue 328 It is required that being selected into the circuit flow of the circuit (tag pipeline) of the label of second level memory cache 118, then So that prefetch require the number of project in queue 328 at least with the pipeline level (stages) in second level memory cache 118 As many.It prefetches and requires to maintain until the pipeline of second level memory cache 118 terminates, at this time point, it is desirable that) it may It is one of three results, as Fig. 7 is described in more detail, that is, hits (hit in) second level memory cache 118, re-executes (replay) or whole team's row pipeline projects are promoted, to prefetch the data of needs from system storage.
Pre-fetch unit 124 also includes control logic 322, and each element that control logic 322 controls pre-fetch unit 124 executes it Function.
Although Fig. 3 only shows one group of 332 (block position of hardware cell related with active (active) memory block Shade buffer 302, block number buffer 303, minimum index buffer 304, Maximum Index buffer 306, minimum change Counter 308, maximum change counter 312, total counter 314, intermediary outcomes buffer 316, pattern sequence buffer 346, Pattern general register 348 and search index buffer 352), but pre-fetch unit 124 may include multiple it is as shown in Figure 3 Hardware cell 332, to track the access of multiple active memory blocks.
In one embodiment, microprocessor 100 also includes (the highly reactive) of one or more highly reactive formulas The pre-fetch unit of pre-fetch unit (not shown), highly reactive formula is middle using different in very small temporary sample (sample) Algorithm is come into line access, and with 124 compounding practice of pre-fetch unit, be described as follows.Due to pre-fetch unit described herein 124 numbers (compared to the pre-fetch unit of highly reactive formula) that accesss compared with large memories of analysis, when must tend to use longer Between go to start to prefetch a new memory block, as described below, but more accurate than the pre-fetch unit of high reaction equation.Therefore, make It is operated simultaneously with the pre-fetch unit of highly reactive formula and pre-fetch unit 124, what microprocessor 100 can possess high reaction equation prefetches list The faster response time of member and the pinpoint accuracy of pre-fetch unit 124.It is prefetched from other in addition, pre-fetch unit 124 can be monitored The requirement of unit, and these requirements are used in it prefetches algorithm.
It is illustrated in figure 4 the operational flowchart of the microprocessor 100 of Fig. 2, and the pre-fetch unit 124 of especially Fig. 3 Operation.Flow starts from step 402.
In step 402, pre-fetch unit 124 receives one loading/storage memory access requirement, is stored to one to access One loading of device address/storage memory access requirement.In one embodiment, pre-fetch unit 124 is judging which cache prefetched When line, it will can go out to be loaded into memory access requirement and be distinguish with storage memory access requirement.In other embodiments, it prefetches Unit 124 can't distinguish and be loaded into and storage when judging to prefetch which cache line.In one embodiment, pre-fetch unit 124 Receive the memory access requirement that loading/storage element 134 is exported.Pre-fetch unit 124 can receive depositing from separate sources Access to store requirement, above-mentioned source includes but is not limited to loading/storage element 134, first order data cache 116 (such as an allocation request caused by first order data cache 116, not in the access of 134 memory of loading/storage element When hitting first order data cache 116) and/or other sources, such as microprocessor 100 executing and prefetch 124 difference of unit prefetches algorithm to prefetch other pre-fetch units (not shown) of data.Flow enters step 404.
In step 404, control logic 322 is according to comparing memory access address and each block number buffer 303 Numerical value, judge whether to the memory of an active block into line access.It is, control logic 322 judge it is shown in Fig. 3 hard Whether part unit 332 has been assigned to the relevant memory block of the institute of the storage address specified by memory access requirement.If It is then to enter step 406.
In a step 406, control logic 322 assigns hardware cell 332 shown in Fig. 3 to relevant memory block. In one embodiment, control logic 322 assigns hardware cell 332 in a manner of one in turn (round-robin).In other implementations Example, control logic 322 are the letter that hardware cell 332 maintains the page method of substitution (least-recently-used) that do not use at most Breath, and the basis for the page method of substitution (least-recently-used) that do not used at most with one is assigned.In addition, control Logic 322 can initialize assigned hardware cell 332.In particular, control logic 322 can remove block position shade buffer 302 all positions, by the upper layer position filling (populate) of memory access address to block number buffer 303, and it is clear Except minimum index buffer 304, Maximum Index buffer 306, minimum change counter 308, maximum change counter 312, total Counter 314 and cycle match counter 318 are 0.Flow enters step 408.
In a step 408, control logic 322 updates hardware cell 332 according to memory access address, as described in Figure 5.Stream Journey enters step 412.
In step 412, whether hardware cell 332 tests (examine) total counter 314 to determining program to depositing Memory block carries out enough access requirements, to detect an access pattern.In one embodiment, control logic 322 judges always Whether the count value of counter 314 is more than a set value.In one embodiment, this set value is 10, however this set value has very It is a variety of that the invention is not limited thereto.If the enough access requirements of executed, flow are carried out to step 414;Otherwise flow terminates.
In step 414, control logic 322 judges that specified access requirement in block position shade buffer 302 is There are one apparent trend for no tool.That is, control logic 322 judges that access requirement has apparent uptrend (access ground Location increases) or downtrend (access address reduction).In one embodiment, control logic 322 is counted according to minimum change Whether the difference (difference) of 312 the two of device 308 and maximum change counter is wanted more than a set value to determine to access Seeking Truth is no apparent trend.In one embodiment, set value 2, and set value can be other numerical value in other embodiments. When the count value of minimum change counter 308 is more than one set value of count value of maximum change counter 312, then have it is apparent downwards Trend;Conversely, when the count value of maximum change counter 312 is more than one set value of count value of minimum change counter 308, Then there is apparent uptrend.When there is an apparent trend to generate, then 416 are entered step, flow is otherwise terminated.
In step 416, control logic 322 judge be in the access requirement specified by block position shade buffer 302 No is with an apparent pattern period winner (pattern period winner).In one embodiment, control logic 322 It is set according to whether the difference of the one of cycle match counter 318 and other 318 count values of cycle match counter are more than one Value determines whether there are an apparent pattern period winner.In one embodiment, set value 2, and in other embodiments both Definite value can be other numerical value.The update operation of cycle match counter 318 will be described in detail in Fig. 5.When there is an apparent pattern Period winner generates, and flow proceeds to step 418;Otherwise, flow terminates.
In step 418,322 filling direction buffer 342 of control logic is to point out apparent side that step 414 is judged To trend.In addition, control logic 322 is used in clear winner's pattern period (clear winning that step 416 detects Pattern period) N filling pattern sequences buffer 346.Finally, control logic 322 will be apparent detected by step 416 Winner fill into pattern buffer 344 in the pattern period.That is, the block position shade buffer 302 of control logic 322 The positions N to the right side of intermediary outcomes buffer 316 or left side (being matched according to described in Fig. 5 steps 518) it is temporary to fill pattern Storage 344.Flow proceeds to step 422.
In step 422, control logic 322 according to detected direction and pattern start in memory block still The cache line (non-fetched cache line) not being prefetched is prefetched (as shown in Figure 6).Flow is in step 422 knot Beam.
Fig. 5 show the operating process that pre-fetch unit 124 shown in Fig. 3 executes step 408 shown in Fig. 4.Flow starts In step 502.
In step 502, control logic 322 increases the count value of total counter 314.Flow proceeds to step 504.
In step 504, control logic 322 judges that current memory access address (particularly relates to, is most recently stored device and deposits Take address the memory block of relevant cache line index value) whether be more than Maximum Index buffer 306 value.If so, Flow proceeds to step 506;If otherwise flow is carried out to step 508.
In step 506, control logic 322 be most recently stored device access address relevant cache line memory block Index value update Maximum Index buffer 306, and increase the count value of maximum change counter 312.Flow proceeds to step Rapid 514.
In step 508, control logic 322 judge to be most recently stored device access address relevant cache line memory Whether the index value of block is less than the value of minimum index buffer 304.If so, flow is carried out to step 512;If it is not, then flow It carries out to step 514.
In step 512, control logic 322 be most recently stored device access address relevant cache line memory block Index value update minimum index buffer 304, and increase the count value of minimum change counter 308.Flow proceeds to step Rapid 514.
In the step 514, control logic 322 calculates being averaged for minimum index buffer 304 and Maximum Index buffer 306 Value, and go out average value update intermediary outcomes buffer 316 with what is calculated.Flow proceeds to step 516.
In step 516, control logic 322 checks block position shade buffer 302, and with intermediary outcomes buffer Centered on 316, cut into left side with right side it is N each, wherein N for it is related with each cycle match counter 318 each The digit of position.Flow proceeds to step 518.
In step 518, control logic 322 determines whether the positions N in the left side of intermediary outcomes buffer 316 refer to centre Mark the N non-colinear positions on the right side of buffer 316.If so, flow proceeds to step 522;If it is not, then flow terminates.
In step 522, control logic 322 increases the count value of the cycle match counter 318 with the N periods.Stream Journey ends at step 522.
Fig. 6 show the operational flowchart of the step 422 of the execution of pre-fetch unit 124 Fig. 4 of Fig. 3.Flow starts from step 602。
In step 602, the initialization of control logic 322 can be in the sample for the intermediary outcomes buffer 316 for leaving detection direction In state sequence buffer 346, to search index buffer 352 and pattern general register (patten location) 348 into Row initialization.That is, control logic 322 can will search index buffer 352 and pattern general register 348 initializes At the value after intermediary outcomes buffer 316 and detected period (N) between the two added/subtracted.For example, working as intermediary outcomes The value of buffer 316 be 16, N 5, and trend shown in direction buffer 342 be it is upward when, control logic 322 will then search It seeks index buffer 352 and pattern general register 348 is initialized as 21.Therefore, in this example, for comparison purposes (as described below), 5 positions 21 to 25 that may be disposed at block position shade buffer 302 of pattern buffer 344.Flow proceeds to Step 604.
In step 604, in 342 meaning of direction buffer in 322 test block position shade buffer 302 of control logic (position is located in pattern general register 348, is hidden to corresponding block position for correspondence position in position and pattern buffer 344 Cover buffer), to predict whether to prefetch memory areas corresponding cache line in the block.Flow proceeds to step 606.
In step 606, control logic 322 predicts whether to need tested cache line.When the position of pattern buffer 344 It is true (true), control logic 322 then predicts that this cache line is needed, and pattern Prediction program will access this cache line.If Cache line is needed, and flow proceeds to step 614;Otherwise, flow proceeds to step 608.
In step 608, whether control logic 322 arrived block position shade buffer 302 according to direction buffer 342 The end, judge the cache line for whether thering are other not test in memory block.If without the cache line that do not test, flow Terminate;Otherwise, flow is carried out to step 612.
In step 612, the value of 322 increases of control logic/reduction direction buffer 342.If in addition, direction buffer 342 when being more than last of pattern buffer 344, and control logic 322 will use the new numerical value of direction buffer 342 to update sample State general register 348, such as pattern buffer 344 is shifted into (shift) to the position of direction buffer 342.Flow carries out To step 604.
In step 614, control logic 322 determines whether required cache line has been prefetched.When block position, shade is temporary The position of storage 302 is true, and control logic 322 then judges that required cache line has been prefetched.If required cache line by It prefetches, flow proceeds to step 608;Otherwise, flow proceeds to step 616.
In judgment step 616, if direction buffer 342 is downward, control logic 322 determines to judge to be included in the fast of reference The whether certainly minimum index buffer 304 of line taking is more than a set value (set value is 16 in one embodiment);If direction is temporary Storage 342 is upward, and control logic 322 will determine that whether decision is included in the cache line of reference from Maximum Index buffer more than 306 In a set value.If what the decision of control logic 322 was included in reference is true set value more than above-mentioned judgement, flow terminates;It is no Then, flow proceeds to judgment step 618.It is worth noting that, if cache line is substantially more than (separate) minimum index buffer Then flow terminates 304/ Maximum Index buffer 306, but memory areas will not then be prefetched by not representing pre-fetch unit 124 in this way Other cache lines of block can also trigger more the subsequent access of the cache line of memory block because of the step of according to Fig. 4 again Pre- extract operation.
In step 618, whether the judgement of control logic 322 prefetches requires queue 328 full.Queue is required if prefetching 328 is full, then flow proceeds to step 622, and otherwise flow proceeds to step 624.
In step 622, control logic 322 suspend (stall) until prefetch require queue 328 be discontented with (non-full) be Only.Flow proceeds to step 624.
In step 624, control logic 322 promotes a project (entry) to require queue 328 to prefetching, to prefetch cache Line.Flow proceeds to step 608.
Be illustrated in figure 7 Fig. 3 prefetches the operational flowchart for requiring queue 328.Flow starts from step 702.
In a step 702, be advanced in step 624 prefetch require one in queue 328 prefetch requirement be allowed into Line access (wherein this prefetch requirement to second level memory cache 118 into line access), and it is fast to continue to the second level The pipeline of access to memory 118.Flow proceeds to step 704.
In step 704, second level memory cache 118 judges whether cache line address hits second level memory cache 118.If cache line address hit second level memory cache 118, flow proceeds to step 706;Otherwise, flow, which proceeds to, sentences Disconnected step 708.
In step 706, it because cache line is ready in second level memory cache 118, therefore need not prefetch Cache line, flow terminate.
In step 708, control logic 322 judges whether the response of second level memory cache 118 prefetches requirement thus It must be merely re-executed.If so, flow is carried out to step 712;Otherwise, flow is carried out to step 714.
In step 712, the requirement that prefetches for prefetching cache line promotes (re-pushed) to require queue 328 to prefetching again In.Flow ends at step 712.
In step 714, second level memory cache 118 promotes one the whole team to microprocessor 100 is required to arrange (fill Queue) in (not shown), to require Bus Interface Unit 122 to read cache line into microprocessor 100.Flow terminates In step 714.
It is illustrated in figure 9 the example operation of the microprocessor 100 of Fig. 2.It is illustrated in figure 9 and ten is carried out to a memory block After secondary access, block position shade buffer 302 (asterisk on a position is indicated to corresponding cache line into line access), Minimum change counter 308, maximum change counter 312 and total counter 314 are accessed in first, second and the tenth Content.In fig.9, minimum change counter 308 is known as " cntr_min_change ", maximum change counter 312 is known as " Cntr_max_change " and total counter 314 are known as " cntr_total ".The position of intermediary outcomes buffer 316 is in Fig. 9 In then with " M " is indicated.
Since the first time carried out to address 0x4dced300 accesses, (such as step 402) of Fig. 4 is in memory block In carry out in the cache line that is located on index 12, therefore control logic 322 will set the position 12 of block position shade buffer 302 (step 408) of Fig. 4, as shown in the figure.In addition, control logic 322 will update minimum change counter 308, maximum change counts Device 312 and total counter 314 (step 502 of Fig. 5,506 and 512).
Due to second of address 0x4ced260 access be the cache line that is located in memory block on index 9 into Row, control logic 322 is according to the position 9 that will set block position shade buffer 302, as shown in the figure.In addition, control logic 322 will Update the count value of minimum change counter 308 and total counter 314.
(address that third is accessed to the 9th time is unillustrated, the tenth access address in third to the tenth access For 0x4dced6c0), control logic 322 is according to the setting that can carry out appropriate member to block position shade buffer 302, as schemed institute Show.In addition, control logic 322 correspond to the minimum change counter of access update each time 308, maximum change counter 312 with And the count value of total counter 314.
Logic 322 arrives in the access of each memory for executing ten times when executing the step 514 in order to control for the bottoms Fig. 9 The content of cycle match counter 318 after 522.In fig.9, cycle match counter 318 is known as " cntr_period_N_ Matches ", wherein N are 1,2,3,4 or 5.
Example as shown in Figure 9, although meeting the criterion (total counter 314 is at least ten) of step 412 and meeting step Rapid 416 criterion (the cycle match counter 318 in period 5 is compared with other all cycle match counters 318 at least more than 2), But the criterion of step 414 is not met (difference between minimum change counter 308 and block position shade buffer 302 is less than 2). Therefore, pre- extract operation will not be executed in this memory block at this time.
If the bottoms Fig. 9 were also shown in the period 3,4 and 5, from the period 3,4 and 5 to the right side of intermediary outcomes buffer 316 The pattern of side and left side.
Continue the operational flowchart of example shown in Fig. 9 for the microprocessor 100 of Fig. 2 as shown in Figure 10.Figure 10 describes phase It is similar to the information of Fig. 9, but is not in together to the carry out the tenth of memory block is primary and the 12nd access the (the 12nd The address of secondary access is 0x4dced760).As shown, its meet the criterion of step 412 (total counter 314 be at least ten), The criterion (difference between minimum change counter 308 and block position shade buffer 302 is at least 2) and step of step 414 Rapid 416 criterion (the cycle match counter 318 in period 5 the period 5 counting compared with other all cycle match counters 318 at least more than 2).Therefore, according to the step 418 of Fig. 4, control logic 322 is filled the direction (populate) buffer 342 and (is used To point out that direction trend is upward), pattern sequence buffer 346 (filling numerical value 5), pattern buffer 344 (with pattern " * * " or Person " 01010 ").Control logic 322 executes pre-fetched predictive, such as Figure 11 also according to the step 422 of Fig. 4 and Fig. 6 for memory block It is shown.Also display control logic 322 is in step 602 operation of Fig. 6 by Figure 10, the position of direction buffer 342 in place 21.
Continue the operational flowchart of the example of Fig. 9 and Figure 10 for the microprocessor 100 of Fig. 2 as shown in figure 11.Figure 11 is passed through By describe in example each (table is labeled as 0 to 11) of 12 different examples by Fig. 6 step 604 to step 616 until The cache line of memory block is prefetched the operation that the prediction of unit 124 finds the memory block for needing to be prefetched.As schemed Show, in each example, the value of direction buffer 342 is increased according to Fig. 6 steps 612.As shown in figure 11, example 5 with And in 10, pattern general register 348 can be updated according to the step 612 of Fig. 6.As shown in example 0,2,4,5,7 and 10, by In being false (false) in the position of direction buffer 342, pattern points out that the cache line on direction buffer 342 will be not needed. It is also shown in figure, in example 1,3,6 and 8, since the position of the pattern buffer 344 in direction buffer 342 is true (ture), pattern buffer 344 points out that the cache line on direction buffer 342 will be required, however cache line has been prepared for It is removed (fetched), such as the instruction that the position of block position shade buffer 302 is true (ture).Finally as shown, in example In 11, since the position of the pattern buffer 344 in direction buffer 342 is true (ture), so pattern buffer 344 is pointed out Cache line on direction buffer 342 will be required, but because the position of block position shade buffer 302 is false (false), so This cache line is not yet removed (fetched).Therefore, control logic 322 promotes one to prefetch requirement extremely according to the step 624 of Fig. 6 It prefetches and requires in queue 328, to prefetch the cache line in address 0x4dced800, correspond in block position shade buffer 302 position 32.
In one embodiment, described one or more set value can be by operating system (such as via pattern spy Determine buffer (model specific register, MSR)) or program via the fuse (fuses) of microprocessor 100, Wherein fuse can fuse in the production process of microprocessor 100.
In one embodiment, the size of block position shade buffer 302 can in order to save power supply (power) with and bare die Chip size machine plate (die real estate) and reduce.That is, the position in each block position shade buffer 302 Number, will be less than the quantity of the cache line in a memory block.For example, in one embodiment, each block position shade buffer 302 digit is only the half of the quantity for the cache line that memory block is included.Block position shade buffer 302 is only tracked pair The access of upper half block or lower half block, end see that the half of memory block is first accessed, and an additional position is referring to Whether the lower half or the first half for going out memory block are first accessed.
In one embodiment, control logic 322 and not as good as testing intermediary outcomes buffer 316 described in step 516/518 N up and down, but include a sequence engine (serial engine), one at a time or two ground scanning block position shade is temporary Storage 302, to find pattern (be as previously described 5) of the period more than a maximum cycle.
In one embodiment, if apparent direction trend is not detected in step 414 or is not examined in step 416 The count value for measuring apparent pattern period and total counter 314 reaches a predetermined threshold (to point out in memory areas The cache line of major part in the block has been accessed) when, control logic 322 is then continued to execute and prefetched to be remained in memory block Under cache line.Above-mentioned predetermined threshold is a relatively high percent value of the memory cache quantity of memory block, example Such as the value of the position of block position shade buffer 302.
In conjunction with the pre-fetch unit of second level memory cache and first order data cache
The microprocessor in modern age includes the memory cache with a hierarchical structure.Typically, a microprocessor includes one Small but also fast first order data cache and a larger but slower second level memory cache not only, respectively such as Fig. 2 First order data cache 116 and second level memory cache 118.Memory cache with a hierarchical structure is advantageous In prefetching data to memory cache, to improve the hit rate speed (hit rate) of memory cache.Due to the first level data The speed of memory cache 116, therefore preferably situation is to prefetch data to first order data cache 116.However, Since the memory capacity of first order data cache 116 is smaller, the velocity of variation of memory cache hit may be actually It is poor slack-off, if prefetching data since pre-fetch unit is incorrect into first order data cache 116 so that final data It is unwanted, just needs and substitute and substituted with other desired data.Therefore, data are loaded into first order data quick and deposit Reservoir 116 or second level memory cache 118 as a result, be pre-fetch unit whether can correctly predicted data whether be required Function (function).Because first order data cache 116 is required smaller size, the storage of first order data quick Device 116 is inclined to smaller capacity and therefore has poor accuracy;Conversely, as second level memory cache label and The size of data array makes the size of first order memory cache pre-fetch unit seem very little, so a second level cache storage Therefore device pre-fetch unit, which can be larger capacity, has preferable accuracy.
The advantage of microprocessor 100 described in the embodiment of the present invention is one loading/storage element 134 to as second The basis for prefetching needs of grade memory cache 118 and first order data cache 116.The embodiment of the present invention is promoted The accuracy of loading/storage element 134 (second level memory cache 118) is solving above-mentioned prefetch into first to apply The problem of level data memory cache 116.Furthermore it is also completed in embodiment with monomer logic (single body of Logic) come handle first order data cache 116 and second level memory cache 118 pre- extract operation target.
As shown in figure 12 for according to the microprocessor 100 of various embodiments of the present invention.The microprocessor 100 of Figure 12 is similar to figure 2 microprocessor 100 simultaneously has additional characteristic as described below.
First order data cache 116 provides first order data memory addresses 196 to pre-fetch unit 124.First Level data storage address 196 is by loading, and/storage element 134 to first order data cache 116 be loaded into/stores The physical address of access.That is, pre-fetch unit 124 can be deposited as loading/storage element 134 accesses first order data quick It is eavesdropped when reservoir 116 (eavesdrops).Pre-fetch unit 124 provides pattern prediction cache line address 194 to the first order One queue 198 of data cache 116, pattern predict that cache line address 194 is the address of cache line, cache line therein It is that pre-fetch unit 124 predicts that loading/storage element 134 will be to the first level data according to first order data memory addresses 196 116 proposed requirement of memory cache.First order data cache 116 provides a cache line configuration requirement 192 to pre- Unit 124 is taken, to require cache line from second level memory cache 118, and the address of these cache lines is stored in queue 198 In.Finally, second level memory cache 118 provides required cache line data 188 to first order data cache 116。
Pre-fetch unit 124 is also including first order data search pointer 172 and the first level data pattern address 178, such as Shown in Figure 12.First order data search pointer 172 and the purposes of the first level data pattern address 178 are related to Fig. 4 and such as It is lower described.
It is as shown in figure 13 the operational flowchart of the pre-fetch unit 124 of Figure 12.Flow starts from step 1302.
In step 1302, pre-fetch unit 124 receives the first level data of Figure 12 from first order data cache 116 Storage address 196.Flow proceeds to step 1304.
In step 1304, since pre-fetch unit 124 has detected an access pattern in advance and has started to store from system Device prefetches cache line and enters second level memory cache 118, therefore the detection of pre-fetch unit 124 belongs to a memory block and (such as divides Page (page)) first order data memory addresses 196, as in the 1st to 11 figure at correlation it is described.For carefully, due to access Pattern has been detected, therefore pre-fetch unit 124 is maintaining (maintain) block number buffer 303, designated memory The base address of block.Whether pre-fetch unit 124 matches the storage of the first level data by the position of detection block number buffer 303 The correspondence position of device address 196, to detect whether first order data memory addresses 196 fall in memory block.Flow carries out To step 1306.
In step 1306, since first order data memory addresses 196, pre-fetch unit 124 is in memory block Lower two cache lines, the two cache lines are found on detected access direction (detected access direction) It is related with the access direction previously detected.Step 1306 executes operation and will be illustrated in subsequent Figure 14 in more detail. Flow proceeds to step 1308.
In step 1308, pre-fetch unit 124 provides the physical address of lower two cache lines found in step 1306 extremely First order data cache 116, as pattern prediction cache line address 194.In other embodiments, pre-fetch unit 124 The quantity of the cache line address provided can be more or less than 2.Flow proceeds to step 1312.
In step 1312, first order data cache 116 is advanced into the address provided in step 1308 In queue 198.Flow proceeds to step 1314.
In step 1314, as long as no matter when queue 198 is non-empty (non-empty), the storage of first order data quick Next address is taken out queue 198 by device 116, and sends out a cache line configuration requirement 192 to second level memory cache 118, To obtain the cache line in the address.However, if the address in queue 198 has come across first order data cache 116, first order data cache 116 will abandon the address (dumps) and abandon wanting from second level memory cache 118 Seek its cache line.Second level memory cache 118 then provides required cache line data 188 to first order data quick and deposits Reservoir 116.Flow ends at step 1314.
It is as shown in figure 14 pre-fetch unit 124 shown in Figure 12 according to the operational flowchart of the step 1306 of Figure 13.Figure 14 The operation described is under the situation that the pattern direction detected by Fig. 3 is upward (upward).If however, detected sample State direction is downward, and pre-fetch unit 124 also can be used to execute same function.The operation of step 1402 to 1408 is to by Fig. 3 In pattern buffer 344 be placed on position appropriate in memory block so that pre-fetch unit 124 is by from the first level data The pattern of the pattern buffer 344 started in storage address 196 is searched in lower two cache lines into line search, as long as and having need The pattern 344 of the pattern buffer 344 is replicated when asking on the memory block.Flow starts from step 1402.
In step 1402, pre-fetch unit 124 be similar to Fig. 6 step 602 initialize search index buffer 352 with And the mode of pattern general register 348, it is total with the pattern sequence buffer 346 of Fig. 3 and intermediary outcomes buffer 316 With to initialize first order data search pointer 172 and the first level data pattern address 178 of Figure 12.If for example, intermediate The value of index buffer 316 is 16 and pattern sequence buffer 346 is 5, and the direction of direction buffer 342 be up, Pre-fetch unit 124 initializes first order data search pointer 172 and the first level data pattern address 178 to 21.Flow into Row arrives step 1414.
In step 14014, pre-fetch unit 124 determines whether first order data memory addresses 196 fall with mesh In the pattern of the pattern buffer 344 of preceding specified position, the current position of pattern is initially to be determined according to step 1402, And it can be updated according to step 1406.That is, pre-fetch unit 124 determines the suitable of first order data memory addresses 196 When the value of position (relevant bits) is (i.e. in addition to the position for removing to confirm memory block, and specified with what is be used in cache line Outside the position of byte compensation offset (byte offset)), if it is more than or equal to the value of first order data search pointer 172, And whether it is less than or equal to both the value of first order data search pointer 172 and the value of pattern sequence buffer 346 institute The sum total of addition.If first order data memory addresses 196 are fallen into the pattern of (fall within) pattern buffer 344, stream Journey proceeds to step 1408;Otherwise flow proceeds to step 1406.
In step 1406, pre-fetch unit 124 increases first order data search pointer according to pattern sequence buffer 346 172 and the first level data pattern address 178.According to step 1406 (with the operation described in follow-up step 1418), if first The terminal that level data search pointer 172 has reached memory block then terminates to search.Flow returns to step 1404.
In step 1408, the value setting (set) of first order data search pointer 172 is first by pre-fetch unit 124 Level data storage address 196 the storage page of relevant cache line offset (offset).Flow proceeds to step 1412。
In step 1412, pre-fetch unit 124 tests pattern buffer 344 in first order data search pointer 172 In position.Flow proceeds to step 1414.
In step 1414, whether the position that 124 deciding step 1412 of pre-fetch unit is tested sets.If in step 1412 positions tested are set, and flow proceeds to step 1416;Otherwise flow proceeds to step 1418.
In step 1416, step 1414 is labeled as by pre-fetch unit 124 by the cache line that pattern buffer 344 is predicted Transmission physical address is ready for first order data cache 116, to predict cache line address 194 as a pattern. Flow ends at step 1416.
In step 1418, pre-fetch unit 124 increases the value of first order data search pointer 172.If in addition, the first order Data search pointer 172 has been more than the last one position of above-mentioned pattern buffer 344, and pre-fetch unit 124 is then with the first series According to the value for the new numerical value update first order data search pointer 172 for searching pointer 172, that is, convert (shift) pattern Buffer 344 is to the position of new first order data search pointer 172.The operation of step 1412 to 1418 executes repeatedly, directly Until two cache lines (or other set values of cache line) are found.Flow ends at step 1418.
The benefit for prefetching cache line to first order data cache 116 in Figure 13 is first order data cache 116 and second level memory cache 118 it is required change it is smaller.However, in other embodiments, pre-fetch unit 124 is also Pattern prediction cache line address 194 can not be provided to first order data cache 116.For example, in one embodiment, prefetching Unit 124 directly requires Bus Interface Unit 122 to obtain acquisition cache line from memory, then writes the write caching line received Enter to first order data cache 116.In another embodiment, pre-fetch unit 124 prefetches list to provide data to certainly The second level memory cache 118 of member 124 requires and obtains cache line (if it is hit failure (missing) then from memory Obtain cache line), and the cache line received is written to first order data cache 116.In other embodiments, it prefetches Unit 124 requires cache line (then to be obtained from memory if it is hit failure (missing) from second level memory cache 118 Cache line), directly by cache line write-in first order data cache 116.
As described above, various embodiments of the present invention are advantageous in that with single 124 total counter 314 of pre-fetch unit, The basis for prefetching needs as both second level memory cache 118 and first order data cache 116.Although figure 2, (content as discussed below) shown in Figure 12 and Figure 15 runs after fame bright different block, and pre-fetch unit 124 can on arrangement space Occupy the position of the label (tag) for being adjacent to second level memory cache 118 and data row (data array) and concept Upper includes second level memory cache 118, as shown in figure 21.Each embodiment allows the peace of the tool large space of loading/storage element 134 Row is come the demand of its accuracy and its large space for being promoted, to handle first order data cache using a monomer logic 116 and second level memory cache 118 pre- extract operation, with solve can only prefetch in the prior art into data to capacity compared with The problem of small first order data cache 116.
The demarcation frame pre-fetch unit of (warm-up penalty) is lost with the warming-up reduced on cross-page
Pre-fetch unit 124 of the present invention detects more multiple on a memory block (for example, a physical memory page) Miscellaneous access pattern (for example, a physical memory page) is different from the detection of existing general pre-fetch unit.For example, in advance Take unit 124 that can be detected into the program of one memory block of line access, even if microprocessor 100 is non-according to a pattern Following out-of-sequence execution (out-of-order execution) pipeline (pipeline) can not again be arranged with the sequence of program command Sequence (re-order) memory access, this be likely to result in existing general pre-fetch unit do not go detection memory access pattern and And lead to no pre- extract operation.This is because 124 consideration actively accessing to memory block of pre-fetch unit, and when Between sequence (time order) be not that it is considered a little.
However, in order to meet the ability for identifying more complicated access pattern and/or rearrangement access pattern, compared to existing Some pre-fetch units, pre-fetch unit 124 of the invention may need a longer time to go detection access pattern, as described below " warm-up period (warm-up time) ".Therefore a method for reducing 124 warm-up period of pre-fetch unit is needed.
Pre-fetch unit 124 to predict to access before one the program of a memory block by an access pattern, It whether has been cross over (cross over) new memory block actually adjacent with old memory block, and has predicted this Whether program can continue to access this new memory block according to identical pattern.In response in this, pre-fetch unit 124 uses next From the pattern of old memory block, direction and other relevant informations, to accelerate in new memory block detection access sample The speed of state, that is, reduce warm-up period.
It is the block diagram of the microprocessor 100 with a pre-fetch unit 124 as shown in figure 15.The microprocessor 100 of Figure 15 It is similar to the microprocessor 100 of Fig. 2 and Figure 12, and with other characteristics as described below.
Such as the correlation narration in Fig. 3, pre-fetch unit 124 includes multiple hardware cells 332.Each hardware cell 332 is compared Further include the virtual hash virtual address bar of a memory block (hashed virtual address of described in Fig. 3 Memory, HVAMB) 354 and a status bar (status) 356.Step 406 described in Fig. 4 initializes the hardware assigned During unit 332, pre-fetch unit 124 takes out physical blocks code (the physical block in block number buffer 303 Number), and after physical blocks code is translated into a virtual address, performed by the step 1704 according to subsequent figure 17 Physical blocks code is translated into a virtual address (hash by identical hash rule (the same hashing algorithm) (hash) this virtual address), and the result that its hash is calculated is stored to the virtual hash address field of memory block 354.Shape There are three types of possible numerical value for the tool of state column 356:Non-active (inactive), actively (active) are on probation (probationary), as described below.Pre-fetch unit 124 also includes a virtual hash table (virtual hash table, VHT) 162, the detailed description about 162 organizational structure of virtual hash table and operation please refers to the narration of follow-up 16 to 19th figure.
It is the virtual hash table 162 of Figure 15 as shown in figure 16.Virtual hash table 162 includes multiple projects, is preferably organized into One queue.Each project includes a significance bit (valid bit) (not shown) and three columns:One minus 1 hash virtual address 1602 (HVAM1), a unmodified hash virtual address 1604 (HVAUN) and a positive 1 hash virtual address 1606 (HVAP1). Generation to fill the numerical value of above-mentioned field please refers to described in subsequent figure 17.
It is the operational flowchart of the microprocessor 100 of Figure 15 described in Figure 17.Flow starts from step 1702.
In step 1702, a loading of the reception of first order data cache 116 from loading/storage element 134/ Storage request, loading/storage request include a virtual address.Flow proceeds to step 1704.
In step 1704, first order data cache 116 is to the hash address choice that is received in step 1702 Position execute a hash function (function), to generate a unmodified hash virtual address 1604 (HVAUN).In addition, the first order Data cache 116 increases a memory block size (MBS) to selected by the hash address that step 1702 is received Position, to generate a totalling value, and a hash function is executed to totalling value, to generate a positive 1 hash virtual address 1606 (HVAP1).In addition, first order data cache 116 is subtracted from the position of the hash address choice received in step 1702 The size of memory block to generate a difference, and executes a hash function to this difference, virtual to generate minus 1 hash Address 1602 (HVAM1).In one embodiment, memory block size 4KB.In one embodiment, virtual address is 40, The position 39 of virtual address:30 and 11:0 is ignored by meeting hash function.Remaining 18 virtual address positions are " it is processed (dealt) ", such as the information possessed, handled by hash position position.Its idea, which is the relatively low level of virtual address, to be had most High-irregularity (entropy) and high bit have minimum unrest degree.It is handled with the method and can guarantee Luan Du classes (entropy Level) it is the more consistent position for intersecting hash.In one embodiment, 18 of remaining virtual address are the sides according to rear continued 1 Method hash is to 6.However, in other embodiments, also it is contemplated that using different hash algorithms;If in addition, there is performance domination The design consideration of space (performance dominates space) and power consumption, embodiment are contemplated that without using miscellaneous Gather algorithm.Flow proceeds to step 1706.
Assign hash [5]=VA [29] ^VA [18] ^VA [17];
Assign hash [4]=VA [28] ^VA [19] ^VA [16];
Assign hash [3]=VA [27] ^VA [20] ^VA [15];
Assign hash [2]=VA [26] ^VA [21] ^VA [14];
Assign hash [1]=VA [25] ^VA [22] ^VA [13];
Assign hash [0]=VA [24] ^VA [23] ^VA [12];
Table 1
In step 1706, first order data cache 116 provides generated unmodified miscellaneous in step 1704 Gather virtual address (HVAUN) 1604, positive 1 hash virtual address (HVAP1) 1606 and minus 1 hash virtual address (HVAM1) 1602 to pre-fetch unit 124.Flow proceeds to step 1708.
In step 1708, unmodified hash virtual address (HVAUN) that the step 1706 of pre-fetch unit 124 is received 1604, positive 1 hash virtual address (HVAP1) 1606 and minus 1 hash virtual address (HVAM1) 1602 selectively update virtually Hash table 162.That is, if virtually hash table 162 has included one with unmodified hash virtual address 1604 (HVAUN), the project of positive 1 hash virtual address 1606 (HVAP1) and minus 1 hash virtual address 1602 (HVAM1), prefetches list Member 124 is then abandoned updating virtual hash table 162.On the contrary, pre-fetch unit 124 is then with first in first out (first-in-first- Out mode) is by unmodified hash virtual address 1604 (HVAUN), positive 1 hash virtual address 1606 (HVAP1) and minus 1 miscellaneous Virtual address 1602 (HVAM1) of gathering is advanced into the project of 162 top of virtual hash table, and is to have by the project mark promoted It imitates (valid).Flow ends at step 1708.
Virtual hash table 162 as shown in figure 18 for Figure 16 is in pre-fetch unit 124 in loading/storage element 134 according to figure Content after 17 narration operation, wherein in loading/storage element 134 in response in the execution of program, via two memories Block (being denoted as A and A+MBS) carries out in a upwardly direction, and (is denoted as A+2* into a third memory block MBS), to respond the pre-fetch unit 124 for having been filled with virtual hash table 162.For carefully, virtual hash table 162 is apart from tail end Two projects project be included in minus 1 hash virtual address (HVAM1) 1602 A-MBS hash, unmodified hash void The hash of the hash of the A of quasi- address (HVAUN) 1604 and the A+MBS in positive 1 hash virtual address (HVAP1) 1606;Virtually 162 project of hash table be the project of a project apart from tail end include minus 1 hash virtual address (HVAM1) 1602 A it is miscellaneous Gather, unmodified hash virtual address (HVAUN) 1604 A+MBS hash and in positive 1 hash virtual address (HVAP1) The hash of 1606 A+2*MBS;Virtual 162 project of hash table is (the i.e. nearest time nearest propulsion of rear-end project Project) be included in minus 1 hash virtual address (HVAM1) 1602 A+MBS hash, in unmodified hash virtual address (HVAUN) hash of 1604 A+2*MBS and the hash of the A+3*MBS in positive 1 hash virtual address (HVAP1) 1606.
The operational flowchart of the pre-fetch unit 124 for the Fig. 5 (being made of as shown in figure 19 Figure 19 A and Figure 19 B).Flow is opened Start from step 1902.
In step 1902, first order data cache 116 transmits a new configuration requirement (allocation Request, AR) to second level memory cache 118.New configuration requirement is one new memory block of requirement.That is pre- It takes unit 124 to determine that with the relevant memory block of configuration requirement be new, implies that and not yet configure a hardware cell 332 to new The relevant memory block of configuration requirement institute.It is newly deposited that is, pre-fetch unit 124 does not receive (encountered) one recently The configuration requirement of memory block.In one embodiment, configuration requirement is in one loading/storage first order data cache 116 results failure and with require to require caused by same cache line by second level memory cache 118.In an embodiment In, configuration requirement is from physical address translation to specify a physical address, the relevant virtual address of physical address institute 's.First order data cache 116 according to a hash function (implying that hash function identical with the step 1704 of Figure 17), Hash virtual address related with the physical address of configuration requirement, to generate the one of configuration requirement hash virtual address (HVAAR), and by the virtual address of hash of configuration requirement it is provided to pre-fetch unit 124.Flow is carried out to step 1903.
In step 1903, pre-fetch unit 124 is assigned to a new hardware cell 332 to new memory block.If There is the hardware cell 332 of inactive (inactive) to exist, pre-fetch unit 124 configures an inactive hardware cell 332 to new Memory block.Otherwise, in one embodiment, pre-fetch unit 124 then configures a least recently used hardware cell 332 To new memory block.In one embodiment, once pre-fetch unit 124 has prefetched the memory block indicated by pattern When all cache lines, pre-fetch unit 124 can then be passivated (inactivate) hardware cell 332.In one embodiment, pre-fetch unit 124 have the abilities of fixed (pin) hardware cell 332, even if make its for a least recently used hardware cell 332 also not It can be reset.For example, if pre-fetch unit 124 detects carries out a predetermined number of times according to pattern to memory block Access, but pre-fetch unit 124 not yet according to pattern to whole memory block complete it is all prefetch, pre-fetch unit 124 Fixed hardware cell 332 related with memory block, even if it is made to become a least recently used hardware cell 332 still It is not qualified to be reset.In one embodiment, pre-fetch unit 124 maintains the opposite period of each hardware cell 332 (to match from original Set), and during it when one set period critical value of (age) arrival, pre-fetch unit 124 can then be passivated hardware cell 332. In another embodiment, if one virtual adjacent memory areas of pre-fetch unit 124 (by follow-up step 1904 to 1926) detection Block, and prefetching for the neighbouring memory block of self-virtualizing is completed, pre-fetch unit 124 then can selectively be reused in void Intend the hardware cell 332 of adjacent memory block, rather than configures a new hardware cell 332.In this embodiment, it prefetches Unit 124 selectively initializes various storage elements (such as direction buffer 342, the sample of the hardware cell 332 of reuse State buffer 344 and pattern general register 348), to maintain available information stored within.Flow is carried out to step 1904。
In step 1904, pre-fetch unit 124 compare caused by step 1902 hash virtual address (HVAAR) with The minus 1 hash virtual address 1602 (HVAM1) of each single item purpose of virtual hash table 162 and positive 1 hash virtual address 1606 (HVAP1).Pre-fetch unit 124 is to determine the actively memory areas (active) according to the operation of step 1904 to 1922 Whether virtually adjacent to new memory block, pre-fetch unit 124 is to predict to deposit according to the operation of step 1924 to 1928 to block Whether access to store continues self-virtualizing adjacent active memory block by according to the access pattern and direction detected in advance Into new memory block, to reduce the warm-up period of pre-fetch unit 124 so that pre-fetch unit 124 can comparatively fast start Prefetch new memory block.Flow is carried out to step 1906.
In step 1906, the manner of comparison that pre-fetch unit 124 is executed according to step 1904 determines hash virtual address (HVAAR) whether matched with any one mesh of virtual hash table 162.If hash virtual address (HVAAR) and virtual hash One project of table 162 matches, and flow is carried out to step 1908;Otherwise, flow is carried out to step 1912.
In step 1908, pre-fetch unit 124 sets a candidate direction flag (candidate_direction flag) To a numerical value, to indicate upwardly direction.Flow is carried out to step 1916.
In step 1912, manner of comparison of the pre-fetch unit 124 performed by step 1908 determines that hash is virtually Whether location (HVAAR) matches with any one mesh of virtual hash table 162.If hash virtual address (HVAAR) with it is virtual miscellaneous Gather the project matching of table 162, flow carries out to step 1914;Otherwise, flow terminates.
In step 1914, pre-fetch unit 124 sets candidate direction flag (candidate_direction flag) extremely One numerical value, to indicate downwardly direction.Flow is carried out to step 1916.
In step 1916, pre-fetch unit 124 (does not scheme candidate hash buffer (candidate_hav register) Show) it is set as step 1906 or the 1912 unmodified hash virtual addresses 1604 (HVAUN) of virtual hash table 162 that are determined One numerical value.Flow is carried out to step 1918.
In step 1918, pre-fetch unit 124 compares every in candidate hash (candidate_hva) and pre-fetch unit 124 The virtual hash address field (HVAMB) of the memory block of one active memory block 354.Flow is carried out to step 1922.
In step 1922, manner of comparison of the pre-fetch unit 124 performed by step 1918 determines candidate hash (candidate_hva) whether matched with the virtual hash address field (HVAMB) of any memory block 354.If candidate hash (candidate_hva) it is matched with the virtual hash address field (HVAMB) of a memory block 354, flow is carried out to step 1924; Otherwise, flow terminates.
In step 1924, pre-fetch unit 124 has determined that the matching active memory block that step 1922 is found is certain Virtually it is adjacent to new memory block.Therefore, pre-fetch unit 124 compares candidate side (specified by step 1908 or 1914) To with the direction buffer 342 that matches active memory block, to according to the access pattern and direction being previously detected, prediction Whether memory access, which will continue the adjacent block of active memory of self-virtualizing, enters new memory block.For carefully, If candidate direction is different from the direction buffer 342 of virtual adjacent memory blocks, memory access is less likely can be according to elder generation Before the access pattern that detects and direction, continue the adjacent block of active memory of self-virtualizing and enter new memory block. Flow is carried out to step 1926.
In step 1926, comparative approach of the pre-fetch unit 124 performed by step 1924, determine candidate direction with Whether the direction buffer 342 with active memory block matches.If candidate direction and the direction for matching active memory block Buffer 342 matches, then flow is carried out to step 1928;Otherwise, flow terminates.
In step 1928, pre-fetch unit 124 determines the new resetting received by step 1902 requires whether to be referred to It has been predicted by pattern buffer 344 to one of the adjacent active memory block of matching virtual detected by step 1926 fast Line taking.In one embodiment, in order to execute the decision of step 1928, pre-fetch unit 124 has according to its pattern sequence buffer 346 The pattern buffer 344 of effect ground conversion active memory block adjacent with matching virtual is replicated, in virtual adjacent memory Block continues pattern position pattern general register 348, to maintain 334 continuity of pattern in new memory block.If new Configuration requirement be requirement matching active memory block the relevant memory caches row of the institute of pattern buffer 344, flow It carries out to step 1934;Otherwise, flow is carried out to step 1932.
In step 1932, pre-fetch unit 124 initializes and filling (step 1903 institute according to the step 406 and 408 of Fig. 4 Configuration) new hardware cell 332, it is desirable to it can finally be detected and be deposited to new according to above-mentioned with the relevant methods of Fig. 4 to Fig. 6 The new pattern of the access of memory block, and this will need warm-up period.Flow ends at step 1932.
In step 1934, the requirement of 124 predicted access of pre-fetch unit will be according to the adjacent active memory area of matching virtual The pattern buffer 344 of block goes successively to new memory block with direction buffer 342.Therefore, pre-fetch unit 124 is with similar New hardware cell 332 is filled in the mode of step 1932, but has a little difference.For carefully, pre-fetch unit 124 can be used for The correspondence numerical value of the hardware cell 332 of self-virtualizing adjacent memory blocks comes filling direction buffer 342, pattern buffer 344 And pattern sequence buffer 346.In addition, the new numerical value of pattern general register 348 is increased by continuing to be transformed in The value of pattern sequence buffer 346 determined, until it crosss into new memory block, to provide pattern buffer 344 New memory block is continued into, such as the correlation narration in step 1928.Furthermore the state in new hardware cell 332 Column 356 is (probationary) on probation to mark new hardware cell 332.Finally, index temporary 352 is searched by just making So as to by the beginning of a memory block into line search.Flow is carried out to step 1936.
In step 1936, pre-fetch unit 124 continues the access requirement that monitoring betides new memory block.If prefetching list Member 124, which detects, to be required the subsequent access of an at least given amount for memory block to be that 344 institute of requirement pattern buffer is pre- The memory lines of survey, then pre-fetch unit 124 promote the status bar 356 of hardware cell 332 to turn from (probationary) on probation For actively, and then as described in Figure 6 start prefetched from new memory block.In one embodiment, access requirement Given amount is 2, although other embodiment is contemplated that as other given amounts.Flow is carried out to step 1936.
It is as shown in figure 20 the hash physical address used in pre-fetch unit 124 shown in figure 15 to hash virtual address library (hashed physical address-to-hashed virtual address thesaurus)2002.Hash is physically Location to hash virtual address library 2002 includes a project array.Each project is corresponded to including a physical address (PA) 2004 and one Hash virtual address (HVA) 2006.Corresponding hash virtual address 2006 is translated into virtually by physical address 2004 Location is subject to the result of hash.Pre-fetch unit 124 is carried out by nearest hash physical address to hash virtual address library 2002 Eavesdropping, in the pipeline across loading/storage element 134.In another embodiment, in the step 1902 of Figure 19, the first order Data cache 116 does not provide hash virtual address (HVAAR) and but only provides configuration requirement to pre-fetch unit 124 The relevant physical address of institute.Pre-fetch unit 124 finds provider location in hash physical address to hash virtual address library 2002, To find a matching entities address (PA) 2004, and relevant hash virtual address (HVA) 2006 is obtained, the hash obtained is empty Quasi- address (HVA) 2006 will become hash virtual address (HVAAR) in Figure 19 other parts.By hash physical address to hash Virtual address library 2002 be included in pre-fetch unit 124 can mitigate first order data cache 116 provide configuration requirement wanted The needs for the hash virtual address asked, therefore connecing between first order data cache 116 and pre-fetch unit 124 can be simplified Mouthful.
In one embodiment, each project in hash physical address to hash virtual address library 2002 includes a hash entity Address, rather than physical address 2004, and pre-fetch unit 124 is matched what is received from first order data cache 116 It sets and requires physical address hash at a hash physical address, to look for hash physical address to hash virtual address library 2002, To obtain corresponding hash virtual address (HVA) 2006 appropriate.The present embodiment allows smaller hash physical address to miscellaneous It gathers virtual address library 2002, but the other time is needed to carry out hash to physical address.
It is the multi-core microprocessor 100 of the embodiment of the present invention as shown in figure 21.Multi-core microprocessor 100 includes two cores (being expressed as core A2102A and core B2102B) can entirely be considered as core 2102 (or unitary core 2102).Each core The heart has the element 12 or 15 for being similar to monokaryon microprocessor 100 as shown in Figure 2.In addition, each core 2102 has as before The pre-fetch unit 2104 of the highly reactive formula.Two cores 2102 are shared second level memory cache 118 and are prefetched Unit 124.Specifically, the first order data cache 116, loading/storage element 134 and height of each core 2012 The pre-fetch unit 2104 of degree reaction equation is coupled to shared second level memory cache 118 and pre-fetch unit 124.In addition, The pre-fetch unit 2106 of one shared highly reactive formula is coupled to second level memory cache 118 and pre-fetch unit 124. In one embodiment, the pre-fetch unit 2106 of the shared highly reactive formula of pre-fetch unit 2104/ of highly reactive formula only prefetches one and deposits Access to store next adjacent cache line after relevant cache line.
Storage of the pre-fetch unit 124 in addition to monitoring loading/storage element 134 and first order data cache 116 Except device access, 2106 institute of pre-fetch unit of the shared highly reactive formula of pre-fetch unit 2104/ of highly reactive formula can be also monitored The memory of generation accesses, to carry out prefetching decision.Pre-fetch unit 124 can monitor the memory access source from various combination Memory access, to execute different function of the present invention.For example, pre-fetch unit 124 can monitor memory access One first combination, to execute the pass phase function described in Fig. 2 to Figure 11, pre-fetch unit 124 can monitor the one second of memory access Combination, to execute the correlation function described in Figure 12 to Figure 14, and pre-fetch unit 124 can monitor a third group of memory access It closes, to execute the correlation function described in Figure 15 to Figure 19.In embodiment, shared pre-fetch unit 124 is due to time factor hardly possible To monitor the behavior of loading/storage element 134 of each core 2102.Therefore, shared pre-fetch unit 124 is via the first series The behavior for monitoring loading/storage element 134 indirectly according to status transmission (traffic) caused by memory cache 116, as It is loaded into/stores the result of miss (miss).
Different embodiments of the invention are in described herein as, but those skilled in the art should be able to understand these embodiments only As example, rather than it is defined in this.Those skilled in the art can without departing from the spirit of the invention, to form with Different variations is done in details.For example, software can be described in the enable embodiment of the present invention apparatus and method function, set up (fabrication), mould (modeling), simulation, description (description), with and/or test, also can be by general Program language (C, C++), hardware description language (Hardware Description Languages, HDL) (including Verilog HDL, VHDL etc.) or other available program languages complete.This software, which is configurable on any of computer, to be made With medium, such as tape, semiconductor, disk or CD (such as CD-ROM, DVD-ROM etc.), internet, wired, nothing Among the transmission mode of line or other medium of communication.Apparatus and method embodiment of the present invention can be included in semiconductor Intellectual property core, such as microcontroller core (being realized with HDL), and it is converted into the hardware of IC products.In addition, this The invention apparatus and method are realized by the combination of hardware and software.Therefore, the present invention should not be limited to revealed Embodiment, but claim under this invention is defined with equivalence enforcement.In particular, general present invention can be implemented in being used in In micro processor, apparatus in purposes computer.Finally, though the present invention is disclosed as above with preferred embodiment, so it is not to limit Determine the scope of the present invention, those skilled in the art can do several under the premise without departing from the spirit and scope of the present invention Change and retouch, thus protection scope of the present invention be subject to the present invention claim.

Claims (30)

1. a kind of microprocessor, including:
One first order memory cache;
One second level memory cache;And
One pre-fetch unit, to:
A maximum address and a lowest address, the wherein specified access for appearing in second level memory cache of maximum address is maintained to want The memory areas lowest address in the block of access requirement is specified in the memory areas highest address in the block asked, wherein lowest address;
The count value of the variation to maximum address and the count value of the variation to lowest address are maintained, wherein when control logic changes Control logic update is to the count value of the variation of maximum address when becoming maximum address, when wherein control logic changes lowest address The control logic updates the count value of the variation to lowest address;
Detection appears in a direction and the pattern of the access requirement in above-mentioned second level memory cache, and according to above-mentioned side To and pattern, multiple cache lines are prefetched into above-mentioned second level memory cache, wherein detecting the side according to count value To;
From above-mentioned first order memory cache, a ground of the access requirement that above-mentioned first order memory cache is received is received Location, wherein address above mentioned are related to a cache line;
Determine in above-mentioned direction after relevant cache line by one or more cache lines pointed by above-mentioned pattern;And
Said one or multiple cache lines is caused to be prefetched into above-mentioned first order memory cache.
2. microprocessor as described in claim 1, wherein:
Above-mentioned memory block is can be by a small set of the memory range of above-mentioned microprocessor access;
In order to determine in above-mentioned direction after relevant cache line by one or more cache lines pointed by above-mentioned pattern, Above-mentioned pre-fetch unit to:
Above-mentioned pattern is placed to above-mentioned memory block so that address above mentioned is located in above-mentioned pattern;And
It along above-mentioned direction, is begun search for by address above mentioned, until encountering the cache line pointed by above-mentioned pattern.
3. microprocessor as claimed in claim 2, wherein:
Above-mentioned pattern includes a sequence of cache line;
Wherein in order to place above-mentioned pattern to above-mentioned memory block so that address above mentioned is located in above-mentioned pattern, above-mentioned to prefetch Unit by above-mentioned pattern by said sequence being transferred to above-mentioned memory block.
4. microprocessor as claimed in claim 2, wherein appearing in the above-mentioned memory in above-mentioned second level memory cache The address of the above-mentioned access requirement of block with the function of time and nonmonotonicity increase and reduce.
5. microprocessor as claimed in claim 4, wherein appearing in the above-mentioned memory in above-mentioned second level memory cache The address above mentioned of the above-mentioned access requirement of block can be discrete.
6. microprocessor as described in claim 1, further includes:
Multiple cores;Wherein
Above-mentioned second level memory cache and pre-fetch unit are shared by above-mentioned core;And
Each above-mentioned core includes a different example of above-mentioned first order memory cache.
7. microprocessor as described in claim 1, wherein in order to cause said one or multiple cache lines to be prefetched to above-mentioned In first order memory cache, above-mentioned pre-fetch unit is providing the address of said one or multiple cache lines to the above-mentioned first order Memory cache, wherein above-mentioned first order memory cache to required from the above-mentioned second level memory cache said one or Multiple cache lines.
8. microprocessor as claimed in claim 7, wherein above-mentioned first order memory cache includes a queue, to store from The address above mentioned that above-mentioned pre-fetch unit is received.
9. microprocessor as described in claim 1, wherein in order to cause said one or multiple cache lines to be prefetched to above-mentioned In first order memory cache, above-mentioned pre-fetch unit is the Bus Interface Unit requirement one or more from above-mentioned microprocessor Cache line, and then by provide it is above-mentioned it is required to cache line be provided to above-mentioned first order memory cache.
10. microprocessor as described in claim 1, wherein in order to cause said one or multiple cache lines to be prefetched to above-mentioned In first order memory cache, above-mentioned pre-fetch unit is requiring said one or multiple from above-mentioned second level memory cache Cache line.
11. microprocessor as claimed in claim 10, wherein above-mentioned pre-fetch unit to by above-mentioned by the required cache arrived Line is then provided to first order cache line.
12. microprocessor as claimed in claim 10, wherein above-mentioned second level memory cache is to required cache line It is then provided to above-mentioned first order cache line.
13. microprocessor as described in claim 1, wherein to detect above-mentioned pattern, above-mentioned pre-fetch unit to:
When above-mentioned access requirement occurs, maintain the above-mentioned access requirement of above-mentioned memory block relevant access cache line A historical record;And
According to above-mentioned historical record, above-mentioned pattern is detected.
14. microprocessor as claimed in claim 13, above-mentioned the step of detecting above-mentioned direction according to above-mentioned count value, include:
When the difference between the count value of the variation of the count value and above-mentioned lowest address of the variation of above-mentioned maximum address had been more than one both When definite value, it is upward to detect above-mentioned direction;And
When the difference between the count value of the variation of the count value and above-mentioned maximum address of the variation of above-mentioned lowest address is more than above-mentioned When set value, it is downward to detect above-mentioned direction.
15. microprocessor as claimed in claim 13, wherein:
Above-mentioned historical record includes a shade, to point out the relevant access of above-mentioned access requirement institute of above-mentioned memory block Cache line;
Above-mentioned pre-fetch unit further includes carrying out the following steps:
One intermediary outcomes buffer of the cache line of the above-mentioned access in calculating in rheme shade;And
When right then upper of the positions N and the above-mentioned intermediary outcomes buffer of the upper rheme shade in the left side of above-mentioned intermediary outcomes buffer When the positions the N matching of rheme shade, for each in multiple different bit periods, bit period institute is one relevant in increase Count value with counter, wherein N are the digit in upper bit period.
16. microprocessor as claimed in claim 15, wherein the above-mentioned step for detecting above-mentioned access pattern according to upper rheme shade Suddenly include:
In detection one of bit period relevant above-mentioned match counter and upper bit period other persons institute it is relevant on State whether the difference between match counter is more than a set value;And
It detects by the above-mentioned access pattern specified by the positions N of the wherein side of the above-mentioned intermediary outcomes buffer of upper rheme shade, Wherein N is one digit of upper bit period, relevant matches counter possessed by the said one of upper bit period with it is upper Difference between relevant matches counter possessed by other persons of bit period is more than above-mentioned set value.
17. a kind of data prefetching method, to prefetch data to one of the microprocessor with a second level memory cache First order memory cache, above-mentioned data prefetching method include:
A maximum address and a lowest address, the wherein specified access for appearing in second level memory cache of maximum address is maintained to want The memory areas lowest address in the block of access requirement is specified in the memory areas highest address in the block asked, wherein lowest address;
The count value of the variation to maximum address and the count value of the variation to lowest address are maintained, wherein when control logic changes Control logic update is to the count value of the variation of maximum address when becoming maximum address, when wherein control logic changes lowest address The control logic updates the count value of the variation to lowest address;
Detection appears in a direction and the pattern of the access requirement in above-mentioned second level memory cache, and according to above-mentioned side To and pattern, multiple cache lines are prefetched into above-mentioned second level memory cache, wherein detecting the side according to count value To;
From above-mentioned first order memory cache, a ground of the access requirement that above-mentioned first order memory cache is received is received Location, wherein address above mentioned are related to a cache line;
Determine in above-mentioned direction after relevant cache line by one or more cache lines pointed by above-mentioned pattern;And
Said one or multiple cache lines is caused to be prefetched into above-mentioned first order memory cache.
18. data prefetching method as claimed in claim 17, wherein:
Above-mentioned memory block is can be by a small set of the memory range of above-mentioned microprocessor access;
Determine in above-mentioned direction after relevant cache line by the step of one or more cache lines pointed by above-mentioned pattern Suddenly, including:
Above-mentioned pattern is placed to above-mentioned memory block so that address above mentioned is located in above-mentioned pattern;And
It along in above-mentioned direction, is begun search for by address above mentioned, until encountering the cache line pointed by above-mentioned pattern.
19. data prefetching method as claimed in claim 18 wherein above-mentioned pattern includes a sequence of cache line, and is placed Above-mentioned pattern is to above-mentioned memory block so that address above mentioned is located at the step in above-mentioned pattern, including will by said sequence Above-mentioned pattern is transferred to above-mentioned memory block.
20. data prefetching method as claimed in claim 18, wherein appearing in above-mentioned in above-mentioned second level memory cache The address of the above-mentioned access requirement of memory block be with the function of time and nonmonotonicity increase and reduce.
21. data prefetching method as claimed in claim 20, wherein appearing in above-mentioned in above-mentioned second level memory cache The address above mentioned of the above-mentioned access requirement of memory block can be discrete.
22. data prefetching method as claimed in claim 17, wherein above-mentioned microprocessor further includes multiple cores, and it is above-mentioned Second level memory cache and pre-fetch unit are to be shared by above-mentioned core, and each above-mentioned core includes the above-mentioned first order One different example of memory cache.
23. data prefetching method as claimed in claim 17, wherein it is supreme to cause said one or multiple cache lines to be prefetched The step of stating first order memory cache, including a pre-fetch unit of above-mentioned microprocessor is providing said one or multiple fast The address of line taking is to above-mentioned first order memory cache, wherein above-mentioned first order memory cache is to from above-mentioned second level cache Said one or multiple cache lines are required in memory.
24. data prefetching method as claimed in claim 17, wherein it is supreme to cause said one or multiple cache lines to be prefetched The step of stating first order memory cache, including a pre-fetch unit of above-mentioned microprocessor is providing said one or multiple fast In the address of line taking to above-mentioned first order memory cache, wherein above-mentioned first order memory cache is from the one of above-mentioned microprocessor Bus Interface Unit is to require said one or multiple cache lines, and then by one or more cache lines of above-mentioned requirements It is provided to above-mentioned first order memory cache.
25. data prefetching method as claimed in claim 17, wherein it is supreme to cause said one or multiple cache lines to be prefetched The step of stating first order memory cache, including pre-fetch unit from above-mentioned second level memory cache requiring said one Or multiple cache lines.
26. data prefetching method as claimed in claim 25, wherein it is supreme to cause said one or multiple cache lines to be prefetched The step of stating first order memory cache, including above-mentioned pre-fetch unit is then carrying required one or more cache lines It is supplied to above-mentioned first order cache line.
Further include above-mentioned second level memory cache to will be required 27. data prefetching method as claimed in claim 25 One or more cache lines be then provided to above-mentioned first order cache line.
28. a kind of computer-readable medium, embedded computer readable program code, works as operation in the computer-readable medium When the computer readable program code, data prefetching method as claimed in claim 17 is executed.
29. a kind of microprocessor, including:
One first order memory cache;
One second level memory cache;And
One pre-fetch unit, to:
It detects the pattern of above-mentioned second level memory cache and appears in the nearest access in above-mentioned second level memory cache It is required that a direction, wherein the direction is related with the variation of the address of nearest access requirement, and according to above-mentioned direction and Pattern prefetches multiple cache lines into above-mentioned second level memory cache;
From above-mentioned first order memory cache, a ground of the access requirement that above-mentioned first order memory cache is received is received Location, wherein address above mentioned are related to a cache line;
According to the pattern of detection determine in above-mentioned direction one or more cache lines after relevant cache line;And
Said one or multiple cache lines is caused to be prefetched into above-mentioned first order memory cache,
The address for wherein appearing in the nearest access requirement in above-mentioned second level memory cache is non-monotonic with the function of time Increases and reduce to property.
30. a kind of data prefetching method, to prefetch data to one of the microprocessor with a second level memory cache First order memory cache, above-mentioned data prefetching method include:
It detects the pattern of above-mentioned second level memory cache and appears in the nearest access in above-mentioned second level memory cache It is required that a direction, wherein the direction is related with the variation of the address of nearest access requirement, and according to above-mentioned direction and Pattern prefetches multiple cache lines into above-mentioned second level memory cache;
From above-mentioned first order memory cache, a ground of the access requirement that above-mentioned first order memory cache is received is received Location, wherein address above mentioned are related to a cache line;
According to the pattern of detection determine in above-mentioned direction one or more cache lines after relevant cache line;And
Said one or multiple cache lines is caused to be prefetched into above-mentioned first order memory cache,
The address for wherein appearing in the nearest access requirement in above-mentioned second level memory cache is with the function of time rather than list Increases and reduce to tonality.
CN201510101351.6A 2010-03-29 2011-03-29 Data prefetching method and microprocessor Active CN104615548B (en)

Applications Claiming Priority (9)

Application Number Priority Date Filing Date Title
US31859410P 2010-03-29 2010-03-29
US61/318,594 2010-03-29
US13/033,809 2011-02-24
US13/033,765 US8762649B2 (en) 2010-03-29 2011-02-24 Bounding box prefetcher
US13/033,809 US8645631B2 (en) 2010-03-29 2011-02-24 Combined L2 cache and L1D cache prefetcher
US13/033,765 2011-02-24
US13/033,848 US8719510B2 (en) 2010-03-29 2011-02-24 Bounding box prefetcher with reduced warm-up penalty on memory block crossings
US13/033,848 2011-02-24
CN201110077108.7A CN102169429B (en) 2010-03-29 2011-03-29 Pre-fetch unit, data prefetching method and microprocessor

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN201110077108.7A Division CN102169429B (en) 2010-03-29 2011-03-29 Pre-fetch unit, data prefetching method and microprocessor

Publications (2)

Publication Number Publication Date
CN104615548A CN104615548A (en) 2015-05-13
CN104615548B true CN104615548B (en) 2018-08-31

Family

ID=44490596

Family Applications (4)

Application Number Title Priority Date Filing Date
CN201110077108.7A Active CN102169429B (en) 2010-03-29 2011-03-29 Pre-fetch unit, data prefetching method and microprocessor
CN201510101303.7A Active CN104636274B (en) 2010-03-29 2011-03-29 Data prefetching method and microprocessor
CN201510494634.1A Active CN105183663B (en) 2010-03-29 2011-03-29 Pre-fetch unit and data prefetching method
CN201510101351.6A Active CN104615548B (en) 2010-03-29 2011-03-29 Data prefetching method and microprocessor

Family Applications Before (3)

Application Number Title Priority Date Filing Date
CN201110077108.7A Active CN102169429B (en) 2010-03-29 2011-03-29 Pre-fetch unit, data prefetching method and microprocessor
CN201510101303.7A Active CN104636274B (en) 2010-03-29 2011-03-29 Data prefetching method and microprocessor
CN201510494634.1A Active CN105183663B (en) 2010-03-29 2011-03-29 Pre-fetch unit and data prefetching method

Country Status (2)

Country Link
CN (4) CN102169429B (en)
TW (5) TWI534621B (en)

Families Citing this family (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8959320B2 (en) * 2011-12-07 2015-02-17 Apple Inc. Preventing update training of first predictor with mismatching second predictor for branch instructions with alternating pattern hysteresis
US9442759B2 (en) * 2011-12-09 2016-09-13 Nvidia Corporation Concurrent execution of independent streams in multi-channel time slice groups
US9772845B2 (en) 2011-12-13 2017-09-26 Intel Corporation Method and apparatus to process KECCAK secure hashing algorithm
US10146545B2 (en) 2012-03-13 2018-12-04 Nvidia Corporation Translation address cache for a microprocessor
US9880846B2 (en) 2012-04-11 2018-01-30 Nvidia Corporation Improving hit rate of code translation redirection table with replacement strategy based on usage history table of evicted entries
US10241810B2 (en) 2012-05-18 2019-03-26 Nvidia Corporation Instruction-optimizing processor with branch-count table in hardware
US20140189310A1 (en) 2012-12-27 2014-07-03 Nvidia Corporation Fault detection in instruction translations
CN104133780B (en) 2013-05-02 2017-04-05 华为技术有限公司 A kind of cross-page forecasting method, apparatus and system
US9891916B2 (en) * 2014-10-20 2018-02-13 Via Technologies, Inc. Dynamically updating hardware prefetch trait to exclusive or shared in multi-memory access agent system
CN105653199B (en) * 2014-11-14 2018-12-14 群联电子股份有限公司 Method for reading data, memory storage apparatus and memorizer control circuit unit
EP3049915B1 (en) * 2014-12-14 2020-02-12 VIA Alliance Semiconductor Co., Ltd. Prefetching with level of aggressiveness based on effectiveness by memory access type
US10152421B2 (en) * 2015-11-23 2018-12-11 Intel Corporation Instruction and logic for cache control operations
CN106919367B (en) * 2016-04-20 2019-05-07 上海兆芯集成电路有限公司 Detect the processor and method of modification program code
US10579522B2 (en) * 2016-09-13 2020-03-03 Andes Technology Corporation Method and device for accessing a cache memory
US10353601B2 (en) * 2016-11-28 2019-07-16 Arm Limited Data movement engine
US10452288B2 (en) 2017-01-19 2019-10-22 International Business Machines Corporation Identifying processor attributes based on detecting a guarded storage event
US10496311B2 (en) 2017-01-19 2019-12-03 International Business Machines Corporation Run-time instrumentation of guarded storage event processing
US10496292B2 (en) 2017-01-19 2019-12-03 International Business Machines Corporation Saving/restoring guarded storage controls in a virtualized environment
US10725685B2 (en) * 2017-01-19 2020-07-28 International Business Machines Corporation Load logical and shift guarded instruction
US10732858B2 (en) 2017-01-19 2020-08-04 International Business Machines Corporation Loading and storing controls regulating the operation of a guarded storage facility
US10579377B2 (en) 2017-01-19 2020-03-03 International Business Machines Corporation Guarded storage event handling during transactional execution
CN109857786B (en) * 2018-12-19 2020-10-30 成都四方伟业软件股份有限公司 Page data filling method and device
CN111797052B (en) * 2020-07-01 2023-11-21 上海兆芯集成电路股份有限公司 System single chip and system memory acceleration access method
KR102253362B1 (en) * 2020-09-22 2021-05-20 쿠팡 주식회사 Electronic apparatus and information providing method using the same
CN112416437B (en) * 2020-12-02 2023-04-21 海光信息技术股份有限公司 Information processing method, information processing device and electronic equipment
WO2022233391A1 (en) * 2021-05-04 2022-11-10 Huawei Technologies Co., Ltd. Smart data placement on hierarchical storage
CN114116529B (en) * 2021-12-01 2024-08-20 上海兆芯集成电路股份有限公司 Quick loading device and data cache method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101013401A (en) * 2006-02-03 2007-08-08 国际商业机器公司 Method and processorfor prefetching instruction lines
CN101180611A (en) * 2005-05-24 2008-05-14 德克萨斯仪器股份有限公司 Configurable cache system depending on instruction type
WO2008155815A1 (en) * 2007-06-19 2008-12-24 Fujitsu Limited Information processor and cache control method
CN101398787A (en) * 2007-09-28 2009-04-01 英特尔公司 Address translation caching and i/o cache performance improvement in virtualized environments

Family Cites Families (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5003471A (en) * 1988-09-01 1991-03-26 Gibson Glenn A Windowed programmable data transferring apparatus which uses a selective number of address offset registers and synchronizes memory access to buffer
SE515718C2 (en) * 1994-10-17 2001-10-01 Ericsson Telefon Ab L M Systems and methods for processing memory data and communication systems
US6484239B1 (en) * 1997-12-29 2002-11-19 Intel Corporation Prefetch queue
US6810466B2 (en) * 2001-10-23 2004-10-26 Ip-First, Llc Microprocessor and method for performing selective prefetch based on bus activity level
JP4067887B2 (en) * 2002-06-28 2008-03-26 富士通株式会社 Arithmetic processing device for performing prefetch, information processing device and control method thereof
US7310722B2 (en) * 2003-12-18 2007-12-18 Nvidia Corporation Across-thread out of order instruction dispatch in a multithreaded graphics processor
US8103832B2 (en) * 2007-06-26 2012-01-24 International Business Machines Corporation Method and apparatus of prefetching streams of varying prefetch depth
CN100449481C (en) * 2007-06-29 2009-01-07 东南大学 Storage control circuit with multiple-passage instruction pre-fetching function
US7890702B2 (en) * 2007-11-26 2011-02-15 Advanced Micro Devices, Inc. Prefetch instruction extensions
US8140768B2 (en) * 2008-02-01 2012-03-20 International Business Machines Corporation Jump starting prefetch streams across page boundaries
JP2009230374A (en) * 2008-03-21 2009-10-08 Fujitsu Ltd Information processor, program, and instruction sequence generation method
US7958317B2 (en) * 2008-08-04 2011-06-07 International Business Machines Corporation Cache directed sequential prefetch
US8402279B2 (en) * 2008-09-09 2013-03-19 Via Technologies, Inc. Apparatus and method for updating set of limited access model specific registers in a microprocessor
US9032151B2 (en) * 2008-09-15 2015-05-12 Microsoft Technology Licensing, Llc Method and system for ensuring reliability of cache data and metadata subsequent to a reboot
CN101887360A (en) * 2009-07-10 2010-11-17 威盛电子股份有限公司 The data pre-acquisition machine of microprocessor and method
CN101667159B (en) * 2009-09-15 2012-06-27 威盛电子股份有限公司 High speed cache system and method of trb

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101180611A (en) * 2005-05-24 2008-05-14 德克萨斯仪器股份有限公司 Configurable cache system depending on instruction type
CN101013401A (en) * 2006-02-03 2007-08-08 国际商业机器公司 Method and processorfor prefetching instruction lines
WO2008155815A1 (en) * 2007-06-19 2008-12-24 Fujitsu Limited Information processor and cache control method
CN101398787A (en) * 2007-09-28 2009-04-01 英特尔公司 Address translation caching and i/o cache performance improvement in virtualized environments

Also Published As

Publication number Publication date
TWI506434B (en) 2015-11-01
CN102169429A (en) 2011-08-31
CN104636274B (en) 2018-01-26
CN105183663B (en) 2018-11-27
TW201624289A (en) 2016-07-01
CN102169429B (en) 2016-06-29
TW201535119A (en) 2015-09-16
TW201535118A (en) 2015-09-16
CN104636274A (en) 2015-05-20
TW201447581A (en) 2014-12-16
TWI574155B (en) 2017-03-11
TW201135460A (en) 2011-10-16
TWI547803B (en) 2016-09-01
CN105183663A (en) 2015-12-23
TWI519955B (en) 2016-02-01
CN104615548A (en) 2015-05-13
TWI534621B (en) 2016-05-21

Similar Documents

Publication Publication Date Title
CN104615548B (en) Data prefetching method and microprocessor
CN105701030B (en) It is selected according to the dynamic caching replacement path of label bit
CN105701031B (en) The operating method of processor and its cache memory and cache memory
CN105701033B (en) The cache memory dynamically configurable depending on mode
CN1632877B (en) Variable latency stack cache and method for providing data
US7707397B2 (en) Variable group associativity branch target address cache delivering multiple target addresses per cache line
US7958317B2 (en) Cache directed sequential prefetch
CN105701022B (en) Set associative cache
CN100517274C (en) Cache memory and control method thereof
CN105700856B (en) According to the benefit of memory body access type and cooperate prefetching for positive level
US8677049B2 (en) Region prefetcher and methods thereof
CN114579479A (en) Low-pollution cache prefetching system and method based on instruction flow mixed mode learning
US9164900B1 (en) Methods and systems for expanding preload capabilities of a memory to encompass a register file
US20230205699A1 (en) Region aware delta prefetcher
Kim et al. LPR: learning-based page replacement scheme for scientific applications
JPH05189307A (en) Computer system
US20230089349A1 (en) Computer Architecture with Register Name Addressing and Dynamic Load Size Adjustment
CN101887360A (en) The data pre-acquisition machine of microprocessor and method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant