CN105183663A - Prefetch Unit And Data Prefetch Method - Google Patents

Prefetch Unit And Data Prefetch Method Download PDF

Info

Publication number
CN105183663A
CN105183663A CN201510494634.1A CN201510494634A CN105183663A CN 105183663 A CN105183663 A CN 105183663A CN 201510494634 A CN201510494634 A CN 201510494634A CN 105183663 A CN105183663 A CN 105183663A
Authority
CN
China
Prior art keywords
cache line
pattern
memory block
cycle
fetch unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510494634.1A
Other languages
Chinese (zh)
Other versions
CN105183663B (en
Inventor
罗德尼.E.虎克
约翰.M.吉尔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Via Technologies Inc
Original Assignee
Via Technologies Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US13/033,765 external-priority patent/US8762649B2/en
Priority claimed from US13/033,848 external-priority patent/US8719510B2/en
Priority claimed from US13/033,809 external-priority patent/US8645631B2/en
Application filed by Via Technologies Inc filed Critical Via Technologies Inc
Publication of CN105183663A publication Critical patent/CN105183663A/en
Application granted granted Critical
Publication of CN105183663B publication Critical patent/CN105183663B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Memory System Of A Hierarchy Structure (AREA)

Abstract

A data prefetcher is arranged in a microprocessor. The data prefetcher includes a plurality of period match counters associated with a corresponding plurality of different pattern periods. The data prefetcher also includes control logic that updates the plurality of period match counters in response to accesses to a memory block by the microprocessor, determines a clear pattern period based on the plurality of period match counters and prefetches into the microprocessor non-fetched cache lines within the memory block based on a pattern having the clear pattern period determined based on the plurality of period match counters.

Description

Pre-fetch unit and data prefetching method
The denomination of invention that the application is the applying date is on March 29th, 2011, application number is 201110077108.7 is the divisional application of the application case of " pre-fetch unit, data prefetching method and microprocessor ".
Technical field
The present invention relates to the memory cache of general microprocessor, particularly relate to data pre-fetching to the memory cache of microprocessor.
Background technology
With nearest computer system, when cache failure (cachemiss), the time needed for microprocessor access system storage, can upper one or two orders of magnitude more than microprocessor access memory cache (cache).Therefore, in order to improve cache hit rate (cachehitrate), microprocessor incorporates prefetching technique, be used for testing nearest data access pattern (examinerecentdataaccesspatterns), and which data of attempt prediction are the object of program next one access, and the benefit of looking ahead has been well known category.
But the access pattern that applicant notices some program can not detect for the pre-fetch unit of existing microprocessor.Such as, Figure 1 shows that when the program performed comprise carry out the store operation of a sequence via storer time, the access pattern of second level memory cache (L2Cache), and what describe in figure is the storage address of each time.As shown in Figure 1, although general trend is for increasing storage address along with the time, namely by direction up, but in many cases, specified access memory address also can the more previous time down, but not general trend is up, the result making it be different from existing pre-fetch unit reality to predict.
Although with regard to the sample that quantity is relatively large, general trend is advanced towards direction, and may cause confusion when the facing small sample reason of situation of existing pre-fetch unit has two.First reason is program is follow its framework to access memory, no matter be caused by algorithm characteristic or not good programming (poorprogramming).Second reason is the pipeline of non-sequential (out-of-orderexecution) microcontroller core and queue when performing under normal function, usually can carry out memory access with being different from its procedure order produced.
Therefore, need a data pre-fetch unit (device) can effectively for program carries out data pre-fetching, it can't present obvious trend (nocleartrend) when must consider that window (timewindows) carries out memory access instruction (operation) when less, but then there will be obvious trend when examining with larger samples number.
Summary of the invention
The present invention discloses a kind of pre-fetch unit, is arranged in a microprocessor, comprises: multiple cycle match counter, respectively corresponding to different multiple pattern cycles; And a steering logic, in order to the action in response to this microprocessor access one memory block, upgrade the plurality of cycle match counter; According to the count value of the plurality of cycle match counter, determine an obvious pattern cycle; And according to the pattern with this obvious pattern cycle determined by the plurality of cycle match counter, not yet prefetched cache line in the multiple cache lines in this memory block is looked ahead.
The present invention discloses a kind of data prefetching method, comprise: by a microprocessor, in response to the action to access one memory block, upgrade multiple cycle match counter, wherein the plurality of cycle match counter is respectively corresponding to different multiple pattern cycles; According to the count value of the plurality of cycle match counter, determine an obvious pattern cycle; And according to the pattern with this obvious pattern cycle determined by the plurality of cycle match counter, not yet prefetched cache line in the multiple cache lines in this memory block is looked ahead.
The present invention discloses a kind of pre-fetch unit, be arranged at and have in a microprocessor of a memory cache, wherein pre-fetch unit is in order to receive the multiple access requirements to multiple addresses of a memory block, one in the address of each access requirement correspond to memories block, and the address of access requirement along with function of time nonmonotonicity (non-monotonically) increase or reduce.Pre-fetch unit comprises a storage device and a steering logic.Steering logic, be coupled to storage device, wherein when receiving access requirement, steering logic is then in order to maintain a maximum address and a lowest address of the access requirement in storage device, and the count value of the change of maximum address and lowest address, maintain a historical record of the cache line be accessed recently in memory block, the cache line be accessed recently is relevant to the address of access requirement, according to count value, determine an access direction, according to historical record, determine an access pattern, and according to access pattern and along access direction, be taken to being not yet designated as by historical record the cache line accessed in memory cache in advance in memory block.
The present invention discloses a kind of data prefetching method, in order to a memory cache of prefetch data to microprocessor, data prefetching method, comprise the multiple access requirements received multiple addresses of a memory block, one in the address of each access requirement correspond to memories block, and the address of access requirement along with function of time nonmonotonicity (non-monotonically) increase or reduce; When receiving access requirement, maintain the maximum and lowest address in memory block, and calculate the count value of the change of maximum and lowest address; When receiving access requirement, maintain a historical record of the cache line be accessed recently in memory block, the cache line be accessed recently is relevant to the address of access requirement; An access direction is determined according to count value; An access pattern is determined according to historical record; And according to accessing pattern and along access direction, being taken to being not yet designated as by historical record the cache line accessed in memory cache in advance in memory block.
The present invention discloses a kind of microprocessor, comprises multiple core, a memory cache and a pre-fetch unit.Memory cache, by core share, in order to receive the multiple access requirements to multiple addresses of a memory block, one in the address of each access requirement correspond to memories block, the address of access requirement along with function of time nonmonotonicity (non-monotonically) increase or reduce.Pre-fetch unit, in order to monitor access requirement, and the maximum address maintained in memory block and a lowest address, and the count value of the change of maximum address and lowest address, according to count value, determine an access direction and along access direction, cache line miss in memory block be taken in memory cache in advance.
The present invention discloses a kind of microprocessor, comprises a first order memory cache, a second level memory cache and a pre-fetch unit.Pre-fetch unit is in order to detect a direction and the pattern of the nearest access requirement in the memory cache of the present second level, and according to direction and pattern, multiple cache line is taken in advance in the memory cache of the second level, from first order memory cache, receive an address of the access requirement that first order memory cache receives, wherein address relevant to a cache line, to determine in direction one or more cache line pointed by pattern of cache line rear of being correlated with and cause one or more cache line prefetched in first order memory cache.
The present invention discloses a kind of data prefetching method, in order to prefetch data to the first order memory cache of a microprocessor with a second level memory cache, data prefetching method comprises a direction and the pattern of the nearest access requirement detected in the memory cache of the present second level, and according to direction and pattern, multiple cache line is taken in advance in the memory cache of the second level; From first order memory cache, receive an address of the access requirement that first order memory cache receives, wherein address is relevant to a cache line; To determine in direction rear one or more cache lines pointed by pattern of cache line of being correlated with; And cause one or more cache line prefetched in first order memory cache.
The present invention discloses a kind of microprocessor, comprises a memory cache and a pre-fetch unit.The pattern that multiple memory access that pre-fetch unit has a first memory block in order to detection require, and look ahead multiple cache line to memory cache according to pattern from first memory block, monitor one of a second memory block new memory access requirement, determine whether virtual neighboring is bordering on second memory block to first memory block, and when extending to second memory block from first memory block, then determine that the new memory access whether pattern predicts second memory block requires that a be correlated with cache line is in second memory block, and according to pattern, from second memory block, the cache line of setting each other off is taken to memory cache in advance.
The present invention discloses a kind of data prefetching method, in order to a memory cache of prefetch data to microprocessor, data prefetching method comprises the pattern that requires of multiple memory access detecting and have a first memory block, and supreme to memory cache from first memory block cache line of looking ahead according to pattern; Monitor one of a second memory block new memory access requirement; Determine whether virtual neighboring is bordering on second memory block to first memory block, and when extending to second memory block from first memory block, the new memory access whether decision pattern predicts second memory block requires that a be correlated with cache line is in second memory block; And according to pattern, from second memory block, multiple cache line is taken to memory cache in advance, to respond deciding step.
Accompanying drawing explanation
Figure 1 shows that when perform via storer comprise a sequence store operation program time, a kind of second level memory cache pattern access performance.
Fig. 2 is the calcspar of a kind of microprocessor of the present invention.
Fig. 3 is the more detailed calcspar of pre-fetch unit of Fig. 2 of the present invention.
Fig. 4 is the microprocessor of Fig. 2 of the present invention and the operational flowchart of the particularly pre-fetch unit of Fig. 3.
Fig. 5 is that the pre-fetch unit of Fig. 3 of the present invention is to the operational flowchart of the step of Fig. 4.
Fig. 6 is that the pre-fetch unit of Fig. 3 of the present invention is to the operational flowchart of the step of Fig. 4.
Fig. 7 is that looking ahead of Fig. 3 of the present invention requires the operational flowchart of queue.
Fig. 8 A and Fig. 8 B is two pattern access points of the present invention one memory block, in order to represent demarcation frame pre-fetch unit of the present invention.
Fig. 9 is the calcspar of the example operation of the microprocessor shown in Fig. 2 of the present invention.
Figure 10 continues the calcspar of the example operation of the microprocessor shown in Fig. 2 of the example of Fig. 9 for the present invention.
Figure 11 A and Figure 11 B continues the calcspar of the example operation of the microprocessor shown in Fig. 2 of the example of Fig. 9 and 10 figure for the present invention.
Figure 12 is the calcspar of a kind of microprocessor of another embodiment of the present invention.
Figure 13 is the operational flowchart of the pre-fetch unit shown in Figure 12 of the present invention.
Figure 14 is the operational flowchart of the pre-fetch unit of the present invention according to Figure 12 of Figure 13 step.
Figure 15 is the calcspar that another embodiment of the present invention has a kind of microprocessor of a demarcation frame pre-fetch unit.
Figure 16 is the calcspar of the virtual hash table of Figure 15 of the present invention.
Figure 17 is the operational flowchart of the microprocessor of Figure 15 of the present invention.
Figure 18 is the content of the present invention according to the virtual hash table of the Figure 16 after the operation of pre-fetch unit described via Figure 17 example.
Figure 19 A and Figure 19 B is the operational flowchart of the pre-fetch unit of Figure 15 of the present invention.
Figure 20 is the calcspar of a hash physical address to hash virtual address storehouse being used in the pre-fetch unit of Figure 15 of another embodiment of the present invention.
The calcspar of Figure 21 multi-core microprocessor of the present invention.
Reference numeral explanation
100 ~ microprocessor
102 ~ instruction cache storer
104 ~ instruction decoder
106 ~ working storage alias table
108 ~ reservation station
112 ~ performance element
132 ~ other performance elements
134 ~ loading/storage element
124 ~ pre-fetch unit
114 ~ retirement unit
116 ~ first order data quick storer
118 ~ second level memory cache
122 ~ Bus Interface Unit
162 ~ virtual hash table
198 ~ queue
172 ~ first order data search pointer
178 ~ first order data pattern address
196 ~ first order data memory addresses
194 ~ pattern prediction cache line address
192 ~ cache line configuration requirement
188 ~ cache line data
The virtual hash address field of 354 ~ memory block
356 ~ status bar
302 ~ block position shade working storage
303 ~ block number working storage
304 ~ minimum index working storage
306 ~ Maximum Index working storage
308 ~ minimum change counter
312 ~ maximum change counter
314 ~ total counter
316 ~ intermediary outcomes working storage
318 ~ cycle match counter
342 ~ direction working storage
344 ~ pattern working storage
346 ~ pattern order working storage
348 ~ pattern general register
352 ~ search index working storage
332 ~ hardware cell
322 ~ steering logic
328 ~ looking ahead requires queue
324 ~ extract pointer
326 ~ advance pointer
2002 ~ hash virtual address storehouse
2102A ~ core A
2102B ~ core B
The pre-fetch unit of 2104 ~ height reaction equation
2106 ~ the pre-fetch unit of height reaction equation shared
Embodiment
Manufacture and the using method of various embodiments of the invention will be discussed in detail below.But it should be noted that many feasible inventive concepts provided by the present invention may be implemented in various particular range.These specific embodiments are only for illustrating manufacture of the present invention and using method, but non-for limiting scope of the present invention.
Generally, about the problems referred to above solution can follow-up describe explained.When all accesses (instruction, operation or requirement) of a storer all represent on a figure, a set of all accesses (instruction, operation or requirement) can be delimited frame by one and be enclosed.When additional access requirement is also shown on same figure, the demarcation frame after these access requirements also can be resized encloses.Above-mentioned first figure is illustrated in figure 8 in twice access (instruction or operation) of a memory block.The time of the access of the X-axis presentation directives of Fig. 8, Y-axis represents the index of 64 byte cache lines of the access with 4KB block.First, describe primary two accesses: first access accesses cache line 5, second access requirement accesses cache line 6.As shown in the figure one delimits frame encloses 2 that represent access requirement.
Moreover the 3rd access requirement betides cache line 7, demarcation frame becomes the new point making greatly to represent the 3rd access requirement and can be enclosed interior by demarcation frame.Along with new access constantly occurs, delimiting frame must expand along with X-axis, and the upper limb delimiting frame is also along with Y-axis expands (this is example upwards).The historical record of the movement of above-mentioned demarcation frame upper limb and lower edge accesses the trend of pattern for not to be upwards, or downwards by order to decision.
Delimit the upper limb of frame and the trend of lower edge with except determining a trend direction except following the trail of, it is also necessary for following the trail of other access requirement, because the event that access requirement skips one or two cache lines occurs often.Therefore, in order to avoid the event skipping looked ahead cache line occurs, once a trend up or down be detected, pre-fetch unit then uses extra criterion to determine the cache line that will look ahead.Because access requirement trend can be rearranged, the access historical record that rearranges of these transient state can be deleted by pre-fetch unit.This operation completes in a shade (bitmask) by marker bit (markingbit), each correspondence has a cache line of a memory block,, and when position corresponding in the shade of position is set up, represent that specific block can be accessed.Once reach a quantity sufficient to the access requirement of memory block, pre-fetch unit can use a shade (its meta shade does not have the instruction of the sequential of access), and go to access whole block based on larger access viewpoint (broad sense largeview) as described below, but not only remove the block of access according to the time of access as less access viewpoint (narrow sense smallview) and existing pre-fetch unit.
Figure 2 shows that the calcspar of microprocessor 100 of the present invention.Microprocessor 100 comprises the bang path that has multiple stratum, and also comprises various functional unit in bang path.Bang path comprises an instruction cache storer 102, and instruction cache storer 102 is coupled to an instruction decoder 104; Instruction decoder 104 is coupled to a working storage alias table 106 (registeraliastable, RAT); Working storage alias table 106 is coupled to a reservation station 108 (reservationstation); Reservation station 108 is coupled to a performance element 112 (executionunit); Finally, performance element 112 is coupled to a retirement unit 114 (retireunit).Instruction decoder 104 can comprise an instruction translator (instructiontranslator), in order to Group instruction (such as the Group instruction of x86 framework) to be translated to the Group instruction of the similar reduced instruction set computer (reduceinstructionsetcomputerRISC) of microprocessor 100.Reservation station 108 produce and move instruction to performance element 112, perform according to procedure order (programorder) in order to make performance element 112.Retirement unit 114 comprises a rearrangement impact damper (reorderbuffer), in order to perform the resignation (Retirement) of instruction according to procedure order.Performance element 112 comprises loading/storage element 134 and other performance elements 132 (otherexecutionunit), such as integer unit (integerunit), floating-point unit (floatingpointunit), branch units (branchunit) or single instrction multiple data crossfire (SingleInstructionMultipleData, SIMD) unit.Loading/storage element 134 in order to read the data of first order data quick storer 116 (level1datacache), and writes data to first order data quick storer 116.One second level memory cache 118 is in order to support (back) first order data quick storer 116 and instruction cache storer 102.Second level memory cache 118 is in order to read and writing system storer via a Bus Interface Unit 122, and Bus Interface Unit 122 is the interfaces between microprocessor 100 and a bus (such as a field bus (localbus) or memory bus (memorybus)).Microprocessor 100 also comprises a pre-fetch unit 124, in order to from system storage prefetch data to second level memory cache 118 and/or first order data quick storer 116.
Be illustrated in figure 3 the calcspar that the pre-fetch unit 124 of Fig. 2 is more detailed.Pre-fetch unit 124 comprises a block position shade working storage 302.Each correspondence in block position shade working storage 302 has a cache line of a memory block, and wherein the block number of memory block is stored in a block number working storage 303.In other words, block number working storage 303 stores the upper layer address bits (upperaddressbits) of memory block.When the numerical value of in block position shade working storage 302 is true (truevalue), be point out that corresponding cache line has been accessed.Initialization block position shade working storage 302 is false (false) by making all place values.In one embodiment, the size of memory block is 4KB, and the size of cache line is 64 bytes.Therefore, block position shade working storage 302 has the capacity of 64.In certain embodiments, the size of memory block also can be identical with the size of entity memory paging (physicalmemorypage).But the size of cache line can be other various different sizes in other embodiments.Moreover the size of the memory area that block position shade working storage 302 maintains is changeable, do not need the size corresponding to entity memory paging.More precisely, the size of the memory area (or block) that block position shade working storage 302 maintains can be any size (multiple of two is best), as long as it has enough cache lines to carry out being beneficial to the detection in direction and pattern of looking ahead.
Pre-fetch unit 124 also can comprise a minimum index working storage 304 (minpointerregister) and a Maximum Index working storage 306 (maxpointerregister).Minimum index working storage 304 and Maximum Index working storage 306 respectively in order to start to follow the trail of a memory block in pre-fetch unit 124 access after, point in this memory block the index (index) of the minimum and the highest cache line be accessed constantly.Pre-fetch unit 124 also comprises minimum change counter 308 and a maximum change counter 312.Minimum change counter 308 and maximum change counter 312 respectively in order to start to follow the trail of this memory block in pre-fetch unit 124 access after, calculate the number of times that minimum index working storage 304 and Maximum Index working storage 306 change.Pre-fetch unit 124 also comprises a total counter 314, in order to start to follow the trail of this memory block in pre-fetch unit 124 access after, calculate the sum of the cache line be accessed.Pre-fetch unit 124 also comprises an intermediary outcomes working storage 316, in order to start to follow the trail of this memory block in pre-fetch unit 124 access after, point to the index (count value of such as minimum index working storage 304 and the count value of maximum change counter 312 average) of the intermediate pre-fetch memory lines of this memory block.Pre-fetch unit 124 also comprises a direction working storage 342 (directionregister), a pattern working storage 344, one pattern cycle working storage 346, pattern general register 348 and searches index working storage 352, and its each function is as described below.
Pre-fetch unit 124 also comprises multiple cycle match counter 318 (periodmatchcounter).Each cycle match counter 318 maintains a count value of a different cycles.In one embodiment, the cycle is 3,4 and 5.Cycle refers to the figure place of intermediary outcomes working storage 316 left/right.The rear renewal that the count value of cycle match counter 318 is carried out at each memory access of block.When block position shade working storage 302 indicate in the cycle to the access on intermediary outcomes working storage 316 left side with when matching to the access on the right of intermediary outcomes working storage 316, pre-fetch unit 124 then increases the count value of the cycle match counter 318 relevant to this cycle.About cycle match counter 318 application and operation in more detail, tell about special at following Fig. 4, Fig. 5.
Pre-fetch unit 124 also comprise look ahead require queue 328, extract pointer 324 (poppointer) and one advance pointer 326 (pushpointer).Look ahead and require that queue 328 comprises project (entry) queue of a circulation, each requirement of looking ahead produced in order to the operation (particularly about the 4th, 6 and 7 figure) storing pre-fetch unit 124 of above-mentioned project.Pointer 326 is advanced to point out to require the next project (entry) of queue 328 by being assigned to look ahead.Extract the next project that pointer 324 is pointed out to require queue 328 to shift out from looking ahead.In one embodiment, because look ahead, requirement may terminate in the mode (outoforder) losing non-sequential, requires that queue 328 non-ly can follow out-of-sequence mode to extract (popping) completed (completed) project so look ahead.In one embodiment, look ahead and require that the size of queue 328 is due in circuit flow process, the circuit flow process that all requirements enter the circuit (tagpipeline) of the mark of second level memory cache 118 is selected, then make to look ahead require project in queue 328 number at least with the pipeline level (stages) in second level memory cache 118 as many.Look ahead and require maintenance until the pipeline of second level memory cache 118 terminates, at this time point, require) may be one of three results, as Fig. 7 describes in more detail, that is hit (hitin) second level memory cache 118, re-execute (replay) or advance whole team's row pipeline projects, in order to the data needed of looking ahead from system storage.
Pre-fetch unit 124 also comprises steering logic 322, and each element that steering logic 322 controls pre-fetch unit 124 performs its function.
Although Fig. 3 only demonstrates one group of hardware cell 332 (block position shade working storage 302 relevant with active (active) memory block, block number working storage 303, minimum index working storage 304, Maximum Index working storage 306, minimum change counter 308, maximum change counter 312, total counter 314, intermediary outcomes working storage 316, pattern order working storage 346, pattern general register 348 and search index working storage 352), but pre-fetch unit 124 can comprise multiple hardware cells 332 as shown in Figure 3, in order to follow the trail of the access of multiple active memory block.
In one embodiment, microprocessor 100 also comprises (highlyreactive) pre-fetch unit (not shown) of one or more height reaction equation, the pre-fetch unit of height reaction equation is in very little temporary transient sample (sample), use different algorithms to access, and with pre-fetch unit 124 compounding practice, it is described as follows.Because pre-fetch unit 124 described herein analyzes the number (pre-fetch unit compared to height reaction equation) compared with large memories access, it must tend to use the longer time to go to start a new memory block of looking ahead, as described below, but more accurate than the pre-fetch unit of high reaction equation.Therefore, use the pre-fetch unit of height reaction equation and pre-fetch unit 124 to operate, microprocessor 100 can have the faster response time of the pre-fetch unit of high reaction equation and the pinpoint accuracy of pre-fetch unit 124 simultaneously.In addition, pre-fetch unit 124 can monitor the requirement from other pre-fetch unit, and in it looks ahead algorithm, use these requirements.
Be illustrated in figure 4 the operational flowchart of the microprocessor 100 of Fig. 2, and the operation of the particularly pre-fetch unit 124 of Fig. 3.Flow process starts from step 402.
In step 402, pre-fetch unit 124 receives one loading/storing memory access requirement, in order to the loading/storing memory access requirement of access to a storage address.In one embodiment, pre-fetch unit 124, when judging to look ahead which cache line, can will go out be loaded into memory access requirement and storing memory access requirement is distinguished.In other embodiments, pre-fetch unit 124 when judging to look ahead which cache line, can't distinguish and being loaded into and storing.In one embodiment, pre-fetch unit 124 receives the memory access requirement that loading/storage element 134 exports.Pre-fetch unit 124 can receive the memory access requirement from separate sources, above-mentioned source includes, but is not limited to loading/storage element 134, first order data quick storer 116 (the assignment requirement that such as first order data quick storer 116 produces, when loading/storage element 134 memory access does not hit first order data quick storer 116), and/or other source, such as microprocessor 100 in order to perform look ahead algorithm different from pre-fetch unit 124 with other pre-fetch unit (not shown) of prefetch data.Flow process enters step 404.
In step 404, steering logic 322, according to the numerical value comparing memory access address and each block number working storage 303, judges whether to access the storer of an active block.Namely, steering logic 322 judge the hardware cell 332 shown in Fig. 3 whether be assigned to memory access require specified storage address the memory block of being correlated with.If so, then step 406 is entered.
In a step 406, steering logic 322 assigns the hardware cell 332 shown in Fig. 3 to relevant memory block.In one embodiment, steering logic 322 assigns hardware cell 332 in the mode that rotates (round-robin).In other embodiments, steering logic 322 is the information that hardware cell 332 maintains the page method of substitution (least-recently-used) do not used at most, and assigns with the basis of a page method of substitution (least-recently-used) do not used at most.In addition, steering logic 322 understands the hardware cell 332 that initialization is assigned.Particularly, steering logic 322 can remove all positions of block position shade working storage 302, the position, upper strata of memory access address is filled (populate) to block number working storage 303, and to remove minimum index working storage 304, Maximum Index working storage 306, minimum change counter 308, maximum change counter 312, total counter 314 and cycle match counter 318 be 0.Flow process enters into step 408.
In a step 408, steering logic 322 upgrades hardware cell 332 according to memory access address, as described in Figure 5.Flow process enters step 412.
In step 412, hardware cell 332 is tested (examine) total counter 314 and whether is carried out enough access requirements to memory block in order to determining program, to detect an access pattern.In one embodiment, steering logic 322 judges whether the count value of total counter 314 is greater than a set value.In one embodiment, this set value is 10, but this set value has a variety of the present invention is not limited thereto.If the access requirement that executed is enough, flow process proceeds to step 414; Otherwise flow process terminates.
In step 414, steering logic 322 judges whether access requirement specified in block position shade working storage 302 has an obvious trend.That is, steering logic 322 judges that access requirement has obvious uptrend (access address increase) or downtrend (access address minimizing).In one embodiment, whether steering logic 322 is greater than a set value according to the difference (difference) of minimum change counter 308 and maximum both change counters 312 and decides access requirement and whether have obvious trend.In one embodiment, set value is 2, and set value can be other numerical value in other embodiments.When the count value of minimum change counter 308 is greater than count value one set value of maximum change counter 312, then there is obvious downtrend; Otherwise, when the count value of maximum change counter 312 is greater than count value one set value of minimum change counter 308, then have obvious uptrend.When there being an obvious trend to produce, then enter step 416, otherwise process ends.
In step 416, whether steering logic 322 judges in the access requirement specified by block position shade working storage 302 for having an obvious pattern cycle winner (patternperiodwinner).In one embodiment, whether steering logic 322 is greater than a set value according to one of cycle match counter 318 with the difference of other cycle match counter 318 count values and has determined whether an obvious pattern cycle winner.In one embodiment, set value is 2, and set value can be other numerical value in other embodiments.The renewal rewards theory of cycle match counter 318 will be described in detail in Fig. 5.When there being an obvious pattern cycle winner to produce, flow process proceeds to step 418; Otherwise flow process terminates.
In step 418, steering logic 322 fills the obvious direction trend that direction working storage 342 judges to point out step 414.In addition, steering logic 322 be used in step 416 detect know winner's pattern cycle (clearwinningpatternperiod) N fill pattern order working storage 346.Finally, the obvious winner's pattern cycle detected by step 416 is filled in pattern working storage 344 by steering logic 322.That is, pattern working storage 344 is filled on the steering logic 322 N position of block position shade working storage 302 to the right side of intermediary outcomes working storage 316 or left side (mating according to Fig. 5 step 518).Flow process proceeds to step 422.
In step 422, steering logic 322 starts to look ahead (as shown in Figure 6) to cache line (non-fetchedcacheline) not yet prefetched in memory block according to detected direction and pattern.Flow process terminates in step 422.
Figure 5 shows that the pre-fetch unit 124 shown in Fig. 3 performs the operating process of the step 408 shown in Fig. 4.Flow process starts from step 502.
In step 502, steering logic 322 increases the count value of total counter 314.Flow process proceeds to step 504.
In step 504, steering logic 322 judges whether current memory access address (refer to especially, nearest memory access address the index value of the memory block of cache line of being correlated with) is greater than the value of Maximum Index working storage 306.If so, flow process proceeds to step 506; Then flow process proceeds to step 508 if not.
In step 506, steering logic 322 with nearest memory access address the index value of the memory block of cache line of being correlated with upgrade Maximum Index working storage 306, and increase the count value of maximum change counter 312.Flow process proceeds to step 514.
In step 508, steering logic 322 judge by nearest memory access address the index value of the memory block of cache line of being correlated with whether be less than the value of minimum index working storage 304.If so, flow process proceeds to step 512; If not, then flow process proceeds to step 514.
In step 512, steering logic 322 with nearest memory access address the index value of the memory block of cache line of being correlated with upgrade minimum index working storage 304, and increase the count value of minimum change counter 308.Flow process proceeds to step 514.
In the step 514, steering logic 322 calculates the mean value of minimum index working storage 304 and Maximum Index working storage 306, and upgrades intermediary outcomes working storage 316 with the calculated mean value that goes out.Flow process proceeds to step 516.
In step 516, steering logic 322 checks block position shade working storage 302, and centered by intermediary outcomes working storage 316, cut into left side and each N position, right side, wherein N is the figure place of each the position relevant with each cycle match counter 318.Flow process proceeds to step 518.
In step 518, steering logic 322 determines whether match with the N position on the right side of intermediary outcomes working storage 316 in the N position in the left side of intermediary outcomes working storage 316.If so, flow process proceeds to step 522; If not, then flow process terminates.
In step 522, steering logic 322 increases the count value with the cycle match counter 318 in a N cycle.Flow process ends at step 522.
Figure 6 shows that the pre-fetch unit 124 of Fig. 3 performs the operational flowchart of the step 422 of Fig. 4.Flow process starts from step 602.
In step 602, steering logic 322 initialization can leave detection side to intermediary outcomes working storage 316 pattern order working storage 346 in, initialization is carried out to search index working storage 352 and pattern general register (pattenlocation) 348.That is, steering logic 322 can be initialized to the value after intermediary outcomes working storage 316 and detected cycle (N) between the two added/subtracted by searching index working storage 352 and pattern general register 348.Such as, be 5 when the value of intermediary outcomes working storage 316 is 16, N, and the trend shown in direction working storage 342 be upwards time, search index working storage 352 and pattern general register 348 are initialized as 21 by steering logic 322.Therefore, in this example, for comparison purposes (as described below), 5 of pattern working storage 344 can be arranged at the position 21 to 25 of block position shade working storage 302.Flow process proceeds to step 604.
In step 604, (this position is arranged in pattern general register 348 in corresponding position in steering logic 322 test block position shade working storage 302 in the position and pattern working storage 344 of direction working storage 342 indication, in order to corresponding block position shade working storage), in order to predict the corresponding cache line in whether prefetch memory block.Flow process proceeds to step 606.
In step 606, steering logic 322 is predicted the need of tested cache line.When the position of pattern working storage 344 is true (true), this cache line of steering logic 322 prediction is needs, and pattern predictor will access this cache line.If cache line is needs, flow process proceeds to step 614; Otherwise flow process proceeds to step 608.
In step 608, whether steering logic 322 has arrived the end of block position shade working storage 302 according to direction working storage 342, judges in memory block, whether have other cache lines of not testing.If without the cache line of not testing, then flow process has terminated; Otherwise flow process proceeds to step 612.
In step 612, steering logic 322 increases/reduces the value of direction working storage 342.In addition, if when direction working storage 342 has exceeded last of pattern working storage 344, steering logic 322 upgrades pattern general register 348 by with the new numerical value of direction working storage 342, such as, pattern working storage 344 is shifted (shift) position to direction working storage 342.Flow process proceeds to step 604.
In step 614, steering logic 322 determines that whether required cache line is prefetched.When the position of block position shade working storage 302 is true, the cache line required for steering logic 322 judges is prefetched.If required cache line is prefetched, flow process proceeds to step 608; Otherwise flow process proceeds to step 616.
In determining step 616, if direction working storage 342 is downwards, steering logic 322 determine the cache line that judges to list reference in whether from minimum index working storage 304 more than a set value (set value is 16 in one embodiment); If or direction working storage 342 is upwards, steering logic 322 by the cache line that judges to determine to list reference in whether from Maximum Index working storage 306 more than a set value.If what reference was listed in steering logic 322 decision in is true set value more than above-mentioned judgement, then flow process terminates; Otherwise flow process proceeds to determining step 618.It should be noted that, if cache line significantly more than (away from) minimum index working storage 304/ Maximum Index working storage 306, flow process terminates, but do not represent pre-fetch unit 124 like this by other cache line of not then prefetch memory block, because according to the step of Fig. 4, also more prefetch operation can be triggered again to the subsequent access of the cache line of memory block.
In step 618, steering logic 322 judges to look ahead to require that whether queue 328 is full.Require that if look ahead queue 328 is full, then flow process proceeds to step 622, otherwise flow process proceeds to step 624.
In step 622, steering logic 322 suspends (stall) until look ahead require queue 328 discontented (non-full).Flow process proceeds to step 624.
In step 624, steering logic 322 advances a project (entry) to require queue 328 to looking ahead, with cache line of looking ahead.Flow process proceeds to step 608.
Be illustrated in figure 7 looking ahead of Fig. 3 and require the operational flowchart of queue 328.Flow process starts from step 702.
In a step 702, being advanced in step 624 looks ahead requires that in queue 328 one requirement of looking ahead is allowed to carry out access (wherein this looks ahead and requires in order to access second level memory cache 118), and continues to the pipeline of second level memory cache 118.Flow process proceeds to step 704.
In step 704, second level memory cache 118 judges whether cache line address hits second level memory cache 118.If cache line address hit second level memory cache 118, then flow process proceeds to step 706; Otherwise flow process proceeds to determining step 708.
In step 706, because cache line is ready in second level memory cache 118, therefore do not need cache line of looking ahead, flow process terminates.
In step 708, steering logic 322 judges that the response of second level memory cache 118 requirement of whether for this reason looking ahead must be merely re-executed.If so, then flow process proceeds to step 712; Otherwise flow process proceeds to step 714.
In step 712, looking ahead of cache line of looking ahead requires again to advance (re-pushed) to require in queue 328 to looking ahead.Flow process ends at step 712.
In step 714, second level memory cache 118 advances a requirement in whole team row (fillqueue) (not shown) for microprocessor 100, in order to require that cache line reads in microprocessor 100 by Bus Interface Unit 122.Flow process ends at step 714.
Be illustrated in figure 9 the example operation of the microprocessor 100 of Fig. 2.Be illustrated in figure 9 after carrying out ten accesses to a memory block, block position shade working storage 302 (asterisk on a position represent access corresponding cache line), minimum change counter 308, maximum change counter 312 and total counter 314 are in the content of first, second and the tenth access.In fig .9, minimum change counter 308 is called " cntr_min_change ", maximum change counter 312 is called " and cntr_max_change ", and total counter 314 is called " cntr_total ".The position of intermediary outcomes working storage 316 in fig .9 then with " M " indicated by.
Because first time access (step 402 as Fig. 4) of carrying out address 0x4dced300 is positioned in the cache line on index 12 to carry out in memory block, therefore steering logic 322 is by the position 12 (step 408 of Fig. 4) of setting block position shade working storage 302, as shown in the figure.In addition, steering logic 322 will upgrade minimum change counter 308, maximum change counter 312 and total counter 314 (step 502,506 and 512 of Fig. 5).
Owing to being that the cache line be positioned in memory block on index 9 is carried out to the access of the second time of address 0x4ced260, steering logic 322 according to the position 9 by setting block position shade working storage 302, as shown in the figure.In addition, steering logic 322 will upgrade the count value of minimum change counter 308 and total counter 314.
In the three to the ten access (diagram is not given in the address of the three to the nine access, and the access address of the tenth time is 0x4dced6c0), steering logic 322 carry out the setting of suitably unit to block position shade working storage 302 according to meeting, as shown in the figure.In addition, steering logic 322 corresponds to the count value of the minimum change of access renewal each time counter 308, maximum change counter 312 and total counter 314.
Be steering logic 322 bottom Fig. 9 in the access of the storer of each execution ten times, when the content of the cycle match counter 318 after execution of step 514 to 522.In fig .9, cycle match counter 318 is called " cntr_period_N_matches ", wherein N is 1,2,3,4 or 5.
Example as shown in Figure 9, although the criterion (total counter 314 is at least ten) meeting step 412 and the criterion meeting step 416 (the cycle match counter 318 that the cycle match counter 318 in cycle 5 is all compared with other is at least greater than 2), do not meet the criterion (difference between minimum change counter 308 and block position shade working storage 302 is less than 2) of step 414.Therefore, now prefetch operation can not be performed in this memory block.
As being also presented in the cycle 3,4 and 5 bottom Fig. 9, from cycle 3,4 and the right side of 5 to intermediary outcomes working storage 316 and the pattern in left side.
As shown in Figure 10 for the microprocessor 100 of Fig. 2 continues the operational flowchart of the example shown in Fig. 9.Figure 10 describes similar in appearance to the information of Fig. 9, but difference is in and carries out the tenth once and the access of the 12 time (address of the tenth second-level access is 0x4dced760) to memory block.As shown in the figure, it meets the criterion (total counter 314 is at least ten) of step 412, the criterion (difference between minimum change counter 308 and block position shade working storage 302 is at least 2) of step 414 and the criterion (the cycle match counter 318 in cycle 5 is at least greater than 2 at the cycle match counter 318 that the counting in cycle 5 is all compared with other) of step 416.Therefore, according to the step 418 of Fig. 4, steering logic 322 fills (populate) direction working storage 342 (in order to point out that direction trend is upwards), pattern order working storage 346 (inserting numerical value 5), pattern working storage 344 (with pattern " * * " or " 01010 ").Steering logic 322, also according to step 422 and Fig. 6 of Fig. 4, is memory block execution pre-fetched predictive, as shown in figure 11.Figure 10 also display control logic 322 Fig. 6 step 602 operate in, the position of direction working storage 342 in place 21.
As shown in figure 11 for the microprocessor 100 of Fig. 2 continues the operational flowchart of the example of Fig. 9 and Figure 10.Figure 11 via each (table is labeled as 0 to 11) of in example, describing 12 different examples through the step 604 of Fig. 6 to step 616 until the prediction of the cache line of memory block prefetched unit 124 find to need prefetched memory block operation.As shown in the figure, in each example, the value of direction working storage 342 increases according to Fig. 6 step 612.As shown in figure 11, in example 5 and 10, pattern general register 348 can be updated according to the step 612 of Fig. 6.As shown in example 0,2,4,5,7 and 10, owing to being false (false) in the position of direction working storage 342, pattern points out that the cache line on direction working storage 342 will not be required.Also show in figure, in example 1,3,6 and 8, because in direction working storage 342, the position of pattern working storage 344 is true (ture), pattern working storage 344 points out that the cache line on direction working storage 342 will be required, but cache line has prepared to be removed (fetched), the position as block position shade working storage 302 has been the instruction of true (ture).As shown in the figure finally, in example 11, because in direction working storage 342, the position of pattern working storage 344 is true (ture), so pattern working storage 344 points out that the cache line on direction working storage 342 will be required, but the position because of block position shade working storage 302 is false (false), so this cache line is not yet removed (fetched).Therefore, steering logic 322 advances one to look ahead according to the step 624 of Fig. 6 and requires to require in queue 328 to looking ahead, and in order to be taken at the cache line of address 0x4dced800 in advance, it corresponds in the position 32 of block position shade working storage 302.
In one embodiment, described one or more set values can by operating system (such as via the specific working storage (modelspecificregister of a pattern, MSR) fuse (fuses)) or via microprocessor 100 is programmed, and wherein fuse can fuse in the production run of microprocessor 100.
In one embodiment, block position shade working storage 302 size can in order to save power supply (power) with and die chips size machine plate (dierealestate) and reducing.That is, the figure place in each block position shade working storage 302, will be less than the quantity of cache line in a memory block.Such as, in one embodiment, the figure place of each block position shade working storage 302 is only the half of the quantity of the cache line that memory block comprises.The access to first block or second block only followed the trail of by block position shade working storage 302, and end sees that half of memory block is first accessed, and whether an extra position is first accessed in order to the Lower Half or the first half pointing out memory block.
In one embodiment, steering logic 322 is also not so good as to test intermediary outcomes working storage about 316 N position described in step 516/518, but comprise a sequence engine (serialengine), once one or two scanning block position, ground shade working storage 302, is greater than the pattern (being 5 as previously mentioned) of a maximum cycle in order to find the cycle.
In one embodiment, if when step 414 does not detect obvious direction trend or do not detect that the count value of obvious pattern cycle and total counter 314 arrives a predetermined threshold (in order to point out that the cache line of the major part in memory block is accessed) in step 416, steering logic 322 continues to perform and be taken in advance cache line remaining in memory block.Above-mentioned predetermined threshold is a relatively high percent value of the memory cache quantity of memory block, the such as value of the position of block position shade working storage 302.
in conjunction with the pre-fetch unit of the second collection memory cache and first order data quick storer
The microprocessor in modern age comprises the memory cache with a hierarchical structure.Typically, a microprocessor comprises a not only little but also fast first order data quick storer and more greatly but slower second level memory cache, respectively as first order data quick storer 116 and the second level memory cache 118 of Fig. 2.The memory cache with a hierarchical structure is conducive to prefetch data to memory cache, to improve the hit rate speed (hitrate) of memory cache.Due to the speed of first order data quick storer 116, therefore preferably situation is that prefetch data is to first order data quick storer 116.But, because the memory span of first order data quick storer 116 is less, the velocity of variation of memory cache hit may be in fact poor slack-off, if make final data be unwanted because the incorrect prefetch data of pre-fetch unit enters first order data quick storer 116, just need and substitute do alternative with other desired data.Therefore, data are written into the result of first order data quick storer 116 or second level memory cache 118, are whether pre-fetch unit can the function (function) that whether is required of correct Prediction data.Because first order data quick storer 116 is required less size, first order data quick storer 116 is inclined to less capacity and therefore has poor accuracy; Otherwise, because the size of second level memory cache label and data array makes the size of first order memory cache pre-fetch unit seem very little, so a second level memory cache pre-fetch unit can be larger capacity therefore have preferably accuracy.
The advantage of microprocessor 100 described in the embodiment of the present invention, is that one loading/storage element 134 is in order to the basis needed of looking ahead as second level memory cache 118 and first order data quick storer 116.Embodiments of the invention promote the accuracy of loading/storage element 134 (second level memory cache 118), enter the problem of first order data quick storer 116 in order to be applied in above-mentioned the looking ahead of solution.Moreover, also complete the target using monomer logic (singlebodyoflogic) to process the prefetch operation of first order data quick storer 116 and second level memory cache 118 in embodiment.
Be the microprocessor 100 according to various embodiments of the present invention as shown in figure 12.The microprocessor 100 of Figure 12 similar in appearance to Fig. 2 microprocessor 100 and there is extra characteristic as described below.
First order data quick storer 116 provides first order data memory addresses 196 to pre-fetch unit 124.First order data memory addresses 196 is loaded into/stores the physical address of access by loading/storage element 134 pairs of first order data quick storeies 116.That is, pre-fetch unit 124 can carry out eavesdropping (eavesdrops) along with when loading/storage element 134 accesses first order data quick storer 116.Pre-fetch unit 124 provides a queue 198 of pattern prediction cache line address 194 a to first order data quick storer 116, pattern prediction cache line address 194 is the address of cache line, and cache line is wherein according to first order data memory addresses 196, pre-fetch unit 124 predicts that loading/storage element 134 is about to first order data quick storer 116 proposed requirement.First order data quick storer 116 provides cache line configuration requirement 192 a to pre-fetch unit 124, and in order to require cache line from second level memory cache 118, and the address of these cache lines is stored in queue 198.Finally, second level memory cache 118 provides required cache line data 188 to first order data quick storer 116.
Pre-fetch unit 124 also comprises first order data search pointer 172 and first order data pattern address 178, as shown in figure 12.The purposes of first order data search pointer 172 and first order data pattern address 178 is relevant to Fig. 4 and as described below.
Be the operational flowchart of the pre-fetch unit 124 of Figure 12 as shown in figure 13.Flow process starts from step 1302.
In step 1302, pre-fetch unit 124 receives the first order data memory addresses 196 of Figure 12 from first order data quick storer 116.Flow process proceeds to step 1304.
In step 1304, an access pattern detected in advance due to pre-fetch unit 124 and started to enter second level memory cache 118 from system storage cache line of looking ahead, therefore pre-fetch unit 124 detects the first order data memory addresses 196 belonging to a memory block (such as paging (page)), as described in place relevant in the 1 to 11 figure.Carefully, because access pattern is detected, therefore pre-fetch unit 124 is in order to maintenance (maintain) block number working storage 303, the base address of its designated memory block.Whether whether pre-fetch unit 124 mates the corresponding position of first order data memory addresses 196 by the position detecting block number working storage 303, detect first order data memory addresses 196 and drop in memory block.Flow process proceeds to step 1306.
In step 1306, from first order data memory addresses 196, two cache lines under the upper searching in access direction (detectedaccessdirection) that pre-fetch unit 124 is detected in memory block, these two cache lines are relevant with previous detected access direction.The more detailed executable operations of step 1306 will be illustrated in follow-up Figure 14.Flow process proceeds to step 1308.
In step 1308, pre-fetch unit 124 is provided in the physical address of lower two cache lines that step 1306 finds to first order data quick storer 116, as pattern prediction cache line address 194.In other embodiments, the quantity of cache line address that pre-fetch unit 124 provides can greater or less than 2.Flow process proceeds to step 1312.
In step 1312, first order data quick storer 116 is advanced into address provided in step 1308 in queue 198.Flow process proceeds to step 1314.
In step 1314, as long as no matter when queue 198 is non-NULL (non-empty), next address is taken out queue 198 by first order data quick storer 116, and sends cache line configuration requirement 192 a to second level memory cache 118, to obtain the cache line in this address.But if come across first order data quick storer 116 in an address of queue 198, first order data quick storer 116 will have been abandoned (dumps) this address and abandon requiring its cache line from second level memory cache 118.Second level memory cache 118 then provides required cache line data 188 to first order data quick storer 116.Flow process ends at step 1314.
As shown in figure 14 for the pre-fetch unit 124 shown in Figure 12 is according to the operational flowchart of the step 1306 of Figure 13.Figure 14 the operation that describes be under detected by Fig. 3, pattern direction is the situation of upwards (upward).But, if detected pattern direction is that pre-fetch unit 124 also can in order to perform same function downwards.The operation of step 1402 to 1408 is in order to be placed on position suitable in memory block by the pattern working storage 344 in Fig. 3, pre-fetch unit 124 is searched in lower two cache lines by the pattern of the pattern working storage 344 from first order data memory addresses 196 search, as long as and copy the pattern 344 of this pattern working storage 344 when having demand on this memory block.Flow process starts from step 1402.
In step 1402, pre-fetch unit 124 is to search the mode of index working storage 352 and pattern general register 348 in step 602 initialization similar in appearance to Fig. 6, by the pattern order working storage 346 of Fig. 3 and the summation of intermediary outcomes working storage 316, come first order data search pointer 172 and the first order data pattern address 178 of initialization Figure 12.Such as, if the value of intermediary outcomes working storage 316 be 16 and pattern order working storage 346 be 5, and the direction of direction working storage 342 is up, pre-fetch unit 124 initialization first order data search pointer 172 and first order data pattern address 178 to 21.Flow process proceeds to step 1414.
In step 14014, pre-fetch unit 124 determines whether first order data memory addresses 196 falls within the pattern of the pattern working storage 344 with specified position at present, the current position of pattern starts determined according to step 1402, and can upgrade according to step 1406.That is, pre-fetch unit 124 determines that the value of the suitable position (relevantbits) of first order data memory addresses 196 is (namely except going to confirm the position of memory block, and have outside the position of the specified bytes compensating offset (byteoffset) be used in cache line), whether be greater than or equal to the value of first order data search pointer 172, and the sum total that the value whether being less than or equal to first order data search pointer 172 is added with the value of pattern order working storage 346.If first order data memory addresses 196 falls into the pattern of (fallwithin) pattern working storage 344, flow process proceeds to step 1408; Otherwise flow process proceeds to step 1406.
In step 1406, pre-fetch unit 124 increases first order data search pointer 172 and first order data pattern address 178 according to pattern order working storage 346.Operation according to step 1406 (with follow-up step 1418), if first order data search pointer 172 has reached the terminal of memory block, has terminated to search.Flow process gets back to step 1404.
In step 1408, the side-play amount (offset) of the storage page of cache line that be correlated with to by the first order data memory addresses 196 that the value of first order data search pointer 172 is arranged (set) by pre-fetch unit 124.Flow process proceeds to step 1412.
In step 1412, pre-fetch unit 124 tests the position in pattern working storage 344 in first order data search pointer 172.Flow process proceeds to step 1414.
In step 1414, whether the position that pre-fetch unit 124 deciding step 1412 is tested has set.If the position tested in step 1412 has set, flow process has proceeded to step 1416; Otherwise flow process proceeds to step 1418.
In step 1416, pre-fetch unit 124 by step 1414 cache line predicted by pattern working storage 344 be labeled as and be ready to transmit physical address to first order data quick storer 116, using as a pattern prediction cache line address 194.Flow process ends at step 1416.
In step 1418, pre-fetch unit 124 increases the value of first order data search pointer 172.In addition, if first order data search pointer 172 has exceeded last position of above-mentioned pattern working storage 344, the pre-fetch unit 124 new numerical value of first order data search pointer 172 upgrades the value of first order data search pointer 172, that is conversion (shift) pattern working storage 344 is to the position of new first order data search pointer 172.The operation of step 1412 to 1418 performs repeatedly, until two cache lines (or other set values of cache line) are found.Flow process ends at step 1418.
Cache line of looking ahead in Figure 13 to the benefit of first order data quick storer 116 is that first order data quick storer 116 and the change required for second level memory cache 118 are less.But in other embodiments, pre-fetch unit 124 also can not provide pattern to predict cache line address 194 to first order data quick storer 116.Such as, in one embodiment, pre-fetch unit 124 directly requires that Bus Interface Unit 122 obtains acquisition cache line from storer, then received write caching line is write to first order data quick storer 116.In another embodiment, the cache line received from requiring in order to the second level memory cache 118 providing data to pre-fetch unit 124 and obtaining cache line (if for hit unsuccessfully (missing), obtaining cache line from storer), and is write to first order data quick storer 116 by pre-fetch unit 124.In other embodiments, pre-fetch unit 124 requires cache line (if for hit unsuccessfully (missing), obtaining cache line from storer) from second level memory cache 118, and cache line is directly write first order data quick storer 116 by it.
As mentioned above, the benefit of various embodiments of the present invention is to have single pre-fetch unit 124 total counter 314, as the basis needed of looking ahead of both second level memory cache 118 and first order data quick storer 116.The bright different block although (content as discussed below) shown in Fig. 2, Figure 12 and Figure 15 runs after fame, pre-fetch unit 124 can occupy the position of label (tag) and the data rows (dataarray) being adjacent to second level memory cache 118 and conceptively comprise second level memory cache 118 on arrangement space, as shown in figure 21.The demand of its degree of accuracy that each embodiment allows the arrangement of loading/storage element 134 tool large space to promote and its large space, to apply a monomer logic to process the prefetch operation of first order data quick storer 116 and second level memory cache 118, enter data to the problem of the less first order data quick storer 116 of capacity to solve in prior art to look ahead.
there is the demarcation frame pre-fetch unit reduced across warming-up loss (warm-uppenalty) on page
Pre-fetch unit 124 of the present invention above detects more complicated access pattern (such as, an entity storage page) at a memory block (such as, an entity storage page), and it is different from the detection of existing general pre-fetch unit.For example, pre-fetch unit 124 can detect the program of carrying out access one memory block according to a pattern, even if (re-order) memory access of not resequencing with the order of program command understood by non-out-of-sequence execution (out-of-orderexecution) pipeline (pipeline) that follows of microprocessor 100, this may cause existing general pre-fetch unit not go detect memory access pattern and and cause not having prefetch operation.This is because pre-fetch unit 124 is considered effectively to access the carrying out of memory block, and time sequencing (timeorder) not it is considered a little.
But, the ability of the access pattern more complicated in order to satisfied identification and/or rearrangement access pattern, compared to existing pre-fetch unit, pre-fetch unit 124 of the present invention may need longer time to go to detect access pattern, " warm-up period (warm-uptime) " as described below.Therefore the method for minimizing pre-fetch unit 124 warm-up period is needed.
Pre-fetch unit 124 accessed by one the program that pattern is accessing a memory block before predicting one, whether be cross over the new memory block that (crossover) is in fact adjacent with old memory block, and predict whether this program can continue this new memory block of access according to identical pattern.In response in this, pre-fetch unit 124 uses from the pattern of old memory block, direction and other relevant informations, to accelerate the speed detecting access pattern at new memory block, namely reduces warm-up period.
As shown in figure 15 for having the calcspar of the microprocessor 100 of a pre-fetch unit 124.The microprocessor 100 of Figure 15 similar in appearance to the microprocessor 100 of Fig. 2 and Figure 12, and has other characteristic as described below.
As being correlated with in Fig. 3 describes, pre-fetch unit 124 comprises multiple hardware cell 332.Each hardware cell 332 is compared to also comprising the virtual hash virtual address bar (hashedvirtualaddressofmemory, HVAMB) 354 of a memory block and a status bar (status) 356 described in Fig. 3.In the process of the hardware cell 332 of having assigned in step 406 initialization described in Fig. 4, pre-fetch unit 124 takes out the physical blocks code (physicalblocknumber) in block number working storage 303, and after physical blocks code is translated into a virtual address, physical blocks code is translated into a virtual address (hash (hash) this virtual address) by the identical hash rule (thesamehashingalgorithm) performed by the step 1704 according to follow-up Figure 17, and the result that its hash is calculated is stored to the virtual hash address field 354 of memory block.Status bar 356 has three kinds of possible numerical value: non-active (inactive), initiatively (active) or (probationary) on probation, as described below.Pre-fetch unit 124 also comprises a virtual hash table (virtualhashtable, VHT) 162, and the detailed description about virtual hash table 162 organizational structure and operation please refer to describing of follow-up 16 to 19 figure.
Be the virtual hash table 162 of Figure 15 as shown in figure 16.Virtual hash table 162 comprises multiple project, is preferably organized into a queue.Each project comprises a significance bit (validbit) (not shown) and three hurdles: a negative 1 hash virtual address 1602 (HVAM1), a unmodified hash virtual address 1604 (HVAUN) and a positive 1 hash virtual address 1606 (HVAP1).Generation in order to the numerical value of filling above-mentioned field please refer to described in follow-up Figure 17.
It is the operational flowchart of the microprocessor 100 of Figure 15 described in Figure 17.Flow process starts from step 1702.
In step 1702, first order data quick storer 116 receives the one loading/storage request from loading/storage element 134, and its loading/storage request comprises a virtual address.Flow process proceeds to step 1704.
In step 1704, the position of the hash address selection received in first order data quick storer 116 pairs of steps 1702 performs a hash function (function), in order to produce a unmodified hash virtual address 1604 (HVAUN).In addition, first order data quick storer 116 increases a memory block size (MBS) to the position selected by the hash address that step 1702 receives, total value is added in order to produce one, and perform a hash function to adding total value, to produce a positive 1 hash virtual address 1606 (HVAP1).In addition, first order data quick storer 116, from the position of the hash address selection received in step 1702, deducts the size of memory block, in order to produce a difference, and a hash function is performed to this difference, to produce a negative 1 hash virtual address 1602 (HVAM1).In one embodiment, memory block size is 4KB.In one embodiment, virtual address is 40, and position 39:30 and 11:0 of virtual address can hash function be ignored.18 remaining virtual address spaces are " processed (dealt) ", as the information had, be processed by position, hash position.Its idea is that the comparatively low level of virtual address has most high-irregularity (entropy) and high bit has minimum random degree.Can ensure that the position of hash is more unanimously intersected in Luan Du class (entropylevel) with the method process.In one embodiment, 18 of remaining virtual address is the method hash to 6 according to rear continued 1.But, in other embodiments, also can consider to use different hash algorithm; In addition, if there is performance to arrange the design consideration of space (performancedominatesspace) and power consumption, embodiment can be considered not use hash algorithm.Flow process proceeds to step 1706.
assignhash[5]=VA[29]^VA[18]^VA[17];
assignhash[4]=VA[28]^VA[19]^VA[16];
assignhash[3]=VA[27]^VA[20]^VA[15];
assignhash[2]=VA[26]^VA[21]^VA[14];
assignhash[1]=VA[25]^VA[22]^VA[13];
assignhash[0]=VA[24]^VA[23]^VA[12];
Table 1
In step 1706, first order data quick storer 116 is provided in the unmodified hash virtual address (HVAUN) 1604 produced in step 1704, positive 1 hash virtual address (HVAP1) 1606 and negative 1 hash virtual address (HVAM1) 1602 to pre-fetch unit 124.Flow process proceeds to step 1708.
In step 1708, the unmodified hash virtual address (HVAUN) 1604 that pre-fetch unit 124 receives by step 1706, positive 1 hash virtual address (HVAP1) 1606 and negative 1 hash virtual address (HVAM1) 1602 optionally upgrade virtual hash table 162.That is, have unmodified hash virtual address 1604 (HVAUN), positive 1 hash virtual address 1606 (HVAP1) if virtual hash table 162 has comprised one and bear the project of 1 hash virtual address 1602 (HVAM1), pre-fetch unit 124 is abandoned upgrading virtual hash table 162.On the contrary, unmodified hash virtual address 1604 (HVAUN), positive 1 hash virtual address 1606 (HVAP1) and negative 1 hash virtual address 1602 (HVAM1) are advanced into the project of virtual hash table 162 top by pre-fetch unit 124 in the mode of first in first out (first-in-first-out), and are effectively (valid) by advanced project mark.Flow process ends at step 1708.
Be as shown in figure 18 Figure 16 virtual hash table 162 pre-fetch unit 124 loading/storage element 134 according to Figure 17 describe operation after content, wherein at loading/storage element 134 in response to the execution in program, carried out on a direction upwards by two memory blocks (being denoted as AandA+MBS), and enter one the 3rd memory block (being denoted as A+2*MBS), to respond the pre-fetch unit 124 of having filled virtual hash table 162.Carefully, virtual hash table 162 is included in the hash of the A-MBS of negative 1 hash virtual address (HVAM1) 1602, the hash at the hash of the A of unmodified hash virtual address (HVAUN) 1604 and the A+MBS in positive 1 hash virtual address (HVAP1) 1606 apart from the project of two projects of tail end; Virtual hash table 162 project is that the project of a project of distance tail end comprises the hash of the A of negative 1 hash virtual address (HVAM1) 1602, the hash at the hash of the A+MBS of unmodified hash virtual address (HVAUN) 1604 and the A+2*MBS in positive 1 hash virtual address (HVAP1) 1606; Virtual hash table 162 project is that rear-end project (i.e. the project of the propelling that the nearest time is nearest) is included in the hash of the A+MBS of negative 1 hash virtual address (HVAM1) 1602, the hash at the hash of the A+2*MBS of unmodified hash virtual address (HVAUN) 1604 and the A+3*MBS in positive 1 hash virtual address (HVAP1) 1606.
The operational flowchart of the pre-fetch unit 124 of the Fig. 5 (be made up of Figure 19 A and Figure 19 B) as shown in figure 19.Flow process starts from step 1902.
In step 1902, first order data quick storer 116 transmits a new configuration requirement (allocationrequest, AR) to second level memory cache 118.New configuration requirement is requirement one new memory block.That is pre-fetch unit 124 determines that the memory block relevant to configuration requirement is new, meaning namely not yet configure a hardware cell 332 to new configuration requirement the memory block of being correlated with.That is, pre-fetch unit 124 does not accept (encountered) configuration requirement of new memory block recently.In one embodiment, configuration requirement be one be loaded into/store the failure of first order data quick storer 116 result and with require by second level memory cache 118 requirement that same cache line produces.In one embodiment, configuration requirement in order to specify a physical address, physical address the virtual address of being correlated with translated by physical address.First order data quick storer 116 is according to a hash function (namely identical with the step 1704 of Figure 17 hash function of anticipating), the virtual address that hash is relevant with the physical address of configuration requirement, in order to produce a hash virtual address (HVAAR) of configuration requirement, and the virtual address of hash of configuration requirement is provided to pre-fetch unit 124.Flow process proceeds to step 1903.
In step 1903, pre-fetch unit 124 is assigned to a new hardware cell 332 to new memory block.If there is the hardware cell 332 of inertia (inactive) to exist, pre-fetch unit 124 configures an inactive hardware cell 332 to new memory block.Otherwise in one embodiment, the least-recently-used hardware cell 332 of pre-fetch unit 124 configuration one gives new memory block.In one embodiment, once all cache lines of the memory block that pre-fetch unit 124 has been looked ahead indicated by pattern, pre-fetch unit 124 can passivation (inactivate) hardware cell 332.In one embodiment, pre-fetch unit 124 has the ability of fixing (pin) hardware cell 332, even if make it be that a least-recently-used hardware cell 332 also can not be reset.For example, if pre-fetch unit 124 detects the access according to pattern, memory block being carried out to a predetermined number of times, but pre-fetch unit 124 not yet completes all looking ahead according to pattern to whole memory block, pre-fetch unit 124 can fix the hardware cell 332 relevant with memory block, even if make it become, least-recently-used hardware cell 332 is still not qualified is reset.In one embodiment, pre-fetch unit 124 maintains the relative period (from original configuration) of each hardware cell 332, and when during it, (age) arrives a predetermined time period critical value, pre-fetch unit 124 can passivation hardware cell 332.In another embodiment, if pre-fetch unit 124 (step 1904 to 1926 by follow-up) detects a virtual adjacent memory block, and complete looking ahead of the contiguous memory block of self-virtualizing, pre-fetch unit 124 optionally can reuse the hardware cell 332 at virtual adjacent memory block, instead of the hardware cell 332 that configuration one is new.In this embodiment, the various storage units (such as direction working storage 342, pattern working storage 344 and pattern general register 348) of the pre-fetch unit 124 optionally reusable hardware cell of initialization 332, to maintain the available information stored within it.Flow process proceeds to step 1904.
In step 1904, pre-fetch unit 124 compares the negative 1 hash virtual address 1602 (HVAM1) of each project of the virtual address of hash (HVAAR) and the virtual hash table 162 produced in step 1902 and positive 1 hash virtual address 1606 (HVAP1).Pre-fetch unit 124 is that initiatively whether (active) memory block is virtual adjacent to new memory block in order to determine one according to the operation of step 1904 to 1922, pre-fetch unit 124 is in order to whether predicting access of storage is by according to the access pattern detected in advance and direction according to the operation of step 1924 to 1928, the block of active memory continuing self-virtualizing adjacent enters new memory block, in order to the warm-up period to reduce pre-fetch unit 124, pre-fetch unit 124 is made can comparatively fast to start new memory block of looking ahead.Flow process proceeds to step 1906.
In step 1906, the manner of comparison that pre-fetch unit 124 performs according to step 1904, determines whether hash virtual address (HVAAR) mates with any one order of virtual hash table 162.If hash virtual address (HVAAR) is mated with a project of virtual hash table 162, flow process proceeds to step 1908; Otherwise flow process proceeds to step 1912.
In step 1908, pre-fetch unit 124 sets candidate direction flag (candidate_directionflag) to a numerical value, to indicate direction upwards.Flow process proceeds to step 1916.
In step 1912, the manner of comparison of pre-fetch unit 124 performed by step 1908, determines whether hash virtual address (HVAAR) mates with any one order of virtual hash table 162.If hash virtual address (HVAAR) is mated with a project of virtual hash table 162, flow process proceeds to step 1914; Otherwise flow process terminates.
In step 1914, pre-fetch unit 124 sets candidate direction flag (candidate_directionflag) to a numerical value, to indicate downward direction.Flow process proceeds to step 1916.
In step 1916, candidate hash working storage (candidate_havregister) (not shown) is set as a numerical value of the unmodified hash virtual address 1604 (HVAUN) of the virtual hash table 162 that step 1906 or 1912 determines by pre-fetch unit 124.Flow process proceeds to step 1918.
In step 1918, pre-fetch unit 124 compares the virtual hash address field (HVAMB) 354 of memory block of each active memory block in candidate's hash (candidate_hva) and pre-fetch unit 124.Flow process proceeds to step 1922.
In step 1922, the manner of comparison of pre-fetch unit 124 performed by step 1918, determines whether candidate's hash (candidate_hva) mates with the virtual hash address field (HVAMB) 354 of any memory block.If candidate's hash (candidate_hva) is mated with the virtual hash address field (HVAMB) 354 of a memory block, flow process proceeds to step 1924; Otherwise flow process terminates.
In step 1924, the pre-fetch unit 124 certain virtual neighboring of coupling active memory block that determining step 1922 has found is bordering on new memory block.Therefore, pre-fetch unit 124 compares (specified by step 1908 or 1914) candidate direction and the direction working storage 342 mating active memory block, the access pattern previously detected in order to basis and direction, whether predicting access of storage enters new memory block by the block of active memory continuing self-virtualizing adjacent.Carefully, if candidate direction is different from the direction working storage 342 of virtual adjacent memory blocks, memory access unlikely can according to the access pattern previously detected and direction, and the block of active memory continuing self-virtualizing adjacent enters new memory block.Flow process proceeds to step 1926.
In step 1926, the comparative approach of pre-fetch unit 124 performed by step 1924, determines whether candidate direction mates with the direction working storage 342 mating active memory block.If candidate direction is mated with the direction working storage 342 mating active memory block, then flow process proceeds to step 1928; Otherwise flow process terminates.
In step 1928, pre-fetch unit 124 determine the new replacement received by step 1902 require whether by point to detected by step 1926 one of the adjacent active memory block of matching virtual the cache line predicted by pattern working storage 344.In one embodiment, in order to perform the decision of step 1928, pre-fetch unit 124 is changed and the pattern working storage 344 copying the adjacent active memory block of matching virtual effectively according to its pattern order working storage 346, in order to continue pattern position pattern general register 348 in virtual adjacent memory blocks, to maintain pattern 334 continuity at new memory block.If new configuration requirement be the pattern working storage 344 of requirement coupling active memory block the memory cache of be correlated with arrange, flow process proceeds to step 1934; Otherwise flow process proceeds to step 1932.
In step 1932, pre-fetch unit 124 is according to the step 406 of Fig. 4 and 408, initialization and the new hardware cell 332 of filling (step 1903 configures), wish that it finally can according to the above-mentioned method relevant to Fig. 4 to Fig. 6, detect the new pattern to the access of new memory block, and this will need warm-up period.Flow process ends at step 1932.
In step 1934, pre-fetch unit 124 predicted access requires to continue to enter new memory block with direction working storage 342 according to the pattern working storage 344 of matching virtual adjacent active memory block.Therefore, pre-fetch unit 124 to fill new hardware cell 332 similar in appearance to the mode of step 1932, but has a little difference.Carefully, pre-fetch unit 124 can be used for the corresponding numerical value of hardware cell 332 of self-virtualizing adjacent memory blocks to fill direction working storage 342, pattern working storage 344 and pattern order working storage 346.In addition, the new numerical value of pattern general register 348 determined by the value of the pattern order working storage 346 continuing to be transformed in increase, until the memory block that its cross-entry is new, to provide pattern working storage 344 to enter new memory block constantly, as being correlated with in step 1928 describes.Moreover the status bar 356 in new hardware cell 332 is in order to mark new hardware cell 332 for (probationary) on probation.Finally, search index and keep in 352 by first making to be searched by the beginning of a memory block.Flow process proceeds to step 1936.
In step 1936, pre-fetch unit 124 continues to monitor the access requirement betiding new memory block.If pre-fetch unit 124 detects require it is the memory lines that requirement pattern working storage 344 is predicted to the subsequent access of at least one given amount of memory block, then pre-fetch unit 124 impels the status bar 356 of hardware cell 332 to transfer to initiatively from (probationary) on probation, and then starts as described in Figure 6 to look ahead from new memory block.In one embodiment, the given amount of access requirement is 2, although other embodiments can be thought of as other given amount.Flow process proceeds to step 1936.
As shown in figure 20 for the hash physical address used of the pre-fetch unit 124 shown in Figure 15 is to hash virtual address storehouse (hashedphysicaladdress-to-hashedvirtualaddressthesaurus) 2002.Hash physical address comprises a project array to hash virtual address storehouse 2002.Each project comprises the hash virtual address (HVA) 2006 of a physical address (PA) 2004 and a correspondence.Corresponding hash virtual address 2006 is results of the virtual address that is translated into by physical address 2004 in addition hash.Pre-fetch unit 124 is eavesdropped to hash virtual address storehouse 2002 by nearest hash physical address, in order to the pipeline at leap loading/storage element 134.In another embodiment, in the step 1902 of Figure 19, first order data quick storer 116 do not provide hash virtual address (HVAAR) to pre-fetch unit 124, but only provide configuration requirement the physical address of being correlated with.Pre-fetch unit 124 finds provider location at hash physical address to hash virtual address storehouse 2002, to find a matching entities address (PA) 2004, and obtaining relevant hash virtual address (HVA) 2006, the hash virtual address (HVA) 2006 obtained will become hash virtual address (HVAAR) in other parts of Figure 19.Hash physical address to hash virtual address storehouse 2002 is included in pre-fetch unit 124 and can relaxes the needs that first order data quick storer 116 provides the hash virtual address required by configuration requirement, therefore can simplify the interface between first order data quick storer 116 and pre-fetch unit 124.
In one embodiment, hash physical address comprises a hash physical address to each project in hash virtual address storehouse 2002, instead of physical address 2004, and the configuration requirement physical address hash received from first order data quick storer 116 is become a hash physical address by pre-fetch unit 124, in order to look for hash physical address to hash virtual address storehouse 2002, to obtain the hash virtual address (HVA) 2006 of suitable correspondence.The present embodiment allows less hash physical address to hash virtual address storehouse 2002, but needs the other time to carry out hash to physical address.
Be the multi-core microprocessor 100 of the embodiment of the present invention as shown in figure 21.Multi-core microprocessor 100 comprises two cores (being expressed as core A2102A and core B2102B), wholely can be considered as core 2102 (or unitary core 2102).Each core has the element 12 or 15 similar in appearance to monokaryon microprocessor 100 as shown in Figure 2.In addition, each core 2102 has the pre-fetch unit 2104 of foregoing height reaction equation.These two cores 2102 share second level memory cache 118 and pre-fetch unit 124.Specifically, the pre-fetch unit 2104 of the first order data quick storer 116 of each core 2012, loading/storage element 134 and height reaction equation is coupled to shared second level memory cache 118 and pre-fetch unit 124.In addition, the pre-fetch unit 2106 of a height reaction equation shared is coupled to second level memory cache 118 and pre-fetch unit 124.In one embodiment, the pre-fetch unit 2106 of height reaction equation that the pre-fetch unit 2104/ of height reaction equation is shared only look ahead a memory access the adjacent cache line of the next one after the cache line of being correlated with.
Pre-fetch unit 124 is except monitoring the memory access of loading/storage element 134 and first order data quick storer 116, the memory access that the pre-fetch unit 2106 of the height reaction equation that the pre-fetch unit 2104/ also can monitoring height reaction equation is shared produces, in order to carry out decision of looking ahead.Pre-fetch unit 124 can monitor the memory access of originating from the memory access of various combination, to perform different function of the present invention.Such as, pre-fetch unit 124 can monitoring memory access one first combination, to perform the pass phase function described in Fig. 2 to Figure 11, pre-fetch unit 124 can monitoring memory access one second combination, to perform the correlation function described in Figure 12 to Figure 14, and pre-fetch unit 124 can monitoring memory access one the 3rd combination, to perform the correlation function described in Figure 15 to Figure 19.In an embodiment, the pre-fetch unit 124 shared is difficult to the behavior of the loading/storage element 134 monitoring each core 2102 due to time factor.Therefore, the pre-fetch unit 124 shared monitors the behavior of loading/storage element 134 indirectly via the status transmission (traffic) that first order data quick storer 116 produces, be loaded into/store the result of miss (miss) as it.
Different embodiments of the invention are in describing herein, but those skilled in the art should be able to understand these embodiments only as example, but not are defined in this.Those skilled in the art can when not departing from spirit of the present invention, to form from details is done different changes.Such as, software can apparatus and method described in the activation embodiment of the present invention function, establishment (fabrication), mould (modeling), simulation, describe (description), with and/or test, also come by general procedure language (C, C++), hardware description language (HardwareDescriptionLanguages, HDL) (comprising VerilogHDL, VHDL etc.) or other available program languages.This software is configurable on any known computing machine can use medium, such as tape, semiconductor, disk, or CD (such as CD-ROM, DVD-ROM etc.), the Internet, wired, wireless or other medium of communications transmission mode among.Apparatus and method embodiment of the present invention can be included in semiconductor intellectual property core, such as microcontroller core (realizing with HDL), and converts the hardware of integrated circuit (IC) products to.In addition, apparatus and method of the present invention are realized by the combination of hardware and software.Therefore, the present invention should not be limited to disclosed embodiment, but implements defined according to claim of the present invention and equivalence.Particularly, present invention can be implemented in the micro processor, apparatus be used in general service computing machine.Finally; though the present invention discloses as above with preferred embodiment; so itself and be not used to limit scope of the present invention; those skilled in the art; under the premise without departing from the spirit and scope of the present invention; can do some changes and retouching, therefore protection scope of the present invention is as the criterion with claim of the present invention.

Claims (24)

1. a pre-fetch unit, is arranged in a microprocessor, comprises:
Multiple cycle match counter, respectively corresponding to different multiple pattern cycles; And
One steering logic, in order to
In response to the action of this microprocessor access one memory block, upgrade the plurality of cycle match counter;
According to the count value of the plurality of cycle match counter, determine an obvious pattern cycle; And
According to the pattern with this obvious pattern cycle determined by the plurality of cycle match counter, not yet prefetched cache line in the multiple cache lines in this memory block is looked ahead.
2. pre-fetch unit as claimed in claim 1, more comprises:
One bit mask buffer, has the position that in this memory block, each the plurality of cache line is corresponding;
Wherein, this steering logic, in response to the action accessed this memory block, upgrades this bit mask buffer, to indicate in this memory block the plurality of cache line be accessed; And
Wherein, this steering logic, from this bit mask buffer, determines this pattern with this obvious pattern cycle.
3. pre-fetch unit as claimed in claim 2, more comprises:
One middle pointer buffer, has the value calculated in response to the action that this memory block carries out accessing by this steering logic, in order to point in this memory block the middle cache line in the plurality of cache line that has been accessed in this bit mask buffer; And
Wherein, each different the plurality of pattern cycle is respectively the quantity of the position on this middle pointer buffer left side or the right.
4. pre-fetch unit as claimed in claim 3, wherein in order to upgrade the plurality of cycle match counter, this steering logic carries out following steps respectively to each the plurality of cycle match counter:
When the N number of position on the right of N number of position and this middle pointer buffer on this middle pointer buffer left side matches, increase the numerical value of this cycle match counter; And
Wherein, N is the one corresponding to this cycle match counter in different the plurality of pattern cycle.
5. pre-fetch unit as claimed in claim 3, more comprises:
One minimum pointer buffer, in order to point in this memory block the minimum cache line in the plurality of cache line that has been accessed;
One maximum pointer buffer, in order to point in this memory block the highest cache line in the plurality of cache line that has been accessed; And
Wherein, this steering logic is using count value one average of the count value of this middle pointer buffer as the count value of this minimum pointer buffer and maximum pointer buffer.
6. pre-fetch unit as claimed in claim 3, wherein by this steering logic from this bit mask buffer determine to have this obvious pattern cycle this pattern be this obvious pattern cycle position on this middle pointer buffer left side or the right in this bit mask buffer.
7. pre-fetch unit as claimed in claim 1, wherein in order to determine this obvious pattern cycle according to the plurality of cycle match counter, this steering logic more in order to:
Determine whether the one in the plurality of cycle match counter and the difference between the count value of other person in the plurality of cycle match counter are greater than a set value.
8. pre-fetch unit as claimed in claim 1, the wherein different the plurality of pattern cycles comprises at least three different pattern cycles.
9. pre-fetch unit as claimed in claim 1, when wherein at least nine these cache lines of this steering logic only in this memory block are accessed, looks ahead to these not yet prefetched cache lines in this microprocessor.
10. pre-fetch unit as claimed in claim 1, when wherein at least ten the plurality of cache lines of this steering logic only in this memory block are accessed, looks ahead to not yet prefetched cache line the plurality of in this microprocessor.
11. pre-fetch unit as claimed in claim 1, wherein in order to this microprocessor to be looked ahead cache line not yet prefetched in this memory block according to this pattern with this obvious pattern cycle, this steering logic more in order to:
Be positioned in a search pointer buffer, this pattern outside this bit mask buffer;
One prediction is carried out to each in this pattern, this prediction comprises when this bit is set, predict the need of corresponding to this this cache line, when needing this cache line corresponding to this, judge that whether this cache line is not yet prefetched, wherein when pointing out that this cache line is not yet accessed corresponding to this position in this bit mask of this in this pattern, judge that this cache line is not yet prefetched.
12. pre-fetch unit as claimed in claim 11, wherein in order to this microprocessor to be looked ahead cache line not yet prefetched in this memory block according to this pattern with this obvious pattern cycle, this steering logic more in order to:
Only when this cache line is less than in this memory block a set value of the cache line being positioned at these cache line ends be accessed, to be predicted as by this steering logic needs and this not yet prefetched cache line look ahead.
13. 1 kinds of data prefetching methods, comprising:
By a microprocessor, in response to the action to access one memory block, upgrade multiple cycle match counter, wherein the plurality of cycle match counter is respectively corresponding to different multiple pattern cycles;
According to the count value of the plurality of cycle match counter, determine an obvious pattern cycle; And
According to the pattern with this obvious pattern cycle determined by the plurality of cycle match counter, not yet prefetched cache line in the multiple cache lines in this memory block is looked ahead.
14. data prefetching methods as claimed in claim 13, more comprise:
In response to the action accessed this memory block, upgrade a bit mask buffer, to indicate in this memory block the plurality of cache line be accessed, wherein this bit mask buffer has the position that in this memory block, each these cache line is corresponding; And
From this bit mask buffer, determine this pattern with this obvious pattern cycle.
15. data prefetching methods as claimed in claim 14, more comprise:
Carry out the action accessed in response to this memory block, calculate an intermediary outcomes, this middle pointer in order to point in this memory block the middle cache line in the plurality of cache line that has been accessed in this bit mask buffer; And
Wherein, each different the plurality of pattern cycle is respectively the quantity of the position on this middle pointer buffer left side or the right.
16. data prefetching methods as claimed in claim 15, the step of wherein said renewal the plurality of cycle match counter more comprises carries out following steps respectively to each the plurality of cycle match counter:
When the N number of position on the right of N number of position and this middle pointer buffer on this middle pointer buffer left side matches, increase the numerical value of this cycle match counter; And
Wherein, N is the one corresponding to this cycle match counter in different the plurality of pattern cycle.
17. data prefetching methods as claimed in claim 15, more comprise:
Maintain a minimum pointer buffer, make this minimum pointer buffer point in this memory block a minimum cache line in the plurality of cache line be accessed;
Maintain a maximum pointer buffer, make this minimum pointer buffer point in this memory block the highest cache line in the plurality of cache line be accessed; And
Using count value one average of the count value of this middle pointer buffer as the count value of this minimum pointer buffer and maximum pointer buffer.
18. data prefetching methods as claimed in claim 15, wherein by this steering logic from this bit mask buffer determine to have this obvious pattern cycle this pattern be this obvious pattern cycle position on this middle pointer buffer left side or the right in this bit mask buffer.
According to the plurality of cycle match counter, 19. data prefetching methods as claimed in claim 13, wherein saidly determine that the step in this obvious pattern cycle more comprises:
Determine whether the one in the plurality of cycle match counter and the difference between the count value of other person in the plurality of cycle match counter are greater than a set value.
20. data prefetching methods as claimed in claim 13, the wherein different the plurality of pattern cycles comprises at least three different pattern cycles.
21. data prefetching methods as claimed in claim 13, when at least nine wherein only in this memory block the plurality of cache lines are accessed, perform described step of looking ahead to not yet prefetched cache line the plurality of in this microprocessor.
22. data prefetching methods as claimed in claim 13, when at least ten wherein only in this memory block the plurality of cache lines are accessed, perform described step of looking ahead to not yet prefetched cache line the plurality of in this microprocessor.
23. data prefetching methods as claimed in claim 13, the look ahead step of cache line not yet prefetched in this memory block of this pattern that wherein said basis has this obvious pattern cycle more comprises:
Be positioned in a search pointer buffer, this pattern outside this bit mask buffer;
One prediction is carried out to each in this pattern, this prediction comprises when this bit is set, predict the need of corresponding to this this cache line, when needing this cache line corresponding to this, judge that whether this cache line is not yet prefetched, wherein when pointing out that this cache line is not yet accessed corresponding to this position in this bit mask of this in this pattern, judge that this cache line is not yet prefetched.
24. data prefetching methods as claimed in claim 23, the look ahead step of cache line not yet prefetched in this memory block of this pattern that wherein said basis has this obvious pattern cycle more comprises:
Only when this cache line is less than in this memory block a set value of the cache line being positioned at the plurality of cache line end be accessed, need and be judged as this not yet prefetched cache line to look ahead to being predicted as by this steering logic.
CN201510494634.1A 2010-03-29 2011-03-29 Pre-fetch unit and data prefetching method Active CN105183663B (en)

Applications Claiming Priority (9)

Application Number Priority Date Filing Date Title
US31859410P 2010-03-29 2010-03-29
US61/318,594 2010-03-29
US13/033,765 US8762649B2 (en) 2010-03-29 2011-02-24 Bounding box prefetcher
US13/033,848 US8719510B2 (en) 2010-03-29 2011-02-24 Bounding box prefetcher with reduced warm-up penalty on memory block crossings
US13/033,848 2011-02-24
US13/033,765 2011-02-24
US13/033,809 US8645631B2 (en) 2010-03-29 2011-02-24 Combined L2 cache and L1D cache prefetcher
US13/033,809 2011-02-24
CN201110077108.7A CN102169429B (en) 2010-03-29 2011-03-29 Pre-fetch unit, data prefetching method and microprocessor

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN201110077108.7A Division CN102169429B (en) 2010-03-29 2011-03-29 Pre-fetch unit, data prefetching method and microprocessor

Publications (2)

Publication Number Publication Date
CN105183663A true CN105183663A (en) 2015-12-23
CN105183663B CN105183663B (en) 2018-11-27

Family

ID=44490596

Family Applications (4)

Application Number Title Priority Date Filing Date
CN201510494634.1A Active CN105183663B (en) 2010-03-29 2011-03-29 Pre-fetch unit and data prefetching method
CN201510101351.6A Active CN104615548B (en) 2010-03-29 2011-03-29 Data prefetching method and microprocessor
CN201110077108.7A Active CN102169429B (en) 2010-03-29 2011-03-29 Pre-fetch unit, data prefetching method and microprocessor
CN201510101303.7A Active CN104636274B (en) 2010-03-29 2011-03-29 Data prefetching method and microprocessor

Family Applications After (3)

Application Number Title Priority Date Filing Date
CN201510101351.6A Active CN104615548B (en) 2010-03-29 2011-03-29 Data prefetching method and microprocessor
CN201110077108.7A Active CN102169429B (en) 2010-03-29 2011-03-29 Pre-fetch unit, data prefetching method and microprocessor
CN201510101303.7A Active CN104636274B (en) 2010-03-29 2011-03-29 Data prefetching method and microprocessor

Country Status (2)

Country Link
CN (4) CN105183663B (en)
TW (5) TWI506434B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102253362B1 (en) * 2020-09-22 2021-05-20 쿠팡 주식회사 Electronic apparatus and information providing method using the same

Families Citing this family (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8959320B2 (en) * 2011-12-07 2015-02-17 Apple Inc. Preventing update training of first predictor with mismatching second predictor for branch instructions with alternating pattern hysteresis
US9442759B2 (en) * 2011-12-09 2016-09-13 Nvidia Corporation Concurrent execution of independent streams in multi-channel time slice groups
WO2013089682A1 (en) 2011-12-13 2013-06-20 Intel Corporation Method and apparatus to process keccak secure hashing algorithm
US10146545B2 (en) 2012-03-13 2018-12-04 Nvidia Corporation Translation address cache for a microprocessor
US9880846B2 (en) 2012-04-11 2018-01-30 Nvidia Corporation Improving hit rate of code translation redirection table with replacement strategy based on usage history table of evicted entries
US10241810B2 (en) 2012-05-18 2019-03-26 Nvidia Corporation Instruction-optimizing processor with branch-count table in hardware
US20140189310A1 (en) 2012-12-27 2014-07-03 Nvidia Corporation Fault detection in instruction translations
CN104133780B (en) * 2013-05-02 2017-04-05 华为技术有限公司 A kind of cross-page forecasting method, apparatus and system
US9891916B2 (en) * 2014-10-20 2018-02-13 Via Technologies, Inc. Dynamically updating hardware prefetch trait to exclusive or shared in multi-memory access agent system
CN105653199B (en) * 2014-11-14 2018-12-14 群联电子股份有限公司 Method for reading data, memory storage apparatus and memorizer control circuit unit
US10387318B2 (en) * 2014-12-14 2019-08-20 Via Alliance Semiconductor Co., Ltd Prefetching with level of aggressiveness based on effectiveness by memory access type
US10152421B2 (en) * 2015-11-23 2018-12-11 Intel Corporation Instruction and logic for cache control operations
CN106919367B (en) * 2016-04-20 2019-05-07 上海兆芯集成电路有限公司 Detect the processor and method of modification program code
US10579522B2 (en) * 2016-09-13 2020-03-03 Andes Technology Corporation Method and device for accessing a cache memory
US10353601B2 (en) * 2016-11-28 2019-07-16 Arm Limited Data movement engine
US10725685B2 (en) 2017-01-19 2020-07-28 International Business Machines Corporation Load logical and shift guarded instruction
US10496311B2 (en) 2017-01-19 2019-12-03 International Business Machines Corporation Run-time instrumentation of guarded storage event processing
US10452288B2 (en) 2017-01-19 2019-10-22 International Business Machines Corporation Identifying processor attributes based on detecting a guarded storage event
US10579377B2 (en) 2017-01-19 2020-03-03 International Business Machines Corporation Guarded storage event handling during transactional execution
US10496292B2 (en) 2017-01-19 2019-12-03 International Business Machines Corporation Saving/restoring guarded storage controls in a virtualized environment
US10732858B2 (en) 2017-01-19 2020-08-04 International Business Machines Corporation Loading and storing controls regulating the operation of a guarded storage facility
CN109857786B (en) * 2018-12-19 2020-10-30 成都四方伟业软件股份有限公司 Page data filling method and device
CN111797052B (en) * 2020-07-01 2023-11-21 上海兆芯集成电路股份有限公司 System single chip and system memory acceleration access method
CN112416437B (en) * 2020-12-02 2023-04-21 海光信息技术股份有限公司 Information processing method, information processing device and electronic equipment
WO2022233391A1 (en) * 2021-05-04 2022-11-10 Huawei Technologies Co., Ltd. Smart data placement on hierarchical storage
CN114116529A (en) * 2021-12-01 2022-03-01 上海兆芯集成电路有限公司 Fast loading device and data caching method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040003179A1 (en) * 2002-06-28 2004-01-01 Fujitsu Limited Pre-fetch control device, data processing apparatus and pre-fetch control method
CN101078979A (en) * 2007-06-29 2007-11-28 东南大学 Storage control circuit with multiple-passage instruction pre-fetching function
CN101539853A (en) * 2008-03-21 2009-09-23 富士通株式会社 Information processing unit, program, and instruction sequence generation method
CN101689144A (en) * 2007-06-19 2010-03-31 富士通株式会社 Information processor and cache control method
CN101887360A (en) * 2009-07-10 2010-11-17 威盛电子股份有限公司 The data pre-acquisition machine of microprocessor and method

Family Cites Families (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5003471A (en) * 1988-09-01 1991-03-26 Gibson Glenn A Windowed programmable data transferring apparatus which uses a selective number of address offset registers and synchronizes memory access to buffer
SE515718C2 (en) * 1994-10-17 2001-10-01 Ericsson Telefon Ab L M Systems and methods for processing memory data and communication systems
US6484239B1 (en) * 1997-12-29 2002-11-19 Intel Corporation Prefetch queue
US6810466B2 (en) * 2001-10-23 2004-10-26 Ip-First, Llc Microprocessor and method for performing selective prefetch based on bus activity level
US7310722B2 (en) * 2003-12-18 2007-12-18 Nvidia Corporation Across-thread out of order instruction dispatch in a multithreaded graphics processor
US7237065B2 (en) * 2005-05-24 2007-06-26 Texas Instruments Incorporated Configurable cache system depending on instruction type
US20070186050A1 (en) * 2006-02-03 2007-08-09 International Business Machines Corporation Self prefetching L2 cache mechanism for data lines
US8103832B2 (en) * 2007-06-26 2012-01-24 International Business Machines Corporation Method and apparatus of prefetching streams of varying prefetch depth
US8161243B1 (en) * 2007-09-28 2012-04-17 Intel Corporation Address translation caching and I/O cache performance improvement in virtualized environments
US7890702B2 (en) * 2007-11-26 2011-02-15 Advanced Micro Devices, Inc. Prefetch instruction extensions
US8140768B2 (en) * 2008-02-01 2012-03-20 International Business Machines Corporation Jump starting prefetch streams across page boundaries
US7958317B2 (en) * 2008-08-04 2011-06-07 International Business Machines Corporation Cache directed sequential prefetch
US8402279B2 (en) * 2008-09-09 2013-03-19 Via Technologies, Inc. Apparatus and method for updating set of limited access model specific registers in a microprocessor
US9032151B2 (en) * 2008-09-15 2015-05-12 Microsoft Technology Licensing, Llc Method and system for ensuring reliability of cache data and metadata subsequent to a reboot
CN101667159B (en) * 2009-09-15 2012-06-27 威盛电子股份有限公司 High speed cache system and method of trb

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040003179A1 (en) * 2002-06-28 2004-01-01 Fujitsu Limited Pre-fetch control device, data processing apparatus and pre-fetch control method
CN101689144A (en) * 2007-06-19 2010-03-31 富士通株式会社 Information processor and cache control method
CN101078979A (en) * 2007-06-29 2007-11-28 东南大学 Storage control circuit with multiple-passage instruction pre-fetching function
CN101539853A (en) * 2008-03-21 2009-09-23 富士通株式会社 Information processing unit, program, and instruction sequence generation method
CN101887360A (en) * 2009-07-10 2010-11-17 威盛电子股份有限公司 The data pre-acquisition machine of microprocessor and method

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102253362B1 (en) * 2020-09-22 2021-05-20 쿠팡 주식회사 Electronic apparatus and information providing method using the same
US11182297B1 (en) 2020-09-22 2021-11-23 Coupang Corp. Electronic apparatus and information providing method using the same
KR102366011B1 (en) * 2020-09-22 2022-02-23 쿠팡 주식회사 Electronic apparatus and information providing method using the same
US11544195B2 (en) 2020-09-22 2023-01-03 Coupang Corp. Electronic apparatus and information providing method using the same

Also Published As

Publication number Publication date
CN105183663B (en) 2018-11-27
TWI547803B (en) 2016-09-01
TW201624289A (en) 2016-07-01
TWI574155B (en) 2017-03-11
CN104636274B (en) 2018-01-26
TWI519955B (en) 2016-02-01
CN104615548A (en) 2015-05-13
TWI534621B (en) 2016-05-21
CN104636274A (en) 2015-05-20
CN102169429B (en) 2016-06-29
TW201135460A (en) 2011-10-16
TW201447581A (en) 2014-12-16
TW201535119A (en) 2015-09-16
CN102169429A (en) 2011-08-31
TW201535118A (en) 2015-09-16
TWI506434B (en) 2015-11-01
CN104615548B (en) 2018-08-31

Similar Documents

Publication Publication Date Title
CN102169429B (en) Pre-fetch unit, data prefetching method and microprocessor
CN102498477B (en) TLB prefetching
CN100517274C (en) Cache memory and control method thereof
US5694568A (en) Prefetch system applicable to complex memory access schemes
CN100501739C (en) Method and system using stream prefetching history to improve data prefetching performance
US7640420B2 (en) Pre-fetch apparatus
US8677049B2 (en) Region prefetcher and methods thereof
US20050268076A1 (en) Variable group associativity branch target address cache delivering multiple target addresses per cache line
CN105700857A (en) Multiple data prefetchers that defer to one another based on prefetch effectiveness by memory access type
CN105701033A (en) Multi-mode set associative cache memory dynamically configurable to selectively select one or a plurality of its sets depending upon mode
CN105701031A (en) Multi-mode set associative cache memory dynamically configurable to selectively allocate into all or subset or tis ways depending on mode
CN105700856A (en) Prefetching with level of aggressiveness based on effectiveness by memory access type
US7047362B2 (en) Cache system and method for controlling the cache system comprising direct-mapped cache and fully-associative buffer
CN104871144A (en) Speculative addressing using a virtual address-to-physical address page crossing buffer
CN101833517B (en) Quick memory system and its access method
TWI469044B (en) Hiding instruction cache miss latency by running tag lookups ahead of the instruction accesses
US20070260862A1 (en) Providing storage in a memory hierarchy for prediction information
JP2007272681A (en) Cache memory device, and method for replacing cache line in same
CN115934170A (en) Prefetching method and device, prefetching training method and device, and storage medium
US11048637B2 (en) High-frequency and low-power L1 cache and associated access technique
CN101916184B (en) Method for updating branch target address cache in microprocessor and microprocessor
CN114450668A (en) Circuit and method
US20230205699A1 (en) Region aware delta prefetcher
CN101833518A (en) The data of microprocessor and microprocessor are got access method soon
GR20180200044U (en) Memory address translation

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant