GB2456405A - Managing fetch and store requests in a cache pipeline - Google Patents

Managing fetch and store requests in a cache pipeline Download PDF

Info

Publication number
GB2456405A
GB2456405A GB0822457A GB0822457A GB2456405A GB 2456405 A GB2456405 A GB 2456405A GB 0822457 A GB0822457 A GB 0822457A GB 0822457 A GB0822457 A GB 0822457A GB 2456405 A GB2456405 A GB 2456405A
Authority
GB
United Kingdom
Prior art keywords
store
request
cache
fetch
cycles
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
GB0822457A
Other versions
GB0822457D0 (en
GB2456405B (en
Inventor
Christian Jacobi
Simon Fabel
Matthias Pflanz
Hans-Werner Tast
Hanno Ulrich
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
International Business Machines Corp
Original Assignee
International Business Machines Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Business Machines Corp filed Critical International Business Machines Corp
Publication of GB0822457D0 publication Critical patent/GB0822457D0/en
Publication of GB2456405A publication Critical patent/GB2456405A/en
Application granted granted Critical
Publication of GB2456405B publication Critical patent/GB2456405B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0862Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches with prefetch
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0844Multiple simultaneous or quasi-simultaneous cache accessing
    • G06F12/0855Overlapped cache accessing, e.g. pipeline
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0893Caches characterised by their organisation or structure

Abstract

In a cache accessed under the control of a cache pipeline (14), store requests are managed in a store queue (10) and read requests are managed in a read queue (12), respectively, and prioritization logic (18) decides if a read request or a write request is to be forwarded to the cache pipeline (14). The prioritization logic (62) aborts a store request that has started if a fetch request arrives within a predetermined store abort window and grants cache access to the arrived fetch request. When the fetch request no longer requires the input stage of the cache pipeline, a control mechanism repeats the access control of the aborted store request for a further trial to access the pipeline (14). Preferably, the store abort window spans 3 to 7 cycles, preferably 4 or 5 cycles, and starts after 2 to 4 cycles, preferably 3 cycles.

Description

- 1 -
2456405
DESCRIPTION Store Aborting
1, BACKGROUND OF THE INVENTION
1.1. FIELD OF THE INVENTION
The present invention relates to the field of computer processor technology, and in particular to the performance of caches which are accessed under the control of a cache pipeline.
1.2. DESCRIPTION AND DISADVANTAGES OF PRIOR ART
The Prior Art IBM E-Server Z990 Microprocessor is a high performing main frame processor. As disclosed in "IBM Journal of Research and Development, Volume 48, No. 3 / 4, May/July 2 004, pages 295-309, an overview of this super scalar micro architecture is given. Although it is a super scalar microprocessor, it executes instructions in a strict architectural order. However, it makes up for this by having a shorter pipeline and much larger caches and translation-look aside buffers compared with other processors, along with other performance-enhancing features. This prior art publication gives an overview on this microprocessor technology.
Figure 1 gives a schematic overview on this prior art cache access. A store request queue 10 and a fetch request queue 12 manage respective requests for store instructions and fetch instructions which must enter a cache pipeline 14 which is used in prior art for accessing the cache 16. A store request stores data in the cache 16, whereas a fetch request reads some data from the cache. A priority select logic 18 manages which of those requests, i.e. a store request or a fetch request or even further request is granted access to the cache pipeline.
- 2 -
The prior art characteristic of this pipelined cache access through the cache pipeline 14 and relevant for the present invention is that a cache access takes place in a fixed pipe cycle and that each request entering that pipe cycle necessarily performs its cache access. Access to the pipe facility is serialized via an arbitration scheme implemented in the priority logic 18. To prevent collisions in the cache arrays, after a request has been granted pipe access, an arbiter of logic 18 blocks pipe access for as long as the current request busies the cache arrays.
There are two essential types of cache accesses, fetch and store requests. Crucial for system level performance is the performance of fetch accesses to the cache whereas store performance does not directly influence system level performance. However, since store requests are by factors more frequent than fetch requests (typically by a factor of 10 for an L2 cache when the LI cache is of the store-through type), they can block away fetch requests from pipe access and therefore indirectly impact system level performance. The state-of-the art measure to prevent stores from constraining fetch performance is to arbitrate fetch requests with higher priority than store requests.
In modern high frequency computer designs, however, this prior art measure proves not to be sufficient any more. With steadily increasing processor frequency, the fetch performance starts to suffer from store performance and without additional measures, the adverse impact on system performance can be as high as 1...2%. Such performance impact may incur to caches in the storage hierarchy of IBM's zSeries systems, or other processors.
Figure 2 is a pipeline diagram showing the resource allocations for store requests and fetch requests dependent of the pipe
- 3 -
cycle number, denoted as A3.. A31. The upper box shows the cycle number, the second box gives a descriptive schematic of current state-of-the-art fetch and store requests, the third box denotes a hypothetical ideal state not reached in prior art, and the bottom most box illustrates the delay for a fetch request which comes too late, relative to a store request, i.e., the worst case of the "earliest fetch".
The right column shows the delay potential, left, in total,
right, per store request (as there is only a single store, the number for 'total' and 'per store' are the same).
The fields ,Ft' and ,St' are the arbitration cycles. The fields D1...D6 are cycles where the directory is used, F1...F16 are cycles where the cache array is used for a fetch operation, S1..S4 are cycles where the cache array is used for a store operation. In one cycle the cache array can either be used for a store or a fetch -i.e.: the array has single access port only.
The cause of the problem is that store requests access the cache array in a later pipe cycle relative to the directory access than fetch requests, see the arrows in the second box. Fetch requests read the cache directory -actions denoted as Dl,,.D6 -and access the cache arrays to fetch their data in parallel, see the actions denoted as F1,..F16.
Store requests in contrast, after reading the cache directory in actions Dl and D2, need to calculate the store destination based on the directory content (time interval 22), and then write the data to the cache, see the actions S1,..S4.
To avoid cache collisions, in state-of-the-art cache designs, the arbiter, after a store request has been granted pipe access, blocks pipe access for any other cache request until the store data have been written to cache. As a consequence, upon a store, the cache appears busy not only for the actual cache array busy time caused by the store writing to the arrays but also for the
- 4 -
number of cycles within interval 22 by which a store access to the cache arrays occurs later in the cache access pipeline than a fetch access. This difference between a store and a fetch access, and hence the size of the blocking window 22, increases from system generation to generation as the frequency of the technology increases.
In the following, let n denote the size of the blocking window. Then n is the maximum number of cycles a fetch request, arriving after a store request has been granted, can be blocked away from pipe access. The average delay for a fetch arriving in the blocking window is (n+l)/2 cycles. The impact on performance is of the order of n2 because the probability that a fetch request arrives within the blocking window also increases with the window length n.
Following is a formula for the size of the blocking window 22 in a hypothetical straight-forward example if the above mentioned prior art processor z990 was operated with an increased blocking window as discussed shortly above.
Let Co be the pipe cycle, a request gets granted for cache access, and let Cf and Cs be the pipe cycles, when a fetch respectively a store accesses the cache. Let bs denote the number of cycles a store busies the cache arrays to write its data.
Then, when a store has been granted in Co, the earliest cycle a fetch arriving in Ci or later can be granted is Cs + bs - CE = bs + d, with d denoting the difference Cs - Cf between a store and a fetch request's cache access cycle. In other words, the size of the blocking window is n = d + bs - 1 cycles.
In this example the store busy time is about bs = 4 and a cache access difference between fetches and stores is d = 8 cycles.
This means that after a store has been granted, fetches are blocked away from cache access for a maximum of n = 11 cycles rather than the 3 cycles one would expect from bs = 4. Compared
- 5 -
to a hypothetical optimum design with d = 0, see the third box, the performance impact in this example - see the bottom most box of figure 2 - is about 2 additional queuing cycles, on average, for a fetch accessing the cache.
Figure 2 compares the state-of-the-art solution (box 4) to the hypothetical optimum design with d = 0 (box 3): A fetch arriving in cycle A4 will be delayed by 11 cycles compared to only 5 cycles in the ideal case, see the bidirectional arrow. When all cycles are equally likely for a fetch to arrive, the average delay a fetch suffers is proportional to the sum of the delays for each cycle, hence to 66 cycles in state-of-the-art solutions compared to only 15 cycles in the hypothetical optimum case.
In practice, fetches and stores are clustered and not uniformly distributed. Therefore, modeling tools are required to determine the actual average delay of a fetch request. Modeling has shown that in real user environments, fetches accessing the cache can be delayed by 2 additional queuing cycles on average, and that the resulting impact on system level performance can be as high as 2%.
1.3. OBJECTIVES OF THE INVENTION
The objective of the present invention is to provide an improved method and respective system for controlling the access of a store request to a cache pipeline.
2 . SUMMARY AND ADVANTAGES OF THE INVENTION
This objective of the invention is achieved by the features stated in enclosed independent claims. Further advantageous arrangements and embodiments of the invention are set forth in the respective subclaims. Reference should now be made to the appended claims.
- 6 -
The invention presented here resolves a performance problem encountered in pipelined cache structures where fetch accesses are exposed to be impeded by store accesses.
According to the most basic aspect of this invention, a method and respective system is disclosed for operating processor caches accessed under the control of a cache pipeline, wherein store requests are managed in a store queue, and read requests are managed in a read queue, respectively, wherein both of the queues compete for accessing the cache pipeline, and a prioritization logic decides if a read request or a store request is to be forwarded to said cache pipeline,
in which the method is characterised by the steps of:
a) aborting a store request, if a fetch request arrives within a predetermined store abort window,
b)granting access to said arrived fetch request, and c) using a control mechanism, in order to repeat the access control of the aborted store request for a further trial to access the pipeline (14), when said fetch request does not require the input stage of said cache pipeline any more.
Further advantageously, a speculative and a deferred read pointer pointing to store queue entries are maintained to make sure that a store request which has been aborted gets repeated. The speculative read pointer points to the store request which will be executed next unless a previous store gets aborted. It is forwarded to the next subsequent store queue position whenever the current store request enters the cache pipeline. At the very beginning, the deferred read pointer points to the same store queue entry as the speculative read pointer. But other than the speculative read pointer, it only gets forwarded to the next store queue entry when the current store request has passed said store abort window without having been aborted.
- 7 -
Further advantageously, an aborting of the store requests is allowed for a number of 3 to 7 cycles, preferably 4 or 5 cycles, after a number of 2 to 4 cycles, preferably after 3 cycles.
The present invention concerning store aborting can be advantageously combined with a further functional feature referred to herein as store request grouping. This functional feature is summarized as follows and is further described with reference to figure 3, 4 and 5.
As a person skilled in the art may appreciate, store requests, also referred as "Stores" in the drawings, access a cache in a later cycle than fetches. The tendency is increasing, the higher processor frequencies are pushed.
To avoid collisions in the cache arrays, fetch requests arriving after a store request must be delayed by this difference. This delay of fetch accesses has an adverse impact on system level performance. In today's systems, it can be as high as 1...2%. The inventive method of grouping the store requests to some extent before granting access to the processor pipeline is a means to "hide" this delay and thereby save system performance. It cuts the performance impact by about one half.
The invention presented herein resolves the problem by reducing the number of cycles an affected fetch is blocked. More precisely, the inventive method and system insert a window in a store's fetch blocking window of size n = (d - 1) + bs (see above) wherein the actual store can be aborted in case a new fetch request arrives and requests cache access.
Let w ^ n denote the size of the store abort window. Then the maximum delay, a fetch arriving within the blocking window n can suffer, is (n - w)/2 provided w decomposes n in two halves of about equal size. This is less than half of what it was before and as a consequence, also the average delay a fetch would
- 8 -
suffer, is less than half of what it was before. As a side effect, the size of the blocking window per store is reduced by w. This in turn reduces the probability that a fetch request would be affected at all (see above).
The system architecture is enriched by some logic - ref. 32 in figure 3, which is implemented within the priority select logic 18. Details of the control flow of this logic 32 are further described with reference to figure 5.
This control logic 32 is connected with a read and write line to a counter 34 which detects the store request when they come into the input of the store request queue 10. The basic procedure followed and controlled by logic 32 in communication with counter 34 and the store request queue 10 is to give priority if the current number of waiting store requests is greater than a predetermined level. In this example the trigger level is the count of 4. Otherwise a store request is ignored and a fetch request can be given priority for accessing the cache pipeline 14.
A preferred implementation of this inventive control logic 18, 32 can be for example implemented into the store queue of a cache of the IBM z6 system. Figure 4 illustrates the scheme as already known from figure 2, when it is enriched by the inventive method.
The problem reduced by this invention is caused by the difference in pipeline length between a cache fetch and a cache store. The fetch performs the cache directory read and cache data array read in parallel. From the cache data array all possible data for all ways of the n-way-set-associative cache are read. A late select based on the directory read results selects one of the 'n' data.
- 9 -
In contrast to the late select performed for the cache fetch, a store to the cache needs an early select: First the directory is read and based on the result the write of the cache data array can start.
The difference between the late select and the early select is 8 cycles in the preferred implementation implementation.
The "Pure Store aborting" is described with reference to figures 6, 7, and 8.
The combination of store aborting with store grouping is described with reference to figures 9 and 10.
Summary of Store Grouping:
In short words, the present invention discloses an inventive method and system which resolves the above problem by reducing the size of the fetch blocking window. More precisely, since the window size is a consequence of physical conditions which can't be changed, we reduce the effective size per store by gathering stores in groups:
Rather than requesting cache access for each single store,
stores are grouped together and cache access is requested once for the whole group. The individual stores in the group access the cache back-to-back.
That way, the fetch blocking windows of each of the stores in the store group overlap and the effective window per store gets-reduced accordingly: With k stores per group and with all stores accessing the cache back-to-back, the total blocking window is n* = n + (k-1) bs where n = ni = d - 1 + bs is the blocking window size caused by a single store (k=l, see above). Per single store, the window size thus reduces from nx = (d - 1) + bs to nk/k = (d - 1) /k + bs.
]
- 10 -
The above objective of the invention is achieved by the features stated in enclosed independent claims. Further advantageous arrangements and embodiments of the invention are set forth in the respective subclaims. Reference should now be made to the appended claims.
According to the broadest aspect of the present invention a method and respective system for operating processor caches accessed under the control of a cache pipeline is disclosed, wherein store requests are managed in a store queue, and read requests are managed in a read queue, respectively, wherein both of the queues compete for accessing the cache pipeline, and a priorization logic decides if a read request or a store request is to be forwarded to the cache pipeline, wherein the method is characterised by the steps of:
a) halting the processing of store requests until al) a group of at least a predetermined minimum number of store requests has been accumulated in the store queue for being granted access to the cache pipeline, or a2) a timeout happens, defined by a programmable timeout-counter, or a3) a fetch request requests data that resides currently in the store queue -and thus has to be written to the cache array before the fetch can be granted,
b) when the minimum number of store requests has been accumulated, forwarding the group of store requests for accessing the cache processing pipe for being processed in an overlapping form, such that if no fetch request arrives through this shortened period, the group of store requests is processed without an interruption, and c) operating the cache pipeline with the group of store requests according to prior art.
- 11 -
The inventive method thus resolves the above mentioned performance problem encountered in pipelined cache structures where fetch accesses are exposed to be impeded by store accesses.
As a person skilled in the art may appreciate, store requests, also referred as "Stores" in the drawings, access a cache in a later cycle than fetches. The tendency is increasing, the higher processor frequencies are pushed.
To avoid collisions in the cache arrays, fetch requests arriving after a store request must be delayed by this difference. This delay of fetch accesses has an adverse impact on system level performance. In today's systems, it can be as high as 1...2%. The inventive method of grouping the store requests to some extent before granting access to the processor pipeline is a means to "hide" this delay and thereby save system performance. It cuts the performance impact by about one half.
3. BRIEF DESCRIPTION OF THE DRAWINGS
Figure 1 illustrates the most basic structural components of a prior art pipelined cache access logic,
Figure 2 illustrates a schematic diagram actions over cycles showing the different processing actions of store requests and fetch requests dependent of the pipe cycle number according to a prior art circuit as shown in figure 1,
Figure 3 illustrates the most basic structural components of an inventive pipelined cache access logic including an inventive store grouping control logic
Figure 4 illustrates the control flow of the most important steps of a preferred embodiment of the inventive method implemented within the prioritization logic, and
- 12 -
Figure 5 illustrates a scheme according to figure 2, when implementing the inventive store grouping logic.
Figure 6A illustrates the most basic structural components of an inventive pipelined cache access logic including an inventive store aborting control logic
Figure 6B illustrates a more detailed view into the inventive pipelined cache access logic from figure 6A,
Figure 7 illustrates the control flow of the most important steps of a preferred embodiment of the inventive method implemented within the prioritization logic, and
Figure 8 illustrates a cycle scheme according to figure 2, when implementing the inventive store aborting logic.
Figure 9 illustrates a cycle scheme according to figure 2, 5 and 8, and
Figure 10 illustrates a cycle scheme combining the effects of store aborting and store grouping according to a preferred embodiment of the inventive method.
4. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
With general reference to the figures and with special reference now to figure 3 the system architecture view is enriched by some logic 32, which is implemented within the priority select logic 18. Details of the control flow of this logic 32 are further described with reference to figure 5.
This control logic 32 is connected with a read and write line to a counter 34 which detects the store requests when they come
- 13 -
into the input of the store request queue 10. The basic procedure followed and controlled by logic 32 in communication with counter 34 and the store request queue 10 is to request priority if the current number of waiting store requests is greater than a predetermined level. In this example the trigger level is the count of 4. Otherwise a store request is ignored.
A preferred implementation of this inventive control logic 18, 32 can be for example implemented into the store queue of a cache of a successor product of the prior art z990 system.
Figure 5 illustrates the scheme as already known from figure 2, when it is enriched by the inventive method.
The store queue of the cache in this system has a depth of 16 entries. A state of the art design requests arbitration when at least one entry of the store queue is used (= valid).
To provoke that groups of store requests are performed instead of evenly spread out single stores as known from prior art, a threshold value is implemented. The store grouping threshold is programmable (0 to 15). As long as the number of stores in the store queue is smaller than the threshold (or equal), no arbitration is requested. Once the number of valid entries in the store queue is bigger than the threshold, arbitration is requested for all store requests (one after the other), until the store queue is empty.
A preferred default value for the store grouping threshold is >4 entries. As soon as 5 or more stores are in the store queue, which is sensed via a sense line 36 in figure 3, arbitration is requested. If dependencies exist to any store request in the store queue, arbitration is requested regardless of the store grouping threshold. Also, if no store was put into the store queue for a predetermined 'grouping timeout' number of cycles, arbitration is requested for any store that is in the store
- 14 -
queue even if the number of stores is smaller than the threshold value. The grouping timeout has a default value of 750 cycles, but it is also programmable.
For the purposes of completeness the following alternative description of the inventive method is given as an add-on to the before described section:
Accordingly, it is proposed to defer the request for pipeline prioritization until k stores have been stacked in the store queue, wherein k is the group size, and then process those k stores back-to-back, i.e., sequentially, yielding the above stated improvement. In other words, the inventive method deliberately delays the processing of store requests until enough of them are available to process them more effectively without interruption by a fetch request..
According to figure 4 which illustrates the control flow of the most important steps of a preferred embodiment of the inventive method implemented within the prioritization logic, in a first step 510 the store request counter and the grouping timer are set to zero in order to initialize them. Then, in a further step 512 the store requests are examined in order to detect if a fetch depending on a store is held in the group. If a dependency is detected control is forwarded to step 560, which gives immediate pipeline access priority to the store requests currently comprised of the store request queue.
Then, in a further step 514 the grouping timer is checked if the timeout is reached or not. If the timeout is reached, the control is also forwarded to step 560.
Then, in a checking step 520 a decision is taken if a new store request has arrived at the store queue entrance. This is done via the sense line 36 in figure 3. In the Yes-branch of step 520 the store request counter value is increased by 1, step 530. In
- 15 -
the No-branch of decision 520 control is fed back to step 512 in order to recheck the timeout and "fetch dependencies".
Then, after increasing the counter in a further step 540 the store request counter value is checked if or if not this value is greater than the predetermined trigger value, here, the value is set to 4. In the No-case the enqueued group of store requests is not yet big enough in order to be forwarded to the cache pipeline. So control is fed back to step 512 in order to recalculate and recheck the grouping timeout and fetch dependencies.
Otherwise, in the Yes-branch of decision 540 the current value in the store request counter is "5", when the trigger value is "4". Then, a number of 5 store requests have been currently enqueued in the store request queue. In this case the inventive method of grouping the store requests as a group is done. So, in this case in a step 560 the control logic 32 requests arbitration from the priority logic 18. If no fetch or other requests of higher priority request access to the cache, the five or more store requests are moved into the cache pipeline in order to access the cache.
The additional requirements 512 and 514 check if dependencies exist to any store request in the store queue.
If in step 512 a dependency of a fetch to a store inside the store queue is detected, the arbitration is requested regardless of the store grouping threshold. In step 514, if no store was put into the store queue for the predetermined "grouping timeout' number of cycles, arbitration is requested for any store request that is in the store request queue even if the number of stores is smaller than the threshold value. The grouping timeout has a default value of 750 cycles, but it is also programmable.
- 16 -
Figure 5 shows the effect of the control flow described with reference to figure 4 above:
The second box shows again the prior art behavior with a fetch delay for the earliest fetch request incoming when the store request has just been accepted for being entered into the cache pipeline. This is the same as depicted in the bottom box of figure 2.
In the bottom box of figure 5 a further embodiment is sketched in which a group of four store requests are handled subsequently. The store grouping threshold is set to " >3 instead of " >4 as described in the proceeding example with reference to figure 4. The beginning of each of these subsequent store requests is marked by a respective arrow. Each store request covers a number of 11 cycles. This corresponds to the length of each store request frame 52.
The fetch delay lines indicate the respective number of delay cycles for a respective fetch request. The example shows that the number of cycles a fetch has to wait has an accumulated value of 180. But as 4 stores are processed during this time, only 45 cycles delay appear statistically per store. A fetch arriving in cycle A4 has to wait for 11 cycles, because only the 1st store is performed and the rest of the store group is not arbitrated.
Like in the state-of-the-art case, each store request covers a range of 14 cycles. Thereof, the first 11 cycles are cycles where no fetch can be given pipe access. This fetch delay area is depicted by the shaded frames 52. Like before, a fetch request arriving at cycle A4 will be delayed by 11 cycles due to a conflict with the group's first store. The fetch will suppress the group's subsequent stores 2 to 4. Accordingly, a fetch arriving in cycle A5 will be delayed by 10 cycles and so on.
However, differing from the state-of-the-art case, a fetch request arriving at cycle A8 will conflict with the group's second store request and will therefore be delayed by 11 cycles rather than by the 7 cycles the first store would impose.
Summing up the delays for each cycle, the total delay potential imposed by the store group is 180 cycles. Since 4 stores are grouped, the delay potential per single store is only 45 cycles as compared to the 66 cycles in the state-of-the-art case. The reason for the reduction is that the delay areas 52 of the individual stores in the store group now overlap in a systematic way.
The skilled reader will understand that the problem reduced by this invention is caused by the difference in pipeline length between a cache fetch and a cache store. The fetch performs the cache directory read and cache data array read in parallel. From the cache data array all possible data for all ways of the n-way-set-associative cache are read.
A late select based on the directory read results selects one of the ln' data.
In contrast to the late select performed for the cache fetch, a store to the cache needs an early select:
First the directory is read and based on the result the write of the cache data array can start.
The difference between the late select and the early select is 6 cycles in the implementation shown with reference to figure 2. The arbitration circuitry is optimized for the fetch access,
which adds 2 cycles for the store access counted from the arbitration cycle. The difference Cs - Cf is d = 6 + 2 = 8.
In a IBM zSeries example with bs = 4 and d = 8 and with, the blocking window is reduced from ni = 11 to n4 = 5.75 cycles when k = 4 stores are grouped together. The pipe queuing penalty for
- 18 -
fetches caused by stores and the resulting impact on system performance are both about cut by half.
The foregoing section was directed to describe the aspect of pure store request grouping.
In the following, the aspect of pure store aborting will be described in more detail with reference to figures 6A, 6B, 7 and 8:
As the overview depictions of figure 6A and 8 illustrate, the system architecture of a preferred embodiment is enriched by some control logic 62, which is implemented within the priority select logic 18. This logic 62 aborts a store request within a predetermined store abort window in case a fetch request arrives before the window expires. In the sample figure 8, this window extends from cycle A7 through A10. When a fetch request arrives in time, the logic grants pipe access to the fetch request, aborts the current store request, and uses a control mechanism consisting of a speculative and a deferred read pointer pointing to store queue entries to assure that all of the aborted store requests get repeated as soon as the fetch request allows.
Details of the circuit logic of this logic 62 are further described with reference to figure 6B and 7, illustrating a preferred implementation as implemented in the before mentioned IBM zSeries example, wherein a 4-cycle store request abort window was implemented starting at a store request's fourth pipe cycle and extending over 4 pipe cycles. The basis functioning and its effects are depicted in figure 8:
The first row compartment in figure 8 shows the "as is" state-of-the-art solution: A store request's grant cycle is denoted by "St". In the next cycle, the store enters the pipe. In the sample picture, this happens at cycle A4. 10 cycles later, in cycle A14, the store request begins to write to the cache
- 19 -
arrays. The write operation takes four cycles through A17. Therefore, A18 is the first cycle a subsequent fetch request can read from the cache arrays. In the preferred implementation, a fetch request's read from the cache arrays starts in the request's third pipe cycle. Therefore, A15 is the first cycle a fetch request can be granted pipe access by the arbiter logic. As a consequence, a fetch request arriving between A4 and A14 must be delayed. For reasons not related to the current invention, the minimum delay a fetch request suffers in case of a conflict is three cycles. Therefore, a fetch request arriving between A4 and A12 will be delayed to A15 and requests arriving at A13 and A14 will be delayed by the base delay of three cycles: These delays are depicted in the row labeled "Fetch Delay". They sum up to a total delay potential of 69 cycles. The second row compartment shows the effect of the inventive solution with a store abort window: The store window opens up in the store's fourth pipe cycle which happens in A7 and it extends over four cycles through AlO. Within this window, the store request can be aborted. Once this window is passed, the store request will necessarily complete its pipe pass. As a result, a fetch request arriving between A4 and AlO, would enter the pipe and abort the current store request. Since A7 is the first cycle a fetch request can be granted pipe access, a fetch request arriving at A4 will be delayed to A7, a fetch request arriving at A5 will be delayed to A8 and so on. Therefore, three cycles is the maximum delay a fetch request suffers when arriving between A4 and AlO.
Beyond the window, that is from All onward, the system behaves like in the "as is" case: An arrival at All will be delayed by four cycles and all later arrivals by three cycles. The delays sum up to a total delay potential of no more than 34 cycles. Compared to the "as is" case, the delay potential is about cut by half.
As a bottomline result, the pipe queuing penalty for fetches caused by stores and the resulting impact on system performance
I
- 20 -
are both cut by about one half with the inventive solution as compared to the state-of-the-art solution.
In a state-of-the-art arbitration cycle, the store queue reads out the current store request, and then steps its read-pointer forward for setting itself up for the next store request. In order to implement an improved mechanism, according to this preferred embodiment of the present invention, the same as done in prior art as described shortly above, is also done for a so-called "speculative read-pointer" 64, see steps 710, 720. Additionally, a so-called "deferred read-pointer" 66 is maintained, which will only be stepped forward after a store has passed the abort window successfully, see steps 730 and 740 -which is signaled through control line 68 - and can hence not be aborted anymore, when a store request gets aborted due to a fetch request, see steps 730 and 750 - which is signaled through control line 69 - instead of stepping the "deferred read-pointer" forward, the aborted store request will later be repeated.
In order to recover from the abortion, the original "deferred read-pointer" is therefore copied back into the "speculative read-pointer", the control is then fed back to step 710 using the deferred read pointer, and the store request processing continues by repeating the aborted store request. Respective multiplexers 65, 67 and the control lines 68, 69 implement a respective selection logic.
Next, and with reference to figure 8, a cycle scheme is given describing the effects of the before mentioned control logic. Pipe cycles of a store request are denoted as CS0, CS1,... and pipe cycles of a fetch request as CF0, CF1,.... CS0 and CF0 are the respective arbitration cycles where a request gets granted pipe access. In the drawings, they are denoted by "St" and "Ft" respectively. The cycles when a store actually writes to the
cache are labelled S1,S2,S3,S4 and the cycles a fetch reads from cache are labelled F1,F2 through F16. Dl and D2 are directory access cycles.
A fetch can only abort a store, if the store is of low priority (this is the normal case). When, however, a fetch has dependencies to any store in the store queue, the stores are given high priority and cannot be aborted.
When a fetch request competes with a store request in the arbitration cycle and there are no dependencies (which is the normal case), the fetch request wins. When the fetch is requested one to three cycles later, CFO happens in parallel to CS4, and the current store request in its fourth pipe cycle CS4 is aborted. The abort window 80 starts in CS4 and ends in CS7 meaning that CFO can only happen in one of the current store cycles CS4 through CS7, but not in parallel to CSl through CS3 nor in parallel to CS8 through CS11. Within its abort window, a store request will be aborted when a fetch request enters the arbiter logic.
Since stores busy the cache arrays for four consecutive cycles, subsequent stores are arbitrated at least four cycles apart from each other. Therefore, CS5 is a store's first window cycle where a subsequent store can already have entered the pipe and execute its own CSl cycle. When in this constellation, a fetch request's CFO falls into the first store's CS5, both store requests in the pipe are aborted.
When CFO happens in parallel to CS6, a second store may be in its own CSl or CS2 cycle. In this case also, both stores are aborted. And accordingly when CFO happens in parallel to CS7. This way, all stores entering the pipe after an aborted store will be aborted as well.
The store window control logic also cares for stores running ahead of a store reaching its abort window. When a current store enters its CS4 cycle, a preceding store will be at least 4
- 22 -
cycles apart in its own CS8 cycle. Therefore, when at this cycle, a fetch request exploits the abort window and gains pipe access, it aborts the current store, but the preceding store cannot be aborted any more. In this situation, the fetch request's access to the cache arrays is delayed by 4 rather than the usual 3 cycles to avoid a collision with the store ahead.
The correct re-execution of aborted stores is assured by two pointers to store queue entries, a speculative and a deferred read pointer. Their behaviour and management is described with respect to figure 7:
The speculative read pointer points to the store queue entry which is the candidate for the next store request arbitration. When this store request enters the pipe with its CSl cycle, the speculative read pointer gets incremented and points to the next position in the store queue.
The deferred read pointer, at the very beginning, points to the same entry as the speculative read pointer. Unlike the speculative read pointer, it gets incremented only when the current store has successfully passed its abort window without abortion. In the preferred embodiment, this happens in the store's CSll cycle.
When however the store has been aborted, then in CSll, the deferred read pointer gets not incremented but instead loaded into the speculative read pointer. In this case, when in parallel a new store happens to be in its own CSl cycle, the normal CSl incrementing of the speculative read pointer is suppressed. This makes sure that the new speculative read pointer points to the first store that has been aborted. Since the speculative read pointer identifies the store queue position which will execute next, this logic makes sure that the complete sequence of aborted stores will be re-executed. In order to let all stores leave the pipeline after an abortion, no store can arbitrate after the first aborted store's CS10 for 12 cycles.
- 23 -
In the following, the aspect of combined store aborting and store grouping will be described in more detail:
Figure 9 recalls prior art ("as is") in the first row, the effects of store aborting in the second row, and the effect of store grouping in the third row, with aborting and grouping done independently from each other, and without the respective other mechanism.
The first row compartment shows the "as is" state-of-the-art. As discussed above, a store granted pipe access in A3 will suppress pipe access for fetches from A4 through A14. The resulting delay a fetch can suffer is indicated in the row labeled "Fetch Delay". It ranges from 11 cycles down to the minimum of 3 cycles and sums up to a total delay potential of 69 cycles.
The second row compartment shows the effect of store aborting: As discussed above, a fetch arriving between A4 and A14 is delayed by 3 cycles in all cycles other than All where the delay is 4 cycles. This is a total delay potential of 34 cycles, a little less than half the nas is" delay potential.
The third row compartment shows the effect of store grouping independently of store aborting. Assembling groups of four stores and executing them as close as possible makes their individual fetch blocking windows overlap and thus reduces the delay potential per store to one fourth of 183 cycles or 45.75 cycles.
Advantageously, the inventive store grouping can be combined with the inventive store aborting as depicted in figure 10 and described in the following:
The second row compartment recalls the "as is" state with its delay potential of 69 cycles per store.
The third row compartment shows the effects of combining store grouping with aborting: Each individual store's delay pattern is
- 24 -
the same as with the standalone store aborting technique. Store grouping causes these delay zones to overlap thus reducing the total delay potential per store to one fourth of the total of 73 cycles which is 18.25 cycles.
This result of a combined store grouping and aborting technique is very close to the hypothetical lowest delay potential of 18 cycles depicted in the first row compartment:
The hypothetical minimum delay would be encountered when the store and the fetch pipeline would behave similarly with stores accessing the cache arrays in parallel to their second directory cycle like fetches do. If they could, their cache access would happen in their fifth pipe cycle which in the drawing falls into A8. This hypothetical behaviour would delay fetches by 5 cycles at most, resulting in the above mentioned hypothetically lowest total delay potential of 18 cycles.
The invention can take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment containing both hardware and software elements. In a preferred embodiment, the invention is implemented in software, which includes but is not limited to firmware, resident software, microcode, etc.
Furthermore, the invention can take the form of a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
- 25 -
The medium can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium. Examples of a computer-readable medium include a semiconductor or solid state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk. Current examples of optical disks include compact disk - read only memory (CD-ROM), compact disk -read/write (CD-R/w) and DVD.
A data processing system suitable for storing and/or executing program code will include at least one processor coupled directly or indirectly to memory elements through a system bus. The memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories which provide temporary storage of at least some program code in order to reduce the number of times code must be retrieved from bulk storage during execution.
Input/output or I/O devices (including but not limited to keyboards, displays, pointing devices, etc.) can be coupled to the system either directly or through intervening I/O controllers.
Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks. Modems, cable modem and Ethernet cards are just a few of the currently available types of network adapters.
The circuit as described above is part of the design for an integrated circuit chip. The chip design is created in a graphical computer programming language, and stored in a computer storage medium (such as a disk, tape, physical hard
- 26 -
drive, or virtual hard drive such as in a storage access network). If the designer does not fabricate chips or the photolithographic masks used to fabricate chips, the designer transmits the resulting design by physical means (e.g., by providing a copy of the storage medium storing the design) or electronically (e.g., through the Internet) to such entities, directly or indirectly. The stored design is then converted into the appropriate format (e.g., GDSII) for the fabrication of photolithographic masks, which typically include multiple copies of the chip design in question that are to be formed on a wafer. The photolithographic masks are utilized to define areas of the wafer (and/or the layers thereon) to be etched or otherwise processed.
- 27 -

Claims (6)

1. A method for operating processor caches (16) accessed under the control of a cache pipeline (14), wherein store requests are managed in a store queue (10), and read requests are managed in a read queue (12), respectively, wherein both of said queues (10, 12) compete for accessing said cache pipeline (14), and a priorization logic (18) decides if a read request or a store request is to be forwarded to said cache pipeline (14),
characterised by the steps of:
a) aborting a store request, if a fetch request arrives within a predetermined store abort window,
b)granting access to said arrived fetch request,
c) using a control mechanism, in order to repeat the access control of the aborted store request for a further trial to access the pipeline (14), when said fetch request does not require the input stage of said cache pipeline any more.
2. The method according to claim 1, wherein a deferred read pointer is used for pointing to a position of the store request queue in which the current store request currently being forwarded into the cache pipeline is located,
and wherein a speculative read pointer is forwarded from said current store request to the next store request enqueued, when said current store request enters the cache pipeline, and wherein said deferred read pointer is forwarded to the position of the next store request, only after the processing of said aborted store request has been successfully passed said abort window.
3. The method according to claim 1, wherein an aborting of the store requests is allowed for a number of 3 to 7 cycles, preferably 4 or 5 cycles, after a number of 2 to 4 cycles,
- 28 -
preferably after 3 cycles.
4. The method according to claim 1, further comprising the steps of:
a) halting the processing of store requests until a group of at least a predetermined minimum number of store requests has been accumulated in said store queue (10) for being granted access to the cache pipeline (14),
b) when (540) said minimum number of store requests has been accumulated,
c) forwarding(560) said group of store requests for accessing said cache processing pipe (14) for being processed without an interruption being allowed by occurrence of a read request,
d) operating said cache (16) processing pipe with said group of store requests according to prior art.
5. The method according to the preceding claim, wherein a counter (34) is implemented which counts the current number of enqueued store requests, and a control logic (32) is implemented which gives priority to the processing of the store queue (10) in relation to the read queue (12), as soon as a respective counter value is reached which is greater than said minimum number of store requests.
6. An electronic data processing system having means for operating processor caches (16) accessed under the control of a cache pipeline (14), wherein store requests are managed in a store queue (10), and read requests are managed in a read queue (12), respectively, wherein both of said queues (10, 12) compete for accessing said cache pipeline (14), and a priorization logic (18) decides if a
- 29 -
read request or a store request is to be forwarded to said cache pipeline (14)
characterized by a control logic(62) performing the steps of:
a) aborting a store request, if a fetch request arrives within a predetermined store abort window,
b)granting access to said arrived fetch request,
c) using a control mechanism, in order to repeat the access control of the aborted store request for a further trial to access the pipeline (14), when said fetch request does not require the input stage of said cache pipeline any more.
GB0822457.8A 2008-01-15 2008-12-10 Store aborting Active GB2456405B (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
EP08100460 2008-01-15

Publications (3)

Publication Number Publication Date
GB0822457D0 GB0822457D0 (en) 2009-01-14
GB2456405A true GB2456405A (en) 2009-07-22
GB2456405B GB2456405B (en) 2012-05-02

Family

ID=40289745

Family Applications (1)

Application Number Title Priority Date Filing Date
GB0822457.8A Active GB2456405B (en) 2008-01-15 2008-12-10 Store aborting

Country Status (1)

Country Link
GB (1) GB2456405B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4631668A (en) * 1982-02-03 1986-12-23 Hitachi, Ltd. Storage system using comparison and merger of encached data and update data at buffer to cache to maintain data integrity
US5450564A (en) * 1990-05-04 1995-09-12 Unisys Corporation Method and apparatus for cache memory access with separate fetch and store queues
US5699551A (en) * 1989-12-01 1997-12-16 Silicon Graphics, Inc. Software invalidation in a multiple level, multiple cache system
US20020083244A1 (en) * 2000-12-27 2002-06-27 Hammarlund Per H. Processing requests to efficiently access a limited bandwidth storage area
US20030056066A1 (en) * 2001-09-14 2003-03-20 Shailender Chaudhry Method and apparatus for decoupling tag and data accesses in a cache memory
US20080091883A1 (en) * 2006-10-12 2008-04-17 International Business Machines Corporation Load starvation detector and buster

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4631668A (en) * 1982-02-03 1986-12-23 Hitachi, Ltd. Storage system using comparison and merger of encached data and update data at buffer to cache to maintain data integrity
US5699551A (en) * 1989-12-01 1997-12-16 Silicon Graphics, Inc. Software invalidation in a multiple level, multiple cache system
US5450564A (en) * 1990-05-04 1995-09-12 Unisys Corporation Method and apparatus for cache memory access with separate fetch and store queues
US20020083244A1 (en) * 2000-12-27 2002-06-27 Hammarlund Per H. Processing requests to efficiently access a limited bandwidth storage area
US20030056066A1 (en) * 2001-09-14 2003-03-20 Shailender Chaudhry Method and apparatus for decoupling tag and data accesses in a cache memory
US20080091883A1 (en) * 2006-10-12 2008-04-17 International Business Machines Corporation Load starvation detector and buster

Also Published As

Publication number Publication date
GB0822457D0 (en) 2009-01-14
GB2456405B (en) 2012-05-02

Similar Documents

Publication Publication Date Title
JP3014773B2 (en) Processor architecture
US8484438B2 (en) Hierarchical bloom filters for facilitating concurrency control
US8689221B2 (en) Speculative thread execution and asynchronous conflict events
US9244724B2 (en) Management of transactional memory access requests by a cache memory
US9524164B2 (en) Specialized memory disambiguation mechanisms for different memory read access types
US9513959B2 (en) Contention management for a hardware transactional memory
US9367264B2 (en) Transaction check instruction for memory transactions
US6557095B1 (en) Scheduling operations using a dependency matrix
JP5398375B2 (en) Optimizing grace period detection for preemptable reads, copies, and updates on uniprocessor systems
US9396115B2 (en) Rewind only transactions in a data processing system supporting transactional storage accesses
US7024543B2 (en) Synchronising pipelines in a data processing apparatus
US20150154045A1 (en) Contention management for a hardware transactional memory
US9798577B2 (en) Transactional storage accesses supporting differing priority levels
US20100174840A1 (en) Prioritization for conflict arbitration in transactional memory management
US10725937B2 (en) Synchronized access to shared memory by extending protection for a store target address of a store-conditional request
US20150052312A1 (en) Protecting the footprint of memory transactions from victimization
US10108464B2 (en) Managing speculative memory access requests in the presence of transactional storage accesses
GB2456621A (en) Queuing cache store requests while read requests are processed until the number of store request reaches a limit or a timeout happens.
US10884740B2 (en) Synchronized access to data in shared memory by resolving conflicting accesses by co-located hardware threads
US20060047495A1 (en) Analyzer for spawning pairs in speculative multithreaded processor
US10592517B2 (en) Ranking items
Sakalis et al. Seeds of SEED: Preventing priority inversion in instruction scheduling to disrupt speculative interference
JP3146058B2 (en) Parallel processing type processor system and control method of parallel processing type processor system
GB2456405A (en) Managing fetch and store requests in a cache pipeline
WO2021037124A1 (en) Task processing method and task processing device

Legal Events

Date Code Title Description
746 Register noted 'licences of right' (sect. 46/1977)

Effective date: 20130107