KR100987832B1 - System, apparatus and method for managing predictions of various access types to a memory associated with cache memory - Google Patents

System, apparatus and method for managing predictions of various access types to a memory associated with cache memory Download PDF

Info

Publication number
KR100987832B1
KR100987832B1 KR1020077003839A KR20077003839A KR100987832B1 KR 100987832 B1 KR100987832 B1 KR 100987832B1 KR 1020077003839 A KR1020077003839 A KR 1020077003839A KR 20077003839 A KR20077003839 A KR 20077003839A KR 100987832 B1 KR100987832 B1 KR 100987832B1
Authority
KR
South Korea
Prior art keywords
address
prediction
addresses
predictions
trigger
Prior art date
Application number
KR1020077003839A
Other languages
Korean (ko)
Other versions
KR20070050443A (en
Inventor
라도슬라브 다닐라크
브라이언 케이트 랑엔도르프
드미트리 비시트스키
브래드 더블유 시머럴
스테파노 에이 페스카도르
지야드 에스 하쿠라
Original Assignee
엔비디아 코포레이션
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority to US10/921,026 priority Critical patent/US7206902B2/en
Priority to US10/920,682 priority
Priority to US10/920,682 priority patent/US7461211B2/en
Priority to US10/920,995 priority patent/US7260686B2/en
Priority to US10/920,610 priority
Priority to US10/920,995 priority
Priority to US10/921,026 priority
Priority to US10/920,610 priority patent/US7441087B2/en
Application filed by 엔비디아 코포레이션 filed Critical 엔비디아 코포레이션
Publication of KR20070050443A publication Critical patent/KR20070050443A/en
Application granted granted Critical
Publication of KR100987832B1 publication Critical patent/KR100987832B1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3824Operand accessing
    • G06F9/383Operand prefetching
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/34Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes
    • G06F9/345Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes of multiple operands or results
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3802Instruction prefetching
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline, look ahead
    • G06F9/3824Operand accessing
    • G06F9/383Operand prefetching
    • G06F9/3832Value prediction for operands; operand history buffers

Abstract

Systems, apparatus, and methods are disclosed for predicting access to memory. In one embodiment, an example apparatus includes a processor configured to execute program instructions and process program data, and the memory includes program instructions and program data, and a memory processor. The memory processor may include a speculator configured to receive an address containing a program instruction or program data. Such speculators may include sequential predictors and nonsequential predictors to generate configurable numbers of sequential and nonsequential addresses, respectively. In one embodiment, the prefetcher implements the device. In various embodiments, the speculator can include any one of a facilitator, suppressor, inventory, inventory filter, post-inventory filter, and data return cache memory, which can include short term and long term caches.
Sequential address predictor, non-sequential address predictor

Description

SYSTEM, APPARATUS AND METHOD FOR MANAGING PREDICTIONS OF VARIOUS ACCESS TYPES TO A MEMORY ASSOCIATED WITH CACHE MEMORY}

Brief Description of the Invention

FIELD OF THE INVENTION The present invention relates generally to computing systems and, more particularly, to generating a configurable amount of prediction, for example, into prediction inventory and / or multi-level cache. A system for predicting sequential and nonsequential access to memory by suppressing and filtering contrary predictions.

Background of the Invention

The prefetcher is used to fetch program instructions and program data so that the retrieved information itself needed by the processor is immediately available. Prefetchers predict the instructions and data that a processor may use in the future so that the processor does not have to wait for instructions or data accessed from system memory that typically operates at a slower rate than that processor. With prefetchers implemented between the processor and system memory, the processor is less likely to remain idle, such as waiting for data requested in memory. As such, prefetchers generally improve processor performance.

In general, the more prefetchers generate the predictions, the more likely the prefetchers can be arranged to have the necessary instructions and data available to the processor, thus increasing the likelihood that the processor's latency is reduced. However, conventional prefetchers typically lack sufficient management of the prediction process. Without such management, these prefetchers tend to overload memory resources when they exceed the amount of predictable address that the prefetcher can adjust. Thus, to prevent resource overload, conventional prefetchers tend to be cautious in generating predictions so as not to generate an amount of prediction that can overload either the prefetcher or the memory resource. In addition, conventional prefetchers typically generate predictions without considering the cost of implementing such a prediction process, thereby clarifying the benefits that occur by streamlining the prediction process and the amount of resources needed to support it. Will fail to grasp. In particular, the typical type of prefetcher relies on standard techniques for generating predictions of natural sequential in nature and do not store predictions in a computational or otherwise resource-conserving manner. Also, conventional prefetchers generally lack sufficient management of the prediction process and tend to overload computational and memory resources when the amount of predicted addresses exceeds the amount that the prefetcher can adjust. Thus, to prevent resource overload, these prefetchers tend to be conservative in generating predictions so as not to generate an amount of prediction that overloads the prefetchers. In addition, many conventional prefetchers lack the ability to manage predictions after they are generated and before one processor requests them. In general, these prefetchers store prefetch data in a single cache memory that typically lacks functionality to limit superfluous predictions associated with predictions previously stored in the cache. The cache memory used in conventional prefetchers is only for storing data and is not designed sufficiently to effectively manage the predicted addresses stored therein.

In view of the foregoing, it would be desirable to provide a system, apparatus, and method for efficiently predicting access to a memory. Exemplary systems, devices or methods are ideal to minimize or eliminate at least the aforementioned drawbacks.

Summary of the Invention

Systems, apparatus, and methods are disclosed for predicting access to memory. In one embodiment, an example apparatus includes a processor configured to execute program instructions and program data, a memory including the program instructions and program data, and a memory processor. The memory processor may include a speculator configured to receive an address containing a program instruction or program data. Such speculators may include a sequential predictor for generating a configurable number of sequential addresses. The speculator may also include an out of order predictor configured to combine a subset of addresses with the address. In addition, the out of order predictor may be configured to predict a group of addresses based on the addresses of one or more subsets, wherein one or more addresses of the subset are not patternable with the addresses. In one embodiment, the exemplary out of order predictor expects access to memory. The out of order predictor includes a prediction generator configured to generate an index and a tag at an address. The out of order predictor also includes a target cache coupled to the prediction generator. The target cache includes a number of pieces of memory each having a memory location for storing trigger-target combinations. The trigger-target combination stored in the first portion of the memory is combined with a higher priority than other trigger-target combinations stored in the second portion of the memory.

In one embodiment of the invention, the apparatus includes a predictive inventory comprising each queue configured to hold a group of items. Typically, a group of items includes a triggering address corresponding to the group of items. Each item in the group consists of one type of prediction. The apparatus also includes an inventory filter configured to compare the number of predictions with one or more queues having the same type of prediction as the number of predictions. In some cases, the inventory filter is configured to compare the number of predictions with one or more other queues having different prediction types. For example, multiple forward sequential predictions may be filtered for the back queue and the like. In at least one embodiment, the apparatus includes a data return cache memory to manage predictable access to the memory. The data return cache memory includes, for example, short term cache memory configured to store predictions having an age less than a threshold and long term cache memory configured to store predictions having, for example, a lifetime greater than or equal to a threshold. can do. Long term cache memory typically has greater memory capacity than short term cache memory. In addition, the prefetcher is an interface configured to detect in parallel while spanning more than one or two cycles of operation, whether multiple predictions are stored in both short-term cache memory or long-term cache memory, or either. Wherein the interface uses two or more representations of multiple predictions respectively when examining the short term cache memory and the long term cache memory.

Brief description of the drawings

The invention is more fully understood from the following detailed description taken in conjunction with the accompanying drawings.

1 is a block diagram illustrating an exemplary speculator implemented through a memory process, in accordance with certain embodiments of the present invention.

2 is a diagram of an exemplary embodiment according to one embodiment of the invention.

3A is a diagram of an exemplary forward sequential predictor, in accordance with certain embodiments of the present invention.

3B is a diagram of an exemplary blind back sequential predictor, in accordance with certain embodiments of the present invention.

FIG. 3C is a diagram of an exemplary back sector sequential predictor, in accordance with certain embodiments of the present invention. FIG.

3D is a diagram illustrating the operation of an exemplary reverse sequential predictor in accordance with certain embodiments of the present invention.

4 is a diagram illustrating an exemplary out of order predictor, in accordance with an embodiment of the invention.

5 is a diagram illustrating an example technique for suppressing out of order prediction for a stream of interleaved sequential addresses, in accordance with an embodiment of the present invention.

6 is a diagram illustrating an exemplary technique for suppressing out of order prediction for an interleaved sequential address across multiple threads, in accordance with an embodiment of the present invention.

7 is a diagram illustrating another technique for suppressing out of order prediction based on arrival times of a reference address and out of order address, in accordance with certain embodiments of the present invention.

8 is a diagram illustrating an exemplary technique for facilitating generation of prediction, in accordance with certain embodiments of the present invention.

9 is a diagram illustrating another exemplary speculator including a predictive filter, in accordance with an embodiment of the present invention.

10 is a block diagram illustrating a prefetcher implementing an exemplary out of order predictor, in accordance with certain embodiments of the present invention.

11 is a diagram illustrating an exemplary out of order predictor in accordance with an embodiment of the present invention.

12 is a diagram illustrating an exemplary prediction generator, in accordance with an embodiment of the present invention.

13 is a diagram illustrating an exemplary priority adjuster, in accordance with certain embodiments of the present invention.

14 is a diagram illustrating an example pipeline of operating an out of order predictor generator when forming out of order prediction, in accordance with certain embodiments of the present invention.

FIG. 15 is a diagram illustrating an example pipeline for operating a priority adjuster that prioritizes out of order prediction, in accordance with certain embodiments of the present invention.

16 is a block diagram illustrating an exemplary predictive inventory in a memory processor, in accordance with certain embodiments of the present invention.

17 is a diagram illustrating an exemplary predictive inventory in accordance with an embodiment of the present invention.

18 is a diagram illustrating an example of an inventory filter according to a particular embodiment of the present invention.

19A and 19B are diagrams illustrating example techniques for filtering redundancy, in accordance with certain embodiments of the present invention.

20 is a diagram illustrating another exemplary predictive inventory placed in a prefetcher, in accordance with an embodiment of the present invention.

21 is a block diagram illustrating a prefetcher including an exemplary cache memory, in accordance with certain embodiments of the present invention.

22 is a diagram illustrating an exemplary multi-level cache, in accordance with an embodiment of the present invention.

FIG. 23A is a diagram illustrating an exemplary first query interface for a first address store, in accordance with certain embodiments of the present invention. FIG.

FIG. 23B illustrates a number of input addresses that can be checked simultaneously using the first query interface of FIG. 23A.

24 is a diagram illustrating an exemplary second query interface for a second address store, in accordance with certain embodiments of the present invention.

FIG. 25A is a diagram illustrating a possible arrangement (or representation thereof) of example addresses stored in a second address store, in accordance with an embodiment of the present invention. FIG.

FIG. 25B is a diagram illustrating an exemplary hit generator that generates a result based on an unordered order and valid bits in accordance with one embodiment of the present invention.

FIG. 26 is a schematic representation of a component that generates R, one result of the heat generator of FIG. 25B, in accordance with an embodiment of the present invention.

27 is a diagram illustrating an example of a heat generator, in accordance with certain embodiments of the present invention.

28 is a diagram showing another example of a heat generator, according to another embodiment of the present invention.

The same reference numerals indicate corresponding parts throughout the several views.

Detailed Description of Exemplary Embodiments

The present invention provides a system, apparatus, and method for effectively predicting access to a memory for retrieving program instructions and program data that may be expected to be required by a processor. By effectively predicting access to memory, the latency of providing the necessary data to one or more processes can be minimized. In accordance with certain embodiments of the present invention, an apparatus includes a speculator configured to predict memory access. The example speculator can be configured to generate a configurable amount of prediction to change the prediction generation rate. In another embodiment, the speculator can suppress the generation of certain predictions to limit the amount of unnecessary predictions, such as redundant predictions that prefetchers may be required to manage. Further, in certain embodiments, the speculator can filter out unnecessary predictions by examining whether the inventory containing cache memory or predictions includes more appropriate predictions for representation to the processor. In one embodiment, the cache memory stores predictions in the short term cache and the long term cache memory, which are simultaneously checked to filter out the excess predictions.

Sequential and Out of order  To generate predictions

Prefetchers  And Specular  Exemplary Embodiment

1 is a block diagram illustrating an exemplary speculator in accordance with certain embodiments of the present invention. In this example, speculator 108 is shown to reside within prefetcher 106. Also, prefetchers 106 are shown to reside within memory processor 104 designed to control at least memory access by one or more processors. Prefetcher 106 operates to " fetch " both program instructions and program data in memory 112 before it is required, and then fetches the program instructions and program data fetched to processor 102 under request by the processor. to provide. By fetching them (ie, “prefetching”) prior to use, processor idle time (eg, time while data in processor 102 is depleted) is minimized. The prefetcher 106 also includes a cache memory 110 for storing and managing a representation of the prefetched data with the processor 102. The cache memory 110 functions as a data store for speeding-up instruction execution and data retrieval. In particular, the cache memory 110 resides in the prefetcher 106, and other memory caches, such as the " L1 " and " L2 " caches, which are generally employed to reduce some delay time separately from the memory controller 104. It works to supplement.

In operation, speculator 108 monitors system bus 103 for a request by processor 102 (“read request”) to access memory 112. In particular, because processor 102 executes program instructions, speculator 108 detects a read request for an address containing program instructions and program data used by processor 102. For illustration purposes, an "address" is combined with a cache line or unit of memory that is generally transferred between memory 112 and cache memory 110. An "address" of a cache line may indicate a memory location, and the cache line may include data from one or more addresses of memory 112. The term "data" refers to a unit of information that can be prefetched, while the terms "program instruction" and "program data" respectively refer to the instructions and data used by the processor 102 in its processing. Thus, the data (eg, any number of bits) may represent predictable information that constitutes the program instruction and / or program data. The term "prediction" may also be used interchangeably with the term "predicted address". If the predicted address is used to access the memory 112, the one or more cache lines include those in which the predicted address as well as other addresses (predicted or otherwise) are typically fetched.

Based on the detected read request, speculator 108 may then generate a configurable number of predicted addresses that may be requested by processor 102. Thus, speculator 108 operates by using one or more speculation techniques in accordance with at least one embodiment of the present invention. Speculator 108 implements these speculation techniques, such as predictors, and their implementation is described below. In addition, speculator 108 suppresses the occurrence of some predictions and filters out other predictions. By suppressing or filtering certain predictions, or both, the number of redundant predictions is reduced, thereby conserving resources. Examples of conserved resources include memory resources such as cache memory 110, and bus resources (eg, by bandwidth) such as memory bus 111.

After the prediction of the speculator 108 performs additional filtering, the memory processor 104 sends the remaining prediction (ie, the unfiltered prediction) to the memory 112 via the memory bus 111. . In response, memory 112 returns the prefetched data with the predicted address. The cache memory 110 temporarily stores the returned data until the time when the memory processor 104 transmits the data to the processor 102. At the appropriate time, the memory processor 104 sends the prefetched data to the processor 102 over the system bus 103 to ensure that the delay time is minimized, among other conditions.

2 illustrates an exemplary speculator in accordance with an embodiment of the present invention. Speculator 108 is configured to receive a read request 201 that generates a prediction 203. As shown, the speculator 108 is a prediction controller 202 configured to provide control information and address information to a sequential predictor ("SEQ. Predictor"; 206) and an out of order predictor ("NONSEQ. Predictor"; 216). ), Where these predictors generate the prediction (203). Prediction controller 202 functions, in whole or in part, to manage the prediction generation process in a manner that provides the optimal amount and optimal type of prediction. For example, prediction controller 202 can change the number of predictions and the type of predictions generated for a particular cache line, or group of cache lines, specified in read request 201. As another example, prediction controller 202 suppresses the occurrence of certain predictions to conserve resources such as memory available in target cache 218 and minimizes unnecessary access to memory 112 due to over-predicted addresses. Hazard suppressor 204. Additionally, prediction controller 202 can include an accelerator 205 for encouraging the generation of out of order prediction. As shown in FIG. 8, the accelerator 208 operates to trigger the generation of out-of-order prediction before detecting an address that occurs just before the non-linear address stream with which the out-of-order prediction occurs. A more detailed description of the prediction controller 202 follows the description below for the sequential predictor 206 and the out of order predictor 216.

The sequential predictor 206 is configured to generate a prediction (ie, predicted address) having a degree of expectancy. That is, sequential predictor 206 generates predictions that may be expected to follow one or more patterns of regular read requests 201 over time. These patterns arise from the fact that memory references have spatial locality between them. For example, as the processor 102 executes program instructions, the stream of read requests 201 may be a natural sequence that they cross the system bus 103. In order to predict an address following a sequential pattern, the type of speculation technique described below as "forward sequential prediction" may predict the sequential address. This type of speculation technique is described below.

The forward sequential predictor 208 is configured to generate a plurality of sequential addresses in ascending order. Thus, when the processor 102 sends a series of read requests 201 over a system bus 103 that includes a stream of ascending addresses, the forward sequential predictor 208 is responsible for prefetching additional ascending addresses. Generates numbers An example of a forward sequential predictor (“FSP” 208) is shown in FIG. 3A. As shown in FIG. 3A, the FSP 208 receives an address, such as address A0, and generates one or more addresses in a forward (ie ascending) order from the A0 address. The indication of A0 identifies the reference address (ie A + 0) in which one or more predictions are made. Thus, displays A1, A2, A3, and the like represent addresses such as A + 1, A + 2, A + 3, and so on, while displays A (-1), A (-2), A (-3), and the like are A -1, A-2, A-3, etc. address is shown. Although these indications represent a series of addresses in either ascending or descending order by one address, any patternable set of addresses may appear sequentially. As used throughout this specification, sequential addresses may be represented and represented by a single letter. For example, "A" represents A0, A1, A2, A3 and the like, and "B" represents B0, B1, B2, B3 and the like. In this way, "A" and "B" represent sequential address streams, respectively, but the address stream of "B" represents a non-sequential address stream of "A".

Referring further to FIG. 3A, the FSP 208 is shown to receive at least an enable signal and a batch signal, which signals are provided by the prediction controller 202. The enable signal controls whether forward sequential prediction occurs, and if so, the batch signal controls the number of sequential addresses from which the FSP 208 occurs. In this example, the batch signal indicates that "7" addresses under the reference address are expected. As such, the FSP 208 generates forward-sequential addresses A1 to A7. Thus, when the speculator 108 receives an address such as A0 as part of the read request 201, the sequential predictor 206 is assigned the addresses A1, A2, A3,... As part of the prediction 203. Ab can be provided, where b is the number of “batch”.

The blind back sequential predictor 210 of FIG. 2 generates one sequential address, but is configured in descending order from the reference address. An example of a blind back sequential predictor (“blind back”) 210 is shown in FIG. 3B, which receives one or more addresses, such as address A0, and generates only one prediction, such as address A (−1). The blind back sequential predictor 210 is shown backwards (ie, in descending order) at the A0 address. In addition, as in the case via the FSP 208, the blind back sequential predictor 210 receives an enable signal to control whether to generate descending prediction.

The back sector sequential predictor 214 of FIG. 2 is configured to generate a particular cache line as a prediction after detecting another particular cache line on the system bus 103. In particular, when the back sector sequential predictor 214 detects that a particular read request 201 is a request for a high-order cache line, then the associated low-order cache line is generated with prediction. High-order cache lines may be represented as upper ("front") sectors containing odd addresses, while low-order cache lines may be represented as lower ("back") sectors containing even addresses. For purposes of illustration, the cache line contains 128 bytes, with 64 bytes (ie, upper half of 128 bytes) high-order cache line and 64 bytes (ie, lower half of 128 bytes). Suppose it consists of a low-order cache line.

One example of a back sector sequential predictor 214 is shown in FIG. 3C, which represents a back sector sequential predictor (“back sector”) 214 that receives one or more addresses. Under the condition of receiving a read request 201 for the top or front sector of the cache line, the back sector sequential predictor 214, such as address AU, generates only one prediction: address AL. This type of speculation technique affects the phenomenon in which the processor 102 generally requests the top or front sectors of the cache line and requests the bottom or back sectors after some time. The back sector sequential predictor 214 also receives an enable signal to control whether to generate back sector prediction.

The reverse sequential predictor 212 of FIG. 2 is configured to generate multiple sequential addresses in descending order. Thus, when processor 102 sends a series of read requests over system bus 103 that includes a stream of descending addresses, reverse sequential predictor 212 generates multiple predictions for additional descending addresses. An example of a reverse sequential predictor (“RSP”) 212 is shown in FIG. 3D. As shown in FIG. 3D, the RSP 212 detects address streams such as addresses A0, A (-1), and A (-2) and, in response, reverses (ie, in descending order) at reference address A0. Generate one or more addresses sequentially. 3D also indicates that the RSP 212 receives at least an enable signal, a placement signal, and a confidence level (“Conf.”) Confidence signal, all of which are provided by the prediction controller 202. Indicates. Although the enable signal and the placement signal operate in the same way as used through the FSP 208, a confidence level ("Conf.") Signal is a threshold defined when triggering the generation of reverse-sequential prediction. ).

3D also shows a chart 310 illustrating the operation of an example RSP 212, in accordance with certain embodiments of the present invention. Here, " 2 " confidence levels set the trigger level 312, and the batch signal indicates that " 5 " addresses more than the trigger address to be predicted. The trigger address is the address that causes the predictor to generate the prediction. After detecting A (0) during interval I1, assume that RSP 212 detects address A (-1) during subsequent interval I2. Next, during the interval I3, address A (-2) is detected, and the detected stream reaches a specific level of trust, which is a series of descending addresses. This level of confidence is reached when the trigger level 312 is exceeded, which causes the RSP 212 to generate the reverse-sequential addresses A (-3) to A (-7). Thus, if speculator 108 receives a certain number of addresses, such as A0, A (-1), and A (-2) as a series of read requests 201, sequential predictor 206 predicts (203). Address A (-3), A (-4), A (-5), ..., Ab can be provided as part of a) where b is the number "batch". In some embodiments, RSP 212 does not employ a confidence level, but produces a prediction that starts after the reference address. In another embodiment of the present invention, the concept of confidence level is employed in the other predictions described herein. Control of the RSP 212 and other configuration predictors of the sequential predictor 206 is further described below; The out of order predictor 216 of FIG. 2 is described below.

Out-of-order predictor 216 may generate one or more predictions (ie, predicted addresses) subsequent to the address detected by speculator 108, even until the address is within the non-linear stream of read request 201. It is composed. Typically, if there is no observable pattern of the requested address to predict the next address, prediction based only on the previous address is difficult. However, in accordance with one embodiment of the present invention, out of order predictor 216 generates out of order predictions that include predicted addresses that are not patternable in one or more prior addresses. A "nonpatternable" prediction is a prediction that cannot be patterned together or is irregular for a prior address. One type of nonpatterned prediction is out of order prediction. The pre-address to which out-of-order prediction is based may be an instant address or any address configured as a trigger address. In particular, the lack of one or more patterns across two or more addresses in the stream of read requests 201 may cause processor 102 to execute program instructions in a slightly indiscriminate manner due to fetching instructions and data at various spatial locations of memory locations. ).

Out-of-order predictor 216 includes a target cache 218 as storage that stores a combination of one or more possible out-of-order addresses that may be qualified in a prior address as out-of-order prediction. The target cache 218 is designed to easily compare its content against the incoming, which is the address detected to generate out of order prediction in an appropriate manner. The detected address that produces the out of order prediction is referred to as the "trigger" address, and the result is the "target" of an unpatternable combination between the two addresses. An exemplary out of order predictor 216 is described below.

4 illustrates an exemplary out of order predictor 216, in accordance with an embodiment of the present invention. Out-of-order predictor 216 includes an out-of-order prediction engine ("NonSeq. Prediction Engine") 420 operatively coupled to a repository that is a target cache 422. The target cache 422 maintains a combination between each trigger address and one or more corresponding target addresses. 4 illustrates one of many ways of combining out of order addresses. Here, the tree structure associates a particular trigger address with its corresponding target address. In this example, target cache 422 includes address "A" as a trigger address to form a combination with addresses of possible out of order predictions, such as addresses "B", "X" and "L". Also, the three target addresses are trigger addresses for the addresses "C" and "G", "Y" and "M", respectively. The formation and operation of the target cache 422 is described in more detail below. Also, the address "A" may be a target address for a trigger address not shown in FIG. In addition, many other combinations are possible among the addresses not shown.

Out of order prediction engine 420 is configured to receive at least four signals and any number of addresses 402. To control the operation of the out of order prediction engine 420, the prediction controller 202 provides a "batch" signal and an "enable" signal, which signals are substantially similar to the signals previously described. The prediction controller 202 also provides two other signals: a width ("W") signal and a depth ("D") signal. These signals control the formation of the target cache 422, the width signal W sets the possible number of targets that the trigger address can predict, and the depth signal D sets the number of levels in combination with the trigger address. An example of the latter is when D represents a depth of "4". This means that address A is at the first level, address B is at the second level, addresses C and G are at the third level, and address D is at the fourth level. An example of the former is when W is set to "2". This means that only two of the three addresses "B", "W" and "L" are used for out of order prediction.

4 also illustrates a non-sequential prediction engine configured to receive an example address 402 at the prediction controller 202, such as an address conceptually shown in the non-sequential address streams 404, 406, 408, 410 and 412. 420, where each out of order address stream includes an address that is not patternable with a pre-detected address. For example, stream 404 includes address "A" before address "B", which in turn is followed by address "C". As in the case of out of order addresses, detecting a pattern for predicting "B" at "A" and predicting "C" at "B" is to monitor the read request 201 at the processor 102. Without the above operation, it is a difficult problem. To accomplish this, out of order predictor 216 forms a target cache 422 to enable prediction of an unpatternable combination between a particular trigger address and its target address. When the out of order prediction engine 420 forms the out of order prediction, it generates a group of predictions at the combined target address. Thus, if trigger address "A" is derived to out of order prediction of address "B" (ie, B0 as a reference address), then the predicted address includes B0, B1, B2, ... Bb, Here b is a number set by the batch signal.

In one embodiment of the invention, the out of order prediction engine 420 forms a target cache 422 as it stores the combination from each address 402 to a subsequent address. For example, by detecting the address A of the stream 404, the out-of-order prediction engine 420 causes the target cache 422 to have a combination, such as a combination of A to B, a combination of B to C, a combination of C to D, and the like. ) Out of order prediction engine 420 operates the same as when detecting the addresses of other streams 406, 408, and the like.

According to a particular embodiment, the target cache 422 stores these combinations in the form of tables, such as tables 430, 440, and 450. These tables include a trigger column 426 and a target column 428 for storing a combination between a trigger address and a target address, respectively. Next, consider that the addresses 402 of all streams are stored in the tables 430, 440, and 450 of the target cache 422. As shown in table 430, trigger-target combinations 432, 434, and 436 describe combinations A to B, combinations B to C, and combinations G to Q, respectively. Other trigger-target combinations 438 include C to D combinations, and the like. As such, table 440 includes trigger-target combination 442 to describe the combination from A to X, and table 450 uses trigger-target combination 452 to describe the combination from A to L. Include.

4 is identified as "Way 0", "Way 1", and "Way 2", where tables 430, 440, and 450 respectively describe the relevant priorities of multiple trigger-target combinations for the same trigger address. . In this case, Way 0 is combined as the highest priority and Way 1 is combined as the second highest priority. In this example, trigger-target combination 432 of table 430 indicates that the combination from A to B is higher priority than the combination of A to X, which is trigger-target combination 442 of table 440. Thus, after the target cache 422 includes these combinations, the subsequent temporal out of order prediction engine 420 detects address A (which enables the out of order prediction engine 420 for the prediction controller 202 to operate). 1) Next, address B is predicted as the highest priority, followed by address X as the second highest priority due to the relevant priority of the table.

According to one embodiment of the invention, the relevant priorities are determined in at least two ways. First, the trigger-target combination is assigned the highest priority when first detected and placed into the target cache 422. Second, the trigger-target combination may indicate that the non-sequential prediction engine 420 determines whether the trigger-target combination is successful (eg, most- recent cache hits resulting in non-sequential prediction based on that particular combination). When determining whether there is a recent cache hit, the highest priority is assigned. The "most recent" cache hit is at least one recent cache hit of the target address combined with a specific trigger address. Also, the previous "highest priority" (also designated leg 0) is shuffled to the second highest priority (also designated leg 1) by moving the combination corresponding to the Way 1 table. As an example, consider a first point when a combination of A to X is introduced into the target cache 422 as a first trigger-target combination. As a result, the highest priority (i.e., initial leg 0) is assigned by locating to table 430 (i.e. Way 0). At a later point in time, the target cache 422 inserts a combination of A to B into the table 430 (highest priority, leg 0). Also, the combination of A to X is moved to the table 440 (second highest priority, leg 1). In a particular embodiment of the invention, the table in which the trigger-target combination is stored depends on some of the address bits that make up the index.

Referring back to FIG. 2, prediction controller 202 is configured to control both sequential predictor 206 and out of order predictor 216. Prediction controller 202 controls the type as well as the amount of prediction generated by sequential predictor 206 or out of order predictor 216, or both. In addition, prediction controller 202 suppresses the occurrence of unnecessary predictions 203, such as excess or replication predictions. Since each prediction 208, 210, 212, 214 and 216 can operate simultaneously, the number of predictions 203 must be managed so as not to overload the prefetcher resource. Prediction controller 202 employs suppressor 204 to perform this operation and other similar operations.

In one embodiment of the invention, the suppressor 204 controls the amount of prediction that is generated. This is done by first confirming a particular distribution of the read request 201. In particular, suppressor 204 determines whether read request 201 belongs to either a program instruction (ie, "code") or program data (ie, "code not"). Typically, the read request 201 to retrieve the code rather than the program data tends to be a natural sequence, or at least a patternable sequence. This is because it is common for the processor 102 to execute instructions in a more linear manner than its request for program data. As such, the suppressor 204 can direct the sequential predictor 206 or the non-sequential predictor 216 to suppress occurrence of prediction when the read request 201 relates to program data. This helps to prevent the occurrence of spurious prediction.

In addition, the suppressor 204 determines the amount of prediction that generates the sequential predictor 206 and the non-sequential predictor 216 by checking whether the read request 201 is a non-prefetch “request” or prefetch. Can be adjusted. Processor 102 is typically absolutely necessary in some cases to require program instructions or program data retrieved from memory 112 (non-prefetch requests), while processor 102 only expects the following needs. It may be required to prefetch program instructions or program data in order to do so. Since the absolute need can function more important than the expected need, the suppressor 204 can suppress the prediction based on the prefetch of the read request 201 in the prediction based on the request of the read request 201. It can point to a specific predictor.

Table I shows an example technique for suppressing the number of predictions generated. That is, if the read request 201 belongs to both codes and requests, the suppressor 204 will be least suppressed. In other words, prediction controller 202 sets a "batch" at the large size indicated as Batch Size (4) in Table I. In a particular example, batch size 4 can be set to seven. However, for the reasons described above, if the read request 201 relates to program data (ie, not code) and processor-generated prefetchers, the suppressor 204 will be most suppressed. As such, prediction controller 202 sets a "batch" at the small size indicated by batch size 1 in table I. As an example, batch size 1 can be set to one. In other cases, the prediction controller 202 can vary the level of prediction suppression by using other batch sizes such as batch size 2 and batch size 3. Although the suppressor according to one embodiment of the invention is configured to suppress the occurrence of one or more predicted addresses by reducing the "deployment" amount when the processor request is for a data or prefetch request, or both, Table I Is not limiting. For example, processor requests for code or instructions may be reduced rather than increased "batch" size. As another example, the request for a request can also be reduced rather than increased the "batch" size. Those skilled in the art having ordinary skill in the art understand that there are many variations within the scope of the present invention.

[Table 1]

Figure 112007014645629-pct00001

The suppressor 204 can also adjust the type of prediction in which the sequential predictor 206 and the out of order predictor 216 occur. First, the prediction controller 202 considers that the forward sequential predictor 208 and the reverse sequential predictor 212 can be enabled at the same time. As such, the suppressor 204 triggers when the reverse sequential predictor 212 triggers to minimize the prediction of the address in ascending order when the processor 102 requests a read request in descending order (ie, exceeds the confidence level). ) Instruct the prediction controller 202 to disable at least the forward sequential predictor 208.

Second, a particular address may be used for back prediction (i.e., blind back sequential predictor 210 or back sector sequential predictor 214) for sequential prediction (i.e., forward sequential predictor 208) for the prediction controller 202 to operate. Or consider triggering when inverse sequential predictor 212 is enabled. In this case, the suppressor 204 suppresses the placement by one of the initial amounts for either the forward sequential predictor 208 or the reverse sequential predictor 212. That is, if "batch" is initially set to 7, then "batch" is reduced by triggering or activating either the blind back sequential predictor 210 or the back sector sequential predictor 214. For example, if forward sequential predictor 208 is set to generate addresses A0, A1, A2, ..., A7, and blind back sequential predictor 210 enables for one or more read requests 201. Forward sequential predictor 208 only generates predictions A1, A2, ..., A6. The final result is a set of predictions A (-1), A (0), A1, A2, ..., A6 for these read requests 201, where the back prediction provides prediction A (-1). .

Third, the prediction controller 202 may either be blind back sequential prediction 210 or back sector sequential predictor 214 to suppress the prediction after the first prediction has occurred within the sequential stream 201 of addresses in the processor. Can be additionally disabled. This is because after a sequential reference address has been established, subsequent forward or backward sequential predictions predict reverse-type speculation (although one address is behind). For example, forward sequential predictions A2, A3 and A4 (if the reference address is A0) all cover the predicted backward-type predictions A1, A2, and A3. Suppressor 204 can be configured to suppress other types of predictions, examples are described below.

5 illustrates an example technique for suppressing out of order prediction, in accordance with an embodiment of the present invention. According to this technique, the suppressor 204 detects an interleaved sequential stream that may be considered out of order and requires the storage of the trigger-target combination in the target cache 422. To conserve resources, especially available memory in the target cache 422, the suppressor 204 analyzes non-sequential addresses such as in stream 502 and models non-sequential addresses such as interleaved sequential streams. . As shown, stream 502 includes addresses A0, B0, C0, A1, B1, C1, A2, B2, and during the intervals I1, I2, I3, I4, I5, I6, I8, and I9 that are detected. It consists of C2. Suppressor 204 includes a data structure such as table 504 to model non-sequential addresses as sequential. The table 504 can include any number of stream trackers for decomposing the stream 502. In particular, stream trackers 520, 522, and 524 are designed to model sequential streams B0, B1, and B2, A0, A1, and A2, and C0 and C1, respectively. The post-detected read addresses in stream 502 such as A7 (not shown) are compared against these streams to observe whether out of order predictions can be suppressed for the stream being tracked.

In operation, the suppressor 204 tracks the sequential stream by storing a reference address 510, such as the sequential first address. The suppressor 204 then retains the finally-detected address 514. For each new final-detection address (e.g., B2 of stream tracker 520), the preliminary final-detection address (e.g., Bl of stream tracker 520) is an additional column column 512. Is canceled by placing it in a "void". Through this exemplary technique, the suppressor 204 suppresses the generation of unnecessary out of order predictions when other types of predictions may be used. Thus, for the example shown in FIG. 5, forward sequential predictor 208 can sufficiently generate prediction for stream 502.

6 illustrates another example technique for suppressing out of order prediction, in accordance with an embodiment of the present invention. In accordance with this technique, the suppressor 204 models out of order addresses as an interleaved sequential stream similar to the process shown in FIG. However, the technique of FIG. 6 implements each composite data structure used to detect sequential streams over any number of threads. In this example, the tables 604, 606, and 608 store respective stream trackers for thread (0; "T"), thread (1; "T '"), and thread (2; "T' '"). Include. Through this technique, the out of order addresses of stream 602 can be modeled as multiple out of order streams over multiple threads to suppress out of order prediction. It is specified that this technique may be applied to reverse sequential streams or other types of prediction.

7 illustrates another technique for suppressing out of order prediction, in accordance with certain embodiments of the present invention. For the stream at address 702, out of order exists between addresses A4 and B0. However, in some cases, if the time difference between these requested read addresses is too short, then there is not enough time to employ out of order prediction. The matcher 706 of the suppressor 204 operates to compare the time difference d between addresses A4 and B0. If d is equal to or greater than the thread (TH) thread, matcher 706 signals to enable (ie, "do not suppress") out of order predictor 216 to operate. However, if d is less than TH, matcher 706 signals to disable out of order predictor 216, thereby suppressing prediction.

Other suppression mechanisms that can be implemented by the suppressor 204 are described below. In general, there is a finite amount of time that elapses before a request for a back sector address is made by processor 102 after requesting a front sector address. If the amount of time is long enough, the back sector address read request may appear to be irregular (ie, non-patternable for the front sector). To prevent this, the suppressor 204 is configured to maintain a list of front sector reads by the processor 102. After detecting the front sector address, the address is compared against that front sector address. When the corresponding back sector arrives, it is recognized. Thus, the prediction as well as the out of order can be suppressed.

8 illustrates an example technique for facilitating the generation of prediction, in accordance with certain embodiments of the present invention. Specifically, accelerator 205 (FIG. 2) operates in accordance with this technique to promote the occurrence of out of order prediction. In this example, stream 802 includes two adjacent sequential streams A0 to A4 and B0 to B3. Out of order predictor 216 typically specifies address A4 as trigger address 808 having address B0 as target address 810. However, to reduce the time to generate out of order prediction, the trigger address 808 can be changed to a new trigger address 804 (ie A0). Thus, through the designation of a new trigger address for the target address, the next processor 102 requests the address of the stream 802 and the out of order predictor 216 predicts it under detection of an address earlier than the subsequent address. May occur immediately (ie, prediction may occur when A0 is detected as a "new" trigger address than A4). This ensures that out of order prediction occurs at the most appropriate time.

9 shows another exemplary speculator according to an embodiment of the present invention. In this example, prefetcher 900 includes a speculator 908 having a filter 914 for filtering out redundant addresses to keep unnecessary prediction occurrences to a minimum. The prefetcher 900 of FIG. 9 also includes a multi-level cache 920 and predictive inventory 916. Here, the multi-level cache 920 is comprised of a first level data return cache ("DRC1") 922 and a second level data return cache ("DRC2") 924. The first level data return cache 922 may generally be described as a short term data store, and the second level data return cache 924 may generally be described as a long term data store. The multi-level cache 920 stores program instructions and program data prefetched from the memory 112 until the processor 102 requires them. Similarly, prediction inventory 916 provides temporary storage for predictions generated until selected as access memory 112 by arbiter 918. The arbiter 918 is configured to determine, according to the arbitration rule, that the generated prediction is issued to access the memory 112 that prefetches instructions and data.

Filter 914 includes at least two filters: cache filter 910 and inventory filter 912. Cache filter 910 is configured to compare the previously-predicted pre-prediction of the instructions and data already stored in multi-level cache 920 with the newly-generated prediction. Thus, if a surplus occurs for any pre-generated prediction in which one or more newly-generated predictions relate to the multi-level cache 920, the surplus prediction is canceled to minimize the number of predictions. In addition, the inventory filter 912 is configured to compare the newly-generated prediction against predictions already generated and stored in the predictive inventory 916. Thus, one or more newly-generated predictions are redundant to the prior predictions stored in prediction inventory 916, and any surplus predictions can be canceled to minimize the number of predictions, thereby prefetcher resources Free

Out of order  Example Embodiments for Predictor

10 is a block diagram illustrating an exemplary out of order ("NONSEQ") predictor 1010 in accordance with certain embodiments of the present invention. In this example, out of order predictor 1010 is shown to reside within speculator 1008 and also includes a sequential predictor 1012 for generating sequential prediction. Prefetcher 1006 including speculator 1008 “fetches” both program instructions and program data from (not shown) memory before it is required, and then under request by its processor (not shown) Operate to provide fetched program instructions and program data to a process. By fetching them (ie, “prefetching”) before use, processor idle time (eg, time while data is running out of the processor) is minimized. Out of order predictor 1010 includes an out of order prediction engine (“prediction engine”) 1020 for generating predictions and a target cache 1030 for storing and prioritizing the predictions.

The prefetcher 1006 also includes a filter 1014, a selection prediction inventory 1016, a selection arbiter 1018, and a multi-level cache 1040. Here, the filter 1014 includes a cache filter (not shown) configured to compare the newly-occurred prediction with the pre-prediction that caused the program instructions and program data to be prefetched first into the multi-level cache 1040. Thus, if any new-occurring prediction is redundant in any pre-generated prediction stored in the multi-level cache 1040, the redundant prediction is canceled to minimize the number of predictions, thereby prefetcher resources Free Prediction inventory 1016 provides temporary storage for storing predictions generated until selected by arbiter 1018 to access memory. The arbiter 1018 is configured to determine which generated predictions are issued to access the memory to prefetch instructions and data.

The multi-level cache 1040 is comprised of a first level data return cache ("DRC1") 1042 and a second level data return cache ("DRC2") 1044. The first level data return cache 1042 can generally be described as a short term data store, and the second level data return cache 1044 can generally be described as a long term data store. According to one embodiment of the invention, either, or both, the first level data return cache 1042 or the second level data return cache 1044 are prefetch based on the predicted address (ie, the target address). Stored program instructions and program data. As shown, the prefetched prediction information stored in the multi-level cache 1040 is represented as data TRT1 and data TRT2. This indication contributes to prefetching data in which the target addresses TRT1 and TRT2 represent prediction information. As shown, as described below, data TRT1 and data TRT2 are stored in a multi-level cache 1040 having prediction identifiers " PID " 1 and 2, respectively. When either data TRT1 or data TRT2 is requested by the processor, the corresponding target address (eg, TRT1) and the prediction identifier are communicated to the out of order predictor 1010.

In operation, speculator 1008 monitors the system bus as the processor requests access to memory (“read request”). Because the processor executes program instructions, speculator 1008 detects a read request for an address that includes program data and program data that is now used by the processor. For the purposes of discussion, an "address" is combined with a cache line or unit of memory, such as a multi-level cache 1040, that is generally transferred between memory and cache memory. It is specified that the cache memory is an example of external storage of the target cache 1030.

Based on the detected read request, the out of order predictor 1010 may generate a configurable number of prediction addresses that are likely to be subsequently requested by the processor. In particular, out of order predictor 1010 is configured to generate one or more predictions (ie, predicted addresses) following detection of the address, even when the address is in the nonlinear stream of the read request. Typically, there is no observable pattern of requested addresses under conditions that are difficult to predict based on one previous address, the next address. However, according to one embodiment of the present invention, the out of order prediction engine 1020 generates out of order predictions that include predicted addresses that are not patternable from one or more preceding addresses. A "nonpatternable" prediction is a prediction that cannot be patterned together, or is irregular for the preceding address. One type of nonpatterned prediction is out of order prediction. The preceding address on which the out of order prediction is based may be either an immediate address or any address configured as a trigger address. In particular, the lack of one or more patterns across two or more addresses in the stream of read requests implies a process of executing program instructions in a slightly indiscriminate manner for fetching instructions and data from various spatial locations of memory locations.

Out of order predictor 1010 includes a target cache 1030 as storage for storing a combination of preceding addresses for one or more potential out of order addresses that may each qualify as out of order prediction. The target cache 1030 is designed to compare its content with the introduced address that generates out of order prediction in a fast manner. The target cache 1030 is also configured to prioritize out of order prediction in response to, for example, a hit in the cache memory. Or, the out of order predictor 1010 can prioritize the first example to establish a combination between the new out of order prediction and the specific trigger address. A "trigger" address is a detected address that causes out of order predictor 1010 to generate out of order prediction with a result prediction called "target" of an unpatternable combination between the two. In accordance with at least one embodiment of the present invention, the target cache 1030 may be a single-ported memory for holding resources otherwise utilized by multi-ported memory. Is specified.

Prefetcher 1006 issues predictions from out of order predictor 1010, and out of order prediction is used to access the memory. In response, the memory returns prefetched data that references with respect to the predicted address, where the reference information may include a predictive identifier (“PID”) and a corresponding target address. Next, the multi-level cache memory 1040 temporarily stores the returned data until the time the processor requests the returned data. As described below, when the processor requests prefetched data (ie, prediction information), the reference information is passed to the out of order predictor 1010 to reorder the out of order prediction if necessary.

11 illustrates an exemplary out of order predictor 1010, in accordance with an embodiment of the present invention. Out-of-order predictor 1010 includes an out-of-order prediction engine ("NonSeq. Prediction Engine") 1120 operably connected to the storage illustrated by target cache 1130. The out of order prediction engine 1120 also includes a prediction generator 1122 and a priority adjuster 1124. Prediction generator 1122 generates predictions and manages trigger-target combinations stored in target cache 1130. Priority adjuster 1324 operates, for example, to prioritize the trigger-target combination from the most recent successful target address to at least the least recent or successful target address. Prediction generator 1122 and priority adjuster 1124 are described in more detail with reference to FIGS. 12 and 13, respectively.

The target cache 1130 maintains a combination between each trigger address ("TGR") and one or more corresponding target addresses ("TRT"). 11 illustrates one of a number of methods for combining out of order addresses. Here, the tree structure associates a particular trigger address with its corresponding target address. In this example, target cache 1130 includes address "A" as a trigger address that forms a combination with addresses of possible out of order predictions such as addresses "B", "X", and "L". In addition, these three target addresses are trigger addresses for respective addresses "C" and "G", "Y", and "M". In particular, the formation and operation of target cache 1130, in which prediction generator 1122 discovers a new trigger-target combination and inserts the combination into target cache 1130, is described in detail below. Also, the address "A" may be a target address for a trigger address not shown in FIG. In addition, many other combinations are possible among the addresses not shown.

As shown, the target cache is an out-of-sequence prediction engine 1120 according to at least three variables of width ("w"), depth ("d"), and height ("h") according to one embodiment of the invention. It can be configured by). The width, w sets the number of possible targets that the trigger address can predict, and the depth, d sets the number of levels combined with the trigger address. The height, h, sets the number of consecutive trigger addresses used to generate out of order prediction. As an example, consider d representing "4" depths. This means that address A is at the first level, address B is at the second level, addresses C and G are at the third level, and address D is at the fourth level. As another example, consider w to be set to "2". This means that only two of the three addresses "B", "X", and "L" are used for out-of-order prediction as leg 0 and leg 1, and all three addresses are at the second level. In a particular embodiment, the variable h sets the number of levels just past the first level to validate multi-level prediction occurrences.

As shown in FIG. 11, it is considered that h is set to two. This means that there are two levels of trigger addresses of the trigger address (eg address A) in the first level and successive trigger addresses (eg address B) in the second level. Thus, with h set to 2, the first grouping of predictions is formed in response to triggering address A. That is, any target addresses of the second level may generate one or more groups of out of order addresses. For example, any of the addresses "B", "X", and "L" may be the criteria for generating out of order prediction, where the number of these addresses is an active defined by out of order prediction engine 1120. It is selected by the number of legs (eg leg 0 through leg 2). However, according to multi-level prediction occurrences (and through h set to 2), each of the addresses "B", "X" and "L" generates a second grouping of predictions based on the target address of the lower next level. It may be a consecutive trigger address. Thus, the third level of target addresses C and G can be used to generate additional out of order prediction based on the successive trigger address B. Similarly, target addresses Y and M may be used to generate out of order predictions depending on successive trigger addresses X and L, respectively. Those skilled in the art having ordinary skill in the art should specify that there are as many implementations as possible by changing one or more of the three aforementioned variables.

Out of order prediction engine 1120 is configured to receive an example address 1101 of a read request. 11 conceptually illustrates non-sequential address streams 1102, 1104, 1106, 1108, and 1110, each of which includes an address that cannot be patterned with a pre-detected address. For example, stream 1102 includes address "A" followed by address "B" followed by address "C". As in the case of a non-sequential address, detecting a pattern for predicting "B" at "A" and predicting "B" at "C" is difficult without more than just monitoring the read request 1101. Condition. To accomplish this, prediction generator 1122 establishes the contents of target cache 1130 to enable prediction of an unpatternable combination between a particular trigger address and its target address. For example, upon detecting address A (as well as subsequent address) of stream 1102, prediction generator 1122 has a target having a combination such as a combination of A to B, a combination of B to C, a combination of C to D, and the like. The cache 1130 is inhabited. Out of order prediction engine 1120 operates the same as when the addresses of other streams 1104, 1106, etc. are detected.

According to a particular embodiment, the target cache 1130 stores combinations thereof in the form of tables, such as tables 1140, 1150, and 1160. These tables include a trigger column ("TGR") and a target column ("TGT") for storing a trigger address and a target address, respectively. Next, consider that the addresses 1101 of all streams are stored in tables 1140, 1150 and 1160. As shown in table 1140, trigger-target combinations 1142, 1144, and 1146 describe combinations A to B, combinations B to C, and combinations G to Q, respectively. Other trigger-target combinations 1148 include C to D combinations and the like. Similarly, table 1150 includes a trigger-target combination 1152 that describes a combination from A to X, and table 1160 includes a trigger-target combination 1162 to describe the combination from A to L. .

FIG. 11 shows that tables 1140, 1150 and 1160 are identified as “Way 0”, “Way 1” and “Way 2” respectively, and multiple trigger-targets within trigger cache 1130 for the same trigger address. The relative position of the combination is shown. Priority adjuster 1124 typically assigns priority to the trigger-target combination by combining prioritized memory locations, thereby assigning prediction. In this case, Way 0 is combined with the highest priority and Way 1 is combined with the second highest priority. In this example, the trigger-target combination 1142 of the table 1140 indicates that the combination of A to B is higher priority than the combination of A to X, which is the trigger-target combination 1152 of the table 1150. Thus, after target cache 1130 includes these combinations, next time out of order prediction engine 1120 detects address A, and then out of order prediction engine 1120 may provide one or more predictions. Typically, out of order prediction engine 1120 generates out of order predictions generated in order of priority. In particular, the out of order prediction engine 1120 generates the prediction with the highest priority before generating the lower priority prediction. As such, out of order prediction engine 1120 may generate a configurable number of predictions based on priority. For example, the out of order prediction engine 1120 can limit the number of two predictions: leg 0 and leg 1 (ie, the top two trigger-target combinations). In some cases this means that the out of order prediction engine 1120 tends to provide address B rather than address X by the relative priority of the table. The related priority of the trigger-target combination specifies that it is only relevant. This means that the target cache 1130 may, for example, locate the highest priority combination for a particular trigger address in Way 4 and place the second highest priority combination in Way 9. However, it is specified that the target cache 1130 may include any amount of “legs” at one address, that is, any amount below that only leg 0 and leg 1.

12 illustrates an example prediction generator 1222, in accordance with an embodiment of the present invention. In this example, prediction generator 1222 is connected to target cache 1230 to generate predictions as well as manage trigger-target combinations stored therein. Prediction generator 1222 includes an index generator 1204, a tag generator 1206, a target determiner 1208, and a combiner 1210. The prediction generator 1222 also includes an inserter 1202 for inserting the found trigger-target combination into the target cache 1230.

In generating the prediction, the index generator 1204 and the tag generator 1206 operate respectively to generate an index and a tag for indicating the first address "addr_1", which may be an address that takes precedence over the other address. Index generator 1204 forms index "index (addr_1)" from addr_1 to access a subset of memory locations in target cache 1230. Typically, the value of index addr_1 selects each corresponding memory location of each selected method. In addition, the tag generator 1206 forms a tag “tag (addr_1)” such that the prediction generator 1222 can access a specific trigger-target combination in the target cache 1230 combined with addr_1.

As an example, consider that addr_1 is "G". Through this address, prediction generator 1222 generates index G to select the memory location combined with that index. In this example, index G has a value "I" of 3 (ie, I = 3). This means that each memory location whose index G is identified by I = 3 for method ("Way 0"; 1240), method ("Way 1"; 1250), to method ("Method N"; 1260) Can be used to select N, where N is a configurable number representing the number of methods available within the target cache 1230. For the same address G, tag generator 1206 generates a tag of address G as tag G to identify a particular memory location in combination with G. Therefore, in the case of the index of the index G and the tag of the tag G, the target addresses Q and P (or surrogate representations thereof) are respectively represented in the way 1240 and the way 1250, as shown in FIG. 12. May be retrieved from or stored at a memory location of the device. In a particular embodiment, each address constitutes 36 bits. Bits 28:18 can represent a tag for an address, and any group 19: 9, 18: 8, 17: 7 or bit 16: 6 of a bit can represent a configurable index for that address. Alternatively, in one embodiment, some of the addresses represent target addresses. For example, bits 30: 6 of the 36-bit target address are kept in the TRT column of the target cache 1230. With reduced representation of target address and trigger address, less hardware is required, thereby reducing costs associated with materials, resources, and the like.

Target determiner 1208 determines whether a trigger-target combination exists for a particular trigger, and if present, determines each target address for that trigger. In relation to the preliminary example, the target determiner 1208 retrieves the target addresses Q and P in response to the tag G that is matched for the tag at the index G representing the different trigger address. Those skilled in the art understand that it is appropriate for a known comparator circuit (not shown) to implement in either the prediction generator 1222 or the target cache 1230 to identify the matching tag. If more than one target address is found, these addresses are passed to combiner 1210. The combiner 1210 combines each target address 1214 into a prediction identifier (“PID”) 1212, which consists of an index and a tag of the trigger address. PID 1212 identifies the trigger address that caused the target addresses Q and P to be predicted. Thus, if PID 1212 can be represented as [index (G), tag (G)], the out of order prediction generated by prediction generator 1222 is [[index (G), tag (G) as a reference. ], Q]. If [index (G), tag (G)] is combined into a reference prediction, it specifies that Q is considered as a prediction. Therefore, the prediction information prefetched into the cache memory may be represented as data (Q) + [[index (G), tag (G)], Q].

The combiner 1210 can be configured to receive a "batch" signal 1226 that generates a number of additional predictions that are out of order with respect to the trigger address. For example, assume that placement signal 1226 directs combiner 1210 to generate "n" predictions as a group of predictions having a range that includes a matched target address. Thus, when the trigger address "G" generates out of order prediction of the address "Q" (ie, Q0 as a reference address), the predicted address may include Q0, Q1, Q2, ... Qb, where , b is the number set by the batch signal. In some cases where back sector or blind back sequential prediction occurs at the same time, then placement b may be set to b-1. As such, the group of predicted addresses includes Q (-1), Q0, Q1, Q2, ... Q (b-1). It is specified that each in the group of predicted addresses can be combined with the PID 1212. In a particular embodiment, the target address 1214 inherits the attributes of the trigger address, where these attributes indicate whether the trigger address is combined with code or program data and whether the trigger address is a processor request address or not. . Also, in another particular embodiment, a number less than the number of predicted addresses in the group may be combined with the PID 1212. In one example, only target address Q0 may be combined with PID 1212, and one or more other groups (eg, Q (-1), Q2, Q3, etc.) need not be combined with PID 1212. . As such, when trigger address G is collided by a subsequent target address Q0, PID 1212 is written to an out of order predictor. Next, when Q2 or any other group crashes, the PID 1212 is not recorded. This reduces the number of redundant inputs in the target cache. Thus, only the combination "G-> Q0" is stored and reprioritized as a result of the hit on the prediction. When the address Q1 is detected inside the address stream, the out of order predictor need not insert the combination "G-> Q1".

Next, the target determiner 1208 does not detect the target address for addr_1. The target determiner 1208 is then passed to the inserter 1202 in which no trigger-target combination for addr_1 exists. In response, inserter 1202 forms a trigger-target combination for addr_1 and inserts the combination into target cache 1230. To do this, inserter 1202 first identifies a memory location that uses an index addr_1 for storing the tag addr_1 together. In addition, inserter 1202 is configured to receive a subsequent address “addr_2” for storing as a target address to trigger address addr_1. Since no trigger-target combination exists prior to the newly-formed trigger-target combination, inserter 1202 is a TG column and TGT of Way 1240, which is the highest priority method (ie, Way 0). Store the tags (addr_1) and (addr_2) in the columns, respectively. For example, considering the address stream of FIG. 11, this stream represents a first example where "Z" follows "Y". After determining this, there are no “tags (Y) to (Z)” trigger-target combinations, and then the inserter 1202 of FIG. 12 stores the new trigger-target combination at index Y. . As such, “tags Y through Z” are stored as trigger-target combination 1242 in Way 1240. In a particular embodiment, inserter 1202 receives an insert signal (“INS”) 1224 from priority adjuster 1324 described below.

13 illustrates an example priority adjuster 1324 in accordance with an embodiment of the present invention. In general, priority adjuster 1324 operates to prioritize the trigger-target combination from the most recent, successful target address to the oldest or successful address. For example, a trigger-target combination is assigned the highest priority (ie, controlled by Way 0) when no pre-target exists for a particular combination. In addition, the trigger-target combination has the highest priority if the predicted target address is successfully proved (e.g., there is a read of the data by a process where the data is prefetched based on out of order prediction). Can be assigned. In this example, priority adjuster 1324 is coupled to target cache 1230 to prioritize the trigger-target combination stored therein between them. Priority adjuster 1324 includes a register 1302, index decoder 1308, tag decoder 1310, target determiner 1318, matcher 1314, and reprioritizer 1316.

In general, priority adjuster 1324 receives information independent of out-of-order predictor 1010 indicating that a particular address is successful in providing the data requested by the processor. Such information may be generated by cache memory, such as the multi-level cache shown in FIG. Priority adjuster 1324 receives this information in register 1302 as "Hit Info". Hit Info is reference information that includes at least an address 1304 of data (eg, program instructions and / or program data substantially requested by the processor). Address 1304 is distinguished as addr_2. This reference symbol also includes a PID 1306 combined with an address 1304.

Index decoder 1308 and tag decoder 1310 extract the index addr_1 and tag addr_1 from PID 1306, respectively, to determine whether addr_2 has an appropriate level of priority. To accomplish this, priority adjuster 1324 identifies whether addr_2 is a target address of an existing trigger-target combination in target cache 1230. Priority adjuster 1324 applies tag addr_1 and index addr_1 to target cache 1230, and any matching trigger address of the TRG column of target cache 1230 is received by target determiner 1318. . Under detection of one or more target addresses combined with addr_1, target determiner 1318 provides these target addresses to matcher 1314.

However, if the target determiner 1318 determines that no target address is present in the trigger-target combination (ie, there is no addr_2 in combination with the address addr_1), then inserting the new trigger-target combination The inserter 1202 of FIG. 12 and the insert signal ("INS") are in communication. Insert signal 1224 typically includes address information such as addr_1 and addr_2. Typically, a situation where no matching target address exists for the PID 1306 of Hit Info means that the processor is under previously issued out of order prediction. However, the target cache 1230 has eliminated the trigger-target combination that formed the basis for its previously issued out of order prediction. As such, the out of order prediction engine 1010 inserts or reinserts a trigger-target combination that can be used again to predict out of order addresses successfully used by the processor.

If target determiner 1318 detects one or more target addresses, it provides the detected target address to matcher 1314. The matcher 1314 is opposed to addr_2 (ie, address 1304) to determine how many combined target addresses exist for addr_1 and for existing target addresses in such a way that the corresponding trigger-target combination is stationed. Each detected target address is compared. If necessary, the confirmation circuit 1314 provides the result of the matching to the priority reorientator 1316 to modify the priority.

First, consider an example where one or more target addresses represent addr_1 as a trigger address, but are detected as being combined with a PID 1306 (ie, addr_1) without any trigger-target combination including addr_2. Thus, prioritizer 1316 inserts the new trigger-target combination into the position representing the highest priority (eg, Way 0) and demotes the priority of an existing trigger-target combination of the same trigger. Let's do it. For example, as shown in FIG. 12, the "tags (A to X)" trigger-target combination is in the memory location with the highest priority, while the "tags (A to L)" combination is the lower priority. Consider a combination with ranking. Next, it is assumed that PID 1306 represents address A as addr_1 and addr_2 is address B. FIG. Priority readjuster 1316 operates to store "tags A through B" via another dictionary combination stored in another way of lower priority, as shown in FIG.

Second, consider an example in which two target addresses are detected as combined with PID 1306 (ie addr_1), but the two trigger-target combinations have their priorities properly exchanged. In this case, priority reorderor 1316 inserts the highest priority trigger-target combination into the position representing the highest priority (eg, Way 0), and the second highest priority (eg, Insert the previous highest priority trigger-target combination into another location representing Way 1). For example, as shown in FIG. 12, the "tag (B to G)" trigger target combination is at the memory location that represents the highest priority, while the "tag (B to C)" combination is the lower priority. Consider a combination with ranking. Next, it is assumed that PID 1306 represents address B as addr_1 and address C is addr_2. Priority readjuster 1316 operates to store a “tag (B to C)” combination of Way 0 through another combination in Way 1 of lower priority, as shown in FIG. 13. It is stated that the prioritizing technique is useful to be maintained as the highest priority and the second priority, respectively, when at least two highest priorities are maintained as "leg 0" and "leg 1", respectively.

Next, consider an example in which two target addresses are detected as combined with PID 1306 (ie addr_1) and the two trigger-target combinations have their priorities properly assigned. In this case, the reprioritizer 1316 does not take any action since the corresponding trigger-target combination must be.

14 illustrates an example pipeline 1400 for operating a prediction generator to form out of order prediction, in accordance with certain embodiments of the present invention. In Fig. 14, the solid boxes represent storage during or between the steps, and the dotted boxes represent operations performed by the out of order predictor. During stage 0, addr_1 of the read request is decoded by the combine-tag-and-index generator 1402, which may be a mixture (amalgam) of the index decoder 1308 and the tag decoder 1310 of FIG. 13. In one embodiment, the combine-tag-and-index generator 1402 is a multiplexer configured to separate addr_1 into a first portion of the address and a second portion of the address. The first portion is maintained as tag addr_1 at 1406 and the second portion is maintained as index addr_1 at 1408. Also during this step, index addr_1 is added to the target cache at 1410 to retrieve data representing the trigger-target combination. Additionally, addr_1 of the read request may be temporarily stored in buffer 1404 while the target cache is being written.

During stage 1, tag addr_1 and index addr_1 remain at 1412 and 1414, respectively. At 1416, the target address is read from the target cache. During stage 2, the out of order prediction engine selects the appropriate out of order prediction by the first matching tag addr_1 against the tag combined at index addr_1 at 1418. At 1420, the out of order prediction engine sends, for example, the highest priority target address (ie, in the method of storing the highest priority trigger-target combination) to the leg 0 prediction queue at 1422, and the second highest. Configure a multiplexer to send the priority target address (ie, in the method of storing the second highest priority trigger-target combination) to the Leg 1 prediction queue at 1424. In stage 3 these two out of order predictions are output to the combiner, for example, at 1430. Although FIG. 14 generates out of order prediction in four steps, it is noted that other out of order prediction pipelines in other embodiments may have more or fewer steps.

15 illustrates an example pipeline 1500 for operating a priority adjuster to prioritize out of order prediction in accordance with certain embodiments of the present invention. Solid boxes represent storage during or between steps, and dashed boxes represent operations that can be performed by the priority adjuster. Pipeline 1500 illustrates an example method of inserting a trigger-target combination into a target cache and reordering the target cache combination. Step-1 determines whether the priority adjuster inserts or prioritizes. If the priority adjuster performs insertion, then at 1502 the address addr_1 of the read request is stored at 1506 during this step. This address has the potential to be a trigger address for the target address. If the priority adjuster performs prioritization, then at 1504 the priority adjuster receives a PID 1508 indicating the addr_1 address from an external source (eg, cache memory), and also, during this step, 1510. Receives the address addr_2.

14 and 15 illustrate out of order prediction using one level of prediction. To validate multi-level prediction occurrences, example pipelines 1400 and 1500 feed the predictions generated at the ends of each pipeline 1400 and 1500 back to pipelines 1400 and 1500 as input addresses. It can be modified to make. These predictions are then queued for different levels of prediction occurrences. For example, if A is detected, then target cache 1130 generates target addresses B and X (eg, the two highest priority methods). Next, address B as a continuous trigger is inputted again to the top of the pipeline where the target cache 1130 generates addresses C and G. In other words, a feedback loop is added to the example pipelines 1400 and 1500 for implementing one or more levels of prediction.

First, during stage 0, the priority adjuster considers performing a trigger-target combination insertion. In this example, addr_1 is decoded by the combine-tag-and-index generator 1514 and addr_2 is selected from 1512 via the multiplexer 1516. Combine-tag-and-index generator 1514 performs the collective functions of the index generator and the tag generator. In one embodiment, the combine-tag-and-index generator 1514 is a multiplexer configured to select an address from either 1506 or 1508. In this case, the combine-tag-and-index generator 1514 forms a first address portion that is maintained as a tag addr_1 at 1520 and forms a second address portion that is maintained as an index addr_1 at 1522. In addition, during this step, index addr_1 is applied to target cache 1524 through multiplexer 1518 to retrieve data describing the trigger-target combination. Next, during stage 0, the priority adjuster performs prioritization of the target cache. In this example, addr_1 (or other representation thereof) is received from 1508 and addr_2 is selected from 1510 via multiplexer 1516. The combined tag and index generator 1514 then forms first and second portions from the PID 1508. Next, the index addr_1 formed from the PID 1508 is applied to the target cache through the multiplexer 1518 at 1524 to retrieve data describing the trigger-target combination. In stages 1 through 3, the pipeline 1500 operates similarly regardless of whether the priority adjuster performs insertion or prioritization.

During stage 1, tag addr_1 and index addr_1 remain at 1530 and 1532, respectively. At 1534, the target address is read from the target cache. During stage 2, the priority adjuster first matches the tag addr_1 against the tag. If no tag matches at 1540, the multiplexer is configured at 1542 to prepare to insert the trigger-target combination. However, if at least one tag matches at 1544 by way of the target cache and the highest priority trigger-target combination is stationed in a way that corresponds to the highest priority, the trigger-target combination is re-prioritized at 1554. do. To accomplish this, the multiplexer is selected at 1552 to reorder or insert a new trigger-target combination. During stage 3, the fully-connected prioritization multiplexer is configured to store addr_2 at 1556. This address is written to the target address at Way 0 during stage 0, as determined by the index addr_1 maintained at 1550. As shown, another trigger-target combination determined by the fully-connected prioritization multiplexer at 1560 is written as cache write data into the target cache at 1524 using the index addr_1 maintained at 1550. After the pipeline 1500 returns to stage 0, the priority regulator continues to operate accordingly.

From inventory  Example Embodiments That Issue Predictions

16 is a block diagram illustrating an example predictive inventory 1620 in accordance with certain embodiments of the present invention. In this example, predictive inventory 1620 is shown to be within prefetcher 1606. Also, prefetcher 1606 is shown to operate within memory processor 1604 designed to control at least memory access by one or more processors. Prefetcher 1606 “fetches” both program instructions and program data in memory 1612 before being requested, and then provides the program instructions and program data fetched to processor 1602 under a request by the processor. To work. By fetching them before use (ie, “prefetching”), the idle time of the processor (eg, the time when data in processor 1602 is depleted) is minimized. Prefetcher 1606 also includes a speculator 1608 for generating predictions and a filter 1622 for removing unnecessary predictions.

Filter 1622 represents one or both of an inventory filter or a post-inventory filter. By eliminating unnecessary predictions, prefetcher 1606 can maintain computational and memory resources that are used to manage unnecessarily replicable predictions. The inventory filter (pre-inventory filter) operates to remove unnecessary predictions prior to insertion into predictive inventory 1620, while the post-inventory filter removes unnecessary predictions before publication to memory 1612. An example of a post-inventory filter is shown in FIG. 20. Next, the operation of the prefetcher 1606 and its components will be described below.

In operation, speculator 1608 monitors system bus 1603 for a request (“read request”) by processor 1602 to access memory 1612. As the processor 1602 executes the program instructions, the speculator 1608 detects a read request for an address that includes program data and program data that is subsequently used by the processor 1602. For illustration purposes, an "address" is generally combined with a cache line or unit of memory transferred between memory 1612 and cache memory (not shown). Cache memory is an example of predictive storage that is independent of predictive inventory. An "address" of a cache line may be referred to as a memory location, and the cache line may include data at one or more addresses of memory 1612. The term “data” is referred to as a unit of information that can be prefetched, while the terms “program instructions” and “program data” respectively refer to the instructions and data used by the processor 1602 in its processing. Thus, the data (eg, any number of bits) may represent predictable information that constitutes the program instruction and / or program data.

Based on the detected read request, speculator 1608 can generate a number of predictions to improve the chances of accurately predicting access to memory 1612 by processor 1602, and these numerous predictions are redundant. It may also include predictions. Examples of such predictions include forward sequential prediction, reverse sequential prediction, back blind sequential prediction, back sector sequential prediction, non-sequential prediction, and the like. To remove this redundancy, the inventory filter 1622 filters the replication predictions to generate surviving predictions, and then stores them in the prediction inventory 1620. To remove the redundancy, the inventory filter 1622 compares the predictions generated against the contents of the cache (not shown) before inserting the prediction into the prediction inventory 1620. If a match is found between the prediction and the prediction remaining in the prediction inventory 1620, the inventory filter 1622 invalidates the prediction. However, if no match is found, inventory filter 1622 inserts surviving predictions into predictive inventory 1620. This specifies that within the new group of predictions some predictions (ie, one event, or predictions generated by the same trigger address) match the contents of the cache, while other predictions may be the case. In this case, the inventory filter 1622 invalidates the individual predictions that match the predictions in the cache and inserts those unmatched predictions (eg, not marked "invalid") into the prediction inventory 1620.

As soon as you reside within the predictive inventory 1620, the forecast remains as an “item” of the inventory. The term “item” is referred to as either “prediction” or “triggering address (generating predictions)” stored in predictive inventory 1620. These items can be compared against post-generated predictions to filter the objectives. Prefetcher 1606 manages these items in the inventory while issuing these items to memory 1612 at the rate of change. The rate of issuance depends on the type of prediction (eg, forward sequential prediction, non-sequential prediction, etc.), priority of each type of prediction, and other factors described below.

One way that predictions can be redundant is that when processor 1602 issues an actual read request for a particular address, the prediction for that address already exists in the prediction inventory 1620. In this case, the prediction is filtered (ie invalidated) and the actual read request of the processor 1602 is maintained. This is especially true for predictions such as sequential-type and back-type predictions. In addition, the prediction inventory 1620 may have some predictions surplus between the time of receiving these predictions until the prefetchers 1606 issue the predictions to memory, and also for the prefetchers 1606 to publish the items. You can filter the predictions before. This again reduces the number of surplus predictions that generate a duplicate during that time, but the after-generated predictions are inserted into the prediction inventory 1620. Also, as the number of redundant predictions decreases, more resources are maintained.

After the prefetcher 1606 issues a prediction in the side inventory 1620, the memory processor 1604 receives the remaining predictions (ie, at least not filtered by the post-inventory filter) over the memory bus 1611. Transfer to memory 1612. In response, memory 1612 returns prefetched data with respect to the predicted address. A cache memory (not shown), which may be located inside or outside the prefetcher 1606, temporarily stores the returned data until the time of transferring the data to the memory processor. At the appropriate point in time, the memory processor 1604 sends the prefetched data to the processor 1602 via the system bus 1603 to ensure that latent data is minimized, among others.

17 illustrates an example predictive inventory 1620 according to one embodiment of the invention. Prediction inventory 1620 includes multiple queues 1710, 1712, 1714, and 1716 to store predictions, where the queues are buffers or any such for storing predictions until each is issued or filtered. It may be a component. In addition, predictive inventory 1620 includes an inventory manager 1704 and one or more queue attributes, and the inventory manager 1704 forms the operation of each of the queues according to the structure and / or corresponding queue attributes 1706.

Each cue maintains a prediction as an item, all of which are generally the same special type of prediction, such as forward sequential prediction. As shown, predictive inventory 1620 includes four queues, a sequential queue (“S Queue”; 1710), a back queue (“B Queue”; 1712), an out of sequence zero-queue (“NS0 Queue”; 1714). , And out of order one-queue (“NS1 Queue”; 1716). The sequential queue 1710 may be configured to include either forward sequential prediction or reverse sequential prediction, while the back queue 1712 may include either blind back sequential prediction or back sector sequential prediction. For illustrative purposes, forward sequential prediction, reverse sequential prediction, and the like may be collectively referred to as "serial-type" prediction, while blind back sequential prediction, back sector sequential prediction, etc. may be collectively referred to as "back-type". Is referred to as prediction.

Predictive inventory 1620 includes a "0th" out of order queue and a "first" out of order queue. Out of order ("zero-") queue 1714 and out of order ("one-") queue 1716 include out of order predictions having "highest" and "second highest" priorities, respectively. In particular, out of order zero-queue 1714 maintains out of order prediction including the highest priority target address (any number of target addresses) that may be generated by the corresponding trigger address. The "trigger" address is the detected address at which speculator 1608 generates the prediction. This prediction (ie, predicted address) is a "target" address that is not patterned (eg, out of order) with a trigger that generates the target. Similarly, out of order one-queue 1716 includes a second highest priority target address that can be generated by the corresponding trigger address, instead of maintaining out of order prediction.

Each queue may consist of any number of groups 1720, such as groups 0, 1, 2, and 3. Each group 1720 includes a configurable number of yields, such as a triggering address and the corresponding prediction in which the triggering address occurs. For example, group 1720 of sequential queue 1710 may include a triggering address and seven sequential predictions, respectively, while group 1720 of back queue 1712 may each include a triggering address and one back- May include type prediction (or in some cases, these cues contain only prediction as an item). In addition, a group 1720 of either or both of the non-sequential zero-queues 1714 or the non-sequential one-queues 1716 may include a trigger address and four out of order predictions (or in some cases, these A cue contains only predictions as an item). In a particular embodiment, the number of items per group 1720 stored in prediction inventory 1620 is determined by setting its “placement” number to generate a particular number of speculator 1608 predictions. By storing predictions in prediction inventory 1620 as grouped items, group 1720 reduces the amount of information typically used to manage each prediction separately, and, in turn, facilitates mediation when issuing predictions. do.

Inventory manager 1704 is configured to manage the inventory of items in each queue as well as control the structure and / or operation of the queue. To manage predictive inventory 1620, inventory manager 1704 performs in whole or in part using one or more queue attributes 1706. The first example of a queue attribute is a type of queue. For example, any queue 1710-1716 can be configured to be a "first-in first-out" buffer, a "last-in first-out" buffer, or any other type of buffer. Can be. Types of cues, such as FIFO or LIFO, affect how you insert and remove items associated with the cue. In one embodiment, the sequential queue 1710 is configured as a LIFO, and each of the non-sequential zero-queues 1714 and the non-sequential one-queues 1716 are configured as FIFOs.

A second example of a queue attribute is an expiration time, or a lifetime that is assignable to a queue, group or yield. This property controls the degree of staleness for prediction. As predictions or queue ages in any group 1720 decay, the likelihood that they will not affect accurate predictions increases gradually. Thus, to minimize old items, inventory manager 1704 maintains the current inventory by one group until a specific expiration time after inventory manager 1704 has removed both the totally old group or any remaining items that have not yet been issued. Makes it possible to do In one embodiment of the present invention, lifecycles for queues, groups or items can be configured to hold them indefinitely. In other words, that they can be set to "constant" means that they remain in one queue until an invariant prediction is issued or the invariant disappears. In a particular embodiment, the expiration time is combined with that group when one group is inserted into the queue. Thereafter, the timer is counted at the expiration time until any remaining item whose group is invalidated becomes zero. In another embodiment, the expiration time for either group 1720 of either out of order zero-queue 1714 or out of order one-queue 1716 is the likelihood that out of order predictions are issued and consequently hit within the data cache. It is set longer than the group 1720 of the sequential queue 1710 to increase.

A third example of a queue attribute is an insertion indicator in combination with a queue that indicates how inventory manager 1704 inserts predictions into the queue when the queue is full. In one example, the insertion indicator indicates whether inventory manager 1704 drops a newly-generated prediction from being inserted, or writes over an old item residing in a particular queue. Inventory manager 1704, with the insertion indicator indicating “drop,” corrupts any new prediction and inserts it if it is not a drop. However, if the insertion indicator is an "overlap record", the inventory manager 1704 takes one of two course operations depending on the type of queue corresponding to the particular queue. If this queue consists of a LIFO, the inventory manager 1704 adds new predictions into the LIFO as a stack that effectively pushes out the oldest items and / or groups at the bottom of the LIFO. However, if the queue is configured as a FIFO, the new prediction overwrites the oldest item in the FIFO.

The fourth example of a queue attribute is a priority combined with each queue to determine the particular queue on which the next item is to be published. In one embodiment, the order of priority order is set in relation to each queue 1710, 1712, 1714 and 1716 for arbitrating among the queues to select the next prediction. In applications where series-type prediction occurs more abundantly, it is important to function the sequential queue 1710. Thus, this queue is typically combined with a relatively high priority. This means, for example, that a non-sequential zero-queue (“NS0 Queue”) 1714 and a non-sequential one-queue (“NS1 Queue”) 1716 are set to a lower priority with respect to the sequential queue 1710. That means you are most likely to be. Another example of a queue attribute is the queue size combined with each queue to determine how many predictions can be stored therein temporarily. For example, the sequential queue may have a size or depth of two groups, the back queue may have a depth of one group, and the non-sequential queue may have a depth of four groups. It is specified that the queue size can control the number of predictions issued by the prefetcher 1606 by controlling how large inventory memory is allocated to different types of predictions.

The priority of the back queue 1712 may be dynamically enhanced or modified to be higher than the priority of the sequential queue 1710 in accordance with one embodiment of the present invention. This feature resides in retrieving predictable information from the memory 1612 after the speculator 1608 detects an upper or "front" sector. This is because the processor 1612 is likely to request the bottom or "back" sector immediately upon requesting the top or front sector of the cache line. Thus, increasing the priority of the back queue 1712 increases the likelihood that the prefetcher 1606 will issue the appropriate back sector sequential prediction to the memory 1612, especially when maintaining the back sector sequential prediction. In a particular embodiment, a back queue counter (not shown) counts the number of items issued in a queue other than the back queue 1712. When this counter reaches a threshold, the back queue 1712 is promoted to at least a higher priority than the sequential queue 1710. Thereafter, the item (eg, back sector item) can be issued in the back queue 1712. After issuing one or more back-type items or emptying the back-cue 1712 (eg, by aging or issuing all items), the priority of the back queue 1712 is the initial priority and the back queue. Return to (or return to) counter reset.

In general, for any group 1720 of out of order group prediction, there may be a mix of series-type and back-type prediction as target addresses for out of order prediction. In particular, a group of non-sequential addresses may only contain series-type (ie, forward or reverse) prediction. However, these groups can also include many series-type predictions mixed with bag-types. As in the previous example, speculator 1608 has a trigger address "A" combined with a target address "B" and another target address "C". If target address B is higher priority than C, B remains in out of order zero-cue 1714 along the group of out of order predictions for trigger address A. Next, this group may include predictions B0 (ie, address B), B1, B2 and B3, which are out of order for address A but are all forward series-type. As an example below, group 1720 may include out of order prediction B (-1) (ie address B-1), B0, B1, and B2, where prediction B (-1) is a different series- It is a bag-type prediction mixed with type prediction. Or, group 1720 may include any other arrangement of predictions that is not described below. Since C has a second higher priority than B, C remains in the out of order one-cue 1716 with a similar group of out of order predictions. As a result, predictions B0, B1, B2 and B3 can be inserted as group 3 of out of order zero-cues 1714, and predictions C0, C1, C2 and C3 are groups of out of order one-cues 1716. It can be inserted as three.

In addition, FIG. 17 illustrates that, in one embodiment, the predictive inventory 1620 is configured to receive the prediction 1701 via the inventory filter 1702 through the surviving prediction path. The surviving predictions are then inserted into the appropriate queue and managed by the inventory manager 1704 described above. An example inventory filter 1702 is described below.

18 illustrates an example of an inventory filter 1702 in accordance with certain embodiments of the present invention. Since this example is applied to filter forward sequential predictions for sequential queues, such as sequential queue 1710 in FIG. 17, inventory filter 1702 may be used in conjunction with any queue to filter any type of prediction. Can be. That is, inventory filter 1702 can be configured to compare any number of predictions of any prediction type for one or more other queues that include predictions of different prediction types. For example, a number of forward sequential predictions can be filtered against back queues and the like. Inventory filter 1702 includes at least a matcher 1804 to match a number of predictions 1802 with items in group 1806. Group 1806 includes items A1 through A7 in combination with item A0, respectively. A0 is the triggering address that previously generated the prediction identified as items A1 through A7. Further, group 1806 can reside as any group 1720 in sequential queue 1710. As far as the number of predictions 1802 are concerned, they include "TA" as the triggering address and include all predictions B1 through B7 generated by the speculator 1608 under detection of the address TA. Although FIG. 18 shows only one group (ie, group 1806), it is specified that other groups 1720 of the same queue can be filtered simultaneously in the same manner.

In a particular embodiment, matcher 1804 consists of a number of comparators identified as CMP0, CMP1, CMP2, ... CMPM (not shown). Comparator CMP0 is configured to compare TA against N items in group 1806, and each of comparators CMP1, CMP2, ... CMPM compares the prediction in prediction 1802 against the number of N items in group 1806. Where M is set to accommodate the largest number of predictions generated. As an example, M is 7, and therefore requires seven comparators, N is 3, and each comparator considers that one element of 1802 is compared against three items of 1806. In addition, assume that each element of prediction 1802 matches a corresponding item having the same location (eg, first versus first, second versus second, etc.). As such, CMP0 compares TA for A0, item A1 and item A2, and CMP1 compares predictions B1 for items A1, A2, A3, and the like. The number N can be set to minimize the amount of comparator hardware, but it is continuous with these predictions that may result in small jumps (ie, not greater than N) in the stream of addresses detected on the system bus 1603. It can be set up to sufficiently filter the conventional stream.

In one embodiment, the queue stores page addresses to represent A0 and offsets each representing item A1, item A2, and the like. To determine whether a match exists in this case, the address TA in prediction 1802 and the page address of the offset of the particular prediction are compared against the page address of A0 and the corresponding offset, respectively. In certain embodiments of the present invention, the inventory filter 1702 does not filter sequential predictions against out of order predictions, and thus, with either out of order zero-cue 1714 or out of order one-cue 1716. It doesn't work. This is because it may be less likely to have sequential speculation as sequential prediction and as much surplus present.

19A and 19B are diagrams illustrating exemplary techniques for filtering excess in accordance with certain embodiments of the present invention. If the matcher 1804 is determined to match, either the newly-generated prediction (ie, new item K) or the previously generated item (ie, existing item K) is invalidated. 19A shows that either the new item K or the existing item K is filtered or invalidated. In this case, queue 1902 is a FIFO. In this way, the new item K is invalidated, thereby keeping the existing item K. In contrast, FIG. 19B shows that if the queue 1904 is LIFO, the existing item K is invalidated, thereby maintaining the new item K. In general, the earliest publication of either New Item K or Existing Item K is maintained, while later published items are invalidated. Those skilled in the art should appreciate that the inventory filter 1702 may employ other techniques without departing from the scope and spirit of the invention.

20 illustrates another example predictive inventory placed in a prefetcher, according to one embodiment of the invention. In this example, the prefetcher 2000 includes a speculator 1608 and a filter 2014. The prefetcher 2000 of FIG. 20 also includes a multi-level cache 2020 and predictive inventory 1620. Here, the multi-level cache 2020 is comprised of a first level data return cache ("DRC1") 2022 and a second level data return cache ("DRC2") 2024. The first level data return cache 2022 may generally be described as a short term data store, and the second level data return cache 2024 may generally be described as a long term data store. The multi-level cache 2020 stores pre-fetched program instructions and program data in memory 1612 until processor 1602 requires them. The cache of the multi-level cache 2020 also stores reference information about the prediction that generated the prefetched predictable information so that the newly-generated prediction can be filtered against the multi-level cache 2020. For example, DRC1 2022 and DRC2 2024 store two types of information as references in addition to data for a cache line or memory unit: where the two types are: (1) against new predictions; The address for the stored cache line used to filter, and (2) the trigger address if the cache line is accepted into the cache as a result of the prediction. In particular, the trigger address is used to mix the order of out of order prediction within speculator 1608.

Prediction inventory 1620 provides a temporary repository for predictions generated until selected by arbiter 2018. The predictions stored in the prediction inventory 1620 are used to filter out the redundancy, otherwise issued. The arbiter 2018 is configured to determine if a prediction generated in accordance with the arbitration rules is issued for prefetching instructions and data. In general, such an arbitration rule provides a criterion for selecting a particular cue to issue a prediction. For example, the arbiter 2018 selects and issues predictions in part or in whole based on the relative priorities between queues and / or groups.

Filter 2014 includes at least two filters: cache filter 2010 and inventory filter 1702. The cache filter 2010 is configured to compare these pre-predictions with the newly-generated predictions that result in prefetched instructions and data already stored in the multi-level cache 2020. Thus, if one or more newly-generated predictions have excess for any pre-generated predictions with respect to multi-level cache 2020, the excess predictions are canceled to minimize the number of predictions that require processing. It is specified that the surplus prediction (ie, surplus, unnecessary prediction) may be a newly-generated prediction. Inventory filter 1702 is configured to compare newly-generated predictions against predictions that have already been generated and stored in prediction inventory 1620. In one embodiment, the inventory filter 1702 is similar to the structure and / or function of the prediction shown in FIG. 18. In addition, if one or more newly-generated predictions have surplus for the predictions previously stored in prediction inventory 1620, any surplus predictions are canceled to free the prefetcher resource.

To further reduce the number of redundant predictions, a post-inventory filter 2016 is included in the prefetcher 2000. Immediately before or after the prefetcher 1606 issues predictions from the prediction inventory 1620, the post-inventory filter 2016 is between the times of first receiving these predictions until the arbitrator 2018 selects the predictions it publishes. Filter out any excess predictions that occur in. Typically, a prediction that represents the same predictive address of an item in the predictive inventory may be issued from predictive inventory 1620 to memory and cache for predicting any predictable information 2020 (ie, filtering any reference information against it). Since it may not return to (not in 2020), a surplus occurs. In one embodiment, the post-inventory filter 2016 may be similar in structure and / or functionality of either the inventory filter 1702 or the cache filter 2002 shown in FIG. 18.

In one embodiment, post-inventory filter 2016 maintains publication information for each item of each group 1720 in predictive inventory 1620. In particular, this publication information indicates which items of a particular group were published. However, the post-inventory filter 2016 does not remove the issued item from the predictive inventory 1620. More precisely, they remain so that they can be compared with when filtering out the prediction of the introduction surplus. With each item issued by a particular group, the publication information is updated to affect it. Once all items have been issued, the group is removed and the queue is released to take additional items.

In one embodiment, the arbiter 2018 may control some aspects of the predictive inventory 1620 related to issuing predictions. In particular, the arbiter 2018 can change the correlation priority among queues, groups or items to issue the most efficient prediction. In certain embodiments, the arbiter 2018 may generate numerous predictions that excessively burden memory (ie, memory over-use efficiency), such as memory 1612, cache memory 2020, or other components of the memory subsystem. It is configured to efficiently transform the correlation priority in order to slow down. For example, arbiter 2018 can assign a configurable load threshold to each queue. This threshold represents the maximum rate at which a particular queue can make predictions. This loading threshold is compared against the content of a standard load accumulator (not shown) and maintains an accumulated unit of requested work in memory 1612. The unit of work is any requested operation of the memory 1612 such as read, write, and the like. Because additional units of work are requested to memory 1612, the value in the standard load accumulator increases. However, over time (eg, for every particular number of clock cycles), the value decreases. In operation, the arbiter 2018 compares the load threshold of each queue with the value of the average load accumulator. If the load threshold is exceeded by the average load value, the arbiter 2018 performs one of two example operations. The arbiter 2018 can instruct the predictive inventory 1620 to stop taking predictions for a particular queue such that the item therein is one of two published or destroyed. Or, the arbitrator 2018 can take items other than queues by overwriting them. The arbiter 2018 detects that the average load value falls below the load threshold, and the queue is able to issue a prediction again.

Of predictive information in cache memory Preview Look up  ( Look - Ahead Lookup )

Example Embodiments for Performing

21 is a block diagram illustrating a prefetcher 2100 that includes an exemplary multi-level cache 2120 in accordance with certain embodiments of the present invention. In this example, the multi-level cache 2120 includes a cache filter 2110, a first level data return cache ("DRC1") 2122, and a second level data return cache ("DRC2") 2124. Cache filter 2110 can quickly test or " preview " on first level DRC 2122 and second level DRC 2124 to detect either the presence or absence of an input address, such as a prediction address, within these caches. View lookup ". The preview lookup is a test of cache memory to simultaneously determine whether the number of predictions already exists in the multi-level cache 2120, for example.

Depending on the presence or absence of prediction, the multi-level cache 2120 manages the content of the first level DRC 2122 and the second level DRC 2124 according to caching means such as the examples described below. do. The first level DRC 2122 may be described generally as a short term data store, the second level DRC 2124 may be described generally as a long term data store, and the prediction of the first level DRC 2122 may be a process. When the book does not request these predictions it eventually moves to the second level DRC 2124. In accordance with one embodiment of the present invention, either or both of the first level DRC 2122 or the second level DRC 2124 are prefetched program instructions based on the processor-requested address as well as the predicted address. And program data. In addition, cache filter 2110, first level DRC 2122 and second level DRC 2124 not only reduce redundant prediction but also speed up prefetching of predictable information (e.g., page opening). Work together to reduce the potential for providing prefetched program instructions and program data). Although the description below relates to multi-level cache memory (ie, multiple caches), it is specified that any of the following exemplary embodiments may include a single cache memory.

Cache filter 2110 is configured to compare the range of input addresses for each of a number of multiple caches at the same time, where the multiple caches are naturally hierarchical. For example, the first cache may be of smaller size and may be configured to store relatively short-term predictions, while the second cache may be of larger size and for longer periods than the period of the first cache. It may be configured to store the prediction. In addition, the second cache receives only the predicted address and corresponding prediction data in the first cache according to one embodiment of the present invention. To test both caches at the same time, especially when the second cache is larger than the first cache, the cache filter generates two representations of each address "looked up" or tested within the cache. Both caches are tested simultaneously with one representation used for the first cache and the second representation used for the second cache. One reason for this is that there are more addresses and entries requiring testing in a larger cache than in a smaller cache. If both are tested at the same time, a more efficient technique is needed to test the address of the larger cache than the address of the small cache. The query interface described below performs these functions.

Also, prefetcher 2100 includes speculator 2108 to generate prediction. In particular, speculator 2108 includes a sequential predictor (“SEQ. Predictor”) 2102 to generate sequential predictions such as forward sequential prediction, reverse sequential prediction, back blind sequential prediction, back sector sequential prediction, and the like. Speculator 2108 also includes an out of order predictor (“NONSEQ. Predictor”) 2104 for forming out of order prediction. Prefetcher 2100 “fetches” both program instructions and program data in memory (not shown), and then fetches in multi-level cache 2120 before the processor (not shown) requests the instruction or data. These predictions are used to store program instructions and program data. By fetching them before using them (ie, “prefetching”), processor idle time (eg, time while the processor is running out of data) is minimized.

Out-of-order predictor 2104 includes a target cache (not shown) as storage for storing the combination of addresses described above as one or more potential out-of-order addresses, each of which may qualify as out-of-order prediction. The target cache is designed to compare its content against the introductory detection address to generate out of order predictions in a fast manner, where the target cache is stored for example in response to a hit in the multi-level cache 2120. It is configured to prioritize. In particular, if the multi-level cache 2120 provides a predicted address to the processor under the request, then the stored trigger-target combination to which the address belongs is prioritized. The "trigger" address is the detected address at which out of order predictor 2104 generates out of order prediction, resulting in a prediction referred to as "target" of an unpatternable combination formed between the two. It is also specified that the trigger address may be referred to as the address that generates the sequential prediction and may be referred to as the target address.

In addition, prefetcher 2100 includes filter 2114, additional predictive inventory 2116, additional post-inventory filter 2117, and additional arbiter 2118. Here, filter 2114 may be configured to include an inventory filter (not shown) for comparing the generated predictions with pre-generated predictions residing within predictive inventory 2116. Prediction inventory 2116 provides temporary storage for storing predictions generated until arbitrator 2118 selects predictions for access to memory. The arbiter 2118 is configured to determine whether a prediction of the generated prediction is issued to access the memory when prefetching instructions and data. In some embodiments, the filter 2114 may include a cache filter 2110, and includes pre-generated predictions that cause program instructions and program data that have already been “prefetched” into the multi-level cache 2120. It can be configured to compare the generated predictions. Thus, if any generated prediction has surplus for any pre-occurrence prediction stored in multi-level cache 2120, the surplus prediction will be canceled (or invalidated) to minimize the number of predictions requiring control. Can free up prefetcher resources.

In operation, speculator 2108 monitors the system bus as the processor requests access to the memory (“read request”). As the processor executes the program instructions, speculator 2108 detects a read request for an address that includes program instructions and program data that has not yet been used by the processor. For purposes of explanation, an "address" is combined with a cache line or memory unit that is generally transferred between memory and cache memory, such as multi-level cache 2120. An "address" of a cache line may refer to a memory location, and the cache line may include data from one or more addresses of the memory. The term "data" refers to a unit of information that can be prefetched, and the terms "program instructions" and "program data" respectively refer to the instructions and data used by the processor in its processing. Thus, data (eg, any number of bits) may represent “predictable information” which refers to information that constitutes one or both of the program instructions or program data. The term "prediction" may also be used interchangeably with the term "prediction address". When a predictive address is used to access a memory, typically, one or more cache lines containing the predictive address as well as other addresses (prediction addresses or other addresses) are fetched.

When prefetcher 2100 issues predictions, it may add or combine reference information with each prediction. If the prediction is out of order prediction, the reference information combined with it may include a prediction identifier (“PID”) and a corresponding target address. The PID (not shown) identifies the trigger address (or a representation thereof) that results in the predicted corresponding target address. This reference information is received by the multi-level cache 2120 when the memory recovers prefetched data. The multi-level cache 2120 then temporarily stores the returned data until the time the processor requires this data. During the time that the multi-level cache 2120 stores the prefetched data, it filters for the predictions generated, ensures consistency of the data stored therein, and classifies the data as either short-term or long-term data. Manage your data. However, if a processor requests prefetched data (ie, predictable information), this data is sent to the processor. If the data located in the multi-level cache 2120 is the result of out of order prediction, reference information may be sent to the out of order predictor 2104 to reorder the out of order prediction stored in the target cache if necessary. .

22 illustrates an example multi-level cache 2220 in accordance with an embodiment of the present invention. The multi-level cache 2220 includes a cache filter 2210, a first level data return cache (“DRC1”) 2222 and a second level data return cache (“DRC2”: 2224). The cache filter 2210 interfaces the first level DRC 2222 and the second level DRC 2224 with other components, such as components of the preprocessor 2100 as well as components of a memory processor (not shown), respectively. DRC1 query interface 2204 and DRC2 query interface 2214 for each. If such a memory processor component is the write-back cache 2290 of FIG. 21 operating according to well-known caching methods, whereby a modification to the data in the cache is required until the cache source (eg, system memory) is needed. Is not copied. The write-back cache 2290 is similar in structure and functionality well known in the art and need not be described in detail. In addition, DRC1 query interface 2204 includes DRC1 matcher 2206 and DRC1 processor 2208, and DRC2 query interface 2214 includes DRC2 matcher 2216 and DRC2 processor 2218.

The first level DRC 2222 includes a DRC1 address store 2230 for storing an address (eg, a predictive address), and the DRC1 address store 2230 stores data (ie, predictable information) and a PID. Is connected to a DRC1 data store 2232. For example, the prefetched data derived from the predictive address (“PA”) may be stored as data (PA) 2232a with respect to the PID 2232b. This indication represents a predictive address PA that contributes to prefetching data representing predictable information. When data (PA) 2232a is requested by the processor, the corresponding prediction address, PA, and prediction identifier, PID 2232b, are moved to out-of-order predictor 2104 to change the priority of the prediction address if necessary. . Prediction identifier, PID 2232b, typically includes information indicating a trigger address that generates a PA. The PA generated by out of order predictor 2104 may also be referred to as a target address, processor-requested address (and correlation data), and may also be stored in multi-level cache 2220. Further, data PA 2232a does not need to perform PID 2232b.

In addition, DRC1 address store 2230 and DRC1 data store 2232 are communicatively coupled to DRC1 manager 2234, which manages its functionality and / or structure. The second level DRC2 2224 includes a DRC2 address store 2240 coupled to a DRC2 data store 2242 that stores similar types of data in the data 2232a and the data of the PID 2232b. Both DRC2 address store 2240 and DRC2 data store 2242 are communicatively coupled to DRC2 manager 2246, which manages its functionality and / or structure.

In a particular embodiment of the invention, the second level DRC 2224 also includes a store of "valid bits" 2244 for maintaining valid bits 2244 separate from the DRC2 address store 2240, each of The valid bit indicates that the stored prediction is either valid (available to function a processor request for data) or invalid (not available). Entries with invalid predictions may appear as empty entries. Resetting or setting one or more valid bits by keeping the bits of the valid bits 2224 separate from the address is faster and computationally smaller than when the DRC2 address store 2240 stores the valid bits with the corresponding address. Have a burden. In most cases, the valid bits for the addresses of DRC1 are usually stored together or as part of these addresses.

In operation, the DRC1 query interface 2204 and the DRC2 query interface 2214 are configured to determine whether they include any one or more addresses applied as " input addresses " And to test the content of level DRC 2224, respectively. The input address may originate in the speculator 2108 as a prediction generated from the write-back cache, or other element, into an external multi-level cache 2220 as the write address. As described herein, the input address is generally the generated prediction that is compared against the contents of the multi-level cache 2220 to filter out the excess. However, the input address is sometimes a write address that identifies the location of the memory where data is to be written or to be written. In this case, the multi-level cache 2220 is tested to determine whether an operation is requested to remain consistent among the memory, DRC1 data store 2222, and DRC2 data store 2224.

DRC1 matcher 2206 and DRC2 matcher 2216 can determine whether one or more input addresses reside in DRC1 address store 2230 and DRC2 address store 2240, respectively, at input / output port (“I / O”) 2250. Configured to determine whether or not. If either the DRC1 matcher 2206 or the DRC2 matcher 2216 detects that the input address matches one of the first level DRC 2222 and the second level DRC 2224, the DRC1 regulator 2208 or DRC2 Combination regulators, such as regulator 2218, operate to filter out excess prediction or to ensure that data in multi-level cache 2220 is consistent through memory. The DRC1 matcher 2206 and the DRC2 matcher 2216 can be configured to simultaneously compare the range of input addresses for the content of the first level DRC 2222 and the second level DRC 2224 (ie, simultaneously or nearly simultaneously). , Depending on the structure of the multi-level cache 2220, consisting of one or two periods of operation (eg, clock periods), or other minimum number of periods). Examples of ranges of input addresses that can be compared simultaneously for a cache are address A0 (trigger address) and prediction addresses A1, A2, A3, A4, A5, A6 and A7, and the seven later addresses are sequential predictors 2102. Is caused by.

When tested at the same time, the matchers 2206 and 2216 that perform this comparison are specified to perform a "preview lookup". In some embodiments, the preview lookup is performed when the processor is idle or not requesting data at the prefetcher 2100. Also, although similar in terms of functionality, each structure of DRC1 matcher 2206 and DRC2 matcher 2216 is configured to work with DRC1 address store 2230 and DRC2 address store 2240, thus It does not have to be structurally similar. Examples of DRC1 matcher 2206 and DRC matcher 2216 are described below with respect to FIGS. 23A and 24, respectively, in accordance with one or more specific embodiments of the present invention.

Next, consider a situation where the query interface 2204, 2214 performs a filtering operation. By comparing a large number of input addresses for the contents of the multi-level cache 2220, by detecting input addresses that do not match each other, the regulators 2208, 2218 may cause input addresses that do not match each other than if no filtering is performed. Appropriate actions may be taken to filter mutually matched predictions (ie, redundant predictions) while proceeding as predictions generated to fetch predictable information. As such, multi-level cache 2220 and its cache filter 2210 reduce potential by determining more quickly which cache line is starting fetching. This means that the first level DRC 2222 and second level DRC 2224 caches are able to retrieve the predictable information prefetched faster than either or both if the predictions are not compared at the same time or if they are not filtered. Because of the greater likelihood of inclusion, the likelihood of reducing the latent potential caused by the processor is in turn greater.

DRC1 address store 2230 and DRC2 address store 2240 each store an address in combination with prefetched data stored in DRC1 data store 2232 and DRC2 data store 2242, respectively. Each of address stores 2230 and 2240 stores either an address or another representation of an address. According to one embodiment of the invention, the exemplary DRC1 address store 2230 is configured to store a completely separate, completely separate address. For example, bits 35: 6 for each address are stored in DRC1 to separately identify these addresses. Addresses stored in DRC1 address store 2230 may appear to include a common portion (eg, a tag) and a delta portion (eg, an index), the common portion and delta portion being in at least one embodiment. Is used to represent an address during a preview lookup of DRC1 according to. In addition, DRC1 address store 2230 and DRC1 data store 2232 are configured to store 32 entries of addresses of address data of data and 64 byte cache lines per address entry. In general, even if the prefetched data is from a memory such as dynamic random access memory (“DRAM”), it can be from the write back cache when the data in the DRC1 data store 2232 requires updating. .

In contrast, the example DRC2 address store 2240 can be configured in four sets, which can be combined with an entry, and can be configured to store a base portion (eg, a tag) to represent an address. In addition, DRC2 address store 2240 and DRC2 data store 2242 are configured to store 1024 entries of addresses and 64 byte cache lines per address entry of data, respectively. DRC2 data store 2242 stores prefetched data resulting from DRC1 data store 2232, and in some implementations, any number of memory banks (eg, four banks; 0, 1, 2, and 3). It can be composed of).

Although the memory to which the predictable information is prefetched is typically a DRAM memory (eg, a memory arranged in a "Dual In-Line Memory Module"), the memory can be any other known memory technology. . Typically, memory is subdivided into "pages" which are sections of memory available within a particular row address. When a particular page is accessed or "opened", the other page is closed through a process of open and closed pages that require time for completion. Thus, when a processor executes program instructions in a somewhat indiscriminate manner, access to the memory is out of order with respect to fetching instructions and data at various memory locations in the DRAM memory. As such, the stream of read requests may extend across page boundaries. If the next address on the next page is not available, the processor must fetch the program instructions and program data directly from memory. This increases the potential for recovering such instructions and data. Thus, by prefetching and storing predictable information spanning multiple pages in a multi-level cache 2220, the potential associated with opening a page is reduced in accordance with the present invention. Since the data being prefetched is from the cache, the latent by or associated with the processor is reduced while the accessed page is in the open state.

For example, consider that out of order predictor 2104 accurately predicts that address "00200" is accessed following a processor read of address "00100". Thus, out of order predictor 2104 is a range of lines starting address " 00200 " (as well as addresses 00201, 00202, 00203 and 00204 when batch is 4) that is fetched before the processor that actually accesses address " 00200 ". (E.g., one target address and four predicted addresses, the number of predictions is configurable and occurs as defined by placement " b "). When the processor actually performs a read on address "00200", the preview lookup of the multi-level cache 2220 quickly determines which cache lines within a particular range following the address "00200" have been prefetched beforehand. . Since out-of-order translation within a read address stream can be performed by a DRAM page open operation, the preview lookup causes the prefetcher 2100 to quickly preview within the stream of read requests, which address or cache line Determine if it needs to be prefetched. By quickly fetching, the prefetcher 2100 can hide the potential for DRAM page open operations, after which a trigger address that forms the basis for a sequential stream of cache lines (the base for the target address) without incurring a delay penalty in the processor. And despite being out of order).

22 shows DRC1 manager 2234 and DRC2 manager 2246 as separate entities, but need not be separated. That is, DRC1 manager 2234 and DRC2 manager 2246 can be combined into a single management entity and can also be placed outside of multi-level cache 2220, and both. The first level DRC 2222 and the second level DRC 2224 are structurally and / or functionally different from the conventional L1 and L2 in the processor and are a unique means of managing the predictable information stored in the multi-level cache 2220. Is employed. Examples of such means are: means for allocating memory in each data return cache, means for copying information to short term and long term data stores, and consistency between other entities such as multi-level cache 2220 and write-back cache. Means for maintaining.

First, consider this copy means used to manage a copy of the predictable information from the first level DRC 2222 to the second level DRC 2224 as this information ages from the short term to the long term information. The DRC1 manager 2234 may send the data from the DRC1 data store 2232 to the DRC2 data store 2242 if this data is within the first level DRC 2222 up to a certain threshold of time. Cooperate with It is specified that this threshold may be an integer or may vary variously during operation. Typically, old data may be configured to be sent at any time less than N invalid entries (ie, available) of DRC1, where N is programmable. In operation, if data is copied from short term to long term storage, the entry in the first level DRC 2222 is deleted (ie invalidated).

Second, consider allocation means for inserting predictable information within first level DRC 2222 and second level DRC 2224. When inserting predictable information into first level DRC 2222, DRC1 manager 2234 selects any valid entry of DRC1 data store 2232 excluding the locked entry as a candidate. If the DRC1 manager 2234 does not detect any invalid entries for which predictable information can be stored, the oldest entry can be used to allocate space for one entry. With regard to allocating entries in the DRC2 data store 2242, the DRC2 manager 2246 is one of a number of ways to receive data copied from the first level DRC 2222 to the second level DRC 2224 (ie, 4). One of two methods) can be used. For example, the index of the prediction address may include four entries for storing data. Initially, DRC2 data store 2242 allocates any one of the unused (ie invalidated) methods. However, if all methods are assigned, the first input is the first output (ie the oldest is overwritten). However, if the oldest entry is valid with the same lifetime, the DRC2 manager 2246 allocates an unlocked entry. Finally, if all entries in the set of methods are locked, the DRC2 manager 2246 writes a record from the first level DRC 2222 to the second level DRC 2224 while maintaining an entry of a valid first level DRC 2222. Suppress In addition, it typically specifies that second level DRC 2224 receives data that only stores at first level DRC 2222.

Other means by which the DRC1 manager 2234 and the DRC2 manager 2246 can be attached are about maintaining consistency. The DRC1 manager 2234 maintains first level DRC 2222 consistency by updating the data of any entry with an address that matches the write address where the data is written. Typically, write-back cache 2290 (FIG. 21) temporarily stores the write address (and corresponding data) until the write address for writing to the memory (e.g., DRAM). In some cases where there is an address of a read request that matches a write address in the write-back cache 2290, the multi-level cache 2220 reads the data of the write address before sending that data to the first level DRC 2222. Merges with data in memory. The DRC2 manager 2246 maintains second level DRC 2224 consistency by invalidating any entry in the address that matches the write address when loaded into the write back cache 2290. Since the second level DRC 2224 receives data only in DRC1, and because the first level DRC 2222 maintains consistency through memory and write-back cache 2290, the second level DRC 2224 Generally does not include corrupted data. In addition, any address copied from DRC1 to DRC2 may first be checked against the write back cache (“WBC”) 2290. If a match is found in the WBC 2290, the copy operation fails. On the other hand, copying the addresses to DRC1 to DRC2 is performed. This additional check is more helpful in maintaining consistency.

FIG. 23A illustrates an example DRC1 query interface 2323 for the first address store 2305, in accordance with certain embodiments. In this example, a trigger address (“A0”) 2300 (eg, processor-requested address), such as an input address, consists of the common address portion 2302a and the delta address portion 2302b. The address 2300 may in some cases be a predictive address, or in other cases a write address (when maintaining consistency). If address 2300 is a trigger address that generates a group of prediction addresses, this group 2307 may contain an address such as the address identified as address ("Am") 2303 in address ("A1"; 2301). Where “m” represents any number of predictions that can be used to perform a “look preview” in accordance with at least one embodiment of the present invention. In some cases, "m" is set equal to the batch size "b".

Entry 2306 of DRC1 address store 2305 includes a first entry portion 2306a (eg, a tag) and a second entry portion 2306b (eg, an index), respectively. In a particular embodiment, first entry portion 2306a and second entry portion 2306b are similar to common address portion 2302a and delta address portion 2302b, respectively. The second entry portion 2306b indicates a substitution regarding the address moving from the trigger address (“A0”) 2300 to that particular entry 2306. Thus, when DRC1 matcher 2312 compares an input address, such as trigger address (“A0”) 2300, to entry 2306, common portion 2302a is used to indicate a common portion of the addresses of group 2307. Can be. In addition, the common portion 2302a of the address 2300 is generally similar to the common portion for the addresses (“A1”; 2301) to addresses (“Am”) 2303, and the common portion 2302a is an entry ( It needs to be used to compare against one or more first entry portions 2306a of 2306. Also, the delta portion 2302b for addresses ("A1") 2301 through "Am" 2303 may be matched against multiple second entry portions 2306b of entry 2306.

In one embodiment, DRC1 matcher 2312 includes a common comparator 2308 for matching the common address portion for the first entry portion, and a delta comparator 2310 for matching the delta address portion for the second entry portion. Include. In particular, the common portion 2302a is compared simultaneously for the first portion 2306a for entries 0 through nth entry, and the delta portion 2302b is simultaneously compared for the second portion 2306b for the same entry. In some embodiments, common comparator 2308 is a “wide” comparator for high-order bits (eg, bits 35:12 of a 36-bit address), and delta comparator 2310 is a low-rank A "narrow" comparator for comparing bits (e.g., bit 11: 6 of a 36-bit address). FIG. 23A shows one delta portion per one delta portion 2302b, and in some cases, the number of delta comparators 2310 is equal to m * n (not shown), where each delta comparator is one as input. Receives delta portion 2302b and one second entry portion 2306b. The comparator size limits the amount of physical resources requested to perform these comparisons, and as such, the addresses previewed simultaneously are configured to be within the same memory page (eg, memory page size is typically 4K bytes). do. Although this reduces the address of the preview lookup by crossing the page boundaries, these configurations reduce the cost of performing the preview lookup on physical resources. Common portion 2302a and delta portion 2302b are compared with entry 2306 respectively, or nearly simultaneously.

The outputs of common comparator 2308 and delta comparator 2310 are Hbase (0), Hbase (1), ... Hbase (m) and H0, H1, H2, ... HN, respectively, where 0 ( For example, no match) or 1 (eg, match). As a result, they form hit vectors of 0 and 1, which are sent to the DRC1 processor 2314 to take action, depending on whether they are filtering or maintaining consistency. Hit list generator 2313 generates a list of hits (“hit list”) indicating which addresses of range “r” (ie, group 2307) reside in DRC1 address store 2305. If an address matches (ie, the prediction is stored therein), the address is included in the hit list, while an address that does not match (ie, the prediction is not stored) is excluded from the hit list. This hit list is used to generate predictions or remain consistent within the DRC1 address store 2305.

FIG. 23B shows a number of example input addresses 2352 that can be tested simultaneously using the DRC1 query interface 2323 of FIG. 23A, in accordance with certain embodiments. Here, DRC1 query interface 2350 can accept any range of addresses 2352 to match against DRC1 address store 2305. Matcher 2312 of FIG. 23A is duplicated by the time required to perform parallel preview lookups beyond the number of input addresses. For example, for forward sequential prediction via batch size " b " set to 27, DRC1 query interface 2350 matches A1 as a group (2307) to A1 as a base (or trigger) address based on the verb. Ask for a matcher. For blind back prediction, only A (-1) requires matching for the base address as group 2307, whereas for backward sequential prediction, addresses A (-1) to address A (-7) require matching. . It is specified that the range of addresses 2352 is applied in parallel to the DRC1 and DRC2 query interfaces simultaneously.

24 illustrates an example DRC2 query interface 2403 for a DRC2 address store 2404, in accordance with certain embodiments. DRC2 query interface 2403 is configured to receive an input address 2402 to compare the addresses for the contents of DRC2 address store 2404. In this example, input address 2402 is the base portion (eg, tag) of an address, such as tag A0. In addition to this example, DRC2 address store 2404 is comprised of four banks 2406 (bank 0, bank 1, bank 2 and bank 3) of memory, each bank comprising an entry 2410. In this case, entry 2410 may be located in any one of four methods W0, W1, W2, and W3.

The DRC2 matcher 2430 includes a number of comparators to compare the tag A0 for the entry 2410. In general, any matching address in DRC2 address store 2404 shares the same tag A0, but may be different with respect to other groups of bits (eg, by index). In a particular embodiment of the present invention, determining whether a tag matches any entry in the DRC2 address store 2404 is generally performed in the following manner. First, for each bank 2406, one of the indices in this bank is selected to be searched for a potential matching address. As shown in Fig. 25A, since a selected bank to be searched depends on one of the banks in which a specific address (e.g., A0 in Fig. 25) resides, the bank has a constant index of a specific address (e.g., A0). As can be identified by the bits, this can vary per bank. Second, all four methods of the selected index are accessed for each bank 2406. Next, the tags stored in relation to the four methods (e.g., W0 to W3) are compared against tag A0, which is the base address 2402 in this example. In general, it is sufficient to compare tag A0 without comparing another tag such as tag A1. This is because these tags are assumed to be identical (for example, tag A0 = tag A1 = tag A2). Simultaneous search for prediction is usually limited to pages that are on the same page, such as 4 kilobyte pages, which results in the same tag. Third, if an address match is made by the DRC2 matcher 2430, formation of the hit vector and valid bits is used to obtain a final hit vector similar to that described with respect to FIGS. 27 and 28.

The hit generator 2442 of the DRC2 query interface 2403 receives the tag comparison result (“TCR”) 2422 at the DRC2 matcher 2430 and generates a corresponding valid bit to generate an ordered set of predictions (“order prediction”). Compare the results for (2450) further. Here, the tag comparisons are the results of tag comparisons from banks 1, 2, 3, and 4, represented by TRC (a), TCR (b), TCR (c), and TCR (d), respectively, with each tag having one or more entries. One or more bits indicating whether to match 2410. The ordered prediction may be an ordered set of predictions that match (or do not match) the input address 2402. The order prediction may also be a vector of bits, each indicating whether or not the input address has an address present in the DRC2 address store 2404. Any number of input addresses 2402 can be similarly matched to DRC2 query interface 2403 when additional DRC2 matcher 2430 is included. 25A-28 illustrate exemplary heat generators in accordance with some embodiments of the present invention.

25A illustrates a possible arrangement (or representation thereof) of addresses stored in DRC2 address store 2404 in accordance with one embodiment of the present invention. It is specified that the methods W0, W1, W2 and W3 are not shown to simplify the description below. Input addresses A0, A1, A2, and A3 are stored in DRC2 address store 2404. As an example, sequential predictor 2102 (not shown) may generate sequential predictions A1, A2, and A3 based on the trigger address A0 (eg, in any of the four methods). The first arrangement 2502 is derived from A0 stored in bank 0. As such, the second array 2504, the third array 2506 and the fourth array 2508 are derived from the addresses A0 of banks 1, 2 and 3, respectively, individually via the trigger address followed by subsequent addresses stored in series. do. As such, these addresses (or portions thereof, such as in the formation of tags) are generally output from DRC2 address store 2404 without any particular order.

25B illustrates an example sheet generator 2430 that generates results based on unordered addresses and corresponding valid bits in accordance with one embodiment of the present invention. In this example, sequential predictor 2102 generates sequential predictions A1, A2, A3, A4, A5, A6, and A7 based on trigger address A0, which are stored in the particular array shown (i.e., trigger address A0 is Stored in bank 1 with another or less). Hit generator 2430 receives unordered addresses A2, A6, A1, A5, A0, A4, A3, A7 and ordered valid bits VB0 through VB7, orders them, compares them, and then This results in results R0 through R7, which may be (match or do not match) a bit vector or a list of addresses. A valid bit indicating that the prediction is invalidated specifies that it is stored from being matched and prevents the prediction of the invalidation. This is one reason for matching valid bits for the contents of the address store. In accordance with certain embodiments of the present invention, four addresses are considered simultaneously than eight addresses, such as addresses A2, A1, A0 and A3 or one of addresses A6, A5, A4 and A7. As such, as shown in FIG. 25B, it is not necessary to represent the addresses A0 to A7 that are densely accessible in an "overlapping" method. However, in order to simultaneously consider the addresses A0 to A7 of FIG. 25B, the DRC2 may be configured as a double-port random access memory (“RAM”) to perform simultaneous simultaneous access to the same RAM (or the same DRC2).

FIG. 26 is a schematic representation of a heat generator 2600 for the heat generator 2442 of FIG. 24. Hit generator 2600 generates one or more results R0 through R7 by multiplexing the addresses in Ways 0 to 3 and / or valid bits for each input address, where result R compares the multiplexed bits of the address or valid bits. Is determined by. If the valid bit indicates that the tag indicated by the corresponding tag comparison result ("TCR") is valid, the tag is output as a result R. It is specified that the TRC may be a tag of an address or may be a bit having a value of either "1" (ie, hit of DRC2) or "0" (ie, not hit of DRC2). As described below with respect to FIGS. 27 and 28, a tag for an address (eg, tag A1) generally indicates a single TCR bit for that tag.

27 shows an example of a heat generator 2442 in accordance with an embodiment of the present invention. Hit generator 2442 is configured to order the unordered tags for each of the addresses A3, A0, A1 and A2 from the methods of banks 0, 1, 2, and 3. However, the tags for addresses A3, A0, A1 and A2 each represent a single bit representing the TCR for each tag. Next, an ordered TCR (shown as an ordered tag for addresses A0, A1, A2, A3) is tested for valid bits VB0-VB3 at valid bits 2244. AND operator ("AND") 2706 performs the test with a logical AND function. Thus, if the valid bit is true and the single-bit TCR is true, there is a hit and the result R is reflected in it. That is, the results R0, R1, R2 and R3 may be bits indicating a match / mismatch or form an ordered prediction result that may or may not match a tag for an address. If the tag itself is used as a TCR (e.g., tag A3), then the AND operator 2706 will return all the bits whose corresponding valid bits are zero (e.g., if the result R is zero). To mask bits in the case (including zero).

28 shows another example of a heat generator 2442 in accordance with another embodiment of the present invention. Hit generator 2442 includes a valid bit ("VB") sequencer configured to disrupt the order of valid bits VB0-VB3 ordered in valid bits 2224. That is, the valid bit sequencer 2802 reorders the valid order to have the order of VB0, VB1, VB2, and VB3 to VB3, VB0, VB1, and VB2, matches the order of TCR, and matches addresses A3, A0, Represent tags for A1 and A2. Next, unordered tags for the addresses (ie, unordered TCRs for these tags) are tested for similarly ordered valid order by an AND operator ("AND") 2806. The unordered results R3, R0, R1, and R2 pass through a result sequencer 2810 to obtain R0, R1, R2, and R3 as an ordered prediction result, which is a form usable by the prefetcher 2100. The element performs filtering, consistency, and so on. By reordering the valid bits and the result (which may be the result bit), the hardware needs less than reordering each address consisting of a number of bits. Ordering orderer 2702 and result orderer 2810 is illustrative, and other mappings for ordering and reordering bits are within the scope of the present invention.

In a particular embodiment of the invention, the prefetcher 2100 of FIG. 21, which includes out of order prediction 2104 and a multi-level cache 2120, has at least some of the same functionality of a Northbridge chip. It is placed within a Northbridge-Southbridge chipset structure such as the interior. The memory processor is designed to remove at least memory access by one or more processors, such as a CPU, a graphics processor unit (“GPU”), or the like. In a northbridge implementation, the prefetcher 2100 may be connected to the GPU via an AGP / PCI Express interface. In addition, a front side bus (“FSB”) may be used as the system bus between the processor and the memory. The memory may also be system memory. Alternatively, multi-level cache 2120 may be employed in any other structure, circuit, device, or the like that functions to control access to the memory, such as a memory processor. In addition, the multi-level cache 2120 and its elements, as well as other components of the prefetcher 2100, may consist of one or both of hardware or software modules, and may be distributed or combined in any manner.

For purposes of explanation, the foregoing descriptions have used specific lists to provide a thorough understanding of the present invention. However, it will be apparent to one skilled in the art that specific details are not required to test the invention. Accordingly, the foregoing descriptions of specific embodiments of the present invention are presented for purposes of illustration and description. They are not intended to be exhaustive or to limit the invention to the precise form disclosed, and obviously, many modifications and variations are possible in light of the above teaching. Indeed, this description has not been read to limit any feature or aspect of the invention to any embodiment, and the specifics and aspects of one embodiment may be easily interacted with other embodiments. Embodiments have been selected and described in order to best explain the principles of the invention and its practical application, and thus enable those skilled in the art to best utilize the invention and various embodiments through various modifications. It is intended that the following claims and equivalents thereof define the scope of the invention.

Claims (30)

  1. As a prefetcher to predict accesses to memory,
    A first address predictor configured to associate a subset of addresses to one address and predict a group of addresses based on one or more addresses of the subset,
    The one or more addresses of the subset are non-patternable to the address.
  2. The method of claim 1,
    The first address predictor,
    And a nonsequential predictor for generating the group of addresses as nonsequential predictions when the address is detected.
  3. The method of claim 2,
    The out of order predictor,
    A method of prioritizing each subset of the addresses over a remainder, wherein the storage stores the subset of addresses and combinations of the addresses, wherein the address is stored as a trigger address and the subset of addresses is a target. The store, stored as addresses; And
    Further comprising a non-sequential prediction engine configured to detect the address in a stream of addresses and further configured to select the one or more addresses as non-sequential prediction based on the highest priority thereof; Prefetcher.
  4. The method of claim 3, wherein
    The highest priority is
    At least indicating that a processor has requested the one or more addresses most recent for the remaining of the subset of addresses.
  5. The method of claim 3, wherein
    The first address predictor is configured to generate indexes and tags from addresses, the repository including a plurality of methods each having a memory location for storing trigger-target associations,
    The trigger-target combination stored in the first method is combined at a higher priority than other trigger-targets stored in the second method.
  6. The method of claim 5,
    Each trigger-target combination comprises a tag having a tag size and at least a portion of a target address having a portion size;
    Wherein the tag represents a trigger address and the tag size and the size of the portions are configured to minimize a size requirement for a memory location.
  7. The method of claim 5,
    The first address predictor is configured to compare the first address tag with each tag identified by one or more of the indexes to detect any trigger-target combination that includes a first address tag,
    The first address predictor may be a target address from one or more trigger-target combinations for forming out of order prediction, or two consecutive trigger addresses for forming additional out of order prediction based on one or more other trigger-target combinations. Using either or both,
    Each of the one or more other trigger-target combinations relates to lower levels in the target cache than one of the trigger-target combination or the other trigger-target combination.
  8. The method of claim 5,
    Further comprising a priority adjuster configured to modify a priority for one of the trigger-target combinations that includes a target address that matches a second address,
    The target address is identified by a trigger address consisting of one or more of the index and a first address tag.
  9. The method of claim 3, wherein
    Designating a first address of a sequential stream of addresses as a new trigger address for the one or more addresses—that is, when the trigger address is within the sequential stream and the out of order prediction is through the trigger address rather than through the new address. A prefetcher, further comprising an accelerator for prematurely occurring through a trigger address.
  10. The method of claim 3, wherein
    And a suppressor configured to suppress generating one or more prediction addresses.
  11. The method of claim 10,
    The suppressor is configured to reduce the batch quantity of addresses for the group if the address relates to one or both of a request for data or a prefetch request, and thereby Presuppressing the occurrence of the one or more prediction addresses.
  12. The method of claim 10,
    The suppressor is adapted to generate the group of addresses as out of order predictions if the interval of time from the detection of the address as the trigger address to the generation of the group of addresses as out of order predictions is less than a threshold. Further configured to suppress,
    The threshold is an amount of time between a first processor request for the trigger address and a second processor request for the one or more addresses, the time being less than the time required to prefetch one or more of the group of addresses from memory. Prefetcher, at least defined by
  13. The method of claim 10,
    The suppressor keeps track of the base address and the last-detection address for each of the plurality of interleaved sequential streams, and another address from the base address to the last-detection address for any of the plurality of interleaved sequential streams. Determine whether it is within an address stream, and if so, further configured to suppress the occurrence of at least the predicted address based on the other address.
  14. The method of claim 13,
    Each of the plurality of interleaved sequential streams is part of one of a plurality of threads.
  15. The method of claim 10,
    And a second address predictor comprising a sequential predictor to generate a plurality of additional predictive addresses based on the one or more other addresses.
  16. The method of claim 15,
    The plurality of additional prediction addresses;
    A first number of addresses sequentially ordered from the one or more other addresses, or
    Include a second number of addresses sequentially in descending order from the one or more other addresses, or
    Includes both the first number of addresses and the second number of addresses,
    The inhibitor;
    Detect the one or more other addresses are part of a first address stream in ascending order, and suppress the plurality of additional predictive addresses based on the second number of addresses ordered in descending order,
    And detect the one or more other addresses as being part of a second address stream in descending order and suppress the plurality of additional predictive addresses based on the first number of addresses ordered in ascending order.
  17. The method of claim 15,
    The plurality of additional prediction addresses;
    A back address sequentially ordered by one from the one or more other addresses in descending order, or
    One or both of the back sector addresses of the one or more other addresses,
    And the suppressor is further configured to reduce the batch amount by one when the plurality of additional predictive addresses includes one of the back address or the back sector address.
  18. The method of claim 15,
    A prediction inventory comprising a plurality of queues each configured to maintain predictions of the same prediction type until it is published or filtered; And
    An inventory filter for generating a subset of the filtered addresses,
    The inventory filter is configured to filter extra addresses in the predictive inventory, or one of the group of addresses and the plurality of additional predictive addresses,
    The prefetcher is configured to provide one or more of the filtered subset of addresses.
  19. The method of claim 18,
    And the plurality of queues further comprises one or more queues that maintain different types of predictions than predictions maintained in other queues of the plurality of queues.
  20. The method of claim 18,
    Further comprising an inventory manager configured to control each of the plurality of queues according to one or more queue attributes,
    The one or more queue attributes are
    Type of queue,
    Expiry Time,
    Cue size,
    An insertion indicator indicating a method for inserting an introduction prediction into a complete queue, and
    Prefetcher, which is the priority for selecting the next prediction.
  21. The method of claim 20,
    The plurality of queues are:
    Sequential queues with sequential queue priorities;
    A back queue having a back queue priority configurable to indicate a precedence exceeding the sequential queue priority; And
    Further comprising one or more out of order queues, each having a unique priority relative to the priorities of the other queues.
  22. The method of claim 20,
    The inventory manager manages predictions by each group of items including a triggering address and one or more items,
    Wherein each of the plurality of queues is configured to be searched to match the predictions to other predictions that are independent of the plurality of queues upon publication.
  23. The method of claim 22,
    Further comprising an arbiter configured to publish the one or more items as an issued item to access a memory,
    The published item is selected based on a priority for a publication queue, and the priority is changeable by the arbiter upon detecting that the publication queue is contributing to memory over-use.
  24. The method of claim 23,
    A cache memory containing predictable information and references; And
    And a post-inventory filter configured to compare the published item against the references to filter the published item as a prediction of surplus.
  25. The method of claim 15,
    Further comprising a cache memory for managing predictable accesses to the memory;
    The cache memory,
    A short term cache memory configured to store predictions having a lifetime less than a threshold;
    A long term cache memory configured to store predictions that have a lifetime greater than or equal to the threshold and having more memory capacity than the short term cache memory; And
    An interface configured to detect in parallel whether multiple predictions are stored in one or both of the short term cache memory or the long term cache memory, and
    The interface uses two or more representations for each of the multiple predictions when examining the short term cache memory and the long term cache memory.
  26. The method of claim 25,
    Further comprising a data return cache manager configured to copy one stored prediction as a copied prediction from the short term cache memory to the long term cache memory when the one stored prediction ages past the threshold in the past,
    And the short term cache memory is the only source of data for the long term cache memory.
  27. The method of claim 26,
    The data return cache manager is:
    Store the copied prediction in an entry in the long term cache memory, the entry being one of a number of methods available; or
    Further configured to store the copied prediction in the entry of the long-term cache memory if none of the plurality of methods is available, the entry including the longest stored prediction.
  28. The method of claim 26,
    The data return cache manager;
    Store a prediction in an entry in the short term cache memory, the entry comprising an invalid prediction, or
    And store the prediction in another entry of the short term cache memory that includes the oldest prediction.
  29. The method of claim 26,
    The data return cache manager;
    Match the write address against the next stored prediction to form a matched prediction,
    If the next stored prediction is stored in the short term cache memory, merging at least a portion of the data of the write address through a portion of the predictable information of the matched prediction, and
    And if the next stored prediction is stored in the long term cache memory, further configured to invalidate the next prediction.
  30. The method of claim 25,
    The short term and long term cache memory is configured to reduce the processor-related delay due to open pages of memory by storing out of order predictions as a subset of the range of predictions,
    The subset includes predictions in at least two pages of memory,
    The first address predictor generates the range of predictions in response to a trigger address,
    The trigger address is in a page of memory different from any page that includes the range of predictions,
    The short term and long term cache memories are configured to store a predictive identifier associated with each entry of the stored predictions and to send the predictive identifier to the first address predictor,
    And the long term cache memory is configured to store valid bits separately from each entry configured to store a prediction.
KR1020077003839A 2004-08-17 2005-08-16 System, apparatus and method for managing predictions of various access types to a memory associated with cache memory KR100987832B1 (en)

Priority Applications (8)

Application Number Priority Date Filing Date Title
US10/920,682 US7461211B2 (en) 2004-08-17 2004-08-17 System, apparatus and method for generating nonsequential predictions to access a memory
US10/920,995 US7260686B2 (en) 2004-08-17 2004-08-17 System, apparatus and method for performing look-ahead lookup on predictive information in a cache memory
US10/920,610 2004-08-17
US10/920,995 2004-08-17
US10/921,026 US7206902B2 (en) 2004-08-17 2004-08-17 System, apparatus and method for predicting accesses to a memory
US10/920,682 2004-08-17
US10/921,026 2004-08-17
US10/920,610 US7441087B2 (en) 2004-08-17 2004-08-17 System, apparatus and method for issuing predictions from an inventory to access a memory

Publications (2)

Publication Number Publication Date
KR20070050443A KR20070050443A (en) 2007-05-15
KR100987832B1 true KR100987832B1 (en) 2010-10-13

Family

ID=36142947

Family Applications (1)

Application Number Title Priority Date Filing Date
KR1020077003839A KR100987832B1 (en) 2004-08-17 2005-08-16 System, apparatus and method for managing predictions of various access types to a memory associated with cache memory

Country Status (4)

Country Link
JP (1) JP5059609B2 (en)
KR (1) KR100987832B1 (en)
TW (1) TWI348097B (en)
WO (1) WO2006038991A2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020046517A1 (en) * 2018-08-30 2020-03-05 Micron Technology, Inc Asynchronous forward caching memory systems and methods

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7636813B2 (en) * 2006-05-22 2009-12-22 International Business Machines Corporation Systems and methods for providing remote pre-fetch buffers
JP6252348B2 (en) * 2014-05-14 2017-12-27 富士通株式会社 Arithmetic processing device and control method of arithmetic processing device
US9817764B2 (en) 2014-12-14 2017-11-14 Via Alliance Semiconductor Co., Ltd Multiple data prefetchers that defer to one another based on prefetch effectiveness by memory access type
KR101757098B1 (en) * 2014-12-14 2017-07-26 비아 얼라이언스 세미컨덕터 씨오., 엘티디. Prefetching with level of aggressiveness based on effectiveness by memory access type
JP2017072929A (en) 2015-10-06 2017-04-13 富士通株式会社 Data management program, data management device, and data management method
US10509726B2 (en) * 2015-12-20 2019-12-17 Intel Corporation Instructions and logic for load-indices-and-prefetch-scatters operations
US20170177349A1 (en) * 2015-12-21 2017-06-22 Intel Corporation Instructions and Logic for Load-Indices-and-Prefetch-Gathers Operations
KR102142498B1 (en) 2018-10-05 2020-08-10 성균관대학교산학협력단 GPU memory controller for GPU prefetching through static analysis and method of control

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5561782A (en) 1994-06-30 1996-10-01 Intel Corporation Pipelined cache system having low effective latency for nonsequential accesses
US5623608A (en) 1994-11-14 1997-04-22 International Business Machines Corporation Method and apparatus for adaptive circular predictive buffer management
US6789171B2 (en) * 2002-05-31 2004-09-07 Veritas Operating Corporation Computer system implementing a multi-threaded stride prediction read ahead algorithm

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH06103169A (en) * 1992-09-18 1994-04-15 Nec Corp Read data prefetching mechanism for central arithmetic processor
US5426764A (en) * 1993-08-24 1995-06-20 Ryan; Charles P. Cache miss prediction apparatus with priority encoder for multiple prediction matches and method therefor
JP3741945B2 (en) * 1999-09-30 2006-02-01 富士通株式会社 Instruction fetch control device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5561782A (en) 1994-06-30 1996-10-01 Intel Corporation Pipelined cache system having low effective latency for nonsequential accesses
US5623608A (en) 1994-11-14 1997-04-22 International Business Machines Corporation Method and apparatus for adaptive circular predictive buffer management
US6789171B2 (en) * 2002-05-31 2004-09-07 Veritas Operating Corporation Computer system implementing a multi-threaded stride prediction read ahead algorithm

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020046517A1 (en) * 2018-08-30 2020-03-05 Micron Technology, Inc Asynchronous forward caching memory systems and methods

Also Published As

Publication number Publication date
KR20070050443A (en) 2007-05-15
TW200619937A (en) 2006-06-16
JP2008510258A (en) 2008-04-03
JP5059609B2 (en) 2012-10-24
TWI348097B (en) 2011-09-01
WO2006038991A2 (en) 2006-04-13
WO2006038991A3 (en) 2006-08-03

Similar Documents

Publication Publication Date Title
US8880807B2 (en) Bounding box prefetcher
US9524164B2 (en) Specialized memory disambiguation mechanisms for different memory read access types
US10474584B2 (en) Storing cache metadata separately from integrated circuit containing cache controller
US9720839B2 (en) Systems and methods for supporting a plurality of load and store accesses of a cache
KR100227278B1 (en) Cache control unit
US6684296B2 (en) Source controlled cache allocation
US8347039B2 (en) Programmable stream prefetch with resource optimization
US8271736B2 (en) Data block frequency map dependent caching
US6460114B1 (en) Storing a flushed cache line in a memory buffer of a controller
US5553305A (en) System for synchronizing execution by a processing element of threads within a process using a state indicator
US6675280B2 (en) Method and apparatus for identifying candidate virtual addresses in a content-aware prefetcher
DE69816044T2 (en) Timeline based cache storage and replacement techniques
US8645631B2 (en) Combined L2 cache and L1D cache prefetcher
US7133981B2 (en) Prioritized bus request scheduling mechanism for processing devices
US7461209B2 (en) Transient cache storage with discard function for disposable data
KR100240911B1 (en) Progressive data cache
US6839816B2 (en) Shared cache line update mechanism
US7493452B2 (en) Method to efficiently prefetch and batch compiler-assisted software cache accesses
US6496902B1 (en) Vector and scalar data cache for a vector multiprocessor
JP3618385B2 (en) Method and system for buffering data
US7904661B2 (en) Data stream prefetching in a microprocessor
US6578130B2 (en) Programmable data prefetch pacing
US7739477B2 (en) Multiple page size address translation incorporating page size prediction
CA1322058C (en) Multi-processor computer systems having shared memory and private cache memories
US9798590B2 (en) Post-retire scheme for tracking tentative accesses during transactional execution

Legal Events

Date Code Title Description
A201 Request for examination
E701 Decision to grant or registration of patent right
GRNT Written decision to grant
FPAY Annual fee payment

Payment date: 20130926

Year of fee payment: 4

FPAY Annual fee payment

Payment date: 20140923

Year of fee payment: 5

FPAY Annual fee payment

Payment date: 20181001

Year of fee payment: 9

FPAY Annual fee payment

Payment date: 20191001

Year of fee payment: 10