WO2023129126A1

WO2023129126A1 - Machine learning for stride predictor for memory prefetch

Info

Publication number: WO2023129126A1
Application number: PCT/US2021/065275
Authority: WO
Inventors: Sang Wook Do
Original assignee: Futurewei Technologies, Inc.
Priority date: 2021-12-28
Filing date: 2021-12-28
Publication date: 2023-07-06

Abstract

An apparatus includes a processor core, a memory hierarchy, and a prefetcher configured to track recent histories for strides. The recent history for a respective stride specifies whether the respective stride occurred in respective time slots as a result of the processor core requesting a memory block from the memory hierarchy. The prefetcher is configured to train a supervised machine learning algorithm model to predict for each respective stride whether the respective stride will occur in a next time slot. The training for each respective stride is based on the recent histories for the respective stride. The prefetcher is configured to apply the model to predict, for each of the respective strides, whether the stride will occur in a next time slot from a current time slot. The prefetcher is configured to prefetch memory blocks from the memory hierarchy for strides for which the prediction is above a threshold.

Description

MACHINE LEARNING FOR STRIDE PREDICTOR FOR MEMORY PREFETCH

FIELD

[0001] The disclosure generally relates to prefetching into caches in computing systems.

BACKGROUND

[0002] A computing system may use a cache memory to improve computing performance. For instance, a computing system may store data and/or instructions that it needs to access more frequently in a smaller, faster cache memory instead of storing the data and/or instructions in a slower, larger memory (e.g., a main memory unit). The term “information” will be used herein to refer to “data and/or instructions”. Accordingly, the computing system is able to access the information quicker, which can reduce the latency of memory accesses.

[0003] A computing system may have a hierarchy of caches that are ordered in what are referred to herein as cache levels. Typically, the cache levels are numbered from a highest level cache to a lowest level cache. There may be two, three, four, or even more levels of cache in the cache hierarchy. Herein, a convention is used to refer to the highest-level cache with the lowest number, with progressively lower levels receiving progressively higher numbers. For example, the highest-level cache in the hierarchy may be referred to as cache level 1 (L1 ). Here, the lower level cache levels may be referred to as L2, L3, L4, etc. Cache level 1 (L1 ) is typically a small, fast cache near the processor. The lowest level cache is typically referred to as a last level cache (LLC). The computer system also has main memory, with the main memory being at a lower level of a memory hierarchy that includes the main memory and the caches. [0004] When a processor needs information (referred to as target information), the processor typically requests the target information from the highest level cache (e.g., L1 ). If the target information is not in a cache, this is referred to as a cache miss. In the event of a cache miss, the next level cache is typically examined to determine if the target information is at the next level cache. This process is typically repeated until the lowest level cache is searched for the target information. If none of the caches have the target information, then the target information is accessed from main memory. Cache misses will thus degrade performance due to the processor core being idle while the information is being accessed.

[0005] Prefetching is a technique in which instructions and/or data is fetched prior to a request from the processor core. Thus, instruction prefetching will fetch instructions before they need to be executed by the processor core. Data prefetching will fetch data before it is needed by the processor core. The instructions and/or data are prefetched from a lower level of the memory hierarchy to a higher level of the memory hierarchy. For example, the instructions and/or data may be prefetched from main memory to an L2 cache or an L1 cache. However, the instructions and/or data could be prefetched from a lower cache level to a higher cache level. In some cases, the prefetched instructions and/or data could be stored into a prefetch buffer.

[0006] A stride prefetcher is one type of prefetcher. The stride refers to the gap between the memory addresses of two memory blocks requested in succession by the processor core. Each memory block has a unique memory address. A stride prefetcher will predict whether the processor core will request a memory block at a certain stride (gap) from the memory address of the memory block currently requested by the processor core. Some conventional stride pre-fetchers will predict the next stride based on a recent pattern of strides. The recent history indicates what strides recently occurred. For example, a sequence of 2, 3, 1 , 3, 1 , 2 indicates what strides just occurred. A problem with conventional stride pre-fetchers is that they need to store an extremely large number of such stride sequences, which wastes memory. BRIEF SUMMARY

[0007] According to one aspect of the present disclosure, there is provided an apparatus comprising an apparatus for prefetching memory blocks. The apparatus comprises a processor core and a memory hierarchy comprising main memory and one or more caches coupled between the main memory and the processor core. The memory hierarchy is configured to store memory blocks. The apparatus also comprises a prefetcher configured to track a plurality of recent histories for each of a plurality of strides. The recent history for a respective stride specifies, for each respective time slot of a plurality of recent time slots, whether the respective stride occurred in the respective time slot as a result of the processor core requesting a memory block from the memory hierarchy. The prefetcher is configured to train a supervised machine learning algorithm model to predict for each respective stride whether the respective stride will occur in a next time slot from a current time slot. The training for each respective stride is based on the plurality for recent histories for the respective stride. The prefetcher is configured to apply the model to predict, for each of the respective strides, whether the stride will occur in a next time slot from a current time slot. The prefetcher is configured to prefetch memory blocks from the memory hierarchy for strides for which the prediction is above a threshold.

[0008] Optionally, in any of the preceding aspects, the prefetcher is further configured to learn weights for each respective stride based on the plurality of recent histones for each respective stride.

[0009] Optionally, in any of the preceding aspects, the prefetcher is further configured to predict whether the respective stride will occur in a next time slot from a current time slot based on the weights for the respective stride and the recent history for the current time slot for the respective stride.

[0010] Optionally, in any of the preceding aspects, the apparatus further comprises further comprising a stride table. The prefetcher is configured to store into the stride table, for each of the plurality of strides, a recent history vector for each stride that indicates the recent history for a respective stride. The prefetcher is configured to update the recent history vector for each stride each time slot. [0011] Optionally, in any of the preceding aspects, the recent history vector for each stride comprises a plurality of bits. Each bit corresponds to a different time slot in a series of recent time slots in which the processor core requested a memory block from the memory hierarchy. The bit for each respective time slot indicates whether the stride occurred in the respective time slot.

[0012] Optionally, in any of the preceding aspects, the prefetcher is further configured to shift the bits in the recent history vector for a respective stride with each new request from the processor core of a memory block to update the recent history vector for the respective stride in the stride table.

[0013] Optionally, in any of the preceding aspects, the prefetcher is further configured to store in the stride table, for each of the plurality of strides, a weight vector for each stride. The weight vector for a respective stride comprises a weight for each bit in the recent history vector for the respective stride.

[0014] Optionally, in any of the preceding aspects, the prefetcher is further configured to compute, for each respective stride, a vector dot product between the recent history vector for the respective stride and the weight vector for the respective stride. The prefetcher is further configured to predict whether the respective stride will occur in a next time slot from the current time slot based on the vector dot product for the respective stride.

[0015] Optionally, in any of the preceding aspects, the prefetcher is further configured to update the weight vector for a respective stride in response to a metric based on the vector dot product having an absolute value less than a training threshold.

[0016] Optionally, in any of the preceding aspects, the prefetcher is further configured to update the weights for the weight vector for each respective stride to improve the respective predictions of whether the respective stride will occur in a next time slot from a current time slot.

[0017] Optionally, in any of the preceding aspects, the prefetcher is further configured to update the weight vector for a respective stride in response to a previous prediction of whether the stride will occur in a next time slot from a time slot in which the prediction was made being incorrect. [0018] Optionally, in any of the preceding aspects, the prefetcher is further configured to maintain a hit count in the stride table to track an occurrence frequency of each stride.

[0019] Optionally, in any of the preceding aspects, the prefetcher is further configured to determine whether a current stride between a memory address of a memory block requested by the processor core in a current time slot and a memory address of a memory block requested by the processor core in an immediately previous time slot is in the stride table. The prefetcher is further configured to replace an entry for a stride in the stride table with an entry for the current stride responsive to a determination that the stride table does not have an entry for the current stride.

[0020] Optionally, in any of the preceding aspects, the prefetcher is further configured to determine which entry to replace in the stride table based on the hit count of the strides in the stride table.

[0021] Optionally, in any of the preceding aspects, the supervised machine learning algorithm model comprises a perceptron model.

[0022] According to one other aspect of the present disclosure, there is provided a method of prefetching memory blocks. The method comprises tracking, in a stride prediction table, a plurality of recent histories for each of a plurality of strides. The recent history for a respective stride specifies, for each respective time slot of a plurality of recent time slots, whether the respective stride occurred in the respective time slot as a result of a processor core requesting a memory block from a memory hierarchy comprising main memory and one or more caches between the main memory and the processor core. The method comprises training a supervised machine learning algorithm model to predict, for each of the plurality of strides, whether the respective stride will occur in a next time slot from a current time slot. The training for each respective stride is based on the plurality for recent histories for the respective stride. The method comprises applying the model to predict, for each of the plurality of strides, whether the stride will occur in a next time slot from a current time slot. The method comprises prefetching memory blocks from the memory hierarchy for strides for which the prediction is above a threshold. [0023] According to still one other aspect of the present disclosure, there is provided a computer system comprising a processor core and a memory hierarchy comprising main memory and one or more caches between the main memory and the processor core. The computer system comprises a stride table and a stride prefetcher in communication with the memory hierarchy and the stride table. The stride prefetcher is configured to track a plurality of recent history vectors for each of a plurality of strides. The recent history vector for a respective stride comprises a plurality of bits. Each bit corresponds to a different time slot in a series of recent time slots in which the processor core requested a memory block from the memory hierarchy. The bit for each respective time slot indicates whether the stride occurred in the respective time slot. The stride prefetcher is configured to train a binary classifier to predict for each respective stride whether the processor core will in a next time slot request a memory block in the memory hierarchy at a memory address obtained by adding the respective stride to a memory address requested by the processor core in a current time slot. The training for each respective stride occurs in response to memory requests from the processor core and is based on the plurality of recent history vectors for the respective stride, with each recent history vector corresponding to one of the memory requests. The stride prefetcher is configured apply the binary classifier to predict, for each of the respective strides, whether the processor core’s memory request in a next time slot will be at a memory address obtained by adding the respective stride to a memory address requested by the processor core in a current time slot. The stride prefetcher is configured to prefetch the memory hierarchy for strides for which the prediction is above a threshold.

[0024] This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter. The claimed subject matter is not limited to implementations that solve any or all disadvantages noted in the Background. BRIEF DESCRIPTION OF THE DRAWINGS

[0025] Aspects of the present disclosure are illustrated by way of example and are not limited by the accompanying figures (FIGs) for which like references indicate elements.

[0026] FIG. 1 A depicts an example of a memory system.

[0027] FIG. 1 B depicts an embodiment of a memory system with an MLA stride predictor cache prefetcher.

[0028] FIG. 2 depicts one embodiment of an MLA stride predictor cache prefetcher.

[0029] FIGs. 3A, 3B, and 3C depict one embodiment of a stride prediction table.

[0030] FIG. 4 is a flowchart of one embodiment of a process of cache prefetching using MLA based stride prediction.

[0031] FIG. 5 is a flowchart of one embodiment of a process of cache prefetching using MLA based stride prediction.

[0032] FIG. 6 is a flowchart of one embodiment of a process of training an MLA model.

[0033] FIG. 7 is a flowchart of one embodiment of a process of predicting whether a stride will occur in a next time slot.

[0034] FIG. 8 depicts components of an embodiment of an MLA stride prefetch circuitry that may be used to calculate strides and memory addresses to be prefetched.

[0035] FIG. 9 is a flowchart of one embodiment of a process of calculating a memory address to be prefetched.

[0036] FIG. 10 is a flowchart of one embodiment of processing a stride when the processor core makes a memory request.

DETAILED DESCRIPTION

[0037] The present disclosure will now be described with reference to the figures, which in general relate to cache prefetching. Information (e.g., data and/or instructions) may be cached and retrieved from cache in units of a memory block. A memory block is a basic unit of storage in a memory hierarchy. The memory block may also be referred to as a cache block or as a cache line. Herein, a “cache prefetch” is defined as a fetch of one or more memory blocks from its current location in a memory hierarchy into a cache at a higher level in the memory hierarchy prior to a demand from a processor core for the memory block. The term “prefetch” may be used herein instead of “cache prefetch” for brevity. The current location in the memory hierarchy refers to the highest level in the memory hierarchy at which the memory block currently resides (where higher means closer to a processor core and lower is further from a processor core).

[0038] Aspects of the present technology use a supervised machine learning algorithm (MLA) to predict the next memory request that will be made by the processor core. A supervised MLA is an algorithm that maps an input object to an output value based on example input object I output value pairs. The example input object I output value pairs are sometimes referred to as labeled training data. The input object may be a vector, which may be referred to as an input feature vector. The training process learns to predict the output value based on the labeled data. For example, the training process learns weights for one or more weight vectors. The weights determine how much influence the input will have on the output. Learning the weights for a weight vector will thus assign, adjust, and/or update one or more weights in the weight vector. In some techniques, the output is the vector dot product of the input feature vector by the weight vector. In one embodiment, the supervised MLA includes a binary classifier. In one embodiment, the supervised MLA includes a perceptron model.

[0039] Thus, the next stride may be predicted and a prefetch may be based on the prediction of the stride. The term “time slot” is used herein to refer to the period in which the processor core requests a memory block. The processor core requests one memory block each time slot, wherein the time slots can be of any length. Thus, there is a one-to-one correspondence between time slots and memory requests.

[0040] In one embodiment, a stride prediction table is maintained that has a row for each different stride being tracked. In one embodiment, the stride prediction table includes a recent history vector for each stride. The recent history vector indicates, for each of a number of recent time slots, whether the stride occurred. The recent history vector in the table may serve as the “input feature vector” of the MLA. Weights for a weight vector for each stride are learned. Learning the weights will assign, adjust and/or update a value for one or more weights in the weight vector. In one embodiment, a prediction of whether a stride will occur in a next time slot is based on a vector dot product of the present history vector and the weight vector.

[0041] It is understood that the present embodiments of the disclosure may be implemented in many different forms and that claim scope should not be construed as being limited to the embodiments set forth herein. Rather, these embodiments are provided to convey the inventive embodiment concepts to those skilled in the art. Indeed, the disclosure is intended to cover alternatives, modifications, and equivalents of these embodiments, which are included within the scope and spirit of the disclosure as defined by the appended claims. Furthermore, in the following detailed description of the present embodiments of the disclosure, numerous specific details are set forth in order to provide a thorough understanding. However, it will be clear to those of ordinary skill in the art that the present embodiments of the disclosure may be practiced without such specific details.

[0042] Figure 1A is a conceptual illustration of an example of a processor core 102 connected to a memory hierarchy 106. Memory hierarchy 106 includes three levels of cache, level 1 (L1 ) cache 108, level 2 (L2) cache 110, and level 3 (L3) cache 112 connected to main memory 120. Other cache configurations including different numbers of caches may be used (e.g. one or two caches, or more than three caches) in a memory hierarchy. In general, caches that are closer to the processor core are smaller and faster to access than caches that are closer to the main memory. For example, L1 cache 108 is smaller and faster than L2 cache 110, which is smaller and faster than L3 cache 112, which is smaller and faster than main memory 120. Any cache in a memory hierarchy such as memory hierarchy 106 may hold prefetched data. Different criteria may be applied for prefetching in different caches.

[0043] The processor core 102 will request memory blocks. If a requested memory block is found in the L1 cache (“L1 cache hit”), the memory block is returned to the processor core 102 with a minimum delay, otherwise (“L1 cache miss”) searching for the memory block goes down deep into the memory hierarchy towards the bottom until the memory block is found. Every memory block is uniquely identified by a memory address. The size of a memory block is typically 8, 16, 32 or 64 bytes, depending on the underlying implementation.

[0044] A cache prefetcher may prefetch data and/or instructions. For example, a data prefetcher may predict data for future address requests by observing load and store traffic of load and store instructions and bringing the data closer to the core ahead of its future demands. In an embodiment, the cache prefetcher prefetches memory blocks.

[0045] Cache prefetching may be made based on a likelihood that there will be an upcoming demand or need for the memory block from a processor core. If a request for the memory block is received, the memory block can be accessed much faster from the cache than if the memory block were still in main memory, or at a lower level of the memory hierarchy. However, it is possible that a prefetched memory block will not be demanded by the processor core, which means that the space in the higher-level cache is not used efficiently. Prefetching the memory block may result in an eviction of a victim memory block from cache storage. If there is a demand for the victim memory block but not for the prefetched memory block, performance may be degraded. Also, the prefetch requests use bandwidth in the memory hierarchy. Hence, the prefetch requests that do not lead to a demand for a prefetched memory block may waste resources including cache capacity and bandwidth in the memory hierarchy and may impact performance of a memory system. Embodiments of an MLA stride predictor for cache prefetching are able to accurately predict the address of a memory block that the processor core 102 will request in a next time slot. Embodiments of an MLA stride predictor for cache prefetching make efficient use of bandwidth in the memory hierarchy 106.

[0046] FIG. 1 B depicts one embodiment of a system (or apparatus) 100 that includes a memory hierarchy 106 coupled to processor core 102. Memory hierarchy 106 includes a multi-level cache storage 126, main memory 120, cache controllers 140, and a main memory controller 122.

[0047] Processor core 102 may also be referred to as a central processing unit (CPU). The multi-level cache storage 126 and main memory 120 may store data and/or program instructions, which are provided to processor core 102 in response to a demand or request from processor core 102, which may execute the program instructions. A “program instruction” may be defined as an instruction that is executable on a processor (e.g., microprocessor or CPU). A processor core 102 may have a program counter (PC) that contains a value that uniquely defines the program instruction. For example, during sequential execution of program instructions, the program counter may be incremented by one with execution of each program instruction. As is well understood by those of ordinary skill in the art, it is possible for the program instructions to be executed non-sequentially, such as with branch instructions. Thus, the value in the program counter could be increased or decreased by more than one.

[0048] The multi-level cache storage 126 includes multiple cache levels. For example, the multi-level cache storage 126 may include level 1 (L1 ) cache 214, level 2 (L2) cache 216, and last level cache (LLC) 218. There may be other levels of cache. For example, there may be additional cache(s) between the level 2 cache 216 and the LLC 218. In one embodiment, the L1 cache 214 is divided into an instruction cache for caching program instructions, and a data cache for caching program data.

[0049] In one embodiment, the level 1 cache 214 is on the same semiconductor die (e.g., chip) as processor core 102. In one embodiment, both the level 1 cache 214 and the level 2 cache are on the same semiconductor die (e.g., chip) as the processor core 102. A cache that is on the same semiconductor die (e.g., chip) as the processor core 102 may be referred to as internal cache. Alternatively, the L2 cache 216 could be external to the semiconductor die that contains the processor core 102.

[0050] In one embodiment, the LLC 218 is an external cache, by which it is meant that the cache is external to the semiconductor die that contains the processor core 102. In one embodiment, the LLC 218 is implemented using eDRAM. There may be more than one external cache. For example, there could be a level 3 (L3) and a level 4 (L4) cache.

[0051] Some, or all, of the caches may be private caches, by which it is meant that the caches are only accessible by the processor core 102. In one embodiment, the L1 cache 214 is a private cache. In one embodiment, both the L1 cache 214 and the L2 cache 216 are private caches. The LLC 218 could in some cases be private cache. Alternatively, some, or all, of the caches may be shared caches, by which it is meant that the caches are shared by the processor core 102 and another processor core. For example, the LLC 218 could be a shared cache.

[0052] The cache controllers 140 control reads and writes of the multi-level cache storage 126. In one embodiment, there is a cache controller for each cache level. Each cache controller may be responsible for managing a cache in the multi-level cache storage 126. For example, when a cache controller receives a request for a cache line, it checks the address of the cache line to determine whether the cache line is in the cache. If the cache line is in the cache, the cache line may be read from the cache. If the cache line is not in the cache (referred to as a cache miss), the cache controller sends a request to a lower level cache (i.e. , a cache closer to main memory 120), or to main memory if there is not a cache closer to main memory 120. Cache controllers may be implemented using dedicated hardware, using one or more processors configured by firmware, or a combination of dedicated hardware and firmware.

[0053] The cache controllers 140 include control circuits configured to transfer data between processor core 102 and multi-level cache storage 126 (e.g. perform read/write, load/store, or other memory access operations), including prefetching memory blocks. The main memory controller 122 control reads and writes of the main memory 120. The cache controllers 140 and main memory controller 122 may be implemented as dedicated control circuits (e.g. in an Application Specific Control Circuit or ASIC), programmable logic circuits, using a processor configured by firmware to perform memory controller functions, or otherwise.

[0054] The load store unit 131 is responsible for executing all load and store instructions. A load store unit may be implemented in a processor core (e.g. load store unit 131 ) or externally (e.g. in memory hierarchy 106 or some combination (e.g. partially implemented in a processor core and partially implemented by components outside the processor core). The present technology is not limited to any particular load store location or configuration. The load store unit 131 provides data transfer between storage in the memory hierarchy 106 (e.g., multi-level cache storage 126, main memory 120) and registers in the processor core 102). [0055] In one embodiment, the processor core 102 sends demands (requests) to the memory hierarchy 106 for target memory blocks. These demands may occur in response to the processor core 102 executing a program instruction such as, but not limited to, a load instruction. In one embodiment, the demands are sent to the load store unit (LSU) 131. Progressively lower levels of the multi-level cache storage 126 may be searched for the target memory block. If the target memory block is not found at any level of the multi-level cache storage 126, then the main memory 120 is searched. If the target memory block is not found in the memory hierarchy 106, another memory such as a solid state drive (or hard disk drive) may be searched for the target memory block. The amount of time it takes to provide the memory block to the processor core 102 increases greatly with each further level that is searched. In one embodiment, the target memory block, once located, is cached at a highest level of the multi-level cache storage 126 (such as an L1 cache) because, in general, a memory block demanded by the processor core 102 may be demanded again in the near future. However, due to the limited space in the highest level cache, if there is no available space in the highest level cache for the target memory block, an existing memory block is chosen as a “victim” and is then evicted out of the highest level cache to make a room for the target memory block. In one embodiment, the process of evicting and replacing the victim memory block and the caching of the target memory block is based on a replacement algorithm. In some cases, the memory block is prefetched to a level other than the highest level, such as an L2 cache. If the target memory block is found in multi-level cache storage 126, the processor core 102 experiences a smaller delay than if the target memory block is in the main memory 120. If the target memory block is in a cache level that is very close to the processor core 102, the delay may be only one cycle execution time.

[0056] The machine learning algorithm (MLA) stride predictor prefetch circuitry 150 (“prefetch circuitry”) is configured to prefetch memory blocks. The present technology is not limited to any particular location or configuration of prefetch circuitry 150. The prefetch circuitry 150 may be implemented in processor core 102 or outside of the processor core 102 or some combination (e.g. partially implemented in the processor core 102 and partially implemented by components outside the processor core). In some embodiments, prefetch circuitry 150 comprise control logic, arithmetic logic, and storage. The arithmetic logic may be used for operations such as determining a next prefetch address. In an embodiment, the prefetch circuitry 150 include hardware prefetchers that are implemented in hardware and may include additional control circuits associated with the hardware prefetchers that may be implemented in hardware or firmware. In one embodiment, prefetch circuitry 150 is implemented using discrete circuitry. For example, prefetch circuitry 150 may be implemented using discrete logic, which may include but is not limited to NAND gates and/or NOR gates. Prefetch circuitry 150 may significantly reduce delays in providing target memory blocks by prefetching memory blocks. Prefetches are in addition to the demands for memory blocks made by the processor core 102. While Figure 1 B shows prefetch circuitry 150 as separate to LSU 131 , in some examples, some portion of prefetch circuitry 150 may be implemented in LSU 131.

[0057] Each prefetch request at one point in time is for a different memory block. As one example, if a request is received from processor core 102 for a memory block at memory address X, then prefetch circuitry 150 might issue prefetch requests for memory blocks at memory addresses X + a and X + [3. The terms a and [3 may be any integers (positive or negative) and may each be referred to as a stride or offset. In this example, there is a likelihood that subsequent requests will be for the subsequent memory blocks as indicated by these memory addresses. The number of such prefetch requests that are issued by the prefetch circuitry 150 is referred to as the degree of prefetch (or “prefetch degree”). Prefetching generally prefetches memory blocks with an address ahead of a current address and the distance refers to the number of addresses between the address of the prefetched memory block and the current address (i.e. , how far ahead prefetching is performed).

[0058] FIG. 2 is a diagram of one embodiment of an MLA stride predictor prefetcher 150. The prefetch circuitry 150 inputs addresses of memory blocks requested by the processor core 102. The prefetch circuitry 150 outputs prefetch requests, which prefetch memory blocks from somewhere in the memory hierarchy 106. The memory hierarchy 106 may also be referred to as a memory system. The prefetch circuitry 150 has a stride prediction table 250, stride prefetch training circuit 252, stride prefetch prediction circuit 254, prefetch address buffer 258, and prefetch circuit 256. [0059] The prefetch circuitry 150 maintains the stride prediction table 250. In an embodiment, the stride prediction table 250 has a recent history vector and a weight vector for each stride. In an embodiment, the recent history vector for a stride serves as an input feature vector for an MLA. In an embodiment, the weight vector for each stride serves as a weight vector for the MLA.

[0060] The recent history vector for each stride in the stride prediction table 250 is used to track a recent history for the stride. Each recent history corresponds to a number of recent time slots. Each time slot corresponds to the processor core 102 requesting a memory block. The recent history for each respective stride specifies, for each recent time slot, whether the stride occurred in the time slot as a result of the processor core 102 requesting a memory block from the memory hierarchy 106.

[0061] The stride prefetch training circuit 252 updates entries in the stride prediction table 250. With each new memory request made by the processor core 102, the stride prefetch training circuit 252 updates the recent history vectors and the weights. Updating the entries serves to train a supervised MLA model to predict for each respective stride whether the respective stride will occur in a next time slot from a current time slot.

[0062] The stride prefetch prediction circuit 254 predicts, for each respective stride in the stride prediction table 250, whether the stride will occur in a next time slot from a current time slot. In other words, the stride prefetch prediction circuit 254 applies the MLA model to predict whether the processor core’s next memory request will be for a memory block having an address that is any of the various strides from the address of the currently requested memory block.

[0063] Herein, the phrase, “a stride occurring in a current time slot” or the like means that in the current time slot the processor core 102 requested a memory block in the memory hierarchy 106 at a memory address that is the respective stride from a memory address of a memory block requested by the processor core 102 in the time slot immediately prior to the current time slot.

[0064] Herein, the phrase, “predict that a stride will occur in a next time slot from a current time slot” or the like means to predict that in the next time slot the processor core 102 will request a memory block in the memory hierarchy 106 at a memory address that is the respective stride from a memory address of a memory block requested by the processor core in the current time slot.

[0065] The prefetch address buffer 258 is used to store addresses of memory blocks that to be prefetched. The stride prefetch prediction circuit 254 stores such addresses based on the stride predictions.

[0066] The prefetch circuit 256 is configured to issue prefetch requests based on the entries in the prefetch address buffer 258. A prefetch request will prefetch a memory block.

[0067] FIG. 3A depicts one embodiment of a stride prediction table 250. The stride prediction table 250 is depicted with entries for “m” different strides 302. The number of entries may increase or decrease over time. For example, if a stride that was not previously being tracked occurs, an entry for the new stride may be added to the stride prediction table 250. In an embodiment, there is a limit to the number of stride entries. In this case, if a new stride is to be tracked, the entry for another stride may be removed from the stride prediction table 250. However, the number of entries may be relatively small compared to a conventional stride prefetching techniques that tracks a large number of possible stride sequences. Therefore, embodiments of stride prediction make efficient use of memory. FIG. 3B has an example in which three strides are being tracked. The strides have values of 1 , 2, and 4.

[0068] Referring again to FIG. 3A, there is a recent history vector 304 for each stride entry. The recent history vector 304 tracks the history over “n” recent time slots. The recent history vector 304 indicates, for each recent time slot, whether the stride occurred in that time slot. In an embodiment, the recent history vector 304 for each stride comprises a number of bits, with each bit corresponding to a different time slot in a series of recent time slots in which the processor core 102 made a memory request. In Figure 3A, the recent history vector 304 for stride 1 has a format of H1 NH1 N- 1...H11, where H1 i indicates whether the stride occurred in the most recent time slot. Figure 3B shows an example in which the strides are being tracked for five recent time slots. A value of“1” indicates that the stride occurred in that time slot, whereas a value of “0” indicates that the stride did not occur in that time slot. The value of “11001” for stride 1 indicates stride 1 occurred in three recent time slots. Consistent with the example in FIG. 3A, the rightmost bit is for the most recent time slot, but the leftmost bit could instead be used for the most recent time slot. The value of “00100” for stride 2 indicates stride 2 occurred once in the recent time slots. Likewise, the value of “00010” for stride 3 indicates stride 3 occurred once in the recent time slots. In an embodiment, each time that the processor core 102 requests a memory block, the recent history vectors 304 are updated. In an embodiment, this includes dropping the “oldest” or leftmost bit, shifting the remaining bits left, and adding a new rightmost bit to indicate whether the stride just occurred. FIG. 3C depicts an example in which the recent history vectors 304 are updated for a case in which stride 4 has just occurred.

[0069] Referring again to FIG. 3A, there is a weight vector 306 for each stride entry. In the example of FIG. 3A, there are n+1 values in the weight vector. The “nth” value in the weight vector 306 for a stride corresponds to the “nth” entry in the recent history vector 304 for that stride. Each weight vector 306 has an additional entry with a zero subscript (e.g., W1o for stride 1 ). A corresponding entry with a zero subscript for the recent history vector is not depicted. For example, the recent history vector 304 for stride 1 may have an H1 o element for purposes of calculation. The zero subscript element for the recent history vector may be set to 1 for biased input to the MLA. Thus, the recent history vector 304, in one embodiment, has a zero subscript element that always keeps the value of 1 .

[0070] The stride prediction table 250 contains a stride prediction 308 for each stride. The stride prediction is based on the current values of the recent history vector 304 and the weight vector 306. In one embodiment, the prediction is based on a vector dot product. In an embodiment, the prediction is whether the respective stride will occur in a next time slot from a current time slot. Stated another way, the prediction is whether the processor core 102 will, in a next time slot, request a memory block in the memory hierarchy 106 at a memory address obtained by adding the respective stride to a memory address requested by the processor core in the current time slot.

[0071] The stride prediction table 250 contains a hit count 310 for each stride. The hit count tracks how many times the stride has occurred. The hit count may also be referred to as an occurrence frequency. This tracking goes beyond the time slots in the recent history vector 304. In an embodiment, the hit count 310 is used to determine which entry to remove from the stride prediction table 250 when a new entry is added. [0072] FIG. 4 is a flowchart of one embodiment of a process 400 of cache prefetching using MLA based stride prediction. In an embodiment, the process 400 is performed by the prefetch circuitry 150.

[0073] Step 402 includes tracking a number of recent histones for different strides. The recent history for a stride specifies for each recent time slot whether the stride occurred in the time slot. In an embodiment, the recent histones are tracked in the stride prediction table 250. In an embodiment, the recent history for each stride includes a recent history vector 304. FIG. 3A - 3C depict recent history vectors 304 being tracked in an embodiment of a stride prediction table 250. In step 402, the tracking for a particular stride refers to a number of the recent history vectors 304 for that stride. In the example in FIGs. 3A - 3C, only one recent history is stored for each stride at one time.

[0074] Step 404 includes training a supervised MLA model to predict, for each stride, whether the respective stride will occur in a next time slot. In an embodiment, the MLA model includes a binary classifier. In an embodiment, the MLA model includes a perceptron model. The training is based on the recent histories that are tracked. Thus, the training for a particular stride is based on a set of the recent history vectors 304. In an embodiment, the set of the recent history vectors serve as training data for the MLA. In an embodiment, step 404 includes learning weights for a weight vector. The weights may be learned based on the recent histories (e.g., learning weights based on a set of the recent history vectors for a stride). In an embodiment, step 404 includes updating the weight vectors 306 in the stride prediction table 250. Updating a weight vector 306 may include updating a value for one or more of the weights in the weight vector.

[0075] Step 406 includes applying the MLA model to predict, for each stride, whether the stride will occur in a next time slot. In an embodiment, step 406 includes forming a vector dot product between a recent history vector 304 in the stride prediction table 250 and the weight vector 306 for each stride. In an embodiment, a current recent history vector 304 in the stride prediction table 250 serves as an input feature vector for the MLA. [0076] Step 408 includes prefetching for strides having a prediction above a threshold. Prefetching for a stride means to prefetch a memory block at a memory address that is obtained by adding the stride to the memory address of the memory block that is currently requested by the processor core 102.

[0077] FIG. 5 is a flowchart of one embodiment of a process 500 of cache prefetching using MLA based stride prediction. In an embodiment, the process 500 is performed by the prefetch circuitry 150.

[0078] Step 502 includes the processor core 102 requesting a memory block. The memory block resides in the memory hierarchy 106. The memory block could reside in one or more of the caches in the multi-level cache 126. The address of the memory block is referred to as a “current memory address” for a “current time slot.”

[0079] Step 504 includes updating an MLA model that predicts, for a number of strides, whether the processor’s next memory request will be at the address obtained by adding the stride to the current memory address. The next memory request refers to the address of the memory block requested in the next time slot. The next time slot refers to the time slot that immediately follows the current time slot. In one embodiment, step 504 includes updating the weights for the weight vector of one or more of the strides. It is not required that the weights for the weight vector for a stride are updated each time that the processor core requests a memory block. Hence, in some cases, step 504 could be skipped for some or all of the strides. In one embodiment, the weights are updated based on whether a previous prediction for the stride was correct.

[0080] Step 506 includes applying the MLA model to predict, for each stride, whether the processor’s next memory request will be at the address obtained by adding the respective stride to the current memory address. With reference to stride prediction table 250 in FIG. 3A, each of the m strides in the table is analyzed. The analysis includes making a prediction based on the vector dot product of the recent history vector 304 and the weight vector 306 for each respective stride. In an embodiment, if the vector dot product is above a threshold, then the prediction is “yes.” The stride prediction 308 in stride prediction table 250 may be updated for each stride. [0081] Step 508 includes a determination of whether to prefetch for any strides. In an embodiment, prefetching is initiated for any stride having a prediction of yes in stride prediction table 250. Step 510 includes prefetching for strides having a prediction above a threshold. Prefetching for a memory block at a given stride may include a determination of whether the memory block already resides in a desired level of the cache hierarchy in which case it is not required that that the memory block be over-written.

[0082] After either step 508 or 510, the process 500 returns to step 502, which is the next time slot.

[0083] FIG. 6 is a flowchart of one embodiment of a process 600 of training an MLA model. Process 600 provides further details for an embodiment of step 504 of process 500. In an embodiment, the process 600 is performed by the prefetch circuitry 150.

[0084] Step 602 includes the processor core 102 requesting a memory block. The address of the memory block is referred to as a current memory address for a current time slot.

[0085] Step 604 includes setting a stride index to 1 . The stride index is used to point to a row in the stride prediction table 250.

[0086] Steps 606 and 608 are used to determine whether to update the weights for this stride. Updating the weights improves the ability to predict whether the stride will occur in a next time slot from a current time slot. Step 606 is a determination of whether the most recent prediction for this stride was incorrect. If the most recent prediction for this stride was incorrect, then the weights for this stride are updated (in step 610). Even if the most recent prediction for this stride was not incorrect, the weights may still be updated. Step 608 examines the absolute value of the vector dot product of the recent history vector and the weight vector. This vector dot product is referred to as “Y”, which serves as a metric upon which the prediction is made. If the absolute value of Y is below a training threshold, then the weights are updated in step 610. In some cases, the weights for a stride will not be updated. [0087] Table I contains example pseudocode for updating the weights in the stride prediction table 250.

_ Table I

Integer prediction = 0; occurrence = 0; Ho = 1 ;

Integer i;

If [ (CMA - LMA) + S ] then [ occurrence = -1] else [ occurrence = 1];

If Y > 0 then prediction = 1 ; else -1 ;

If [ { prediction + occurrence } or { |Y| < Training-Threshold (TT) } ] and [ S + 0] then

[ for ( i = 0 to N ) do { if (Hi = 0) then Hi = -1 ;

Wi = Wi + (occurrence X Hi);

}

[0088] In Table I, the “occurrence” variable indicates whether the stride has occurred in the current time slot, setting its value as 1 if it has and otherwise to -1. In an embodiment, this is done by comparing the stride (S) with a value derived from subtracting the memory address of the memory block requested in the previous time slot (Last Memory Address or LMA) from the memory address of the memory block currently being requested (Current Memory Address or CMA). The Y’ value, which was computed the last time slot, indicates what the prediction result was for the last time slot. That is, Y indicates whether the corresponding stride was predicted to occur or not. The value for Y is used to set the ‘prediction’ variable to 1 (the stride was predicted to occur) or -1 (the stride was not predicted to occur). Thus, note that the “prediction” in Table I is not a prediction of whether the stride will occur in a next time slot, but is for a previous prediction for the stride.

[0089] In Table I, “prediction + occurrence” may be used in step 606 as a test of whether the prediction was correct. In Table 1 , “ | Y| < Training-Threshold (TT)” may be used in step 608, where “TT” is the training threshold. TT may be used to limit the number of trainings and may be decided empirically. The number of bits used to represent a weight (including the sign bit) may also be limited by the TT value because the weights may be changed when the magnitude of Y is less than the TT value.

[0090] In Table I, the value “N” in the loop is the number of bits in the recent history vector. For example, in FIG. 3A, the recent history vectors 304 each have N bits. The step of “if (Hi = 0) then Hi = -1” is used such that all values in the stride history vector 304 will be either “1” (stride occurred in the relevant time slot) or “-1” (stride did not occur in the relevant time slot). Each time the calculation of “Wi = Wi + (occurrence X Hi)” is performed one of the weights in the weight vector for the stride is updated. Hence, this calculation may be performed in step 610.

[0091] Steps 612 and 614 are performed to update the recent history vector in the stride prediction table 250. Step 612 includes shifting the bits in the recent history vector to the left. The oldest or leftmost bit is eliminated. Step 614 includes adding a rightmost bit to the most recent history vector based on whether the stride occurred in the current time slot. The stride occurring in the current time slot means that address of the memory block requested in the last time slot plus the stride equals the address of the memory block requested in the current time slot. FIG. 3C depicts updating the recent history vectors relative to FIG. 3B.

[0092] Table II contains example pseudocode for updating the recent history vector in the stride prediction table 250.

_ Table II _

If [ (CMA - LMA) + S ] then [ {for ( i = N to 2 ) do ( Hi = Hi - 1 ) }; Hi = 0; ] else [ {for ( i = N to 2 ) do ( Hi = Hi - 1 ) }; Hi = 1 ; ]

[0093] Step 616 includes updating the hit count for this stride in stride prediction table 250, if needed. Table III contains example pseudocode for updating the hit count in the stride prediction table 250. The condition of testing whether the hit count (HC) is equal to MHT (Maximum Hit Count) is performed to prevent a stride from residing in the table for too long. For example, a stride could occur a great many times (e.g., 10,000) and then rarely occur again. In an embodiment, the hit count is reset to 1 if HC = MHT. There are other techniques that could be used to prevent a stride from residing in the table 250 for too long, such as having a stride table counter that increments with each memory request from the processor core. When the stride table counter reaches a threshold, then all of the hit counts may be reset to 1 .

_ Table III

If [ (CMA - LMA) = S ]

Then [HC = HC + 1 ;]

If [HC = MHT]

Then HC = 1 ;

[0094] Step 618 includes a determination of whether there are more strides in the stride prediction table 250. If so, the stride index is incremented in step 620. Then, the process performs steps 606 - 616 for the next stride. When all strides have been processed, the process ends.

[0095] FIG. 7 is a flowchart of one embodiment of a process 700 of predicting whether a stride will occur in a next time slot. Process 700 may be used in one embodiment of step 508 of process 500. Process 700 may be used in combination with process 600. In an embodiment, the process 700 is performed by the prefetch circuitry 150.

[0096] Step 702 includes the processor core 102 requesting a memory block. The address of the memory block is referred to as a current memory address for a current time slot.

[0097] Step 704 includes setting a stride index to 1. The stride index is used to point to a row in the stride prediction table 250.

[0098] Step 706 includes forming a dot product of a recent history vector for this stride and a weight vector for this stride. In an embodiment, step 706 is performed after process 600 is used to update the weights and update the recent history vector. In an embodiment, the recent history vector is stored in the stride prediction table 250 as a string of “1 s” (stride occurred in time slot) and “Os” (stride did not occur in the time slot. Each 0 in the recent history vector may be converted to -1 for purpose of the dot product calculation. Equation 1 shows an example of the vector dot product calculation.

Equation 1

[0099] In Equation 1 , the Ho represents a biased input and may be set to 1. Referring back to FIG. 3A, Ho is not depicted in the recent history vectors 304, as the value of Ho does not change.

[00100] Step 708 includes updating the prediction for this stride in the stride prediction table 250. In an embodiment, the value in table 1 is set to one of two values (e.g., 1 or 0) to indicate whether the stride is predicted to occur in the next time slot. The prediction may be based on the value of Y that was calculated in Equation 1. In one embodiment, if Y is greater than or equal to 0, then the stride is predicted to occur. If Y is negative then the stride is predicted to not occur. However, in one embodiment, the value of Y should be a positive value above some threshold in order for the prediction to be yes.

[00101] Step 710 includes a determination of whether there are more strides in the stride prediction table 250. If so, the stride index is incremented in step 712. Then, the process performs steps 706 - 708 for the next stride. When all strides have been processed, the process ends.

[00102] FIG. 8 depicts components of an embodiment of a prefetch circuitry 150 that may be used to calculate strides and memory addresses to be prefetched. A portion of the stride prediction table 250 is depicted. In particular, fields for the stride 302 and the stride prediction 308 are depicted. The ellipses in the stride prediction table 250 indicates that other fields (e.g., recent history vector, weight vector) are not depicted.

[00103] The current memory address register 802 is used to store the address of the memory block that is requested by the processor core 102 in the current time slot. The previous memory address register 804 is used to store the address of the memory block that was requested by the processor core 102 in the immediately previous time slot. The stride calculator 806 is configured to calculate the current stride based on the current and previous memory addresses. The stride calculator 806 stores the current stride in the current stride register 808. The MUX 810 is configured to select the value of one of the strides from stride prediction table 250 and input that value into the adder 812. The adder is configured to add the value of the selected stride to the current memory address and store the result into the prefetch address buffer 814. The prefetch address buffer 814 is configured to store one or more addresses to be prefetched.

[00104] FIG. 9 is a flowchart of one embodiment of a process 900 of calculating a memory address to be prefetched. Process 900 may be performed by prefetch circuitry 150. Step 902 includes setting a stride index to 1. The stride index is used to walk through the entries in the stride prediction table 250.

[00105] Step 904 is a determination of whether the prediction for this stride is greater than a threshold. This value may be accessed from the stride prediction 308 in the stride prediction table 250. As noted above, the value may be stored into the stride prediction table 250 during process 700. In some cases, the value in the stride prediction table 250 will be one of two values (e.g., 0, 1 ). In this case, a value a 1 may indicate that the prediction for this stride is greater than the threshold.

[00106] If the value is above the threshold, then in step 906 the stride is added to the address of a current memory request to form a prefetch address. In step 908, the prefetch address is added to the prefetch address buffer 814.

[00107] Step 910 includes a determination of whether there are more strides in the stride prediction table 250. If so, the stride index is incremented in step 912. Then, the process performs steps 906 - 908 for the next stride. When all strides have been processed, the process ends.

[00108] FIG. 10 is a flowchart of one embodiment of a process 1000 of processing a stride when the processor core makes a memory request. Process 1000 may be performed by prefetch circuitry 150. Step 1002 includes the processor core 102 requesting a memory block. In an embodiment, when the processor core requests a memory block the address that was in the current memory address register 802 is moved to the previous memory address register 804. After the movement, the current address is placed into the current memory address register 802. [00109] Step 1004 includes calculating a stride relative to the last memory request. In an embodiment, the stride calculator 806 subtracts the value in the previous memory address register 804 from the value in the current memory address register 802 and stores the result in the current stride register 808. The current stride may have a positive value or a negative value.

[00110] Step 1006 includes a determination of whether the stride resides in the stride prediction table 250. If it does, then in step 1008 the entry for this stride is updated in the stride prediction table 250. In step 1010, a prediction is made for this stride. If an entry for the stride does not exist in the stride prediction table 250 then, in step 1012, a determination is made as to whether the stride prediction table 250 is full. The stride prediction table 250 being full means that the maximum number of allowed entries already exist in the table. If the stride prediction table 250 is not full then, in step 1014, a new stride entry is added to the stride prediction table 250. If the stride prediction table 250 is full then, in step 1016, an entry in the stride prediction table 250 is replaced with an entry for the new stride. In one embodiment, the hit count 310 is used to determine which stride to replace. For example, the stride with the lowest hit count may be replaced.

[00111] The technology described herein can be implemented using hardware, software, or a combination of both hardware and software. The software used is stored on one or more of the processor readable storage devices described above to program one or more of the processors to perform the functions described herein. The processor readable storage devices can include computer readable media such as volatile and non-volatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer readable storage media and communication media. Computer readable storage media may be implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Examples of computer readable storage media include RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. A computer readable medium or media does (do) not include propagated, modulated or transitory signals.

[00112] Communication media typically embodies computer readable instructions, data structures, program modules or other data in a propagated, modulated or transitory data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as RF and other wireless media. Combinations of any of the above are also included within the scope of computer readable media.

[00113] In alternative embodiments, some or all of the software can be replaced by dedicated hardware logic components. For example, and without limitation, illustrative types of hardware logic components that can be used include Field-programmable Gate Arrays (FPGAs), Application-specific Integrated Circuits (ASICs), Applicationspecific Standard Products (ASSPs), System-on-a-chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), special purpose computers, etc. In one embodiment, software (stored on a storage device) implementing one or more embodiments is used to program one or more processors. The one or more processors can be in communication with one or more computer readable media/ storage devices, peripherals and/or communication interfaces.

[00114] It is understood that the present subject matter may be embodied in many different forms and should not be construed as being limited to the embodiments set forth herein. Rather, these embodiments are provided so that this subject matter will be thorough and complete and will fully convey the disclosure to those skilled in the art. Indeed, the subject matter is intended to cover alternatives, modifications and equivalents of these embodiments, which are included within the scope and spirit of the subject matter as defined by the appended claims. Furthermore, in the following detailed description of the present subject matter, numerous specific details are set forth in order to provide a thorough understanding of the present subject matter. However, it will be clear to those of ordinary skill in the art that the present subject matter may be practiced without such specific details. [00115] Aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatuses (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable instruction execution apparatus, create a mechanism for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

[00116] The description of the present disclosure has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the disclosure in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the disclosure. The aspects of the disclosure herein were chosen and described in order to best explain the principles of the disclosure and the practical application, and to enable others of ordinary skill in the art to understand the disclosure with various modifications as are suited to the particular use contemplated.

[00117] For purposes of this document, each process associated with the disclosed technology may be performed continuously and by one or more computing devices. Each step in a process may be performed by the same or different computing devices as those used in other steps, and each step need not necessarily be performed by a single computing device.

[00118] Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims

-29- CLAIMS What is claimed is:

1 . An apparatus for prefetching memory blocks, the apparatus comprising: a processor core; a memory hierarchy comprising main memory and one or more caches coupled between the main memory and the processor core, the memory hierarchy configured to store memory blocks; and a prefetcher configured to: track a plurality of recent histories for each of a plurality of strides, wherein the recent history for a respective stride specifies, for each respective time slot of a plurality of recent time slots, whether the respective stride occurred in the respective time slot as a result of the processor core requesting a memory block from the memory hierarchy; train a supervised machine learning algorithm model to predict for each respective stride whether the respective stride will occur in a next time slot from a current time slot, wherein the training for each respective stride is based on the plurality for recent histories for the respective stride; apply the model to predict, for each of the respective strides, whether the stride will occur in a next time slot from a current time slot; and prefetch memory blocks from the memory hierarchy for strides for which the prediction is above a threshold.

2. The apparatus of claim 1 , wherein the prefetcher is further configured to: learn weights for each respective stride based on the plurality of recent histones for each respective stride.

3. The apparatus of claim 2, wherein the prefetcher is further configured to: predict whether the respective stride will occur in a next time slot from a current time slot based on the weights for the respective stride and the recent history for the current time slot for the respective stride. -30-

4. The apparatus of any of claims 1 to 3, further comprising a stride table, wherein the prefetcher is configured to store into the stride table, for each of the plurality of strides: a recent history vector for each stride that indicates the recent history for a respective stride; and update the recent history vector for each stride each time slot.

5. The apparatus of claim 4, wherein: the recent history vector for each stride comprises a plurality of bits, wherein each bit corresponds to a different time slot in a series of recent time slots in which the processor core requested a memory block from the memory hierarchy, wherein the bit for each respective time slot indicates whether the stride occurred in the respective time slot.

6. The apparatus of claim 4 or 5, wherein the prefetcher is further configured to: shift the bits in the recent history vector for a respective stride with each new request from the processor core of a memory block to update the recent history vector for the respective stride in the stride table.

7. The apparatus of any of claims 4 to 6, wherein the prefetcher is further configured to store in the stride table, for each of the plurality of strides: a weight vector for each stride, wherein the weight vector for a respective stride comprises a weight for each bit in the recent history vector for the respective stride.

8. The apparatus of claim 7, wherein the prefetcher is further configured to: compute, for each respective stride, a vector dot product between the recent history vector for the respective stride and the weight vector for the respective stride; and predict whether the respective stride will occur in a next time slot from the current time slot based on the vector dot product for the respective stride.

9. The apparatus of claim 7 or 8, wherein the prefetcher is further configured to: update the weight vector for a respective stride in response to a metric based on the vector dot product having an absolute value less than a training threshold.

10. The apparatus of any of claims 7 to 9, wherein the prefetcher is further configured to: update the weights for the weight vector for each respective stride to improve the respective predictions of whether the respective stride will occur in a next time slot from a current time slot.

11 . The apparatus of any of claims 7 to 10, wherein the prefetcher is further configured to: update the weight vector for a respective stride in response to a previous prediction of whether the stride will occur in a next time slot from a time slot in which the prediction was made being incorrect.

12. The apparatus of any of claims 1 to 11 , wherein the prefetcher is further configured to: maintain a hit count in the stride table to track an occurrence frequency of each stride.

13. The apparatus of any of claims 4 to 12, wherein the prefetcher is further configured to: determine whether a current stride between a memory address of a memory block requested by the processor core in a current time slot and a memory address of a memory block requested by the processor core in an immediately previous time slot is in the stride table; and replace an entry for a stride in the stride table with an entry for the current stride responsive to a determination that the stride table does not have an entry for the current stride.

14. The apparatus of claim 12 or 13, wherein the prefetcher is further configured to: determine which entry to replace in the stride table based on the hit count of the strides in the stride table.

15. The apparatus of any of claims 1 to 14, wherein the supervised machine learning algorithm model comprises a perceptron model.

16. A method of prefetching memory blocks, the method comprising: tracking, in a stride prediction table, a plurality of recent histories for each of a plurality of strides, wherein the recent history for a respective stride specifies, for each respective time slot of a plurality of recent time slots, whether the respective stride occurred in the respective time slot as a result of a processor core requesting a memory block from a memory hierarchy comprising main memory and one or more caches between the main memory and the processor core; training a supervised machine learning algorithm model to predict, for each of the plurality of strides, whether the respective stride will occur in a next time slot from a current time slot, wherein the training for each respective stride is based on the plurality for recent histories for the respective stride; applying the model to predict, for each of the plurality of strides, whether the stride will occur in a next time slot from a current time slot; and prefetching memory blocks from the memory hierarchy for strides for which the prediction is above a threshold.

17. The method of claim 16, wherein training the supervised machine learning algorithm model comprises: learning weights for each respective stride based on the plurality of recent histones for each respective stride.

18. The method of claim 17, wherein applying the model to predict, for each of the plurality of strides, whether the stride will occur in a next time slot from a current time slot comprises: predicting whether the respective stride will occur in a next time slot from a current time slot based on the weights for the respective stride and the recent history for the respective stride for the current time slot.

19. A computing system comprising: -33- a processor core; a memory hierarchy comprising main memory and one or more caches between the main memory and the processor core; a stride table; and a stride prefetcher in communication with the memory hierarchy and the stride table, wherein the stride prefetcher is configured to: track a plurality of recent history vectors for each of a plurality of strides, wherein the recent history vector for a respective stride comprises a plurality of bits, wherein each bit corresponds to a different time slot in a series of recent time slots in which the processor core requested a memory block from the memory hierarchy, wherein the bit for each respective time slot indicates whether the stride occurred in the respective time slot; train a binary classifier to predict for each respective stride whether the processor core will in a next time slot request a memory block in the memory hierarchy at a memory address obtained by adding the respective stride to a memory address requested by the processor core in a current time slot, wherein the training for each respective stride occurs in response to memory requests from the processor core and is based on the plurality of recent history vectors for the respective stride with each recent history vector corresponding to one of the memory requests; apply the binary classifier to predict, for each of the respective strides, whether the processor core’s memory request in a next time slot will be at a memory address obtained by adding the respective stride to a memory address requested by the processor core in a current time slot; and prefetch from the memory hierarchy for strides for which the prediction is above a threshold.

20. The computing system of claim 19, wherein the binary classifier comprises a perceptron model.