WO2022171309A1

WO2022171309A1 - An apparatus and method for performing enhanced pointer chasing prefetcher

Info

Publication number: WO2022171309A1
Application number: PCT/EP2021/053637
Authority: WO
Inventors: Leeor Peled; Nikolay CHERNUHA; Lyu NAN
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2021-02-15
Filing date: 2021-02-15
Publication date: 2022-08-18
Also published as: EP4248321A1

Abstract

An apparatus and methods for performing enhanced pointer chasing prefetch are disclosed. The method includes detecting during an execution of a program a pointer load, which is a load instruction of one or more elements in a data structure wherein returned data of the load instruction from a memory contains one or more pointers, for another load instruction. Mapping the pointer load to: a data structure type, one or more offsets of the one or more pointers in the data structure, and the data structure types the one or more pointers point to, for dispatching a prefetch from the cache layer to the memory; and recursively dispatch prefetches after data of a pointer load or a previous prefetch returns from the memory, based on the offsets of pointers within each data structure type which the pointer is mapped to, before the pointer load is executed by the processor.

Description

AN APPARATUS AND METHOD FOR PERFORMING ENHANCED POINTER CHASING PREFETCHER TECHNICAL FIELD

The present disclosure, in some embodiments thereof, relates to computer systems, more specifically, but not exclusively, to an apparatus and method for performing enhanced pointer chasing prefetcher. BACKGROUND

In computer science, a linked data structure is a data structure which consists of a set of data records (nodes) linked together and organized by references (links or pointers). The link between data can also be called a connector.

In linked data structures, the links are usually treated as special data types that can only be dereferenced or compared for equality. Linked data structures are thus contrasted with arrays and other data structures that require performing arithmetic operations on pointers. This distinction holds even when the nodes are actually implemented as elements of a single array, and the references are actually array indices: as long as no arithmetic is done on those indices, the data structure is essentially a linked one. Linking can be done in two ways - using dynamic allocation and using array index linking.

Linked data structures include linked lists, search trees, expression trees, and many other widely used data structures. They are also key building blocks for many efficient algorithms, such as topological sort and set union-find. Many programs in all market segments today employ algorithms that use various types of lists, trees, graphs and other forms of linked data structures. SUMMARY

It is an object of the present disclosure to provide an apparatus and method for performing an enhanced pointer chasing prefetching. Prefetches are memory accesses originated in a prefetcher unit at a cache layer and are dispatched from the cache layer to the memory. The dispatching of enhanced pointer chasing prefetches reduces memory access latency for memory load operations done by a program using linked data structures. The present disclosure presents a hardware solution for prefetching complex linked data structures, and a hardware recursive traversal prefetcher for linked structures.

It is another object of the present disclosure to provide a method for reconstructing semantics of program data-structures and algorithmic behavior in hardware.

The foregoing and other objects are achieved by the features of the independent claims. Further implementation forms are apparent from the dependent claims, the description and the figures.

According to a first aspect of the present disclosure, a computing apparatus, is disclosed. The computing apparatus: detects, in an execution layer, a pointer load which is a load instruction of one or more elements in a data structure wherein returned data of the load instruction from a memory to a processor contains one or more pointers which are addresses or indexes used for another load instruction; maps the pointer load to a data structure type, one or more offsets of the one or more pointers in the data structure, and the data structure types the one or more pointers point to, by analyzing the pointer load when executed by the processor, for dispatching a prefetch which is a memory access originated in a prefetcher unit at a cache layer and is dispatched from the cache layer to the memory; and recursively dispatches prefetches after data of a pointer load or a previous prefetch returns from the memory, based on the element offsets of each data structure type which the pointer load is mapped to, before the pointer load is executed by the processor.

The combination of the analysis of the prefetches and the recursively dispatched prefetches, significantly reduces, the memory access latency. In a further implementation of the first aspect, analyzing the pointer load is done according to the following data structures: a data type table, DTT, comprising one or more lines, where each line represents one data structure type in a program executed by the processor; a program counter, PC, hash table which maps PCs of the pointer load with the DTT lines, representing the data structure type the pointer load points to; and a linkQ table of dispatched prefetches wherein the pointer load compares the address or index contained in the pointer load to the prefetches addresses in the linkQ table to detect a prefetch with the same address and detect accordingly the structure type that each prefetch points to.

In a further implementation of the first aspect each three DTT columns of offset, usage rate and link to the DTT line representing the data structure type that the pointer points to, represent a specific pointer within the data structure type represented by the DTT line.

In a further implementation of the first aspect, when a pointer in the DTT that is marked as linking to a specific DTT line, and a pointer load which the PC hash table maps to another DTT line are using a same address, than the two or more lines in the DTT are merged to one line, which represents the type of data structure the pointer in the DTT and the pointer load point to.

In a further implementation of the first aspect, each line in the linkQ table contains: a column of a prefetch address; and the line and columns in the DTT of the pointer that triggered the prefetch stored in that linkQ line; and wherein every address or index contained in a pointer load that matches an address stored in the LinkQ table, updates the DTT line representing the structure type of the pointer load accesses, in the DTT, populating the link value at the line and columns in the DTT, representing the pointer that prefetched the address that was matched in the linkQ.

In a further implementation of the first aspect, the linkQ table indicates when an execution of a program is using a width mode where, upon the returned data of a pointer load, more than one addresses or indexes in the pointed data structure of the program are used, or a depth mode where only one address or index in the pointed data structure is used per every pointer load accessing a data structure.

In a further implementation of the first aspect, the linkQ table indicates when to dispatch one prefetch from the cache layer or more than one prefetches , according to the indicated width or depth mode.

In a further implementation of the first aspect the computing apparatus is further adapted to choose which pointer to prefetch in case of a depth mode, predicting the path of pointer loads executed by the program in case of a depth mode, using a table that records which pointer was previously used per any given history of pointers leading to the pointer that was previously used.

In a further implementation of the first aspect, the data structure is one of the following: a linked list, a tree a graph, or a combination thereof.

In a further implementation of the first aspect, the recursively dispatch pointers are dispatched up to a predefined number of times.

In a further implementation of the first aspect, the cache layer is a second cache layer L2 or a third cache layer L3.

According to a second aspect, a method for reducing memory access latency is disclosed. The method comprises: detecting in an execution layer, a pointer load which is a load instruction of one or more elements in a data structure wherein returned data of the load instruction from a memory to a processor contains one or more pointers which are addresses or indexes used for another load instruction; mapping the pointer load to a data structure type, one or more offsets of the one or more pointers in the data structure, and the data structure types the one or more pointers point to, by analyzing the pointer load when executed by the processor, for dispatching a prefetch, which is a memory access originated in a prefetcher unit at a cache layer and dispatched from the cache layer to the memory; and recursively dispatching prefetches after data of a pointer load or a previous prefetch returns from the memory, based on the element offsets of each data structure type which the pointer load is mapped to, before the pointer load is executed by the processor.

In a further implementation of the second aspect, analyzing the pointer load is done according to the following data structures: a data type table, DTT, where each line represents one data structure type in a program executed by the processor; a program counter, PC, hash table which maps PCs of the pointer load with the DTT lines representing the data structure type the pointer load points to; and a linkQ table of dispatched prefetches wherein the pointer load compares the address or index contained in the pointer load to the prefetches addresses in the linkQ table to detect a prefetch with the same address and detect accordingly the structure type that each prefetch points to.

Other systems, methods, features, and advantages of the present disclosure will be or become apparent to one with skill in the art upon examination of the following drawings and detailed description. It is intended that all such additional systems, methods, features, and advantages be included within this description, be within the scope of the present disclosure, and be protected by the accompanying claims.

Unless otherwise defined, all technical and/or scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which embodiments. Although methods and materials similar or equivalent to those described herein can be used in the practice or testing of embodiments, exemplary methods and/or materials are described below. In case of conflict, the patent specification, including definitions, will control. In addition, the materials, methods, and examples are illustrative only and are not intended to be necessarily limiting.

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGiS)

Some embodiments are herein described, by way of example only, with reference to the accompanying drawings. With specific reference now to the drawings in detail, it is stressed that the particulars shown are by way of example and for purposes of illustrative discussion of embodiments. In this regard, the description taken with the drawings makes apparent to those skilled in the art how embodiments may be practiced.

In the drawings:

FIG. 1 schematically shows a sequence diagram of the linked data structure traversal and serialization of memory latencies;

FIG. 2A is a schematic sequence diagram of a load of a linked data structure roundtrip shortening, according to some embodiments of the present disclosure;

FIG. 2B is a schematic sequence diagram of a recursive dispatch of further prefetches based on previous ones, according to some embodiments of the present disclosure;

FIG. 3 schematically shows a block diagram of a computing apparatus for performing enhances pointer chasing prefetches, according to some embodiments of the present disclosure;

FIG. 4 schematically shows an example of the tables stored in the L2 layer, according to some embodiments of the present disclosure;

FIG. 5 schematically shows a flowchart of a method for performing an enhanced pointer chasing prefetches, according to some embodiments of the present disclosure;

FIG. 6 schematically shows a flow of the enhanced pointer chasing prefetches, according to some embodiments of the present disclosure;

FIG. 7 schematically shows an example of two lines in a DTT table which points to the same type of structure, and the merge of the two lines, according to some embodiments of the present disclosure.

DETAILED DESCRIPTION

The present disclosure, in some embodiments thereof, relates to computer systems, and, more specifically, but not exclusively, to an apparatus and method for enhanced pointer chasing prefetcher.

Linked data structures such as linked lists, trees and graphs, or a combination thereof, are widely used in client and server applications. A linked data structure is consisted of a set of data records called nodes, linked together and organized by references which are called links or pointers. The pointers are usually treated as special data types that can only be dereferenced, i.e. access the value in the memory address the pointer points to, or compared for equality. A generic node in a complex data structure may have static values, and may contain multiple pointers linking to other nodes of any type. The values and pointers in the node can be accessed by certain load instructions in the program (herein after loads), at certain locations in the program code represented by specific program counters (PCs) addresses. The pointers can also be in the form of indexes to specific arrays. The node is contiguous in memory, and each link pointer resides at specific fixed offset relative to the head of the node, known at compilation time. FIG. 1 schematically shows a sequence diagram of the linked data structure traversal (which is a path on the linked data structure that starts from some node and passes several other nodes in order to reach some target) and serialization of memory latencies. In FIG. 1, a load instruction of a linked data structure is executed. An execution layer (OEX) 101 of a processor, executes a program, with a load instruction, which loads from a pointer that is kept in register xl, and writes the result back to a register xl. The pointer [xl] points to an address in memory 105. During the execution, the load traverses a path through the load store unit (LSU) layer 102, and the first cache layer LI. Then it passes through the L2 layer 103 and L3 layer 104, which are cache layers with a prefetcher unit, until it gets to memory 105. The load brings data from the memory to the OEX 101, and when the returned load reaches the execution layer, it finishes one roundtrip from the OEX 101 to the OEX with a roundtrip time 106. In the data returned from memory 105, there may by further addresses (pointers) for other loads to access the memory. The load of the data structure, which returns from the memory with data that contains pointers to other addresses, is defined as a pointer load. From FIG. 1 it can be seen that the time latency for executing loads of data structures is very long due to the path the executed load instruction needs to pass.

Although linked data structures are wildly used in a variety of applications, yet they remain some of the hardest to prefetch, i.e. to access memory earlier than the program runtime, and among the slowest workloads in use by modem processors. The constantly widening gap between processor performance and memory latencies makes the memory serialization inherent in these workloads a critical bottleneck that will only intensify in the future.

To date, there are partial solutions for the pointer chasing problem, which fall under three categories. The first is called temporal correlation, which records all address sequences or correlations emerging from the data structure. This solution, suffers from either bad scalability due to limited storage, when relying on in-core tables to store the relations, or from huge storage footprints, when using stolen memory or cache ranges. A second solution attempts to bypass some of the traversal steps through shortcut links and/or hints added during compilation. However, these solutions rely on compiler assistance with special optimizations, and incur additional code footprint and more importantly - additional storage that scales with the data structure size.

A third solution attempts to shorten the latency roundtrip by processing the traversal closer to memory or even in memory (this technique is sometimes called processing in memory, PIM). However, it suffers from lack of information on the page-map (or translation lookaside buffer (TLB) caching it) and therefore has to implement its own page walks far from the core. In addition it does not have the ability to learn pointer offsets of complex data structures and their interactions.

There is therefore a need to find a solution that allows to prefetch (i.e., access memory earlier than an executed program) ahead at runtime by using only knowledge of program running. However, the nature of most linked data structures is that there are no “shortcuts” (except for premeditated ones such at skip lists). There is no way for the hardware to speculate or extrapolate ahead other than prefetch all elements along the way (the only possible way is to construct links that skip subsets of the structure, also known as jump pointers, but that approach is expensive to store).

The present disclosure provides a solution based on running ahead across the data structure faster than the program traverses it. This may be achieved by dedicated hardware that can detect linked traversal steps at lower cache levels (i.e. L2 103 or L3 104), and perform them once the pointer data arrives. Thereby, saving the remaining latency of sending the data to the load and store unit (LSU) of the processor, waking up the dependent loads, performing the arithmetic logic unit (ALU) operations to extract the next pointer and sending the next pointer load.

Before explaining at least one embodiment in detail, it is to be understood that embodiments are not necessarily limited in its application to the details of construction and the arrangement of the components and/or methods set forth in the following description and/or illustrated in the drawings and/or the Examples. Implementations described herein are capable of other embodiments or of being practiced or carried out in various ways.

Embodiments may be a system, a method, and/or a computer program product. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the embodiments.

The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, and any suitable combination of the foregoing. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.

Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.

Computer readable program instructions for carrying out operations of embodiments may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of embodiments.

Aspects of embodiments are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.

These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other device to produce a computer implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.

Reference is now made to FIG. 2A, which is a schematic sequence diagram of a load of a linked data structure roundtrip shortening, according to some embodiments of the present disclosure. From FIG. 2 it can be seen that the memory access roundtrip may be reduced by processing the data and dispatching the next step fetch closer to the memory at the L2 cache layer 203, and earlier than the program execution sends the next load to memory 205, thereby reducing a fraction of the memory access latency. In FIG. 2 A, a second load request 221 original (dashed line) round-trip is replaced by a prefetch 222 originated at L2 203, thanks to its early dispatch. The prefetch is a memory access originated in a prefetcher unit at a cache layer and is dispatched from the cache layer to the memory. The early dispatch of prefetch 222 reduces the roundtrip time of the load 221 down to the roundtrip time 223 instead of the original time of when load 221 would have returned if it would not have encountered the prefetch along the way (lower dashed line). In some embodiments of the present disclosure, the prefetch may be originated and dispatched from the L3 cache layer 204, or even from a memory controller (not shown) which is out of the processor and is considered as a part of the memory, reducing even more the roundtrip time of roundtrip 223. FIG. 2B schematically shows a sequence diagram of a recursive dispatch of further prefetches based on previous ones, according to some embodiments of the present disclosure. The recursion of the prefetches dispatch accumulates the latency reduction in the roundtrip time and allows the sequence of prefetches to advance to an arbitrary depth ahead of the program that is enough to cover the memory latency.

According to some embodiments of the present disclosure, the enhanced pointer chasing prefetch may detect complex linked data structures. According to some embodiments of the present disclosure, the enhanced pointer chasing prefetch attempts to learn each node type and map the locations of all internal offsets of pointers linking out of the node, as well as the type of each pointer. However, in order to prefetch links within the data structures, the enhanced pointer chasing prefetch must learn the pointers‘ offsets within the program type semantics (the layout of structures in the program and the offsets of pointers within them), and the identity of the loads used to dereference them. According to some embodiments of the present disclosure, learning the pointers’ offsets within the program type semantics, and the identity of the loads used to dereference them may be done using hardware logic only (no hints from software), by tracking dependent memory operations within an out-of-order (OOO) execution layer and learning the relations between the memory operations (for example, two load operations) and the offsets used for dereferencing one after the other.

FIG. 3 schematically shows a block diagram of a computing apparatus for performing enhances pointer chasing prefetch, according to some embodiments of the present disclosure.

Apparatus 300 includes an execution layer (OEX) 301, a load and store unit (LSU) 302, a cache layer L2, 303, and a system on chip (SoC) unit 304, which represent all the components along the path of the transaction fetching data from memory, including the L3 cache layer, memory, memory controller and the like. The OEX 301, which in many cases is an out of order (OOO) execution layer, detects pointer loads by searching for load return values (from the memory) used to dereference further loads. In some embodiments of the present disclosure, the OEX 301 checks load-to-load source-destination dependency, by checking when the outcome value of one load is used as a source for another. In some other embodiments of the present disclosure, in a more complex implementation, the load-to-load source-destination dependency may be detected through intermediate operations (load->operation->load) using a physical register tracking map. The load->operation pair is captured and the PC is stored in a table indexed by the physical destination register of the operation. Once a potential second pair (of the form operation -Hoad) is detected using the same physical register, the PC of the first load is confirmed as a pointer load (and the accumulated offset is stored). A scheduler code executed by the processor in the OEX 301, manages data dependencies and wakeups, and looks for pairs of loads where the first load passes data through its destination register to one of the sources of the second load. The OEX 301 passes the data about the detected pointer loads and, the offsets of the pointer loads, and the PC pointing to the pointer loads to the LSU unit 302, which passes it on to the L2 (cache) layer 303. According to some embodiments of the present disclosure, the L2 layer 303 is the layer, which maps the pointer load to: (a) a data structure type, (b) one or more offsets of the one or more pointers in the data structure, and (c) the data structure types the one or more pointers point to. The mapping is done by analyzing the pointer load when executed by the processor, for dispatching a prefetch. In fact, computing the pointer chasing prefetch steps is carried out at L2 layer 303, although in some other embodiments of the present disclosure, the implementation of the analysis of the pointer loads may be moved to the L3 layer (not shown) for further latency reduction. In some embodiments of the present disclosure, the L2 layer 303 includes several data structure, which enable analyzing the pointer loads and the prefetches. According to some embodiments of the present disclosure, a possible implementation for the data structures in the L2 layer 303 may be data tables. For example, a data type table (DTT) 306, which is a table that contains one or more lines, where each line represents one data structure type in a program executed by the processor. In the DTT 306 each line stores one or more column of offsets of up to N pointers observed within the data structure, one or more column of the pointers usage rate i.e. a hit counter, and one or more column of links. The link column represents a type of structure the pointer points to, according to the program type of the pointer in each offset (a self-pointer is also allowed, and would link to its own DTT line).

According to some embodiments of the present disclosure, the L2 layer 303, also includes a data structure, implemented as a PC hash table 305, which is a table used to map PCs of the pointer load with the DTT 306 lines, representing the data structure type the pointer load points to, and points to their index in the DTT 306 (i.e. to their line in the DTT). According to some embodiments of the present disclosure, another data structure implemented as a table stored at the L2 layer 303 is a linkQ table 307, which is a table of dispatched prefetches. In the linkQ table 307, the pointer load compares the address or index contained in the pointer load to the prefetches addresses in the linkQ table to detect a prefetch with the same address as a pointer load, and thereby detect accordingly the structure type that each prefetch points to. The linkQ table 307, is used to track recent prefetch addresses and connect chains of dereferences in order to populate the link information for each offset in the DTT 306. The linkQ table 307 is also used to track usefulness of sent prefetches, judging by the amount of demands of pointer load executed by the program, hitting the addresses of the prefetches, (i.e. pointer load with the same address as a prefetch). Each line in the linkQ table 307 is allocated by sending a pointer chasing prefetch. The LinkQ table 307, contains the prefetch address, and the line and column in the DTT 306 of the pointer that triggered the prefetch stored in that linkQ line. The line is represented by the DTT 306 line index (id) and the column is the index of the column storing the link value for the pointer offset that was used to generate the prefetch. . When a demand (i.e. a pointer load executed by the program, which returns back from the memory) hits one of the addresses in the linkQ table, the type (DTT id) of that demand (according to its PC) is connected as the link in the line and columns of the source pointer in the DTT 306.

According to some embodiments of the present disclosure, another component in apparatus 300 is a counter array (not shown), which tracks the number of pointer load demands from the memory (i.e. executed pointer load by the program) hitting in the linkQ table (i.e. matches the address of a prefetch which was dispatched by the L2 layer 303) on each set of triggered prefetches issued by one pointer returning. This indicates both the usefulness of the prefetcher (low values show no useful prefetches) and the ratio between prefetches brought and used.

FIG. 4 schematically shows an example of the tables stored in the L2 layer 303, according to some embodiments of the present disclosure. Table 401 is an example of a DTT table. The first column represent the ID of the data structure type of the pointer observed. Then, there are three repetitive columns of offset, usage rate (number of hits) and a link column, which represents the data structure type the pointer points to. Each three columns of offset, usage rate and link to the DTT line (represent the data structure type that the pointer points to), represent a specific pointer within the data structure type represented by the DTT line, and are marked as a way. Table 402 is an example of a PC hash table. In some embodiments of the present disclosure the PC hash table 402 contains two columns. The first column is a tag column, which is the PC identified, for example PCI, PC2, PC3 and so on or the address of the PC identified, for example 0x120, 0x130, 0x140 and so on. The second column is an index column, which linked to the DTT line that represents the type of data structure that is accessed by the load instruction residing at the PC. For example, in table 402 there is a PC in address 0x120 denoted with the index 0. The index 0 represents line 0 at the table 401 of the DTT, as the ID of a data type structure. The two next PCs are at addresses 0x130 and 0x140, and both PCs are denoted with the same index of 1. Therefore, both PCs link to the same line in the table 401 of the DTT, in the line of ID 1, which means both loads at these PC addresses access the same type of structure. PCI with the address 0x130 is entered as way 0 and PC2 with address 0x140 is entered as wayl.

Table 403 is an example of a linkQ table. It is a table of prefetches, generated and dispatched by the L2 layer 303, before being executed by the program in the OEX 301. The first column in the linkQ table 403 is the address of the prefetches. Each time a pointer load is executed by the program, the pointer load checks in the LinkQ table if there is a prefetch with the same address as the address of the pointer load executed. When such a match is detected, the DTT type of the hitting pointer load PC is read from the PC hash table and the DTT set and way of the prefetch being matched is read from the second and third columns of the LinkQ table. Then, the DTT type of the matched pointer load is updated as the link column in the DTT table at the DTT and way of the prefetch taken from the linkQ, as the pointer load type of data structure is now known to be the type that the recorded pointer at the DTT line and way points to. According to some embodiments of the present disclosure, the linkQ table indicates when an execution of a program is using a width mode. A width mode is a program behavior where upon the returned data of a pointer load, more than one addresses or indexes in the pointed data structure of the program are used. Alternatively, the linkQ table indicates when the execution of the program is using a depth mode, a behavior where only one address or index in the pointed data structure is used per every pointer load accessing a data structure. When there are more than a single pointer leading from the current structure, a decision of how to proceed is taken by the L2 layer 303 with the prefetcher unit. In case it is assumed that the program will visit multiple nodes linked from the current one, or that it is not known which link is the most likely to be used and there is enough bandwidth to explore all the nodes - in that case a width mode is chosen. In case it is assumed that a specific path is traversed, the L2 layer 303 with the pointer chasing prefetch method may try to predict the specific path based on the history of the traversal and past visits in the current node. This mode may still prefetch out-of-path nodes if it is unsure of the correct path, but most of the steps focus on a single path. According to some embodiments of the present disclosure, the L2 layer 303 further includes an offset history table (OHT) 308, which is used to predict depth traversal paths. The OHT table records which pointer is used per any given history of pointers leading to the pointer that was previously used. According to some embodiments of the present disclosure, the linkQ table indicates when to dispatch one prefetch from the cache layer or more than one prefetches, according to the indicated width or depth mode. According to some embodiments of the present disclosure, the L2 layer 303 chooses which pointer to prefetch in case of a depth mode, predicting the path of pointer loads executed by the program in case of a depth mode, using the OHT table and the accumulated history of recent pointer loads and the DTT types and pointer offsets they used.

According to some embodiments of the present disclosure, the prefetches are recursively dispatched from the L2 layer 303, after data of a pointer load or a previous prefetch returns from the memory, based on the element (pointer) offsets of each data structure type, which the pointer load is mapped to, before the pointer load is executed by the processor.

Reference is now made to FIG. 5, which schematically discloses a method for performing enhances pointer chasing prefetch, according to some embodiments of the present disclosure. At 501, a pointer load is detected by the execution layer (OEX) 301, while executing a program. The pointer load is detected by searching for return values (from the memory) used to dereference further loads. At 502, the pointer load is mapped to: (a) the data structure type, (b) one or more offsets of the one or more pointers in the data structure, and (c) the data structure types the one or more pointers point to. The mapping is done by analyzing the pointer load when executed by the processor, for dispatching a prefetch. According to some embodiments of the present disclosure, the analysis of the pointer load is done in a cache layer for example L2 layer 303, which includes a prefetcher unit, and which dispatches prefetches to the memory. The analysis includes the creation and population of tables and analysis of the tables, based on the information captured from the pointer load when dispatched from the execution and /or out-of-order layer. At 503, the prefetcher unit at the cache layer L2 303, recursively dispatches prefetches. The prefetches are dispatched after data of a pointer load or previous prefetch returns from the memory, based on the detected element offsets stored in the DTT of the data structure type, which the pointer load is mapped to in the PC hash. These prefetches are dispatched before the data is returned to the OEX 301 and before any dependent pointer load is executed by the processor in the OEX 301.

According to some embodiments of the present disclosure, once a node (data structure) type is recorded and its offsets are known, each time the program visits that certain node type (recognized by its load PCs) will trigger prefetches at the moment the node data is returned from memory. Once the pointer (or index) load is sent to fetch data from memory the prefetcher unit in the L2 layer 303 waits for the data to return from memory, and looks up the data type table (DTT) to retrieve the structure layout information matching the pointer of the current load. If the load dereferences a pointer of type X* (pointing at a structure of type X), its PC will point to a line in the DTT holding all pointer offsets in structure X. All other loads of pointers of type X* will also be mapped to the same line in the DTT. When the data arrives, the values at the offsets the DTT line specifies are extracted (all relative to the load address, and depending on pointer sizes for the current run mode). These will be the pointers within the structure leading to other structurer (possibly of other types). Then, the values in the pointers are translated from virtual address (VA) to physical address (PA) using a built-in translation lookaside buffer (TLB) (a subset of the actual data TLB) and issue a prefetch to that address.

Reference is now made to FIG. 6, which schematically shows a flow of the enhanced pointer chasing prefetch, according to some embodiments of the present disclosure. The flow is divided to a train part and a trigger part. The train part is the part where pointer loads are executed by the program, and the trigger part is the part where prefetches are triggered and dispatched to the memory. At a first step 601, the pointer loads are detected in the OEX layer 301. The detection begins when a load returns data to the execution layer and writes back to one of the source registers for another load (with or without an offset). In a generic case, a load has the following form:

PCI: LDR [xl+offsetl]^x2

PC2: LDR [x2+offset2]^x3

In specific cases, this becomes simpler: in a recursive structure a single load may be called repeatedly to do the traversal (so PCI = PC2). If there is a single pointer per structure, offsetl will be the same as offset2.

For example, a simple linked list will have the form:

LDR [x 1+offset l]->x l

Multiple links are also allowed, as a single pointer load may feed multiple following loads: PCI: LDR [xl+offsetl]^x2 PC2: LDR [x2+offset2]^x3 PC3: LDR [x2+offset3]^x4

In some embodiments of the present disclosure, the OEX layer 301 also supports indirect pointer dereference, for example, pointer with additional operations in the middle, such as:

PCI: LDR [xl+offsetl]^x2

PC3: LDR [x3+offset2]^x4

Finally, the OEX layer 301 is also responsible to annotating the loads with a traversal ID. This makes the ID unique per each load chain, so that the chain may be individually tracked.

At step 602, it is checked by the L2 layer 303, is PCI exist in the PC hash table. When PCI is not in the PC hash table, then at step 603, a new line is allocated at the DTT table by the L2 layer 303 to insert the pointer load of PCI to the DTT table. The offset of the pointer load is inserted and the rate of usage is updated to 1, the link column is not updated at this stage as it is not known yet to what structure type the pointer points to. At step 604, the PC hash table is updated to include PCI. Alternatively, when PCI already exists in the PC hash table, then at step 605, the DTT table is checked by the L2 layer 303, to find the DTT line of PCI. At step 606, it is checked whether the offset in the DTT line is a new offset or is it the same as the offset of PCI. In case, the offset is not new and it is the same as the offset in the DTT table line, at step 607 the counter is raised and the rate of usage is updated. In case, the offset is new, then at step 608 the new offset is added to the DTT table in the same line but as a new way (which includes the three columns of offset, usage rate and link). At step 609, the DTT ID is marked on the memory request dispatched by the load to the memory, and at 610, the L2 layer 303 waits for the memory request data to return from the memory with the data dereferenced to analyze the data. At 611, after the prefetch returns from the memory, the depth of the recursion is checked to ensure the predefined threshold of the recursion is not crossed. When the recursion depth is not crossed over, at step 612, it is checked in the DTT table what is the type of the pointer that was dispatched to the memory as a prefetch. At step 613 it is decided by the L2 layer 303 what mode to use, a width mode or a depth mode to further dispatch the one or more pointers in the returned prefetch. In case of a width mode, at step 614, prefetches are dispatched for the pointers at all the known pointer offsets in the returned data structure, which has a hit counter above a predefined threshold. In case of a depth mode, at step 615, an offset history register (OHR) is read to understand the history of the offset and at step 616 the OHT table is checked to decide what specific one or more offsets to dispatch as a prefetch, and at step 617 the prefetch is dispatched. Once the width and/or depth mode have been executed, the address of the dispatched prefetches (specific one or more prefetches in a depth mode or all the prefetches in a width mode), which may be referred to as predicted addresses, are inserted to the linkQ table, in step 618. Future pointer loads at the train part, at step 619, the L2 layer 303, will check whether their address matches one of the predicted addresses added in step 618 in the linkQ table. When it is in the linkQ table it means that one of the prefetches predicted the address of the pointer load. This means the offset of the prefetch is a useful offset, and it enables to find out the type of the structure that the pointer load points to. Therefore, at step 620, the link column in the DTT line of the checked pointer load is populated with the type of structure the pointer load points to. According to some embodiments of the present disclosure, steps 621 of detecting a link conflict and step 622 of trigger merge, refer to a case where two different lines in the DTT table represent the same type of structure and should therefore be merged. When a pointer in the DTT that is marked as linking to a specific DTT line, and a pointer load, which the PC hash table maps to another DTT line are using a same address, then the two or more lines in the DTT are merged to one line, which represents the type of data structure the pointer in the DTT and the pointer load point to.

FIG. 7 schematically shows an example of two lines in a DTT table, which represent the same type of structure and points to each other, and the merge of the two lines, according to some embodiments of the present disclosure. In this example, code 701 is a part of an executed program, which describes a binary tree. In the code, two PCs, which contains loads, are detected: a = a->left denoted as PCI and a=a->right denoted as PC2. left represents the left pointer of the binary tree and right represent the right pointer of the binary tree. The pointer ‘a’ has the same type, but the L2 layer 303 cannot know that since the accesses to the pointers are from two different PCs. Therefore, the two PCs are referred to as containing loads with two different types. Two different lines are inserted to the DTT table 703. However, as the performance of the enhanced pointer chasing prefetches progress, addresses in the linkQ table are hit and link the two structures of PCI and PC2 of the two lines in the DTT table, because the result of dereferencing ->left could be used to dereference ->right and vice versa. Once all the steps permutations are encountered: left->left, left->right, right->left and right->right steps, both structures in the DTT table tries to self-link and cross link, which is impossible (since there is only a single link ID per structure). This indicates that the two structures are in fact the same type, so an attempt to override link IDs actually indicated that the two linked destinations are in-fact the same type. This triggers a merge. Back to FIG. 7, the dynamic program flow 704, would link ’left’, which is the left pointer (DTT[0].way[0] - i.e. DTT line with the ID=0 and way 0) as leading to the path using right pointer (i.e. points to the structure type of DTT[1] - the DTT line with ID=1). In addition, ’right’, which is the right pointer (DTT[l].way[0] - i.e. the DTT line with ID=0 and way 0) leads to the path going left (i.e. points to the structure type of DTT[0] - the DTT line with ID=1). Both DTT lines link to each other representing the concept that the path is always switching directions (which is the case for two real different structures). However, once the program performs a left->left step, the enhanced pointer chasing prefetches method tries to link DTT[0].way[0] back to DTT[0] as can be seen in DT table 706, which results in a conflict. The conclusion is that DTT[0] equals DTT[1], so the two lines are merged, so that the line of DTT[1] is inserted to be in the line of DTT[0] in way 1 as seen in DTT table 707. From this moment on, each pointer returning data for DTT[0] can prefetch both left and right pointers (or choose from among them based on the prediction scheme in case of a depth mode). The merge process extracts the offsets from both lines and match them. Cross links between the two structures becomes self links, and self-links remain. When there is any mismatch in the offset match (i.e. the merge resulted in ways with the same offsets but different links), the line is cleared. Following the merge, the cleared DTT line is retained for a predefined time in a special state in order to correct the pointers to it in the PC hash table to point at the new merged line.

The descriptions of the various embodiments have been presented for purposes of illustration, but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement over technologies found in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

It is expected that during the life of a patent maturing from this application many relevant methods and apparatuses for performing enhanced pointer chasing prefetcher will be developed and the scope of the term method and apparatus for performing enhanced pointer chasing prefetcher is intended to include all such new technologies a priori.

As used herein the term “about” refers to ± 10 %.

The terms "comprises", "comprising", "includes", "including", “having” and their conjugates mean "including but not limited to". This term encompasses the terms "consisting of' and "consisting essentially of'.

The phrase "consisting essentially of means that the composition or method may include additional ingredients and/or steps, but only if the additional ingredients and/or steps do not materially alter the basic and novel characteristics of the claimed composition or method. As used herein, the singular form "a", "an" and "the" include plural references unless the context clearly dictates otherwise. For example, the term "a compound" or "at least one compound" may include a plurality of compounds, including mixtures thereof.

The word “exemplary” is used herein to mean “serving as an example, instance or illustration”. Any embodiment described as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments and/or to exclude the incorporation of features from other embodiments.

The word “optionally” is used herein to mean “is provided in some embodiments and not provided in other embodiments”. Any particular embodiment may include a plurality of “optional” features unless such features conflict.

Throughout this application, various embodiments may be presented in a range format. It should be understood that the description in range format is merely for convenience and brevity and should not be construed as an inflexible limitation on the scope of embodiments. Accordingly, the description of a range should be considered to have specifically disclosed all the possible subranges as well as individual numerical values within that range. For example, description of a range such as from 1 to 6 should be considered to have specifically disclosed subranges such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6 etc., as well as individual numbers within that range, for example, 1, 2, 3, 4, 5, and 6. This applies regardless of the breadth of the range.

Whenever a numerical range is indicated herein, it is meant to include any cited numeral (fractional or integral) within the indicated range. The phrases “ranging/ranges between” a first indicate number and a second indicate number and “ranging/ranges from” a first indicate number “to” a second indicate number are used herein interchangeably and are meant to include the first and second indicated numbers and all the fractional and integral numerals there between.

It is appreciated that certain features of embodiments, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of embodiments, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable sub combination or as suitable in any other described embodiment. Certain features described in the context of various embodiments are not to be considered essential features of those embodiments, unless the embodiment is inoperative without those elements. Although embodiments have been described in conjunction with specific embodiments thereof, it is evident that many alternatives, modifications and variations will be apparent to those skilled in the art. Accordingly, it is intended to embrace all such alternatives, modifications and variations that fall within the spirit and broad scope of the appended claims. It is the intent of the applicant(s) that all publications, patents and patent applications referred to in this specification are to be incorporated in their entirety by reference into the specification, as if each individual publication, patent or patent application was specifically and individually noted when referenced that it is to be incorporated herein by reference. In addition, citation or identification of any reference in this application shall not be construed as an admission that such reference is available as prior art to embodiments. To the extent that section headings are used, they should not be construed as necessarily limiting.

Claims

WHAT IS CLAIMED IS:

1. A computing apparatus, configured to: detect, in an execution layer, a pointer load which is a load instruction of one or more elements in a data structure wherein returned data of the load instruction from a memory to a processor contains one or more pointers, each of the pointers being an address or index used for another load instruction; map the pointer load to a data structure type, one or more offsets of the one or more pointers in the data structure, and the data structure types the one or more pointers point to, by analyzing the pointer load when executed by the processor, for dispatching a prefetch which is a memory access originated in a prefetcher unit at a cache layer and is dispatched from the cache layer to the memory; and recursively dispatch prefetches after data of a pointer load or a previous prefetch returns from the memory, based on the element offsets of each data structure type which the pointer load is mapped to , before the pointer load is executed by the processor.

2. The computing apparatus of claim 1, wherein analyzing the pointer load is done according to the following data structures: a data type table, DTT, comprising one or more lines, where each line represents one data structure type in a program executed by the processor; a program counter, PC, hash table which maps PCs of the pointer load with the DTT lines, representing the data structure type the pointer load points to; and a linkQ table of dispatched prefetches wherein the pointer load compares the address or index contained in the pointer load to the prefetches addresses in the linkQ table to detect a prefetch with the same address and detect accordingly the structure type that each prefetch points to.

3. The computing apparatus of claim 2, wherein each three DTT columns of offset, usage rate and link to the DTT line representing the data structure type that the pointer points to, represent a specific pointer within the data structure type represented by the DTT line.

4. The computing apparatus of claim 3, wherein when a pointer in the DTT that is marked as linking to a specific DTT line, and a pointer load which the PC hash table maps to another DTT line are using a same address, than the two or more lines in the DTT are merged to one line, which represents the type of data structure the pointer in the DTT and the pointer load point to.

5. The computing apparatus of claim 2, wherein each line in the linkQ table contains: a column of a prefetch address; and the line and columns in the DTT of the pointer that triggered the prefetch stored in that linkQ line; and wherein every address or index contained in a pointer load that matches an address stored in the LinkQ table, writes the DTT line representing the structure type of the pointer load accesses, into the DTT, populating the link value at the line and columns in the DTT, representing the pointer that prefetched the address that was matched in the linkQ.

6. The computing apparatus of claim 2, wherein the linkQ table indicates when an execution of a program is using a width mode where, upon the returned data of a pointer load, more than one addresses or indexes in the pointed data structure of the program are used, or a depth mode where only one address or index in the pointed data structure is used per every pointer load accessing a data structure.

7. The computing apparatus of claim 6, wherein the linkQ table indicates when to dispatch one prefetch from the cache layer or more than one prefetches , according to the indicated width or depth mode .

8. The computing apparatus of claim 6, further adapted to choose which pointer to prefetch in case of a depth mode, predicting the path of pointer loads executed by the program in case of a depth mode, using a table that records which pointer was previously used per any given history of pointers leading to the pointer that was previously used.

9. The computing apparatus of claim 3, wherein the data structure is one of the following: a linked list, a tree a graph, or a combination thereof.

10. The computing apparatus of claim 1, wherein the recursively dispatch pointers are dispatched up to a predefined number of times.

11. The computing apparatus of claim 1, wherein the cache layer is a second cache layer L2 or a third cache layer L3.

12. A method for reducing memory access latency, comprising: detecting in an execution layer, a pointer load which is a load instruction of one or more elements in a data structure wherein returned data of the load instruction from a memory to a processor contains one or more pointers which are addresses or indexes used for another load instruction; mapping the pointer load to a data structure type, one or more offsets of the one or more pointers in the data structure, and the data structure types the one or more pointers point to, by analyzing the pointer load when executed by the processor, for dispatching a prefetch, which is a memory access originated in a prefetcher unit at a cache layer and dispatched from the cache layer to the memory; and recursively dispatching prefetches after data of a pointer load or a previous prefetch returns from the memory, based on the element offsets of each data structure type which the pointer load is mapped to, before the pointer load is executed by the processor.

13. The method of claim 12, wherein analyzing the pointer load is done according to the following data structures: a data type table, DTT, where each line represents one data structure type in a program executed by the processor; a program counter, PC, hash table which maps PCs of the pointer load with the DTT lines representing the data structure type the pointer load points to; and a linkQ table of dispatched prefetches wherein the pointer load compares the address or index contained in the pointer load to the prefetches addresses in the linkQ table to detect a prefetch with the same address and detect accordingly the structure type that each prefetch points to.