WO2002027481A2 - System and method for pre-fetching for pointer linked data structures - Google Patents
System and method for pre-fetching for pointer linked data structures Download PDFInfo
- Publication number
- WO2002027481A2 WO2002027481A2 PCT/US2001/030225 US0130225W WO0227481A2 WO 2002027481 A2 WO2002027481 A2 WO 2002027481A2 US 0130225 W US0130225 W US 0130225W WO 0227481 A2 WO0227481 A2 WO 0227481A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- cache
- data
- prefetch
- processor
- program
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 53
- 238000012545 processing Methods 0.000 claims description 32
- 238000004590 computer program Methods 0.000 claims description 10
- 238000012546 transfer Methods 0.000 claims description 5
- 230000004044 response Effects 0.000 claims description 4
- 238000013507 mapping Methods 0.000 description 9
- 230000008569 process Effects 0.000 description 9
- 238000010586 diagram Methods 0.000 description 8
- 230000007246 mechanism Effects 0.000 description 4
- 238000013461 design Methods 0.000 description 3
- 101000771640 Homo sapiens WD repeat and coiled-coil-containing protein Proteins 0.000 description 2
- 102100029476 WD repeat and coiled-coil-containing protein Human genes 0.000 description 2
- 238000013459 approach Methods 0.000 description 2
- 238000004883 computer application Methods 0.000 description 2
- 230000008878 coupling Effects 0.000 description 2
- 238000010168 coupling process Methods 0.000 description 2
- 238000005859 coupling reaction Methods 0.000 description 2
- 238000013500 data storage Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000002123 temporal effect Effects 0.000 description 2
- 238000003491 array Methods 0.000 description 1
- 230000008030 elimination Effects 0.000 description 1
- 238000003379 elimination reaction Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3824—Operand accessing
- G06F9/383—Operand prefetching
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30098—Register arrangements
- G06F9/30101—Special purpose registers
Definitions
- Use of a cache can also reduce the memory latency period during write operations by writing to the cache. This reduces memory latency in two ways. First, it enables the processor to write at the much greater speed of the cache, and second, storing or loading the data into the cache enables it to be obtained directly from the cache should the processor need the data again in the near future.
- the cache is divided logically into two main components or functional units.
- a data-store where the cached information is actually stored, and a tag-field, a small area of memory used by the cache to keep track of the location in the memory where the associated data can be found.
- the data-store is structured or organized as a number of cache-lines each having a tag-field associated therewith, and each capable of storing multiple blocks of data.
- each cache-line stores 32 or 64 blocks or bytes of data.
- the tag-field for each cache-line includes an index that uniquely identifies each cache-line in the cache, and a tag that is used in combination with the index to identify an address in lower-level memory from which data-stored in the cache- line has been read from or written to.
- the tag-field for each cache-line also includes one or more bits, commonly known as a validity-bit, to indicate whether the cache-line contains valid data.
- the tag-field may contain other bits, for example, for indicating whether data at the location is dirty, that is has been modified but not written back to lower-level memory.
- caches To speed up memory access operations, caches rely on principles of temporal and spacial-locality. These principles of locality are based on the assumption that, in general, a computer program accesses only a relatively small portion of the information available in computer memory in a given period of time. In particular, temporal locality holds that if some information is accessed once, it is likely to be accessed again soon, and spatial locality holds that if one memory location is accessed then other nearby memory locations are also likely to be accessed. Thus, in order to exploit temporal-locality, caches temporarily store information from a lower-level memory the first time it is accessed so that if it is accessed again soon it need not be retrieved from the lower-level memory.
- caches transfer several blocks of data from contiguous addresses in lower-level memory, besides the requested block of data, each time data is written to the cache from lower-level memory.
- the most important characteristic of a cache is its hit rate, that is the fraction of all memory accesses that are satisfied from the cache over a given period of time. This in turn depends in large part on how the cache is mapped to addresses in the lower-level memory.
- the choice of mapping technique is so critical to the design of the cache that the cache is often named after this choice. There are generally three different ways to map the cache to the addresses in memory, direct mapping, fully-associative and set- associative.
- Direct-mapping is the simplest way to map a cache to addresses in main-memory.
- the number of cache-lines is determined, the addresses in memory divided into the same number of groups of addresses, and addresses in each group associated with one cache-line. For example, for a cache having 2 n cache-lines, the addresses in memory are divided into 2 n groups and each address in a group is mapped to a single cache-line.
- the lowest n address bits of an address corresponds to the index of the cache-line to which data from the address can be stored.
- the remaining top address bits are stored as a tag that identifies from which of the several possible addresses in the group the data in the cache-line originated.
- each cache-line is shared by a group of 4,096 addresses in main-memory.
- To address 64MB of memory requires 26 address bits since 64-MB is 2 26 bytes.
- the lowest five of these address bits, AO to A4, are ignored in the mapping process, although the processor will use them later to determine which of the 32 blocks of data in the cache-line to accesses.
- the next 14 address bits, A5 to A18, provide the index of the cache-line to which the address is mapped.
- any cache-line can hold data from any one of 4,096 possible addresses in main-memory
- the next seven highest address bits, A19 to A25 are used as a tag to identify to the processor which of the addresses the cache-line holds data from.
- This scheme while simple, can result in a cache-conflict or thrashing in which a sequence of accesses to memory repeatedly overwrites the same cache entry, resulting in a cache-miss on every access. This can happen, for example, if two blocks of data, which are mapped to the same set of cache locations, are needed simultaneously.
- a fully-associative mapped cache avoids the cache-conflict of the directly mapped cache by allowing blocks of data from any address in main-memory to be stored anywhere in the cache.
- one problem with fully associative caches is that the whole main- memory address must be used as a tag, thereby increasing the size of the tag-field and
- a set-associative cache is a compromise between the direct mapped and fully associative designs. In this design, the cache is broken into sets each having a number,
- cache-lines and each address in main-memory is assigned to a set and can be stored in any one of the cache-lines within the set.
- a cache is referred to as a rc-way set associative cache where n is the number of cache-lines in each set.
- Memory addresses are mapped to the set-associative cache in a manner similar to the directly-mapped cache. For example, to map a 64-MB main-memory having 26 address bits to a 512-KB 4-way set associative cache the cache is divided into 4,096 sets of 4 cache-lines each and 16,384 addresses in main-memory associated with each set. Address bits A5 to A16 of a memory address represent the index of the set to which the address maps to. The memory address can be mapped to any of the four cache-lines in the set.
- any cache-line within a set can hold data from any one of 16,384 possible memory addresses
- the next nine highest address bits, A17 to A25 are used as a tag to identify to the processor which of the memory addresses the cache-line holds data from.
- the lowest five address bits, AO to A4 are ignored in the mapping process, although the processor will use them later to determine which of the 32 bytes of data in the cache-line to accesses.
- a cache-line is selected to be written-back or flushed to main-memory or to a lower-level victim cache.
- the new data is then stored in place of the flushed cache-line.
- the cache-line to be flushed is chosen based on a replacement policy implemented via a replacement algorithm.
- LRU Least Recently Used
- a Cache-controller maintains in a register several status bits that keep track of the number of times in which the cache-line was last accessed. Each time one of the cache-lines is accessed, it is marked most recently used and the others are adjusted accordingly. A cache-line is selected to be flushed if it has been accessed (read or written to) less recently than any
- the LRU replacement policy is based on the assumption that, in
- the cache-line which has not been accessed for longest time is least likely to be accessed in the near future.
- replacement schemes include random replacement, an algorithm that picks any cache-line with equal probability, and First-In-First-Out (FIFO), algorithm that simply replaces the first cache-line loaded in a particular set or group of cache-lines.
- FIFO First-In-First-Out
- Another commonly used method of reducing memory latency involves prefetching instructions or data from main-memory to the cache ahead of the time when it will actually be needed by the processor.
- Various approaches and mechanisms have been tried in an attempt to predict the processor's need ahead of time. For example, one approach described in U.S. Pat. No. 5,778,436, to Kedem et al., teaches a predictive caching system and method for prefetching data blocks in which a record of cache misses and hits are maintained in a prediction table, and data to be prefetched is determined based on the last cache-miss and the previous cache-hit. While a significant improvement over cache systems without prefetching all of the prior art prefetching mechanisms suffer from a common short coming.
- the present invention provides a system and method for efficiently prefetching data in a pointer linked data structure.
- the present invention provides a data processing system including a processor capable of executing a program, a main-memory and a prefetch engine configured to prefetch data from a plurality of locations in main-memory in response to a prefetch request from the processor.
- the prefetch engine is configured to traverse the linked-data-structure and prefetch data from the nodes.
- the prefetch request includes a starting address of a first node to be prefetched, an offset value, and a termination value.
- the prefetch engine is configured to determine from data contained in a prefetched first node at the offset value from the address of the first node, an address for a second node to be prefetched.
- the termination value is, for example, an end address
- the prefetch engine is configured to compare the address of the last node to be prefetched to the termination value to determine whether the prefetch request has been satisfied.
- the termination value is a number of nodes to be prefetched
- the prefetch engine is configured to count the number of nodes prefetched as they are prefetched and to compare the number of nodes prefetched to the termination value.
- the prefetch engine includes a number of sets of prefetch registers, one set of prefetch registers for each prefetch request from the processor that is yet to be completed.
- Each set of prefetch registers includes (i) a prefetch address register; (ii) an offset register; (iii) a termination register; (iv) a status register; and (v) a returned data register.
- the data processing system includes a cache capable of storing data transferred between the processor and a main-memory, and the prefetch engine is configured to write the prefetched data from the node to the cache.
- the data processing system further includes a cache controller configured to store data in the cache, and the prefetch engine is configured to issue a prefetch instruction to the cache controller.
- the present invention is directed to a method of prefetching data in a data processing system having a pointer linked data structure with a number of nodes, each with data stored therein.
- a prefetch request from a processor is received in a prefetch engine, the prefetch request including a starting address of a node, an offset value and a termination value.
- Data in the node is prefetched. It is determined whether a termination condition has been met and the prefetch request satisfied. If not,
- the offset and the starting address are used to load a new starting address from the prefetched data. That is the address indicated by the sum of the starting address and the offset holds the new starting address for the next node to be prefetched.
- the process is repeated until the termination condition is met.
- the termination value is an end address, in which case the step of determining whether the termination condition has been met involves comparing the address of the last node from which data was prefetched to the termination value.
- the termination value is a predetermined number of nodes to be prefetched, the step of determining whether the termination condition has been satisfied includes the steps of (i) counting the number of nodes prefetched, and (ii) comparing the number of nodes prefetched to the termination value.
- the step of prefetching data in the node involves recording data from the node in a register in the prefetch engine.
- the step of prefetching data involves writing data from the node to the cache.
- the step of prefetching data is accomplished by issuing a prefetch instruction to the cache controller.
- the invention is directed to a data processing system including a processor, a main-memory having a linked-data-structure with a plurality nodes with data stored therein, and a prefetching means for traversing the linked-data- structure and prefetching data from the nodes in response to a prefetch request from the processor.
- the prefetching means includes: (i) means for receiving the prefetch request from the processor, the prefetch request including a starting address of a node, an offset value and a termination value; (ii) means for prefetching data in the node; (iii) means for determining whether a termination condition has been satisfied using the termination value; and (iv) means for loading an address for another node to be prefetched using the offset and the starting address.
- the prefetching means includes a plurality of sets of prefetch registers, one set of prefetch registers for each prefetch request from the processor that is yet to be completed. Each set of prefetch registers includes (i) a prefetch address register; (ii) an offset register; (iii) a termination register; (iv) a status register; and (v) a returned data register.
- the present invention is directed to a computer system having a cache memory system for caching data transferred between a processor executing a program and a main-memory.
- the cache memory system has at least one cache with multiple cache-lines each capable of caching data therein and is configured to enable a program executed on the processor to directly control caching of data in at least one of the cache- lines.
- the program includes computer program code adapted to: (i) create an address space for each cache to be controlled by the program and (ii) utilize special instructions to directly control caching of data in the cache.
- the address space for each cache is created by setting values in the control registers of the Cache-controller.
- the address space can be created by system calls to an operating system to set up the address space.
- the system call can be a newly created special purpose system call or an adaption of an existing system call.
- MMAP system call or command
- MMAP can be used to set up the address space for each cache to be controlled by the program.
- the special instructions for directly controlling caching of data in the cache are generated by a compiler and inserted into the program during compiling of the program.
- the special instructions are instructions for loading data from cache to the register of the processor and for storing data from registers to the cache.
- the special instructions can also include instructions to transfer data between caches.
- the special instructions can take the form of LOAD_Ll_CACHE [rl], r2, STORE_Ll_CACHE rl, [r2], PREFETCH_L1_CACHE [rl], [r2], READ or PREFETCH_L1_CACHE [rl], [v2], WRITE, where LI is a particular cache and rl is an address in the cache and r2 is a register in the processor to which data is to be loaded to or stored from.
- ASI Alternate Space Indexing
- ASI instructions can take the form of LOAD [A], [ASI], R, STORE [A], [ASI], R or PREFETCH [A], [ASI], R, where A is an address in main-memory and ASI is a number representing one of a number of possible ASI instructions.
- the cache memory system further includes a cache- controller configured to cache data to the cache, and the cache-controller has sole control over at least one cache- line while the program has sole control over at least one of the other cache-lines.
- the cache is a set-associative-cache having a number of sets each with several cache-lines
- at least one cache-line of each set is under the sole control of the program.
- the processor includes a processor-state-register and a control bit in the processor-state-register is set to decide which cache-line or lines are controlled by the program.
- the present invention provides a method for operating a cache memory system having at least one cache with a number of cache-lines capable of caching data therein.
- a cache address space is provided for each cache controlled by a program executed by a processor and special instructions are generated and inserted into the program to directly control caching of data in at least one of the cache-lines.
- the special instructions are received in the cache memory system and executed to cache data in the cache.
- the step of generating special instructions can be accomplished during compiling of the program.
- the method can further include the step of determining which cache-line in a set to flush to main-memory before caching new data to the set.
- the invention is directed to a computer system that includes a processor capable of executing a program, a main-memory, a cache memory system capable of caching data transferred between the processor and the main-memory, the cache memory system having at least one cache with a number of cache-lines capable of caching data therein, and means for enabling the program executed on the processor to directly control caching of data in at least one of the cache-lines.
- the means for enabling the program executed on the processor to directly control caching of data includes computer program code adapted to: (i) create an address space for each cache to be controlled by the program, and (ii) utilize special instructions to directly control caching of data in the cache.
- the step of utilizing special instructions and inserting them into the program can be accomplished by a compiler during compiling of the program.
- the system and method of the present invention is particularly useful in a computer system having a processor and one or more levels of hierarchically organized memory in addition to the cache memory system.
- the system and method of the present invention can be used in a cache memory system coupled between the processor and a lower-level main-memory.
- the system and method of the present invention can also be used in a buffer or interface coupled between the processor or main-memory and a mass-storage-device such as a magnetic, optical or optical- magnetic disk drive.
- the advantages of the present invention include: (i) predictable access times, (ii) reduced cache-misses, (iii) the reduction or elimination of cache conflicts and (iv) direct program control of what data is in certain portion of the cache.
- FIG. 1 is a schematic diagram illustrating a computer network for which an embodiment of a method according to the present invention is particularly useful
- FIG. 2 is a block diagram illustrating a pointer linked data structure having a plurality of nodes with data stored therein;
- FIG. 3 is a flowchart showing an embodiment of a process for prefetching data in a pointer linked data structure according to an embodiment of the present invention
- FIG. 4 is a block diagram illustrating a computer system having an embodiment of a cache memory system according to the present invention
- FIG. 5 illustrates a schema of a cache-line of a cache in cache memory system according to an embodiment of the present invention
- FIG. 6 is a block diagram illustrating a schema of a set in a four-way set associative cache according to an embodiment of the present invention.
- FIG. 7 is a flowchart showing a process for operating a cache memory system according to an embodiment of the present invention to directly control caching of data in a cache.
- the present invention is directed to a system and method for prefetching data stored in a pointer linked data structure.
- FIG. 1 shows a block diagram of an exemplary embodiment of a data processing system with a prefetch engine capable of prefetching data according to an embodiment of the present invention.
- FIG. 1 shows a block diagram of an exemplary embodiment of a data processing system with a prefetch engine capable of prefetching data according to an embodiment of the present invention.
- data processing systems 100 that are widely known and are not relevant to the present invention have been omitted.
- data processing system 100 typically includes central processing unit (CPU) or processor 110 for executing instructions for a computer application or program (not shown), main-memory 115 for storing data and instructions while running the application, a cache memory system 120 for storing data transferred between the processor and the main-memory, a mass-data-storage device, such as disk drive 125, for a more permanent storage of data and instructions, system bus 130 coupling components of the data processing system, and various input and output devices such as a monitor, keyboard or pointing device (not shown).
- CPU central processing unit
- processor 110 for executing instructions for a computer application or program (not shown)
- main-memory 115 for storing data and instructions while running the application
- cache memory system 120 for storing data transferred between the processor and the main-memory
- mass-data-storage device such as disk drive 125
- system bus 130 coupling components of the data processing system
- various input and output devices such as a monitor, keyboard or pointing device (not shown).
- Main-memory 115 holds information including data and instructions arranged in a data structure.
- a frequently used type of data structure is a pointer-linked data structure, such as a tree or a list.
- a block diagram illustrating a pointer linked data structure is shown in FIG. 2.
- Pointer linked data structure 140 consists of a number of nodes 145 (singularly 145 A and 145B) each containing data 150 and a link 155 which is an address, or an offset of specifying a successive node. Unlike an array type data structure successive nodes 145 need not be contiguous, but rather are allocated separately and are widely scattered throughout main-memory 115.
- Cache memory system 120 could have a cache memory or cache 160 separate and distinct from processor 110, as shown, or a cache located on the same chip as the processor (not shown).
- Cache memory system 120 also includes cache controller 165 for controlling operation of cache 160 by controlling mapping of addresses from main- memory 115 to the cache and the replacement of data 150 in the cache in accordance with a cache replacement policy.
- cache memory system 105 could also have additional, separate caches for instructions and data, which are accessed at the same time, thereby allowing an instruction fetch to overlap with a data read or write.
- the data processing system 100 also includes
- prefetch engine 175 is an integrated circuit having a processor and an associated memory (not shown) configured to receive a prefetch request from processor 110, prefetch data 150 in a node 145, determine from the data an offset value and calculate from the offset value and the address of the first node 145 A the address for a second node 145B to be prefetched.
- the prefetch engine 175 includes a plurality of sets of prefetch registers 180, only one of which is shown. The prefetch engine 175 includes one set of prefetch registers 180 for each prefetch request from processor 110 that is yet to be completed.
- the number of sets of prefetch registers 180 depends on the speed of processor 110 relative to main-memory 115, and the resources, such as power or space, available on the processor to be allocated to the prefetching function. In general, it has been found that from about 2 to about 50 sets of prefetch registers 180 is sufficient.
- Each set of prefetch registers 180 includes (i) a prefetch address register 190; (ii) an offset register 195; (iii) a termination register 200 ; (iv) a status register 205; and (v) a returned data register 210.
- a prefetch request from processor 110 is received in prefetch engine 175 (step 220), and the data in the node prefetched, (step 225)
- the prefetch request includes a starting address of a node, an offset value and a termination value.
- the termination value could be either an end address of a last node to be prefetched or a predetermined number of nodes to be prefetched.
- the step of prefetching data in the node, (step 225) involves the step of writing data from node 145 to cache 160. This is accomplished by issuing a prefetch instruction to cache controller 165.
- step 230 it is determined whether a termination condition has been satisfied.
- the termination value is the end address this involves comparing the address of the last node from which data was prefetched to the termination value.
- the termination value is a predetermined number of nodes to be prefetched this is accomplished by keeping track of the number of nodes prefetched and comparing the number of nodes prefetched to the termination value.
- a new starting address for another node to be prefetched is loaded from the prefetched data using the offset and the starting address, (step 235) That is the address indicated by the sum of the starting address and the offset holds the new starting address for the next
- the process includes the further step (not shown) of keeping track of the status of a prefetch. That is whether data 150 has been returned to returned data register 210 and written to cache 160.
- the present invention is directed to cache memory systems and methods of operating the same to provide improved handling of data in a cache memory system for caching data transferred between a processor capable of executing a program and a main-memory.
- FIG.4 shows a block diagram of an exemplary embodiment of a computer system 300 in which an embodiment of a cache memory system 305 of the present invention can be incorporated.
- computer system 300 typically includes central processing unit (CPU) or processor 310 for executing instructions for a computer application or program (not shown), main-memory 315 for storing data and instructions while running the application, a mass-data-storage device, such as disk drive 320, for a more permanent storage of data and instructions, system bus 330 coupling components of the computer system, and various input and output devices such as a monitor, keyboard or pointing device (not shown).
- CPU central processing unit
- processor 310 main-memory 315 for storing data and instructions while running the application
- main-memory 315 for storing data and instructions while running the application
- mass-data-storage device such as disk drive 320
- system bus 330 coupling components of the computer system
- various input and output devices such as a monitor, keyboard or pointing device (not shown).
- Cache memory system 305 has a cache memory or cache separate and distinct from the processor, shown here as level 2 (L2) cache 335, for temporarily storing data and instructions recently read from or written to lower level main-memory 315 or mass- storage-device 320.
- Cache-controller 340 controls operation and content of cache335 by controlling mapping of memory addresses to the cache and the replacement of data in the cache in accordance with a cache replacement policy.
- cache memory system 105 can further include primary or level 1 (LI) cache 345 integrally formed with processor 310 and one or more level 3 (L3) or victim caches 355 for temporarily storing data replaced or displaced from the LI or L2 cache to speed up subsequent read or write operations.
- LI level 1
- L3 level 3
- LI cache 345 typically has a capacity of from about 1 to 64 KB, while lower- level L2 and L3 caches 335, 355, can have capacities of from about 128 KB to 64 MB in size.
- cache memory system 305 can also have separate caches for instructions and data, which can be accessed at the same time, thereby allowing an instruction fetch to overlap with a data read or write.
- the caches 335, 345, 355, can be organized as directly-mapped, fully-associative or set-associative caches as described above. In one embodiment, the caches 335, 345, 355, are organized as «-way set-associative caches, where n is an integer of two or more.
- FIG. 5 illustrates a schema of cache-line 160 of cache 335, 345, 355, in cache memory system 305 of FIG 1.
- Cache-line 360 includes data-store 365 capable of storing multiple blocks or bytes of data, and tag-field 375 containing address information and control bits.
- the address information includes index 380 that uniquely identifies each cache-line 360 in cache 335, 345, 355, and a tag 385 that is used in combination with index 380 to identify an address in main-memory 315 from which data stored in the cache-line has been read from or written to. Often index 180 is not stored in cache 335, 345, 355, but is implicit, provided by the location or address of cache-line 360 within the cache.
- Control bits can include validity bit 390 which indicates if the cache-line contains valid data, bits for implementing a replacement algorithm, and a dirty-data-bit for indicating whether data in the cache-line has been modified but not written-back to lower- level memory.
- FIG. 6 is a block diagram illustrating a schema of a set in a four-way set associative cache according to an embodiment of the " present invention.
- Each way 370 has a data store 365 and tag-field 375 associated therewith.
- the program has sole control over at least one of the ways 370 in each set while the cache-controller 340 has control over the remainder of the ways.
- FIG. 7 is a flowchart showing an embodiment of a process for operating cache memory system 305 having at least one cache 335, 345, 355, for caching data transferred between processor 310 and main-memory 315 with a plurality of cache-lines 360 capable of caching data therein according to an embodiment of the present invention.
- a cache address space is provided for the cache 335, 345, 355 (step 400), and special instructions generated (step 405) that when executed in a program by processor 310 enable it to directly control caching of data in at least one of the plurality of cache- lines 360.
- the special instructions are received in the cache memory system 305 (step 410) and executed to cache data in the cache 335, 345, 355 (step 415).
- the cache memory system 305 includes a set-associative-cache with a number of sets each
- the method further includes the step (not shown) of determining which cache-line in a set to flush to main-memory 315 before caching new data to the set.
- the step of generating special instructions involves inserting special instructions into the program while compiling of the program during an optimization run.
- a compiler is a program that converts a program from a source or programming language to a machine language or object code. The compiler analyzes and determines data caching requirements, and inserts instructions into the machine level instructions for more accurate and efficient cache operation. By being able to directly control the cache 335, 345, 355, the compiler can more accurately predict the speed of execution of the resulting code.
- a loop within a program Loops are readily recognized as a sequence of code that is iteratively executed some number of times. The sequence of such operations is predictable because the same set of operations is repeated for each iteration of the loop.
- index variable is often used to address elements of arrays that correspond to a regular sequence of memory locations.
- Such array references by a loop can lead to overwriting of data that will be needed in the future, cache-conflict, resulting in a significant number of cache- misses.
- a compiler according to the present is able to deduce the number of cycles required by a repetitively executed sequence of code, such as a loop, and then insert prefetch instructions into loops such that array elements that are likely to be needed in future iterations are retrieved ahead of time. Ideally, the instructions are inserted far enough in advance that by the time the data is actually required by the processor, it has been retrieved from memory and stored in the cache.
- the special instructions can be written into the program by the author of the executable program in an assembly language. It will be appreciated that methods can be used in the same program. That is at points or places in the program where performance is critical the present invention allows the programmer to adapt the prefetching to the precise needs of the program, while the compiler is allowed to insert prefetch instruction at other, less critical points in the program as required.
- the special instructions can take the form of LOAD_Ll_CACHE [rl], r2; STORE_Ll_CACHE rl, [r2]; PREFETCH_L1_CACHE [rl], [r2], READ; and PREFETCH_L1_CACHE [rl], [r2], WRITE, where LI is a cache and rl and r2 are registers in processor 310 to which the data is to be loaded from or to.
- the special instructions can also include instructions to transfer data between caches.
- processor 310 has an architecture supporting alternate address space indexing and alternate or special load instructions, and the step of generating special instructions, step 405, involves receiving from processor 310 pre- existing special instructions.
- processor 310 can have a SPARC ® architecture supporting Alternate Space Indexing (ASI), and th the step of generating special instructions, step 405, involves using an ASI instruction to control caching of data in the cache 335, 345, 355.
- SPARC ® or scalable processor architecture is an instruction set architecture designed by Sun Microsystems, Inc., of Palo Alto, CA.
- ASI instructions are an alternate set of load (or store) instructions originally developed for running diagnostics on the processor and for providing access to memory not accessible using ordinary instructions (non-faulting memory).
- ASI address spaces There are 256 possible ASI address spaces available that can be specified.
- the ASI instruction used can take the form of LOAD [A], [ASI], R, STORE [A], [ASI], R, and PREFETCH [A], [ASI], R, where A is the address of the data in main-memory 315, R is a register in processor 310 to which the data is to be loaded from or to and ASI is a number representing one of 256 possible ASI address spaces.
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Memory System Of A Hierarchy Structure (AREA)
Abstract
Description
Claims
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP01975464A EP1320801A2 (en) | 2000-09-29 | 2001-09-26 | System and method for pre-fetching for pointer linked data structures |
AU2001294788A AU2001294788A1 (en) | 2000-09-29 | 2001-09-26 | System and method for pre-fetching for pointer linked data structures |
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US09/677,090 | 2000-09-29 | ||
US09/677,092 | 2000-09-29 | ||
US09/677,090 US6782454B1 (en) | 2000-09-29 | 2000-09-29 | System and method for pre-fetching for pointer linked data structures |
US09/677,092 US6668307B1 (en) | 2000-09-29 | 2000-09-29 | System and method for a software controlled cache |
Publications (2)
Publication Number | Publication Date |
---|---|
WO2002027481A2 true WO2002027481A2 (en) | 2002-04-04 |
WO2002027481A3 WO2002027481A3 (en) | 2002-12-19 |
Family
ID=27101704
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2001/030225 WO2002027481A2 (en) | 2000-09-29 | 2001-09-26 | System and method for pre-fetching for pointer linked data structures |
Country Status (3)
Country | Link |
---|---|
EP (1) | EP1320801A2 (en) |
AU (1) | AU2001294788A1 (en) |
WO (1) | WO2002027481A2 (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108874691B (en) * | 2017-05-16 | 2021-04-30 | 龙芯中科技术股份有限公司 | Data prefetching method and memory controller |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0723221A2 (en) * | 1995-01-20 | 1996-07-24 | Hitachi, Ltd. | Information processing apparatus for prefetching data structure either from a main memory or its cache memory |
US5652858A (en) * | 1994-06-06 | 1997-07-29 | Hitachi, Ltd. | Method for prefetching pointer-type data structure and information processing apparatus therefor |
-
2001
- 2001-09-26 EP EP01975464A patent/EP1320801A2/en not_active Withdrawn
- 2001-09-26 WO PCT/US2001/030225 patent/WO2002027481A2/en active Application Filing
- 2001-09-26 AU AU2001294788A patent/AU2001294788A1/en not_active Abandoned
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5652858A (en) * | 1994-06-06 | 1997-07-29 | Hitachi, Ltd. | Method for prefetching pointer-type data structure and information processing apparatus therefor |
EP0723221A2 (en) * | 1995-01-20 | 1996-07-24 | Hitachi, Ltd. | Information processing apparatus for prefetching data structure either from a main memory or its cache memory |
Non-Patent Citations (1)
Title |
---|
KARLSSON M ET AL: "A prefetching technique for irregular accesses to linked data structures" HIGH-PERFORMANCE COMPUTER ARCHITECTURE, 2000. HPCA-6. PROCEEDINGS. SIXTH INTERNATIONAL SYMPOSIUM ON TOULUSE, FRANCE 8-12 JAN. 2000, LOS ALAMITOS, CA, USA,IEEE COMPUT. SOC, US, 8 January 2000 (2000-01-08), pages 206-217, XP010371910 ISBN: 0-7695-0550-3 * |
Also Published As
Publication number | Publication date |
---|---|
AU2001294788A1 (en) | 2002-04-08 |
EP1320801A2 (en) | 2003-06-25 |
WO2002027481A3 (en) | 2002-12-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US6668307B1 (en) | System and method for a software controlled cache | |
JP4028875B2 (en) | System and method for managing memory | |
US6782454B1 (en) | System and method for pre-fetching for pointer linked data structures | |
US7493452B2 (en) | Method to efficiently prefetch and batch compiler-assisted software cache accesses | |
US6578111B1 (en) | Cache memory system and method for managing streaming-data | |
US6584549B2 (en) | System and method for prefetching data into a cache based on miss distance | |
KR100240911B1 (en) | Progressive data cache | |
KR100240912B1 (en) | Stream filter | |
KR100262906B1 (en) | Data prefetch method and system | |
US5537573A (en) | Cache system and method for prefetching of data | |
JP3888508B2 (en) | Cache data management method | |
US6292871B1 (en) | Loading accessed data from a prefetch buffer to a least recently used position in a cache | |
JP3739491B2 (en) | Harmonized software control of Harvard architecture cache memory using prefetch instructions | |
US6912628B2 (en) | N-way set-associative external cache with standard DDR memory devices | |
KR100810781B1 (en) | Data processor with cache | |
US20180300258A1 (en) | Access rank aware cache replacement policy | |
US6772288B1 (en) | Extended cache memory system and method for caching data including changing a state field value in an extent record | |
JP4298800B2 (en) | Prefetch management in cache memory | |
US20050055511A1 (en) | Systems and methods for data caching | |
KR20010042262A (en) | Shared cache structure for temporal and non-temporal instructions | |
US6832294B2 (en) | Interleaved n-way set-associative external cache | |
JP2008502069A (en) | Memory cache controller and method for performing coherency operations therefor | |
US6959363B2 (en) | Cache memory operation | |
US6687807B1 (en) | Method for apparatus for prefetching linked data structures | |
US6598124B1 (en) | System and method for identifying streaming-data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A2 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PH PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG UZ VN YU ZA ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A2 Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
AK | Designated states |
Kind code of ref document: A3 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PH PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG UZ VN YU ZA ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A3 Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2001975464 Country of ref document: EP |
|
WWP | Wipo information: published in national office |
Ref document number: 2001975464 Country of ref document: EP |
|
REG | Reference to national code |
Ref country code: DE Ref legal event code: 8642 |
|
NENP | Non-entry into the national phase in: |
Ref country code: JP |