CA1309509C

CA1309509C - Cache memory with variable fetch and replacement schemes

Info

Publication number: CA1309509C
Application number: CA000615858A
Authority: CA
Inventors: Allen Baum; William R. Bryg; Michael J. Mahon; Ruby Bei-Loh Lee; Steven S. Muchnick
Original assignee: Hewlett Packard Co
Current assignee: HP Inc
Priority date: 1986-06-27
Filing date: 1990-08-31
Publication date: 1992-10-27
Anticipated expiration: 2006-12-11

Abstract

ABSTRACT
A method is provided for swapping data out of a cache memory. An instruction is presented to the cache; the instruction includes a cache control specifier which identifies a type of data being requested. Based on the cache control specifier, one of a plurality of replacement schemes is selected for swapping a data block out of the cache.

Description

13~9~09 CAOE l~EXORY WITEI
VARIABLE Yl~lCH A.ND RRP~ ~ SCED~IES

Fiçld of the Invention The present invention relates to computer systems and, more particularly, to a computer system which utilizes a cache memory which can switch from a standard replacement scheme to a more efficient method of data replacement when warranted.

Backaround of the Invention Most modern computer systems include a central processing unit (CPU) and a main memory. The speed at which the CPU can decode and execute instructions to process data has for some time exceeded the speed at which instructions and operands can be transf~rred from main memory to the CPU. In an attempt to reduce the problems caused by this mismatch, many computer systems include a cache memory or buffer between the CPU and main memory.
A cache memory is a small, high-speed buffer memory ~k 7~

` 13~09 1 ~ which is used to hold temporarily those portions of the

2 ¦ contents of main memory which it is believed will be used in

3 ¦ the near future by the CPU. The main purpose of a cache

4 ¦ memory is to shorten the time necessary to perform memory

5¦ accesses, either for data or instruc~ion fetch. The 61 information located in cache memory may be accessed in much 71 less time than that located in main memory. Thus, a CPU
81 with a cache memory needs to spend far less time waiting for 9 instructions and operands to be fetched and/or stored. For example, in typical large, high-speed computers, main memory 11 can be accessed in 300 to 600 nanoseconds; information can 12 be obtained from a cache memory on the other hand, in 50 to 13 100 nanoseconds. For such machines, the cache memory 14 produces a very substantial increase in execution speed, but processor performance is still limited in instruction 16 execution rate by cache memory access time. Additional 17 increases in instruction execution rate can be gained by 18 further decreasing the cache memory access time.
19 A cache memory is made up of many blocks of one or more words of data. Each block has associated with it an address 21 tag that uniquely identifies which block of main memory it 22 is a copy of. Each time the processor makes a memory 23 reference, the cache makes an address tag comparison to see 24 if it has a copy of the requested data. If it does, it 25 supplies the data. If it does not, it retrieves the block 26 from main memory to replace one of the blocks stored in the 27 cache, then supplies the data to the processor.

; 138~09 .' I
1 Optimizing the design of a cache memory generally has 2 four aspects:
3 (1) Maximixing the probability of finding a memory 4 reference's information in the cache (the so-called "hit" ratio), 61 (2) minimizing the itme required to access information 7 that is indeed in the cache (access time~, 8 (3) minimizing the delay due to a cache "miss", and 9 (4) minimizing the overhead of updating main memory and maintaining multicache consistency.
11 All of these objectives must be accomplished under cost 12 constraints and in view of the interrelationship between the 13 parameters, for example, the trade-off between "hit" ratio 14 and access time.
It is obious that the larger the cache memory, the 16 higher the probability of finding the needed in~ormation in 17 it. Cache sizes cannot be expanded without limit, however, 18 for several reasons: cost, the most important reason in 19 many machines, especially small ones; physical size, the cache must fit on the boards and in the cabinets; and access 21 time, the larger the cache, the slower it will become.
22 Information is generally retrieved from cache 23 associatively to determine if there is a "hit". However, 24 large associative memories are both very expensive and somewhat slow. In early cache memories, all the elements 26 were searched associatively for each request by the CPU. In 28 order to provide the access time required to keep up with - 130~509 1~ the C , cache sizes was limited a~d the hit ratio was thus 2 rather low.
3 More recently, cache memories have been organized into 4 groups of smaller associative memories called sets. Each set contains a number of locations, referred to as the set

6 size. For a cache of size m, divided into L sets, there are

7 s=m/L locations in each set. When an address in main memory

8 is mapped into the cache, it can appear in any of the L

9 sets. For a cache of a given size, searching each of the sets in parallel can improve access time by a factor of L.
11 However, the time to complete the required associative 12 search is still undesirably lengthy.
13 The operation of cache memories to date has been based 14 upon the assumption that, because a particular memory location has been referenced, that location and locations 16 very close to it are very likely to be accessed in the near 17 future. This is often referred to as the property of 18 locality. The property of locality has two aspects, 19 temporal and spatial. While over short periods of time, a program distributes its memory references nonuniformly over 21 its address Space, the portions of the address space which 22 are favored remain largely the same for long periods of 23 time. This first property, called temporal locality, or 2 locality by time, means that the information which will be in suein the ner future is likely to be in use already.
26 This type of behavior can be expected from cer~aind ata 27 structures, such as program loops, in which both data and ~ 13~09 1 instructions are reused. The second property, locality by 2 space, means that portions of the address space which are in 3 use generally consist of a fairly small number of 4 individually contiguous segments of that address space.
Locality by space, then, means that the loci of reference of 6 the program in the near future are likely to be near the 7 current loci of reference. This type of behavior can be 8 expected from common knowledge of program structure:
9 related data items (variables, arrays) are usually stored together, and instruction are mostly executed sequentially.
11 Since the cache memory retains segments of information that 12 have been recently used, the property of locality implies 13 that needed information is also likely to be found in the 14 cache. See, Smith, A. J., Cache Memories, ACM Computing Surveys, 14:3 (Sept. 1982), pp. 473-530.
16 If a cache has more than one set, as described above, 17 then when thre is a cache miss, the cache must decide which 18 of several blocks of information should be swapped out to 19 make room for the new block being retrieved from main memory. To decide when block will be swapped out, different 21 caches use different replacement schemes.
22 The most commonly utilized replacement scheme is Least 23 Recently Used ("LRU"). According to the LRU replacement 24 scheme, for each group of blocks at a particular index, the cache mantains several status bits that keep track of the 2~ order in which these blocks were last accessed. Each time 27 one of the blocks is accessed, it is marked most recently 2~ 5 ~ 13~509 used and the others are adjusted accordingly. When there is a cache miss, the block swapped out to make room for the block being retrieved from main memory is the block that was least recently used.
Other replacement schemes that are used are First In First Out (FIFO) and random replacement, the nonenclature being self-explanatory.
Contrary to the above-stated assumption, however, not all computer data structures have the same kind of data locality. For some simple structures, such as data stacks or se~uential data, an LRU replacement scheme is not optimal. Thus, in cache memory structures used in the past and in accordance with the basic assumption that the most likely data to be referenced is that which was referenced most recently or is close to that data in physical address, no provision has been made in cache memory operation for deviation from the standard data replacement scheme.

SUMMARY
It is an object of an aspect of the present invention to minimize access time to a cache memory.
It is an object of an aspect of the invention to provide flexibilit~ in cache memory operation so that the cache can switch from a standard replacement scheme to a more efficient method of data replacement when warranted.
These and other objects of the invention are accomplished by including a cache control specifier in each ~30~09 instruction that is presented to the cache memory. When presented with a store or load instruction, the cache identifies the cache control specifier in the instruction and implements one of a number of replacement schemes as indicated by the cache control specifier. By embedding the cache control specifier in the instruction, the cache can treat stack or sequential data structures appropriately, thereby increasing cache performance. The cache can also receive "hints" as to when it is advantageous to prefetch data from main memory.
Various aspects of the invention are as follows:
A method for replacing a data block in a cache memory of a computer system wherein the computer system comprises a processor for fetching and executing data, a main memory, and a cache memory, the method comprising:
presenting an instruction from the processor, said instruction including a cache control specifier indicating that a new data block should be transferred from main memory to the cache.
A computer device comprising:
a processor;
a cache system coupled to the processor, the cache system comprising:
a cache memory; and an extracting means, coupled to the cache memory, for extracting information embedded in a field of an instruction sent from the processor to the cache system, the extracting means supplying the information to the cache memory of cache management.
A method for improving the transfer of information between registers and memory comprising the steps of:
encoding bits in an instruction;
interpreting said bits by a cache controller to identify select information likely to be accessed by future instructions; and storing said select information in a cache memory.

13~5Q9 A method for replacing a data block in a cache memory of a computer system wherein the computer system comprises a processor for fetching and e~ecuting data, a main memory, and a cache memory, the data bloc~ being part of a sequential data structure, and including a plurality of serially arranged words, the method comprising:
presenting an instruction from the processor, said instruction including a cache control specifier indicating that access to a data block from a sequential data structure is desired; searching the cache to determine whether said desired data block is contained in the cache;
in the event that said instruction references the last word in said data block, and said block is contained in the cache, marking said data block most eligible for replacement; in the event that said instruction references a STORE to the first word in said data block and said block is not contained in the cache, allocating a new data block in the cache, marking said new data block as least eligible for replacement, and performing the STORE to said new data block.
A method for replacing a data block in a cache memory of a computer system wherein the computer system comprises a processor for fetching and executing data, a main memory, and a cache memory, the method comprising:
presenting an instruction from the processor, said instruction including a cache control specifier indicating that a new data block should be transferred from main memory to the cache.

7a 13~09 Brief Description of the Drawinqs Figure 1 is a schematic block diagram of a computer system which includes a cache memory;
Figure 2 is a schematic illustration of a multiple-set cache memory structure;
Figure 3 is a schematic block diagram illustrating the operation of a two-set cache in accordance with the method of the present invention; and Figure 4 is a schematic illustration of a stacked data structure which can be utilized in conjunction with the method of the present invention.

Detailed Description of the Invention Figure 1 shows a computer system which incorporates a cache memory. A CPU 11 communicates with main memory 13 and input/output channel 15 via bus 17. The CPU includes a 7~

i 13~09 1 processor 19 which fetches, decodes and executes 2 instructions to process data. Since, as discussed above, it 3 is not practical to store all the instructions and data used 4 by the computer system in CPU 11, data and instructions are stored in main memory 13, transferred to processor 19 when 6 they are requested during the execution of a program or 7 ¦ routine and returned to main memory 13 after the program or 8 ¦ routine has been completed.
9¦ Access to main memory 13 is relatively slow compared

10¦ with the operation of processor 19. IF processor 19 had to

11¦ wait for main memory access to be completed each time an

12¦ instruction or data was needed, its execution rate would be

13¦ significantly reduced. Thereore, in order to provide access

14 times which more closely match the needs of the processor 19, a buffer memory, cache memory 21, stores a limited 16 number of instructions and data.
17 Since cache memory 21 is much smaller than main memory 18 13, it can be economically built to have higher access 19 rates. Nevertheless, there is still a trade-off between the access time for the cache and the size of the cache. As 21 discussed above, as the cache becomes larger it becomes more 22 expensive and the access time for the cache increases.
23 Thus, if cache 21 is made very large to increase the hit 24 ratio, although there will be very few references to main 25 memorv, the processor may be slowed down by increases access 26 time, even for a cache "hit." It is therefore desirable to 27 maximize the hit ratio for a cache of a given size and 130~0~
.' I
1 architecture.
2 To explain the method of the invention more completely, 3 an understanding of the structure of cache memory 21 is 4 necessary. Figure 2 shows a cache memory having multiple sets A-N. Each set comprises an array of locations or 6 blocks which are labeled with an index 31. Each block 7 contains data 33 and an address tag 35 which corresponds to 8 the address of the copy of the same data in main memory 13.
9 In the preferred embodiment, each block of data 33 contains four words. This four-word block is the unit in which data 11 is exchanged between cache memory 21 and the main memory 13, 12 and it is also the unit in which data is indexed, fetched 13 and replaced in cache 21. The bloc~ could contain fewer or 14 more words. These parameters are a matter of design choice for the memory system, and the principles of the invention 16 can be applied to a wide range of possible configurations.
17 In addition to the data 33 and address tag 35, each block in 18 the cache has associated with it wo one-bit status flags:
19 "valid" and "dirty". The cache also has a group of LRU
status bits for all blocks in the cache corresponding to the 21 same index.
22 The valid bit 37 is set if and only if that block 23 contains "valid", i.e., up-to-date, data.
24 The dirty bit 38 is set if the processor 19 has stored 25 to that address in the cache emory since it was brought into 26 he cache. Unless the cache 21 updates main memory 13 every 27 time processor ls does a stored to the cache, the cache will 13~9~9 1 ¦have more up-to-date data for a block than main memory has 2 ¦ for the same block address. The dirty bit 38 indicates that 3 ¦main memory 13 must be updated by writing the data currently 41 in the block in cache 21 back into main memory when that block is swapped out of the cache.
6 The LRU status bits 39 identify whether a block is the 7 least recently used or most eligible for replacement. As 8 stated above, according to the LRU replacement scheme, each 9 time one of the blocks in the cache is accessed, it is marked most recently used and the LRU status bits of the 11 others are adjusted accordingly.
12 Each time processor 19 makes a memory reference, cache 13 21 is searched to see if it contains a copy of the requested 14 data. If it does not, then the data must be fetched in a block from main memory 13, supplied to processor 19 and 16 stored in cache ~1 replacing one of the blocks already 17 there. If the replaced block is "dirty" that is, processor 18 19 has stored the block in the cache 21, then it must be 19 written back to the appropriate address in main memory 13 to update main memory, thereby maintaining data integrity.
21 According to the present invention, as shown in Figure 22 3, the instruction executed by processor 19, which gives 23 rise to the data request the cache 21 must handle, includes 24 a number of bits which comprise a cache control specifier.
25 The cache control specifier is coded to iform the cache 26 about the type of data referenced by the instruction and, 27 thus, which one of a plurality of replacement schemes should 1 3 ~ 9 ~ be used to swap data out of cache 21.
2 ¦ The instruction 41 provided by the processor 19 31 includes information from which may be computed an address 41 tag 35' (which is used for searching the cache 21 for a 51 corresponding tag 35), and index 31' (which is used to 61 identify the desired index 31 in cache 21), an operational 71 code field 43 (which is used to control the operation at the 81 cache 21), and a cache control specifier 45 which specifies 9¦ the method to be used in swapping data from the cache.
10¦ In the preferred embodiment of the invention, the 11¦ instruction can code four such cache specifiers: NORMAL
12¦ data, STACK data, SEQUENTIAL data and PREFETCH. The cache 13¦ responds to ~hese specifiers to choose the replacement 14¦ scheme which is best suited to the type of data identified.

15¦ For N0RMAL data, a NORMAL cache control specifier is

16¦ used adn the cache operates according to the standard

17¦ replacement scheme, in thise case, LRU. Other standard

18¦ replacement schemes that could be used are First-In-First-

19¦ Out (FIFO) and ranclom replacement.

20¦ A STACK cache control specifier is used for stacked

21¦ data structures that grow from low numbered addresses in

22 ¦memory. An example is the Last-in-First-Out (LIFO) stack

23 ¦structure 47 illustrated in Figure 4. As shown in Figure 4,

24 la stack pointer 49 keeps track of the address of the top of

25 ¦the stack. The data at addresses below pointer 49, in the

26 ¦area 51, is valid data; data at addresses above pointer 49,

27 ¦in area 53, is invalid. The processor 19 is allowed to

28~ 11 13~09 1 access any data currently considered valid, i.e., below 2 stack pointer 49. The addresses in the stack 47 are mapped 3 onto the indexed locations 31 in cache 21, and can be 4 divided into blocks corresponding to the blocks in cache 21, S as indicated by the double line divisions 55 shown in Figure 6 4.
7 If the reference to the stack 47 is a STORE to the 8 first word in a block and there is a cache "miss", then a 9 new block must be allocated on the stack and in the cache.
It is assumed that new data will be written into the 11 selected block before the old data is read. Therefore, 12 there is no need, in thise case, to retrieve data from main 13 memory. The cache initializes the address tag, marks the 14 block most recently used and performs the STORE. In addition, to prevent possible security violations, the block 16 is initialized with constant or random data before 17 performing the store.
18 If the reference to the stack 40 is a LOAD to the first 19 word in a block, then room is being deallocated on the stack and it is assumed that the data in the block is no longer 21 needed. The cache 21, therefore, performs a normal LOAD and 22 marks the block clean andmost eligible for replacement, 23 i.e., least recently us~d.
24 Using the STACK variation of the standard replacement scheme avoids transferring invalid stack data into the cache 26 and, thus, saves not only the time required to transfer the 27 data to the cache in the first place, but also the time -~ 1 3 ~

1 ~ r quired to rem~ve it from the cache ~nd initialize the 2 ¦ addresses later on.
31 The SBQUENTIAL cache control specifier is used for 4¦ sequential data access patterns such as htose involved in 51 block move or in scanning text. When a LOAD or a STORE
61 operation references the last address in a block, it is 7 assumed that it will be the last reference to that block for 8 a long time. Therefore, the cache performs the normal LOAD
9 or STORE and the block is marked most eligible for replacement. If a STORE accesses the first word in a block 11 and there is a cache "miss", it can be assumed that the 12 whole block will be overwritten and there is no need to read 13 from main memory 13. Instead, a block is freed up in the 14 cache, initialized for security and marked most recently used, valid and dirty (similar to STACK store). The 16 operating system must make sure that no other data shares 17 the last block of the sequential dta structure, or this 18 procedure can result in the loss of that data.
19 The PREFETCH cache control specifier is used to indicate that, in a LOAD or STORE to a block, the next block 21 is very likely to be accessed in the near future. The cache 22 therefore performs a normal LOAD or STORB and fetches the 23 next block if it is not already in the cache. Depending on 24 how the cache is designed, the prefetch might occur near the 226 beginning, middle or end of the block with the specifier.

2~ 13

Claims

THE EMBODIMENTS OF THE INVENTION IN WHICH AN EXCLUSIVE
PROPERTY OR PRIVILEGE IS CLAIMED ARE DEFINED AS FOLLOWS:

1. A method for replacing a data block in a cache memory of a computer system wherein the computer system comprises a processor for fetching and executing data, a main memory, and a cache memory, the method comprising:
presenting an instruction from the processor, said instruction including a cache control specifier indicating that a new data block should be transferred from main memory to the cache.

2. A computer device comprising:
a processor;
a cache system coupled to the processor, the cache system comprising:
a cache memory; and an extracting means, coupled to the cache memory, for extracting information embedded in a field of an instruction sent from the processor to the cache system, the extracting means supplying the information to the cache memory of cache management.

3. The computer device as in Claim 2, wherein said information optionally directs said cache memory to prefetch selected instructions.

4. The computer device as in Claim 2, wherein said information optionally directs said cache memory to prefetch selected data.

5. The computer device as in Claim 2, wherein said information optionally directs a replacement of selected data in said cache memory.

6. The computer device as in Claim 2, wherein said information optionally directs a replacement of selected instructions in said cache memory.

7. A method for improving the transfer of information between registers and memory comprising the steps of:
encoding bits in an instruction;
interpreting said bits by a cache controller to identify select information likely to be accessed by future instructions; and storing said select information in a cache memory.

8. The method for improving the transfer of information as in Claim 7 wherein the step for encoding bits in an instruction further comprises encoding said bits to identify a stack memory structure.

9. The method for improving the transfer of information as in Claim 7 wherein the step for using said bits further comprises using said bits to prefetch selected data.

10. The method for improving the transfer of information as in Claim 7 wherein the step for using said bits further comprises using said bits to prefetch selected instructions.

11. The method for improving the transfer of information as in Claim 7 wherein the step for using said bits further comprises using said bits to direct replacement of selected data in said cache memory.

12. The method for improving the transfer of information as in Claim 7 wherein the step for using said bits further comprises using said bits to direct replacement of selected instructions in said cache memory.

13. The method as in Claim 7 further comprising the step of encoding bits in said instruction to facilitate cache memory management.