WO1997036234A1

WO1997036234A1 - Cache multi-block touch mechanism for object oriented computer system

Info

Publication number: WO1997036234A1
Application number: PCT/US1996/019469
Authority: WO
Inventors: Mark R. Funk; Steven Raymond Kunkel; Mikko Herman Lipasti; David A. Luick; Robert Ralph Roediger; William Jon Schmidt
Original assignee: International Business Machines Corporation
Priority date: 1996-03-28
Filing date: 1996-12-05
Publication date: 1997-10-02
Also published as: KR19990087830A; EP0890148A1

Abstract

An object-oriented computer apparatus (100) generates a first instruction stream (124) by using a multi-block cache touch instruction generator (122) to insert a multi-block cache touch instruction (300) into a second instruction stream (123). The execution of the multi-block cache touch instruction by the processing unit (110) causes a prefetch of at least one of a plurality of multiple blocks of data or code from a main memory (120) into a plurality of cache lines of a cache memory. These multi-block touch instructions indicate the beginning address of a desired block in computer system memory and the amount of the block to be prefetched into cache. In response the cache will prefetch multiple lines into the cache.

Description

Cache Multi-Block Touch Mechanism for Object Oriented Computer System

Field of the Invention This invention generally relates to computer systems.

More specifically, this invention relates to mechanisms for prefetching information into a cache memory within a computer system.

Background of the Invention The development of the EDVAC computer system of 1948 is often cited as the beginning of the computer era. Two of the more basic components of the EDVAC system which are still found in today's systems are the computer system's memory and processor. The computer system's memory is where the computer system's programs and data are stored, and the computer system's processor is the entity that is responsible for executing the programs that are stored in its memory.

Since 1948, overall computer system technology has progressed on several fronts. One area of dramatic improvement involves the evolution of computer programs. The EDVAC system used what was called a "one address" computer programming language. This language (in combination with other factors, such as limited memory, lack of proper tools, etc.) allowed for only the most rudimentary computer programs. By the 1960s, improvements in computer programming languages led to computer programs that were so large and complex that it was difficult to manage and control their development and maintenance.

Therefore, the focus of the 1970s was on developing programming methodologies and environments that could better accommodate the increasing complexity and cost of large computer programs. One such methodology is called Object Oriented Programming (OOP) . Though it has been some time since the fundamental notions of OOP were first developed, OOP systems are becoming more and more prevalent because it is felt that use of OOP can greatly increase the efficiency of computer programmers.

Another area of progress is in the development of a special computer system memory called cache memory. Cache memory is special because a processor can retrieve information from cache memory much faster than it can from standard memory (called main memory) . However, this speed is not without cost. Cache memory is significantly more expensive than main memory. Consequently, computer system designers balance the need for speed against the cost of cache memory by keeping the size of cache memory relatively small when compared to that of main memory.

The key, then, is to make sure that small but fast cache memory always contains the information needed by the processor. However, since cache memory is typically much smaller than main memory, the computer system must be able to move information from the slower main memory into the faster cache memory before the information is needed by the processor. Modern compilers enhance the effectiveness of the cache memory by generating a special instruction known as a cache touch instruction. The cache touch instruction serves as a signal to the memory controller to prefetch information from main memory to cache memory. Thus, when certain data or instructions are needed in the near future, the compiler inserts one or more cache touch instructions in the instruction stream to prefetch the needed information into the cache. In most cases, more than one cache touch instruction is needed to prefetch enough extended blocks of data into the cache to benefit a system with cache capabilities. In the event of a cache miss, the hardware has the capability of prefetching each touched block of data or instructions into the cache memory in parallel with the execution of the instructions following the cache touch instruction. hile the use of OOP and cache memory (with the associated use of cache touch instructions) have each improved overall computer system technology, their combined use in a single system has yet to be fully optimized. This lack of optimization exists because object oriented programs do not fully benefit from typical cache touch instructions. Standard cache touch instructions are designed to prefetch only small portions of data at a time, which means that large portions of objects are not brought into cache memory even though those portions may well be needed by instructions that are about to execute on the computer system's processor. More advanced cache instructions have been designed to prefetch larger amounts of data. One such cache touch instruction is presented in the article "Data Relocation and Prefetching for Programs with Large Data Sets" by Yamada et al . The Yamada article discusses prefetching data for numerical applications. This is because numerical applications frequently contain nested loop structures that involve the processing of large arrays of data. The Yamada mechanism uses special hardware that is combined with a process that relocates array elements in memory and then arranges them in sequential fashion within cache memory. Relocating array elements in sequential fashion is valuable for numerical processing applications because array access speeds are improved. However, the Yamada reference would not work well and would be far less than optimal for prefetching objects because of its relocation aspect. Each object would have to be relocated as it was prefetched into cache memory, and then returned back to its original location after computation is finished.

Without a cache touch mechanism that is specifically designed and optimized for use with object-oriented programming environments, the computer industry will never fully benefit from the advantages offered by OOP.

Summary of the Invention

It is, therefore, an advantage of this invention to provide a cache multi-block touch mechanism for object oriented computer systems that is capable of successfully and optimally operating with any type or size of cache memory. It is another advantage of this invention to provide a cache multi-block touch mechanism for object oriented computer systems that successfully prefetches multiple cache lines without the compiler having to issue multiple touch instructions, regardless of the line size of the cache memory. It is a further advantage of the present invention to provide a cache multi-block touch mechanism that allows for both data cache and instruction cache multi-block touch instructions.

According to the present invention, an object- oriented computer apparatus is disclosed for generating a first instruction stream executable on a processing unit from a second instruction stream. The computer apparatus comprises a multi-block cache touch instruction generator for generating and inserting a multi-block cache touch instruction into the first instruction stream in at least one location within the first instruction stream where prefetching multiple blocks of object data and code into the cache memory is advantageous. The execution of the multi-block cache touch instruction by the processing unit causes a prefetch of at least one of a plurality of multiple blocks of data and code from a main memory into a plurality of cache lines of a cache memory. These multi-block touch instructions indicate the beginning address of a desired code in main memory and the size of the block to be prefetched. The memory controller will examine the amount of code/data requested and determine how many lines must be brought into cache to satisfy the request. Thus, multiple cache lines may be prefetched with only one touch instruction.

The foregoing and other objects, features and advantages of the invention will be apparent from the following more particular description of preferred embodiments of the invention, as illustrated in the accompanying drawings.

Brief Description of the Drawings

The preferred exemplary embodiments of the present invention will hereinafter be described in conjunction with the appended drawings, where like designations denote like elements, and: FIG. 1 is a block diagram of a computer system according to an embodiment of the present invention;

FIG. 2 is a block diagram of a typical region of the main memory of FIG. 1; FIG. 3 is a block diagram of a cache that has a line size of 8 bytes;

FIG. 4 shows an example of prior art touch instructions that are configured and used for the 8-byte-length cache line of FIG. 3; FIG. 5 is a block diagram of a cache that has a line size of 4 bytes;

FIG. 6 shows an example of prior art touch instructions that are configured for an 8-byte-length cache line but are used in the 4-byte-length cache line of FIG. 5; FIG. 7 is a block diagram of a cache that has a line size of 16 bytes;

FIG. 8 shows an example of prior art touch instructions that are configured for an 8-byte-length cache line but are used in the 16-byte-length cache line of FIG. 7; FIG. 9 is a block diagram of the multi-block cache touch instruction used with the multi-block cache touch mechanism of the present invention, according to the preferred embodiment;

FIG. 10 is a block diagram of a second cache that has a line size of 8 bytes;

FIG. 11 shows the results of using the present invention with the cache of FIG. 10;

FIG. 12 is a block diagram of a second cache that has a line size of 4 bytes; FIG. 13 shows the results of using the present invention with the cache of FIG. 12;

FIG. 14 is a block diagram of a second cache that has a line size of 16 bytes; and

FIG. 15 shows the results of using the present invention with the cache of FIG. 14. Description of the Preferred Embodiments

Overview

Object Oriented Technology

As discussed in the Background section, objects can be thought of as autonomous agents that work together to perform the tasks required by a computer system. A single object represents an individual operation or a group of operations that are performed by a computer system upon information controlled by the object. The operations of objects are called "methods" and the information controlled by objects is called "object data" or just "data." Objects are created (i.e., "instantiated") as instances of something called a "class." Classes define the data that will be controlled by their instances and the methods that will provide access to that data.

Statements, Instructions, Compilers

Computer programs are constructed using one or more programming languages. Like words written in English, a programming language is used to write a series of statements that have particular meaning to the drafter (i.e., the programmer) . The programmer first drafts a computer program in human readable form (called source code) prescribed by the programming language, resulting in a source code instruction

(or statement) stream. The programmer then uses mechanisms that change the source code of the computer program into a form that can be understood by a computer system (called machine readable form, or object code) . These mechanisms are typically called compilers; however, it should be understood that the term "compiler", as used within this specification, generically refers to any mechanism that transforms one representation of a computer program into another representation of that program.

The object code, within this specification, is a stream of binary instructions (i.e., ones and zeros) that are meaningful to the computer. Compilers generally translate each source code statement in the source code instruction stream into one or more intermediate language instructions, which are then converted into corresponding object code instructions. Special compilers, called optimizing compilers, typically operate on the intermediate language instruction stream to make it perform better ( e . g. , by eliminating unneeded instructions, etc.). Some optimizing compilers are wholly separate while others are built into a primary compiler (i.e., the compiler that converts the source code statements into object code) to form a multi-pass compiler. In other words, multi-pass compilers first operate to convert source code into an instruction stream in an intermediate language understood only by the compiler (i.e., as a first pass or stage) and then operate on the intermediate language instruction stream to optimize it and convert it into object code (i.e., as a second pass or stage) .

A compiler may reside within the memory of the computer which will be used to execute the object code, or may reside on a separate computer system. Compilers that reside on one computer system and are used to generate machine code for other computer systems are typically called "cross compilers." The methods and apparatus discussed herein apply to all types of compilers, including cross compilers.

Cache Prefetch Mechanisms Information may be prefetched into a cache memory due to specific requirements of the hardware architecture, or by the processor executing a special command or instruction stream indicating its desire for a prefetch. Modern compilers typically generate cache touch instructions to tell the memory controller (or memory subsystem) when to prefetch information into a cache memory. Prior art compilers must make an assumption based on the cache line size on the target platform. This assumption results in code that runs less efficiently on targets with cache line sizes more or less than the assumed cache line size. Detailed Description

FIG. 1 shows a block diagram of the computer system 100 in accordance with a preferred embodiment of the present invention. The computer system 100 of the preferred embodiment is an enhanced IBM AS/400 mid-range computer system. However, those skilled in the art will appreciate that the mechanisms and apparatus of the present invention apply equally to any computer system, regardless of whether the computer system is a complicated multi-user computing apparatus or a single user device such as a personal computer or workstation. Computer system 100 suitably comprises a processor 110, main memory 120, a memory controller 130, an auxiliary storage interface 140, a terminal interface 150, instruction cache memory 160 and data cache memory 170, all of which are interconnected via a system bus 180. Note that various modifications, additions, or deletions may be made to the computer system 100 illustrated in FIG. 1 within the scope of the present invention such as the addition of other peripheral devices; FIG. 1 is presented to simply illustrate some of the salient features of computer system 100.

Processor 110 performs computation and control functions of computer system 100, and comprises a suitable central processing unit. Processor 110 may comprise a single integrated circuit, such as a microprocessor, or may comprise any suitable number of integrated circuit devices and/or circuit boards working in cooperation to accomplish the functions of a processor. Processor 110 suitably executes an instruction stream 124 within main memory 120.

Auxiliary storage interface 140 is used to allow computer system 100 to store and retrieve information from auxiliary storage, such as magnetic disk (e.g., hard disks or floppy diskettes) or optical storage devices (e.g., CD-ROM). Memory controller 130, through use of a processor separate from processor 110, is responsible for moving requested information from main memory 120 and/or through auxiliary storage interface 140 to instruction cache 160, data cache 170 and/or processor 110. While for the purposes of explanation, memory controller 130 is shown as a separate entity, those skilled in the art understand that, in practice, portions of the function provided by memory controller 130 may actually reside in the circuitry associated with processor 110, main memory 120, instruction cache 160, data cache 170, and/or auxiliary storage interface 140.

Terminal interface 150 allows system administrators and computer programmers to communicate with computer system 100, normally through programmable workstations. Although the system 100 depicted in FIG. 1 contains only a single main processor 110 and a single system bus 180, it should be understood that the present invention applies equally to computer systems having multiple processors and/or multiple system buses. Similarly, although the system bus 180 of the preferred embodiment is a typical hardwired, multidrop bus, any connection means that supports bi-directional communication could be used.

It is important to note that while the present invention has been (and will continue to be) described in the context of a fully functional computer system, those skilled in the art will appreciate that the mechanisms and use of such mechanisms of the present invention are capable of being distributed as a program product in a variety of forms, and that the present invention applies equally regardless of the particular type of signal bearing media used to actually carry out the distribution. Examples of signal bearing media include: recordable type media such as floppy disks and CD ROMs and transmission type media such as digital and analogue communications links. Main memory 120 contains optimizing compiler 122, source code instruction stream 123, machine code instruction stream 124, application programs 126, and operating system 128. It should be understood that main memory 120 will not necessarily contain all parts of all mechanisms shown. For example, portions of application programs 126 and operating system 128 may be loaded into instruction cache 160 for processor 110 to execute, while other files may well be stored on magnetic or optical disk storage devices (not shown) . In addition, compiler 122 may generate a machine code instruction stream 124 that is intended to be executed on a different computer system if compiler 122 is a cross-compiler. It is also to be understood that any appropriate computer system memory may be used in place of main memory 120.

Application programs 126 have been designed to be object-oriented in nature and thus will comprise objects for object-oriented applications. However, it is important to note that while the present invention has been described in the context of object-oriented applications, those skilled in the art will appreciate that uses of the mechanism of the present invention are also directly applicable to any contiguous blocks or regions of data and/or code in memory. Instruction cache 160 contains instructions/blocks of code from main memory 120 for processor 110 to readily access and use. Similarly, data cache 170 contains blocks of data from main memory 120 for processor 110 to readily access and use. It should be understood that even though data cache 170 is separate from instruction cache 160 in the preferred embodiment of the present invention, both caches may be combined to form a single unit.

The remainder of this specification describes how the present invention improves the performance of computer system 100 by first describing the performance of a computer system under the prior art system and then contrasting the improvements of the computer system's performance when using the multi-block cache touch mechanism of the present invention. Those skilled in the art will appreciate that the present invention applies equally to any suitable computer system.

FIG. 2 is a block diagram of a typical region 210 of main memory 120 as used in the present invention. A region in main memory is generally made up of blocks of data 212 (or instruction code) . Although related blocks of data may be stored in fragmented sections in main memory, they may be thought of conceptually as a contiguous stream of blocks in a region as shown in FIG. 2. Effectively, the region 210 of related data blocks 212 will be a given size, with an address indicating the beginning of the related data. For a computer running object oriented code, a region of main memory may contain the instructions/data of an entire object, or significant portions of such functions, which will benefit the computer system when it may be prefetched through the mechanism of the present invention.

FIG. 3 shows a cache memory that has a line size of 8 bytes, that is, each cache line (0-N) is made up of 8 bytes (0-7) . As discussed previously, blocks of data D are prefetched from memory to cache lines through cache touch instructions. Although data D is shown occupying cache lines 0-2, it is to be understood that other cache lines may be used. Al, A2, A3, and A4 are locations addressed by cache touch instructions, and will be discussed later in reference to FIG. 4.

FIG. 4 illustrates the touch instructions and corresponding prefetched lines of a prior art system, wherein the compiler correctly assumes the length of the cache line of FIG. 3 to be 8 bytes long. The touch instructions T1-T4 correspond to the touched addresses A1-A4 shown in the cache memory of FIG. 3. The processor will execute a cache touch instruction for each assumed cache line to be prefetched, that is, touch instruction Tl is executed for the beginning address (Al) , T2 for 8 bytes after the beginning address (A2) , T3 for 16 bytes after the beginning address (A3) , and T4 for the end address (A4) . Touch instructions Tl, T2 and T3 cause the memory subsystem to prefetch cache lines 0, 1 and 2, respectively, (i.e., prefetch blocks of data from main memory to occupy cache lines 0, 1 and 2) . The end address touch instruction T4 does not prefetch any lines, but has to be issued because the compiler cannot accurately predict the alignment of the data object or blocks of data with respect to cache line boundaries. Hence, a compiler cannot determine the number of cache lines that the object or related blocks of data may straddle and will need a touch instruction to prefetch the last block of data if it happens to fall on a new cache line. In this case, the cache touch instruction T4 is an unnecessary instruction, and uses the processor's resources inadequately. FIGS. 5 and 6, as described below, illustrate a case where the end cache touch instruction is necessary. FIGS. 5-8 demonstrate the detrimental affects of prefetching in prior art systems when the compiler assumes a cache line length that is either greater or less than the actual cache line length. FIG. 5 shows a cache with lines that are 4 bytes long, that is, each cache line (0-N) is made up of 4 bytes (0-3) . Data D is shown occupying cache lines 0-5. Al, A2, A3, and A4 are locations addressed by touch instructions that are used to prefetch data. Both the data D and the touched address locations of FIG. 5 will be discussed in reference to FIG. 6. FIG. 6 shows the touch instructions and prefetched lines of a prior art system used with the cache of FIG. 5, wherein the compiler assumes that the cache line is 8 bytes long, an assumption that is greater than the actual 4-byte- length cache line of FIG. 5. The touched addresses A1-A4 of FIG. 5 correspond with touch instructions T1-T4. As in the previous example, the processor will execute a cache touch instruction for each assumed cache line to be prefetched, that is, touch instruction Tl is executed for the beginning address

(Al) , T2 for 8 bytes after the beginning address (A2) , T3 for 16 bytes after the beginning address (A3) , and T4 for the end address (A4) (i.e., the same touches as shown in FIG. 4 since an 8-byte-long cache is assumed) . Touch instruction Tl prefetches line 0, and then, after advancing 8 bytes, T2 will prefetch the corresponding cache line, which is line 2, not line 1. Again the compiler advances 8 bytes and prefetches the corresponding cache line for instruction T3, which is cache line 4. The touch instruction for the end address will then prefetch cache line 5. Even though data is shown on all cache lines, since this is the way memory maps into the cache, the corresponding blocks of data for cache lines 1 and 3 were not actually prefetched because they were not "touched" with a cache touch instruction. Thus, all the related blocks of data needed in cache memory are not present in cache memory, rendering the prefetch routine less valuable, since lines 1 and 3 will need to be accessed later from main memory. In this example, the end touch instruction T4 was necessary since the last block of data was on a new cache line.

FIG. 7 is a block diagram of a cache with lines that are 16 bytes long, that is, each cache line (0-N) is made up of 16 bytes (0-15). Data D is shown occupying cache lines 0 and 1. Al, A2, A3, and A4 are locations addressed by touch instructions that are used to prefetch data. Both the data D and the touched address locations of FIG. 7 are discussed in reference to FIG. 8.

FIG. 8 shows the touch instructions and prefetched lines of a prior art system used with the cache of FIG. 7, wherein the compiler assumes that the cache line is 8 bytes long, an assumption that is less than the actual 16-byte-length cache line of FIG. 7. The touched addresses A1-A4 of FIG. 7 correspond with touch instructions T1-T4. As in the previous examples, the processor will execute a cache touch instruction for each assumed cache line to be prefetched, that is, touch instruction Tl is executed for the beginning address (Al) , T2 for 8 bytes after the beginning address (A2) , T3 for 16 bytes after the beginning address (A3) , and T4 for the end address (A4) (i.e., the same touches as shown in FIGS. 4 and 6 since an 8-byte-long cache is assumed). The touch instruction Tl will prefetch line 0, T2 will prefetch line 1, and, since all the lines have been prefetched, T3 and T4 have no action. Hence, only two of the four instructions were necessary in prefetching all the data. In this case, memory space and processing time are being wasted by the unneeded touch instructions.

A Preferred Exemplary Embodiment

FIG. 9 is a block diagram of the multi-block cache touch instruction in accordance with the present invention. The instruction comprises an op code field 310, an address field 312, and a size field 314. The op code field 310 distinguishes the multi-block cache touch instruction from other instructions. The op code field 310 allows the touch instruction to be placed within the hardware instruction set. The addressing field 312 indicates the beginning of the code or data block to be prefetched from memory. Those skilled in the art will appreciate that the addressing field may be generated through a variety of different methods. Some of these methods include, but are not limited to: denoting, in a field of the instruction, a register as containing the starting address; denoting, through two fields in the instruction, registers containing a base address and an offset for constructing the starting address; or, for loading blocks of code, using an offset from the current instruction pointer. For the particular case of an IBM PowerPC processor (and for other RISC architectures) , the first of these methods could be easily retrofitted into an existing data cache block touch (debt) instruction, placing the size field in the unused target register field.

The size field 314 indicates how much memory is to be prefetched into the cache. The size field is measured in units that are independent of the size of a cache line, since platforms with different size cache lines may have to execute the same code. For example, the size field might denote the number of 32-byte memory blocks that are to be transferred. A cache with a 128-byte line size would examine the starting address and the size field and determine how many 128-byte blocks must be brought into the cache to satisfy the prefetch request. A size field entry of 3 (i.e., three 32-byte memory blocks) might then require either one or two cache lines to be prefetched, depending on the location of the starting address within its cache line.

FIGS. 10-15 demonstrate how the multi-block cache mechanism and instruction of the present invention operate in contrast to the prior art mechanism and touch instruction as shown in FIGS. 3-8. FIG. 10 illustrates a block diagram of an 8-byte-length cache line, which corresponds to the cache of FIG. 3. However, only one address Al is touched (in contrast to the four addresses touched shown in FIG. 3) because only one touch instruction Tl is needed, as described in reference to FIG. 11. Each cache line (0-N) is made up of 8 bytes (0-7) and data D is shown occupying cache lines 0-2. FIG. 11 shows the results of using the multi-block cache touch instruction of the present invention with the 8- byte-length cache line of FIG. 10. The touched address Al of FIG. 10 corresponds to the cache touch instruction Tl. In one embodiment of the invention, a multi-block cache touch instruction generator, such as the compiler or a programmer

(e.g., in an assembler routine) generates the multi-block cache touch instruction with a beginning address and size field. The processor then executes the multi-block cache touch instruction. In response to the execution of the multi-block cache touch instruction, the cache management logic will determine the blocks of data or code in a computer system memory to be prefetched and the corresponding number of cache lines to be preloaded. The blocks are then prefetched directly from the computer system memory into the cache lines without any intermediate processing of the blocks, such as with relocation of data or unnecessary manipulations of the blocks, thus optimizing the prefetching of blocks of data or code, or objects for OOP. In this particular example, the size field denotes 19 bytes of data to be prefetched. The cache memory examines the starting address and the size field of the touch instruction and determines that three cache lines worth of data need to be brought into the cache to satisfy the prefetch request. The cache then prefetches lines 0, 1 and 2 from the memory subsystem. Unlike the prior art examples, which needed four cache touch instructions (see FIGS. 3 and 4), the present invention only requires one cache touch instruction.

Ultimately, the memory hierarchy is permitted to implement the prefetch request using whatever method is deemed to be best for overall system performance. For the preferred embodiment of the invention, the cache management logic would be augmented with a "sequencer" to control the preloading of multiple cache lines. Upon receiving a request, registers in the sequencer would be initialized with the address of the first cache line to be preloaded and the number of cache lines to load. The sequencer would then process cache lines sequentially as follows. The sequencer first determines if the currently addressed memory block is already in the cache. If so, no further action is required for this block, so the sequencer increments its address register by the size of a cache line and decrements the register containing the number of cache lines to process. Otherwise, the sequencer issues a memory transfer request to prefetch the currently addressed memory block into the cache, again followed by incrementing its address register and decrementing the number of cache lines to be processed. This process continues until all requested blocks have been found in the cache or are in the process of being loaded. Those skilled in the art will appreciate that the present invention may be implemented through a variety of appropriate methods, and is not limited to the method described above. The present invention may also by extended by other features, such as the ability for a processor to issue a second cache request before the first multi -block prefetch has completed. The cache may respond to such a second request by:

1) blocking the processor until the first request completes;

2) aborting the first request and accepting the second request;

3) buffering a finite number of requests; or 4) ignoring the second request.

Also, the present invention may be extended by considering other aspects of the memory hierarchy, such as paging considerations. For example, in a system using virtual addresses, a contiguous effective address range (as that implied by the multi-block touch instruction's address and size) may not map onto a contiguous real (physical main storage) address range when a page boundary is crossed (for example a 4K byte page) . The memory subsystem either has the option of : 1) only prefetching within a page; or 2) translating the effective address at the page boundary in order to get to the next real page, and thus prefetch across page boundaries . An additional feature for extending the present invention includes the processor implementing a "history mechanism" that is associated with the touched address Al. A history mechanism would remember the actual cache lines that were subsequently referenced by the processor ("useful") after the prior execution of the multi-block touch to a given location. ith this history, a subsequent multi-block touch to that location would only fetch memory blocks into cache blocks that were useful on the prior multi-block touch. That is, the cache management logic of the cache memory would prefetch only a subset of the blocks of data or code corresponding to the multi-block cache touch instruction, where the subset consists of one or more blocks of data or code that were used by the processor after a previous issuance of the multi-block cache touch instruction. One example of using the history mechanism is as follows. If prior execution of a multi-block touch brought in lines 0, 1, 2, 3, and 4, but only lines 0 and 3 were used subsequently in the instruction stream, then the history of that multi-block touch instruction would be maintained by the cache management logic of the memory subsystem. Then, a subsequent execution of the multi-block touch instruction would only bring in lines 0 and 3 rather than all of the lines 0, 1, 2, 3, and 4. The history mechanism may be implemented by means of a confirmation vector, which would associate one bit (initially zero) with each prefetched cache line. Each reference by the processor to a cache line would cause that cache line's bit in the confirmation vector to be set. At the next issuance of the multiblock cache touch instruction, only those cache lines with set bits in the confirmation vector would actually be prefetched. The confirmation vector scheme might be further extended to use a consensus vote, in which history of the previous N issuances of a multiblock cache touch instruction are remembered, and in which a cache line is prefetched only if it was found to be useful after a majority of the previous N issuances.

FIG. 12 illustrates a 4-byte-length cache line corresponding to the cache memory of FIG. 5. That is, each cache line (0-N) is made up of 4 bytes 0-3. Data D is shown occupying cache lines 0-5. Al is the touched address. Data D and Al are discussed in reference to FIG. 13.

FIG. 13 shows the results of using the multi-block cache touch instruction of the present invention with the 4- byte-length cache line of FIG. 12. The touch address Al of FIG. 12 corresponds with touch instruction Tl . As aforementioned, the corresponding prior art mechanism was only able to prefetch lines 0, 2, 4, and 5 when its compiler assumed a cache line of 8 bytes long (see FIG. 6) . Even if the prior art mechanism correctly assumed a cache line of 4 bytes long, it would have taken 6 cache touch instructions to prefetch lines 0-5. By comparison, the multi -block touch mechanism of the present invention prefetches all necessary cache lines (i.e., lines 0, 1, 2, 3, 4 and 5) with only one touch instruction Tl . Thus, the present invention uses the processor's resources effectively when prefetching data into cache memory.

FIG. 14 illustrates a 16-byte-length cache line corresponding to the cache memory of FIG. 7. That is, each cache line (0-N) is made up of 16 bytes 0-15. Data D is shown occupying cache lines 0-1. Unlike FIG. 7, Al is the only touched address shown, and will be discussed in reference to FIG. 15. FIG. 15 shows the results of using the multi -block touch instruction of the present invention with the 16-byte- length cache line of FIG. 14. The touched address Al of FIG. 14 corresponds with the touch instruction Tl . In the prior art system as shown in FIG. 8, unnecessary touch instructions are generated because: first, the compiler assumed the incorrect length of a cache line; and second, the prior art mechanism issued an end address touch instruction in case the data was misaligned in the cache.

In contrast, the multi-block touch mechanism of the present invention prefetches cache lines 0 and 1 with only one touch instruction Tl . Additionally, the present invention avoids the unnecessary cache touch instructions that result from potential misalignment of the block within the cache. This is possible because the cache, not the compiler, determines the number of cache lines that a block of data or object may straddle by examining the size field of the touch instruction of the present invention. Hence, memory space and processing time are not wasted when prefetching data from a region of memory.

As shown in FIGS. 10-15, the multi-block touch mechanism of the present invention is very efficient and beneficial when prefetching data blocks into a data cache memory. For a computer running object oriented code, this means data for an entire object, or significant portions of such objects, may be prefetched with a single multi-block touch instruction. The multi-block touch mechanism may also be used, as efficiently, for prefetching code into an instruction cache memory, and is especially useful for prefetching entire object functions, or significant portions of such functions, into an instruction cache. The multi-block touch instruction is less applicable to small functions and small object methods, though that can be mitigated by grouping related functions and methods that have an affinity to each other in contiguous memory blocks, and performing the multi-block touch on groups of functions and object methods.

While the invention has been particularly shown and described with reference to preferred exemplary embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention.

Claims

Claims What is claimed is: 1. A computer apparatus comprising: (A) a processing unit, said processing unit executing a plurality of instructions that make up a first instruction stream; (B) a main memory for storing multiple blocks of data and code; (C) a cache memory coupled to said main memory for prefetching and temporarily storing in cache lines at least one of said multiple blocks stored in said main memory; (D) a multi-block cache touch mechanism for prefetching at least one of said multiple blocks directly from said main memory into a plurality of cache lines of said cache memory without intermediate processing of said at least one of said multiple blocks when said processing unit executes a multi-block cache touch instruction in said first instruction stream; (E) a compiler for generating the first instruction stream from a second instruction stream, the compiler including: a multi-block cache touch instruction generator for generating and inserting said multi-block cache touch instruction into said first instruction stream in at least one location within said first instruction stream where prefetching at least one of said multiple blocks into said cache memory is advantageous.

2. The computer apparatus of claim 1 wherein said multiple blocks of data and code are objects used for object oriented applications.

3. The computer apparatus of claim 1 wherein said cache memory is a data cache memory.

4. The computer apparatus of claim 1 wherein said cache memory is an instruction cache memory.

5. The computer apparatus of claim 1, wherein said cache memory includes a history mechanism for prefetching only a subset of said blocks of data or code corresponding to said multi-block cache touch instruction, wherein said subset consists of one or more blocks of data or code that were referenced by said processor unit after at least one previous issuance of said multi-block cache touch instruction.

6. The computer apparatus of claim 3 wherein said multi- block cache touch instruction comprises an op code field for distinguishing said multi-block cache touch instruction from other instructions, and wherein said multi-block cache touch instruction further comprises an address and a size of said at least one of said multiple blocks of data.

7. The computer apparatus of claim 4 wherein said multi- block cache touch instruction comprises an op code field for distinguishing said multi-block cache touch instruction from other instructions, and wherein said multi-block cache touch instruction further comprises an address and a size of said at least one of said multiple blocks of code.

8. A computer apparatus for generating a first instruction stream executable on a processing unit from a second instruction stream, the computer apparatus comprising: a multi-block cache touch instruction generator for generating and inserting a multi-block cache touch instruction into said first instruction stream in at least one location within said first instruction stream where prefetching at least one of a plurality of multiple blocks into a cache memory is advantageous, the execution of said multi-block cache touch instruction by said processing unit causing a prefetch of at least one of a plurality of multiple blocks directly from a main memory without intermediate processing of said at least one of said multiple blocks into a plurality of cache lines of a cache memory.

9. The computer apparatus of claim 8 wherein said at least one of a plurality of multiple blocks are objects used for object oriented applications.

10. The computer apparatus of claim 8 wherein said multi- block cache touch instruction comprises an op code field for distinguishing said multi-block cache touch instruction from other instructions, and wherein said multi -block cache touch instruction further comprises an address and a size of said at least one of said multiple blocks.

11. A program product comprising: a compiler being used to generate a first instruction stream executable on a processing unit in a computer apparatus from a second instruction stream, the compiler including: a multi-block cache touch instruction generator for generating and inserting a multi-block cache touch instruction into said first instruction stream in at least one location within said first instruction stream where prefetching at least one of a plurality of multiple blocks into a cache memory is advantageous, the execution of said multi-block cache touch instruction by said processing unit causing a prefetch of at least one of a plurality of multiple blocks directly from a main memory without intermediate processing of said multiple blocks into a plurality of cache lines of a cache memory; and signal bearing media bearing said compiler.

12. The program product of claim 11 wherein said multiple blocks of data and code are objects used for object oriented applications.

13. The program product of claim 11 wherein said signal bearing media comprises recordable media.

14. The program product of claim 11 wherein said signal bearing media comprises transmission media.

15. The program product of claim 11 wherein said multi- block cache touch instruction comprises an op code field for distinguishing said multi-block cache touch instruction from other instructions, and wherein said multi-block cache touch instruction further comprises an address and a size of said at least one of said multiple blocks.

16. A compiler method for enhancing the performance of a computer program, wherein the compiler generates a first instruction stream executable on a processing unit from a second instruction stream, said compiler method comprising the steps of: determining at least one location within the first instruction stream where performance of the computer program may be enhanced by causing at least one block to be prefetched from a main memory coupled to said processing unit into a cache memory coupled to said processing unit; and inserting a multi-block cache touch instruction at said at least one location, the execution of the multi-block cache touch instruction by said processing unit causing said at least one block to be prefetched directly from said main memory without intermediate processing into a plurality of lines within said cache memory.

17. The compiler method of claim 16 wherein said multiple blocks are objects.

18. The compiler method of claim 16 wherein said cache memory is a data cache memory.

19. The compiler method of claim 16 wherein said cache memory is an instruction cache memory.

20. The compiler method of claim 18 wherein said multi- block cache touch instruction comprises an op code field for distinguishing said multi -block cache touch instruction from other instructions, and wherein said multi-block cache touch instruction further comprises an address and a size of said at least one of said multiple blocks of data.

21. The compiler method of claim 19 wherein said multi- block cache touch instruction comprises an op code field for distinguishing said multi-block cache touch instruction from other instructions, and wherein said multi -block cache touch instruction further comprises an address and a size of said at least one of said multiple blocks of code.

22. An object-oriented computer apparatus comprising: (A) computer system memory for storing multiple blocks of data and code; (B) a cache memory coupled to said computer system memory for prefetching and temporarily storing in multiple cache lines at least one block from said multiple blocks stored in said computer system memory; and (C) a processing unit, coupled to said cache memory for executing a multi-block cache touch instruction, said multi-block cache touch instruction initiating a prefetch of said at least one block from said computer system memory to a plurality of cache lines of said cache memory without intermediate processing of said at least one block.

23. The computer apparatus of claim 22 wherein said multiple blocks are objects.

24. The computer apparatus of claim 22, wherein said multi-block cache touch instruction comprises an address and a size of said at least one block.

25. The computer apparatus of claim 24, wherein said multi -block cache touch instruction further comprises an op code field for distinguishing said multi-block cache touch instruction from other instructions.

26. The computer apparatus of claim 22, wherein said cache memory is an instruction cache memory.

27. The computer apparatus of claim 22, wherein said cache memory is a data cache memory.

28. The computer apparatus of claim 22, wherein said cache memory includes a history mechanism for prefetching only a subset of said blocks of data or code corresponding to said multi -block cache touch instruction, wherein said subset consists of one or more blocks of data or code that were referenced by said processor unit after at least one previous issuance of said multi -block cache touch instruction.

29. A method for preloading multiple cache lines in a cache memory with a single multi-block cache touch instruction comprising the steps of: (1) executing said multi-block cache touch instruction with a processing unit, and in response to executing said multi-block cache touch instruction, performing the steps of: (A) determining a number of said multiple cache lines to be preloaded and corresponding blocks of data or code in a computer system memory to be prefetched as a result of said multi -block cache touch instruction; and (B) prefetching said corresponding blocks of data or code directly from said computer system memory without intermediate processing of said corresponding blocks to said multiple cache lines in said cache memory.

30. The method of claim 29 wherein said multiple blocks are objects.

31. The method of claim 29 further comprising the step of: providing a history mechanism for prefetching only a subset of said blocks of data or code corresponding to said multi-block cache touch instruction, wherein said subset consists of one or more blocks of data or code that were referenced by said processor unit after at least one previous issuance of said multi-block cache touch instruction.

32. The method of claim 29, wherein said multi-block cache touch instruction comprises an address field and a size field.

33. The method of claim 32, wherein said multi-block cache touch instruction further comprises an op code field for distinguishing said multi-block cache touch instruction from other instructions.

34. The method of claim 29, wherein said cache memory is a data cache memory.

35. The method of claim 29, wherein said cache memory is an instruction cache memory.

36. The method of claim 34, wherein said determining step further comprises the steps of: determining an address of said blocks of data to be prefetched from said address field of said multi-block cache touch instruction; and determining said number of said multiple cache lines to be preloaded and said corresponding blocks of data to be prefetched by said size field of said multi -block cache touch instruction.

37. The method of claim 35, wherein said determining step further comprises the steps of: determining an address of said blocks of code to be prefetched from said address field of said multi-block cache touch instruction; and determining said number of said multiple cache lines to be preloaded and said corresponding blocks of code to be prefetched by said size field of said multi-block cache touch instruction .