WO2003034229A1 - Data prefecthing in a computer system - Google Patents

Data prefecthing in a computer system Download PDF

Info

Publication number
WO2003034229A1
WO2003034229A1 PCT/SE2001/002290 SE0102290W WO03034229A1 WO 2003034229 A1 WO2003034229 A1 WO 2003034229A1 SE 0102290 W SE0102290 W SE 0102290W WO 03034229 A1 WO03034229 A1 WO 03034229A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
prefetching
data storage
storage information
job
Prior art date
Application number
PCT/SE2001/002290
Other languages
French (fr)
Inventor
Leif Karl Östen JOHANSSON
Jon Fredrik Helmer Reveman
Original Assignee
Telefonaktiebolaget Lm Ericsson
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Telefonaktiebolaget Lm Ericsson filed Critical Telefonaktiebolaget Lm Ericsson
Priority to EP01979147A priority Critical patent/EP1444584A1/en
Priority to PCT/SE2001/002290 priority patent/WO2003034229A1/en
Publication of WO2003034229A1 publication Critical patent/WO2003034229A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3824Operand accessing
    • G06F9/383Operand prefetching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0862Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches with prefetch
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/60Details of cache memory
    • G06F2212/6028Prefetching based on hints or prefetch instructions

Definitions

  • the present invention generally relates to data prefetching in a computer system, and more particularly to a method and system for prefetching data as well as a method and system for supporting such data prefetching.
  • a common way of alleviating this problem, trying to reduce the average memory latency to an acceptable level, is to use one or more levels of small and fast cache as a buffer between the processor and the larger and slower main memory.
  • a cache memory contains copies of blocks of information that are also stored in the main memory.
  • the system first goes to the fast cache to determine if the information is present in the cache. If the information is available in the cache, a so-called cache hit, access to the main memory is not required and the information is taken directly from the cache. If the information is not available in the cache, a so-called cache miss, the data is fetched from the main memory into the cache, possibly overwriting other active data in the cache.
  • the cache memory has the potential to reduce the average memory latency, the actual performance improvement is naturally dependent on the cache-hit ratio. It is only when the required information is available in the cache, a cache-hit, that the memory latency will be reduced. Whenever the processor needs data and/or instructions that are not available in the cache, the processor "stalls" until the required information is loaded from the main memory, thus wasting valuable processing time.
  • the simplest example of hardware prefetching is the behavior of ordinary caches, which bring an entire cache line from memory even when a single word in the line is referenced, assuming that other words in the same line will be referenced within short.
  • More advanced conventional hardware techniques perform statistical analysis of the memory access patterns of the processor at run-time to generate appropriate prefetch requests.
  • software prefetching also generally known as compiler-assisted prefetching, the compiler analyzes the code to predict references in for example loop structures and inserts specific prefetch instructions into the code accordingly.
  • Compiler-assisted prefetching can use program code knowledge to provide prefetches at suitable places in the code, but has no knowledge of the run-time dynamics.
  • conventional hardware prefetching can effectuate prefetches for run-time memory access patterns not visible or available to the compiler, but has no knowledge of the program code flow.
  • U.S. Patent 5,704,053 relates to a compiler that facilitates efficient insertion of explicit data prefetch instructions into loop structures within an application by simple address expression analysis. Analysis and explicit prefetch instruction insertion is performed by the compiler in a low-level optimizer to provide access to more accurate expected loop iteration latency information. In addition, execution profiles from previous runs of an application are exploited in the insertion of prefetch instructions into loops with internal control flow. Cache line reuse patterns across loop iterations are recognized to eliminate unnecessary prefetch instructions. The prefetch insertion algorithm is also integrated with other low-level optimization phases such as loop unrolling, register re- association and instruction scheduling.
  • U.S. Patent 5,812,996 relates to a database system for improving the execution speed of database queries by optimizing the use of buffer caches.
  • the system includes an optimizer for formulating an optimal strategy for a given query.
  • the optimizer communicates with a buffer manager for determining whether an object of interest exists in its own buffer cache, and how much of the cache that the object requires as well as the optimal I/O size for the cache. Based on this information, the optimizer formulates a query strategy with hints that are passed to the buffer manager.
  • U.S. Patent 5,918,246 relates to data prefetching based on information in a compiler- generated program map.
  • the program map is generated by a compiler when the source code is compiled into object code, and represents the address flow of the compiled program with information of the address location of each branch target that the CPU might encounter during execution. For each application program, the user would have this program map stored with the object file.
  • the operating system will load the program map into a given area of the random access memory, and a special map control unit will utilize the program map in cooperation with a conventional cache controller to effectuate the actual pre-loading of data and instructions to the cache.
  • the present invention overcomes these and other drawbacks of the prior art arrangements.
  • Yet another object of the invention is to provide a method and system for supporting data prefetching in a computer system.
  • the invention is based on the recognition that program code knowledge of the data storage structure used by program procedures can be effectively combined with run- time information in order to generate appropriate prefetch requests.
  • the general idea according to the invention is to combine data storage information generated during program code analysis, for example at compile-time, with one or more run-time arguments to determine a memory address for prefetching data. In this way, efficient data prefetching with a high cache-hit ratio will be accomplished, thus reducing the memory latency and improving the processor utilization.
  • the invention takes advantage of the fact that many computer systems, such as transaction-based systems and database systems, have a queue of jobs to be executed. By peeking into the queue to fetch the relevant information for a given job well in advance of execution of the job, a prefetch can be requested sufficiently early so that the corresponding data will be available when the job is to be executed.
  • the data storage information is generally generated prior to the program execution by program code analysis and stored in the memory system for easy access during run- time.
  • a program code analyzer such as a compiler or code optimizer generates individual data storage information for each of a number of program procedures defined in the program code.
  • the appropriate data storage information to be used for a given job is then accessed based on the program procedure or procedures indicated in the job.
  • the data storage information comprises at least a data area start address together with information concerning which job input argument or arguments that are required to pin-point the memory address of the data to be prefetched.
  • the prefetch address determination and the actual prefetch request are preferably executed by operating system software, dedicated hardware or a combination thereof.
  • the invention offers the following advantages: Efficient data prefetching; Reduced average memory latency; Improved processor utilization; and - Efficient compiler-derived support of data prefetching.
  • Fig. 1 is a schematic block diagram illustrating an example of a computer system implementing a prefetch mechanism according to a preferred embodiment of the invention
  • Fig. 2 is a schematic diagram illustrating the job queue and the corresponding execution flow related to the exemplary computer system of Fig. 1;
  • Fig. 3 is a schematic diagram generally illustrating the general principle for generating data storage information according to the invention.
  • Fig. 4 is a schematic diagram illustrating an example of compiler-assisted generation of data storage information according to a preferred embodiment of the invention
  • Fig. 5 is a schematic diagram illustrating a specific example of the relationship between given data storage information and a data storage structure in the data store.
  • Fig. 1 is a schematic block diagram illustrating an example of a computer system implementing a prefetch mechanism according to a preferred embodiment of the invention.
  • the description is not intended to be complete with regard to the entire computer system, but will concentrate on those parts that are relevant to the invention.
  • the example illustrated in Fig. 1 merely serves as a basis for understanding the basic principles of the invention, and the invention is not limited thereto.
  • the computer system basically comprises a processor 110 and a memory system 120.
  • the computer system also comprises a memory manager 130, a scheduling unit 140 and a prefetch unit 150, implemented in operating system software, dedicated hardware or a combination thereof.
  • the processor 110 and the memory system 120 are interconnected via a conventional communication bus.
  • the memory system 120 comprises a main memory 121 and a faster cache memory 122.
  • the main memory 121 generally comprises a job queue 123 for storing jobs to be executed, a data store 124 for storing data variables and constants, a program store 125 for storing executable program instructions, and a dedicated memory area 126 for storing special data storage information generated during program code analysis.
  • the cache memory 122 generally comprises a data cache 127 and an instruction cache 128.
  • the cache memory 122 may be representative of a so-called on-chip cache provided directly on the processor chip, an off-chip cache provided on a separate chip or both.
  • the performance of a cache is affected by the organization of the cache, and especially the replacement algorithm.
  • the replacement algorithm generally determines to which blocks or lines in the relevant cache that information in the main memory is mapped.
  • the most commonly used replacement algorithms are direct mapping, set-associative and fully associative mapping.
  • the replacement algorithm determines to which blocks or lines in the relevant cache that selected information in the main memory (or another higher-level cache) are mapped, it is still necessary to determine which blocks of information in the main memory that should be copied into the cache in order to maximize the cache-hit ratio and minimize the memory latency.
  • the memory latency will be reduced only when the required information is available in the cache. Whenever the processor needs data and/or instructions that are not available in the cache, the processor stalls until the required information has been loaded from the main memory.
  • the invention proposes a new prefetch mechanism that effectively combines data storage information generated during program code analysis, for example at compile-time, with one or more run-time arguments in order to generate appropriate data prefetch requests.
  • data storage information and run-time information for a given program procedure are combined by means of a generic prefetch function in order to determine a useful prefetch address.
  • the invention has turned out to be particularly applicable in computer systems that operate based on a queue of jobs to be executed. It has been recognized that the queue structure makes it possible to peek into the job queue to fetch relevant information for a given job and request a prefetch of data well in advance of the actual execution of the job. By looking into the queue and generating a prefetch request sufficiently early, the required data will be available in time for execution of the job.
  • the job queue 123 is implemented as a first-in-first- out (FIFO) buffer in which a number of externally and/or internally generated job messages are buffered, awaiting processing by the processor.
  • FIFO first-in-first- out
  • each job message in the job queue 123 includes program address representative information, input arguments to be used in the execution as well as data storage information related to the given procedure.
  • the program address representative information directly or indirectly addresses the program procedure to be executed.
  • program address information directly or indirectly addresses the program procedure to be executed.
  • the actual program address is generally accessed by means of several table look-ups in different tables.
  • the program address information in the job message typically includes a pointer to a look-up table, which in turn points to another table and so on until the final program address is found.
  • the data storage information also referred to as data storage structure information, is generated before program execution by proper analysis of the program code. In many applications, it is often convenient to generate individualized data storage information for each of a number of program procedures defined in the program code.
  • the procedure-specific data storage information generally describes the data storage structure related to the program procedure in question.
  • the data storage information is typically stored in the data storage information area 126 for access during run-time.
  • the data storage information is preferably transferred by the operating system
  • OS or equivalent from the data storage information area 126 into the job queue 123.
  • the operating system analyzes each job message to be placed in the job queue 123, and detects which program procedure that is defined in the job message based on the program address information included in the message. The operating system then adds the corresponding data storage information to the respective job message, and writes the entire job message into the job queue.
  • the data storage information may be loaded directly from the data storage information area 126 based on the program address information for the given job.
  • the scheduling unit 140 schedules the corresponding jobs for execution by the processor 110 by managing the job queue 123 using a special execution pointer.
  • the execution pointer usually points to the head of the job queue, indicating that the job at the head position is to be executed (or currentiy under execution).
  • the prefetch unit 150 looks ahead in the job queue 123, using a special prefetch pointer, and initiates the prefetch mechanism for a given future job a predetermined number of jobs in advance of execution. First, the prefetch unit 150 loads program address information, input arguments and data storage information for the indicated job from the memory-allocated job queue 123 into the cache, unless this information already resides in the cache. The prefetch unit 150 then combines selected data storage information with at least one of the input arguments for the job according to a given prefetch address function, thus calculating a data prefetch address.
  • the data storage information typically comprises a data area start address together with information concerning which input argument or arguments that are required to fully determine the corresponding prefetch address.
  • a prefetch address may be calculated by using the start address to find the relevant area within the data store 124 and pin-pointing the address of the needed data variable or constant by means of the indicated input argument.
  • the prefetch unit 150 communicates the calculated prefetch address to the memory manager 130, which in turn controls the actual transfer of data from the data store 124 into the data cache 127. If the memory manager 130 brings an entire cache line from the main memory 121 when a single word is referenced, it is generally not necessary to find the exact individual memory address for the future data reference. It is merely sufficient to determine the correct memory line or block in which the needed data is located. This relaxes the requirements on the exactness of the prefetch address function.
  • prefetch unit 150 generally follows the same job queue as the scheduling unit 140, but operates a predetermined number of jobs ahead of the job to be executed.
  • the program address information and the input arguments for the job are already available in the cache.
  • data variables and/or constants to be used in the execution of the job are also available in the data cache. This minimizes the memory latency and thus substantially reduces the number of stall cycles. Simulations have indeed shown that the number of stall cycles due to data store accesses can be reduced with 25-50%, as will be described in detail later on.
  • prefetch is merely a hint to the memory system to bring the given data into a closer, faster level of memory, such that a later binding load will complete much faster.
  • This kind of prefetch is executed asynchronously with no follow-on dependencies in the code stream, and therefore does not cause any stall cycles.
  • Fig. 2 is a schematic diagram illustrating the job queue and the corresponding execution flow related to the exemplary computer system of Fig. 1.
  • the prefetch for a given future job is initiated a predetermined number K of jobs in advance of the actual execution.
  • K the number of jobs in advance of the actual execution.
  • step 201 the data cache block or blocks required in the future execution of job M+K are subsequently calculated based on the obtained data storage information and at least one of the obtained input arguments.
  • step 203 the actual prefetch of the calculated data cache block or blocks is requested.
  • the prefetch should not be issued too close in time to the actual data reference, since then the prefetched data may not be available in time to minimize or avoid a stall situation.
  • the prefetch is issued too early, there is a risk that the prefetched line is displaced from the cache before the actual data reference takes place.
  • the so-called look-ahead period is ideally of the same order as the memory access time or slightly longer so that data related to job M+K will be available in the cache when job M+K is ready for execution at the head of the job queue. If the average job execution time is known, it is possible to deteraiine how many jobs in advance of execution the prefetch should be issued. Naturally, the optimal look-ahead period differs from application to application. However, simulations have shown that a significant performance gain can be achieved already with a look-ahead of one or two jobs.
  • the representation of the job queue in Fig. 2 is a snap-shot, and that a similar prefetch of program address information, input arguments and data variables or constants has already been performed for each of the jobs to be executed before job M+K, including job M.
  • the program address information and input arguments required for starting job M as well as data variables and/or constants needed in the execution of job M are ideally available in the cache so that the execution of job M can be initiated in step 204.
  • the results of job M are stored in relevant parts of the memory system. If a new job M+N is generated as a result of the execution of job M, this job is stored in the job queue in step 205.
  • the operating system adds the data storage information corresponding to the new job M+N from the data storage information area 126 into the relevant position of the job queue.
  • job M is shifted out of the job queue and job M + l is placed at the head of the job queue.
  • the prefetch mechanism according to the invention may be implemented as an operating system routine that is activated between jobs, or executed as an integrated part of the currently executing job M.
  • the prefetch may be executed by dedicated hardware or even a separate processor with software responsible for job scheduling and prefetching. In the latter case, it is generally easier to optimize the prefetch timing to the memory access time of the slower memory, since prefetches may be issued by the separate processor at any suitable time.
  • Fig. 3 is a schematic diagram illustrating the general principle for generating data storage information according to the invention.
  • An input program file 302 is provided to a code analyzer 304, which performs a flow graph analysis or equivalent analysis of the program code.
  • the code analyzer 304 extracts static information concerning the data storage structure related to the program procedure.
  • the code analyzer 304 may extract information regarding the start address to a specific area in the data store towards which the given program procedure operates.
  • one or more run-time arguments are typically required.
  • the code analyzer 304 does not know the values of any run-time arguments, but instead analyzes the code to provide information as to which input argument or arguments that are required to pinpoint the address of the needed data within the specified data store area. During runtime, the required input argument(s) can then be accessed and combined with the static information from the code analyzer to determine the relevant data store address.
  • Fig. 4 is a schematic diagram illustrating an example of compiler-assisted generation of data storage information according to a preferred embodiment of the invention.
  • the source code in the form of an input program file 402, is provided to a compiler or optimizer 404. During compilation, the compiler translates the source code into object code, producing a corresponding output program file 406.
  • the compiler generally generates a compiler help file in the form of a procedure descriptor table 408.
  • This table normally includes a general description of each compiled program procedure indicating the name of the procedure, the number of input arguments, possibly the format of the arguments, and program address information.
  • the compiler 404 also generates individual data storage information for each program procedure by analysis of the corresponding program code, and integrates this information into the procedure descriptor table 408. The data storage information can then be accessed from the procedure descriptor table 408 during run-time and combined with run-time arguments to generate appropriate prefetch requests.
  • Fig. 5 is a schematic diagram illustrating a specific example of the relationship between given data storage information and a data storage structure in the data store.
  • the prefetch mechanism has access to the run-time arguments 510 for a given job as well as data storage information 520 related to a program procedure defined in the given job.
  • the data storage information 520 is represented by a base address number ban as well as an indication arg nr of which input argument or arguments to the procedure that are required to determine a prefetch address.
  • the base address number is unique for the given program procedure and acts as a pointer to a base address table 525.
  • the base address table 525 holds data area start addresses, record size values and offset values, and each base address number is associated with a unique data area start address dfn, record size recsize and offset.
  • the data storage parameters dfn, recsize and offset are given directly, eliminating the need for a table look-up in the base address table using the base address number.
  • the dfn value indicates the start address of a given data store area 535 in the data store 530.
  • the recsize value represents the size of a record in the given data storage structure.
  • the offset value indicates which one of the variables within a record that is requested.
  • the input argument or arguments indicated by the data storage information are also required.
  • the data storage information 520 points out a certain input argument, which is provided as input to a pointer function p(arg).
  • the pointer function p(arg) may simply be defined as the sum of the relevant input argument arg and an arbitrary constant C. The resulting pointer value indicates in which record the needed data variable is located.
  • the prefetch address can thus be calculated according to the following generic prefetch function (assuming that the data store does not have any index dependencies):
  • prefetch address dfn (ban) + p(arg) • recsize (ban) + offset (ban) (1)
  • dfn value gives the data area start address
  • p(arg) value multiplied with the recsize value gives the relevant record
  • the offset value finally gives the correct variable within the identified record.
  • the prefetch unit then requests a prefetch of the data variable, or preferably an entire data cache block, from the data store into the data cache based on the calculated address.
  • each set of data storage information is 32 bits, with the following store layout:
  • the cache block determination defined by steps 1-11 is a very straightforward implementation based on simple logical operations and shift operations, and does not involve any logical decisions. This is important for minimizing the overhead for the data prefetch mechanism according to the invention.
  • a trace-driven simulation was used to study the cache-hit ratio obtained by using the memory address determination algorithm proposed above.
  • the trace was recorded in a live telecommunication exchange based on a 64-bit Alpha processor and included approximately 6 million assembler instructions.
  • the cache block size was 64 bytes.
  • the simulator counted the needed number of data cache blocks for every executed signal in the trace and compared that number with the number of preloaded cache blocks. Table I below shows the results of the simulation.
  • the simulation shows that 48-56% of the cache blocks could be preloaded with the algorithm used by the invention.
  • the percentage of data store accesses to these cache blocks was 58-65%.
  • Calculations and measurements show that nearly 50% of the execution time is stalled due to data store accesses in the Alpha-based processor architecture.
  • the proposed prefetch mechanism could reduce the number of stall cycles with approximately 25-50%. This corresponds to a total capacity gain of about 10-25%, which is a remarkable improvement offered by the invention. In real-life applications, improvements of up to at least 5-10% is expected.
  • the invention can be used in any computer system that allows non-binding asynchronous transfer of data from one level of memory to another level of memory. This includes most modern computer systems such as pipelined processor systems, superscalar processor systems, multiprocessor systems and combinations thereof.
  • the invention is particularly applicable to computer systems in which a number of externally and/or internally generated jobs are arranged, explicitly or implicitly, in a queue, and in applications with a high ratio of so-called context switching.
  • a number of externally and/or internally generated jobs are arranged, explicitly or implicitly, in a queue, and in applications with a high ratio of so-called context switching.
  • most transaction-based systems have a job queue in which jobs are buffered, awaiting processing by the processor or processors within the system.
  • database applications a number of requests or queries from various clients are typically buffered for subsequent processing by a database server.
  • the invention is also applicable to process-based computer systems. For example, many commercial operating systems such as Unix and Windows NT work with processes. In a system having an execution model based on processes, incoming signal messages originating from events in the system or from communication messaging are directed to corresponding processes.
  • a process is normally represented by its process control block, which holds the process state when the process is not executing and possibly administrative information required by the operating system.
  • the process state includes program address information and input arguments required for execution of the current job of the process.
  • a process can be either READY, waiting in a ready queue for execution, EXECUTING, meaning that a job is executed based on the current process state of the process control block, or BLOCKED, waiting for a required signal message in a blocked queue.
  • a job queue is defined and used by a main executive.
  • the job queue consists of job-queue entries, and each entry includes information about the specific procedure to be executed and the input arguments to the procedure.
  • two simple program procedures are defined.
  • a generic prefetch function is inlined in the main executive. The prefetch function uses information found in the job-queue in combination with procedure- specific data storage information. The data storage information would normally be generated by the compiler.
  • element number element_number + 123;
  • next_free_entry 0; /* Wrap around at end of queue */
  • unsigned int new_function (element_number > > 1) & 1; /* Pick 2nd lowest bit for new function */ int arg2;
  • element_number element_number > > 1 ; if ((element_number > 0) && (element_number ⁇ WA_SIZE)) ⁇
  • pf_pointer (execute_pointer + LOOKAHEAD) %JQ_SIZE;

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The invention relates to data prefetching in a computer system and involves a prefetch unit (150) that operates towards a job queue (123), in which a number of jobs are scheduled for execution. The prefetch unit (150) combines data storage information generated during program code analysis, for example at compile-time, with one or more run-time arguments to determine a memory address for prefetching data. By peeking into the queue (123) to fetch relevant information for a given job in advance of execution of the job, a prefetch can be requested sufficiently early so that the corresponding data will be available when the job is to be executed. In this way, efficient data prefetching with a high cache-hit ratio will be accomplished, thus reducing the memory latency and improving the processor utilization.

Description

DATA PREFETCHING IN A COMPUTER SYSTEM
TECHNICAL FIELD OF THE INVENTION
The present invention generally relates to data prefetching in a computer system, and more particularly to a method and system for prefetching data as well as a method and system for supporting such data prefetching.
BACKGROUND OF THE INVENTION
With the ever-increasing demand for faster and more effective computer systems naturally comes the need for faster and more sophisticated electronic components. The computer industry has been extremely successful in developing new and faster processors. The processing speed of state-of-the-art processors has increased at a spectacular rate over the past decades. The access time of memory circuits, however, has not been able to improve at the same rate. In fact, the ratio between memory access time and clock cycle time for execution has increased rapidly during the past 10 to 15 years, and is expected to increase even further in the future. This means that memory will continue to be the limiting factor for overall system performance. The relatively long access time for retrieval of data and/or instructions from main memory generally means that the processor has to spend time merely waiting for information that is required during execution. This waiting time is often referred to as memory latency.
A common way of alleviating this problem, trying to reduce the average memory latency to an acceptable level, is to use one or more levels of small and fast cache as a buffer between the processor and the larger and slower main memory. A cache memory contains copies of blocks of information that are also stored in the main memory. As reads to the main memory are issued in the computer system, the system first goes to the fast cache to determine if the information is present in the cache. If the information is available in the cache, a so-called cache hit, access to the main memory is not required and the information is taken directly from the cache. If the information is not available in the cache, a so-called cache miss, the data is fetched from the main memory into the cache, possibly overwriting other active data in the cache. Similarly, as writes to the main memory are issued, data is written to the cache and copied back to the main memory. As should be understood, the goal of using a cache memory in connection with a main memory is to make the memory system appear to be as large as the main memory and as fast as the cache memory.
While the cache memory has the potential to reduce the average memory latency, the actual performance improvement is naturally dependent on the cache-hit ratio. It is only when the required information is available in the cache, a cache-hit, that the memory latency will be reduced. Whenever the processor needs data and/or instructions that are not available in the cache, the processor "stalls" until the required information is loaded from the main memory, thus wasting valuable processing time.
Various attempts have been made to reduce the time that the processor must spend waiting for data by predicting what information the processor is going to need and prefetching that information into the cache before the information is actually referenced. When the information is finally referenced, it can be found in the cache.
Conventional prefetching is typically divided into two different categories: hardware prefetching and software prefetching.
The simplest example of hardware prefetching is the behavior of ordinary caches, which bring an entire cache line from memory even when a single word in the line is referenced, assuming that other words in the same line will be referenced within short.
More advanced conventional hardware techniques perform statistical analysis of the memory access patterns of the processor at run-time to generate appropriate prefetch requests. In software prefetching, also generally known as compiler-assisted prefetching, the compiler analyzes the code to predict references in for example loop structures and inserts specific prefetch instructions into the code accordingly.
Compiler-assisted prefetching can use program code knowledge to provide prefetches at suitable places in the code, but has no knowledge of the run-time dynamics. In contrast, conventional hardware prefetching can effectuate prefetches for run-time memory access patterns not visible or available to the compiler, but has no knowledge of the program code flow.
U.S. Patent 5,704,053 relates to a compiler that facilitates efficient insertion of explicit data prefetch instructions into loop structures within an application by simple address expression analysis. Analysis and explicit prefetch instruction insertion is performed by the compiler in a low-level optimizer to provide access to more accurate expected loop iteration latency information. In addition, execution profiles from previous runs of an application are exploited in the insertion of prefetch instructions into loops with internal control flow. Cache line reuse patterns across loop iterations are recognized to eliminate unnecessary prefetch instructions. The prefetch insertion algorithm is also integrated with other low-level optimization phases such as loop unrolling, register re- association and instruction scheduling.
U.S. Patent 5,812,996 relates to a database system for improving the execution speed of database queries by optimizing the use of buffer caches. The system includes an optimizer for formulating an optimal strategy for a given query. The optimizer communicates with a buffer manager for determining whether an object of interest exists in its own buffer cache, and how much of the cache that the object requires as well as the optimal I/O size for the cache. Based on this information, the optimizer formulates a query strategy with hints that are passed to the buffer manager. U.S. Patent 5,918,246 relates to data prefetching based on information in a compiler- generated program map. The program map is generated by a compiler when the source code is compiled into object code, and represents the address flow of the compiled program with information of the address location of each branch target that the CPU might encounter during execution. For each application program, the user would have this program map stored with the object file. The operating system will load the program map into a given area of the random access memory, and a special map control unit will utilize the program map in cooperation with a conventional cache controller to effectuate the actual pre-loading of data and instructions to the cache.
SUMMARY OF THE INVENTION
The present invention overcomes these and other drawbacks of the prior art arrangements.
It is a general object of the present invention to improve existing cache and prefetch technology in order to reduce the average memory latency and increase the processor utilization.
It is a particular object of the invention to provide a method and system for efficient prefetching of data in a computer system.
Yet another object of the invention is to provide a method and system for supporting data prefetching in a computer system.
These and other objects are met by the invention as defined by the accompanying patent claims.
The invention is based on the recognition that program code knowledge of the data storage structure used by program procedures can be effectively combined with run- time information in order to generate appropriate prefetch requests. The general idea according to the invention is to combine data storage information generated during program code analysis, for example at compile-time, with one or more run-time arguments to determine a memory address for prefetching data. In this way, efficient data prefetching with a high cache-hit ratio will be accomplished, thus reducing the memory latency and improving the processor utilization.
If a prefetch is requested too close in time to the actual data reference, the data may not be available in time to minimize or avoid a processor stall situation. Therefore, it is important that the run-time information used in generating the prefetch address is available sufficiently in advance of the data reference. The invention takes advantage of the fact that many computer systems, such as transaction-based systems and database systems, have a queue of jobs to be executed. By peeking into the queue to fetch the relevant information for a given job well in advance of execution of the job, a prefetch can be requested sufficiently early so that the corresponding data will be available when the job is to be executed.
The data storage information is generally generated prior to the program execution by program code analysis and stored in the memory system for easy access during run- time. Preferably, a program code analyzer such as a compiler or code optimizer generates individual data storage information for each of a number of program procedures defined in the program code. The appropriate data storage information to be used for a given job is then accessed based on the program procedure or procedures indicated in the job. Typically, the data storage information comprises at least a data area start address together with information concerning which job input argument or arguments that are required to pin-point the memory address of the data to be prefetched.
The prefetch address determination and the actual prefetch request are preferably executed by operating system software, dedicated hardware or a combination thereof. The invention offers the following advantages: Efficient data prefetching; Reduced average memory latency; Improved processor utilization; and - Efficient compiler-derived support of data prefetching.
Other advantages offered by the present invention will be appreciated upon reading of the below description of the embodiments of the invention.
BRIEF DESCRIPTION OF THE DRAWINGS
The invention, together with further objects and advantages thereof, will be best understood by reference to the following description taken together with the accompanying drawings, in which:
Fig. 1 is a schematic block diagram illustrating an example of a computer system implementing a prefetch mechanism according to a preferred embodiment of the invention;
Fig. 2 is a schematic diagram illustrating the job queue and the corresponding execution flow related to the exemplary computer system of Fig. 1;
Fig. 3 is a schematic diagram generally illustrating the general principle for generating data storage information according to the invention;
Fig. 4 is a schematic diagram illustrating an example of compiler-assisted generation of data storage information according to a preferred embodiment of the invention;
Fig. 5 is a schematic diagram illustrating a specific example of the relationship between given data storage information and a data storage structure in the data store. DETAILED DESCRIPTION OF EMBODIMENTS OF THE INVENTION
Fig. 1 is a schematic block diagram illustrating an example of a computer system implementing a prefetch mechanism according to a preferred embodiment of the invention. As the skilled person will understand, the description is not intended to be complete with regard to the entire computer system, but will concentrate on those parts that are relevant to the invention. Furthermore, the example illustrated in Fig. 1 merely serves as a basis for understanding the basic principles of the invention, and the invention is not limited thereto.
The computer system basically comprises a processor 110 and a memory system 120. The computer system also comprises a memory manager 130, a scheduling unit 140 and a prefetch unit 150, implemented in operating system software, dedicated hardware or a combination thereof.
The processor 110 and the memory system 120 are interconnected via a conventional communication bus. In the particular example of Fig. 1, the memory system 120 comprises a main memory 121 and a faster cache memory 122. The main memory 121 generally comprises a job queue 123 for storing jobs to be executed, a data store 124 for storing data variables and constants, a program store 125 for storing executable program instructions, and a dedicated memory area 126 for storing special data storage information generated during program code analysis. As in most modern computer systems, the cache memory 122 generally comprises a data cache 127 and an instruction cache 128.
Although a single level of data and instruction cache is illustrated in Fig. 1, it is possible to utilize any number of cache levels. For example, the cache memory 122 may be representative of a so-called on-chip cache provided directly on the processor chip, an off-chip cache provided on a separate chip or both. The performance of a cache is affected by the organization of the cache, and especially the replacement algorithm. The replacement algorithm generally determines to which blocks or lines in the relevant cache that information in the main memory is mapped. The most commonly used replacement algorithms are direct mapping, set-associative and fully associative mapping.
Although the replacement algorithm determines to which blocks or lines in the relevant cache that selected information in the main memory (or another higher-level cache) are mapped, it is still necessary to determine which blocks of information in the main memory that should be copied into the cache in order to maximize the cache-hit ratio and minimize the memory latency. As mentioned in the background section, the memory latency will be reduced only when the required information is available in the cache. Whenever the processor needs data and/or instructions that are not available in the cache, the processor stalls until the required information has been loaded from the main memory.
In order to reduce the memory latency for accessing data variables and/or constants required in the execution, the invention proposes a new prefetch mechanism that effectively combines data storage information generated during program code analysis, for example at compile-time, with one or more run-time arguments in order to generate appropriate data prefetch requests. Preferably, data storage information and run-time information for a given program procedure are combined by means of a generic prefetch function in order to determine a useful prefetch address.
If a prefetch is requested too close in time to the actual execution, the data variables and/or constants may not be available in time for execution. In this respect, the invention has turned out to be particularly applicable in computer systems that operate based on a queue of jobs to be executed. It has been recognized that the queue structure makes it possible to peek into the job queue to fetch relevant information for a given job and request a prefetch of data well in advance of the actual execution of the job. By looking into the queue and generating a prefetch request sufficiently early, the required data will be available in time for execution of the job.
In the computer system of Fig. 1, the job queue 123 is implemented as a first-in-first- out (FIFO) buffer in which a number of externally and/or internally generated job messages are buffered, awaiting processing by the processor. Preferably, each job message in the job queue 123 includes program address representative information, input arguments to be used in the execution as well as data storage information related to the given procedure.
The program address representative information, simply referred to as program address information, directly or indirectly addresses the program procedure to be executed. For example, in systems using dynamically linked code that can be reconfigured during operation, the actual program address is generally accessed by means of several table look-ups in different tables. In this case, the program address information in the job message typically includes a pointer to a look-up table, which in turn points to another table and so on until the final program address is found.
The data storage information, also referred to as data storage structure information, is generated before program execution by proper analysis of the program code. In many applications, it is often convenient to generate individualized data storage information for each of a number of program procedures defined in the program code. The procedure-specific data storage information generally describes the data storage structure related to the program procedure in question. The data storage information is typically stored in the data storage information area 126 for access during run-time.
In order to facilitate the data storage information access and reduce the number of table look-ups, the data storage information is preferably transferred by the operating system
(OS) or equivalent from the data storage information area 126 into the job queue 123. For example, the operating system analyzes each job message to be placed in the job queue 123, and detects which program procedure that is defined in the job message based on the program address information included in the message. The operating system then adds the corresponding data storage information to the respective job message, and writes the entire job message into the job queue.
Alternatively, however, the data storage information may be loaded directly from the data storage information area 126 based on the program address information for the given job.
Typically, the scheduling unit 140 schedules the corresponding jobs for execution by the processor 110 by managing the job queue 123 using a special execution pointer. The execution pointer usually points to the head of the job queue, indicating that the job at the head position is to be executed (or currentiy under execution).
For prefetch purposes, the prefetch unit 150 looks ahead in the job queue 123, using a special prefetch pointer, and initiates the prefetch mechanism for a given future job a predetermined number of jobs in advance of execution. First, the prefetch unit 150 loads program address information, input arguments and data storage information for the indicated job from the memory-allocated job queue 123 into the cache, unless this information already resides in the cache. The prefetch unit 150 then combines selected data storage information with at least one of the input arguments for the job according to a given prefetch address function, thus calculating a data prefetch address. For example, the data storage information typically comprises a data area start address together with information concerning which input argument or arguments that are required to fully determine the corresponding prefetch address. In this case, a prefetch address may be calculated by using the start address to find the relevant area within the data store 124 and pin-pointing the address of the needed data variable or constant by means of the indicated input argument. The prefetch unit 150 communicates the calculated prefetch address to the memory manager 130, which in turn controls the actual transfer of data from the data store 124 into the data cache 127. If the memory manager 130 brings an entire cache line from the main memory 121 when a single word is referenced, it is generally not necessary to find the exact individual memory address for the future data reference. It is merely sufficient to determine the correct memory line or block in which the needed data is located. This relaxes the requirements on the exactness of the prefetch address function.
It is important to note that the prefetch unit 150 generally follows the same job queue as the scheduling unit 140, but operates a predetermined number of jobs ahead of the job to be executed.
Once the given job has moved to the head of the job queue 123, ready for execution, the program address information and the input arguments for the job are already available in the cache. In addition, by using the data prefetch mechanism proposed by the invention, data variables and/or constants to be used in the execution of the job are also available in the data cache. This minimizes the memory latency and thus substantially reduces the number of stall cycles. Simulations have indeed shown that the number of stall cycles due to data store accesses can be reduced with 25-50%, as will be described in detail later on.
It is important to understand that we are dealing with so-called non-binding or asynchronous prefetching. This means that the prefetch is merely a hint to the memory system to bring the given data into a closer, faster level of memory, such that a later binding load will complete much faster. This kind of prefetch is executed asynchronously with no follow-on dependencies in the code stream, and therefore does not cause any stall cycles.
Fig. 2 is a schematic diagram illustrating the job queue and the corresponding execution flow related to the exemplary computer system of Fig. 1. Preferably, the prefetch for a given future job is initiated a predetermined number K of jobs in advance of the actual execution. This generally means that if job M is at the head of the job queue, a prefetch for job M+K will be initiated in step 201 by fetching program address information, input arguments and data storage information for job M+K from the job queue. In step 202, the data cache block or blocks required in the future execution of job M+K are subsequently calculated based on the obtained data storage information and at least one of the obtained input arguments. In step 203, the actual prefetch of the calculated data cache block or blocks is requested.
As mentioned earlier, the prefetch should not be issued too close in time to the actual data reference, since then the prefetched data may not be available in time to minimize or avoid a stall situation. On the other hand, if the prefetch is issued too early, there is a risk that the prefetched line is displaced from the cache before the actual data reference takes place. The so-called look-ahead period is ideally of the same order as the memory access time or slightly longer so that data related to job M+K will be available in the cache when job M+K is ready for execution at the head of the job queue. If the average job execution time is known, it is possible to deteraiine how many jobs in advance of execution the prefetch should be issued. Naturally, the optimal look-ahead period differs from application to application. However, simulations have shown that a significant performance gain can be achieved already with a look-ahead of one or two jobs.
It should be understood that the representation of the job queue in Fig. 2 is a snap-shot, and that a similar prefetch of program address information, input arguments and data variables or constants has already been performed for each of the jobs to be executed before job M+K, including job M. Consequentiy, the program address information and input arguments required for starting job M as well as data variables and/or constants needed in the execution of job M are ideally available in the cache so that the execution of job M can be initiated in step 204. The results of job M are stored in relevant parts of the memory system. If a new job M+N is generated as a result of the execution of job M, this job is stored in the job queue in step 205. Preferably, the operating system adds the data storage information corresponding to the new job M+N from the data storage information area 126 into the relevant position of the job queue. When the execution of job M is completed, job M is shifted out of the job queue and job M + l is placed at the head of the job queue. The execution flow defined by steps 201 to 205 is then repeated with M=M+ 1.
The prefetch mechanism according to the invention may be implemented as an operating system routine that is activated between jobs, or executed as an integrated part of the currently executing job M. Alternatively, the prefetch may be executed by dedicated hardware or even a separate processor with software responsible for job scheduling and prefetching. In the latter case, it is generally easier to optimize the prefetch timing to the memory access time of the slower memory, since prefetches may be issued by the separate processor at any suitable time.
Fig. 3 is a schematic diagram illustrating the general principle for generating data storage information according to the invention. An input program file 302 is provided to a code analyzer 304, which performs a flow graph analysis or equivalent analysis of the program code. For each of a number of program procedures, the code analyzer 304 extracts static information concerning the data storage structure related to the program procedure. For example, during analysis of the code related to a given program procedure, the code analyzer 304 may extract information regarding the start address to a specific area in the data store towards which the given program procedure operates. In order to determine the address of a data variable or constant in the specific data store area, one or more run-time arguments are typically required. The code analyzer 304 does not know the values of any run-time arguments, but instead analyzes the code to provide information as to which input argument or arguments that are required to pinpoint the address of the needed data within the specified data store area. During runtime, the required input argument(s) can then be accessed and combined with the static information from the code analyzer to determine the relevant data store address. Fig. 4 is a schematic diagram illustrating an example of compiler-assisted generation of data storage information according to a preferred embodiment of the invention. The source code, in the form of an input program file 402, is provided to a compiler or optimizer 404. During compilation, the compiler translates the source code into object code, producing a corresponding output program file 406. In the same process, the compiler generally generates a compiler help file in the form of a procedure descriptor table 408. This table normally includes a general description of each compiled program procedure indicating the name of the procedure, the number of input arguments, possibly the format of the arguments, and program address information. In accordance with a preferred embodiment of the invention, the compiler 404 also generates individual data storage information for each program procedure by analysis of the corresponding program code, and integrates this information into the procedure descriptor table 408. The data storage information can then be accessed from the procedure descriptor table 408 during run-time and combined with run-time arguments to generate appropriate prefetch requests. Although it is convenient to integrate the data storage information within the procedure descriptor table 408, it is also possible to generate a separate data storage information table that is accessible during run-time.
For a better understanding of the invention, an example of a memory address determination algorithm will now be described with reference to a given data storage structure.
Fig. 5 is a schematic diagram illustrating a specific example of the relationship between given data storage information and a data storage structure in the data store. The prefetch mechanism has access to the run-time arguments 510 for a given job as well as data storage information 520 related to a program procedure defined in the given job.
In this example, the data storage information 520 is represented by a base address number ban as well as an indication arg nr of which input argument or arguments to the procedure that are required to determine a prefetch address. The base address number is unique for the given program procedure and acts as a pointer to a base address table 525. The base address table 525 holds data area start addresses, record size values and offset values, and each base address number is associated with a unique data area start address dfn, record size recsize and offset. By making a lookup in the base address table 525 using the given base address number ban, a unique set of data storage parameters dfn, recsize and offset for use in addressing the data store 530 is obtained. Alternatively, the data storage parameters dfn, recsize and offset are given directly, eliminating the need for a table look-up in the base address table using the base address number. The dfn value indicates the start address of a given data store area 535 in the data store 530. The recsize value represents the size of a record in the given data storage structure. The offset value indicates which one of the variables within a record that is requested.
In order to pin-point the memory address of the data variable, the input argument or arguments indicated by the data storage information are also required. In the example of Fig. 5, the data storage information 520 points out a certain input argument, which is provided as input to a pointer function p(arg). For example, the pointer function p(arg) may simply be defined as the sum of the relevant input argument arg and an arbitrary constant C. The resulting pointer value indicates in which record the needed data variable is located.
Based on the base address table information accessed via the given base address number ban in combination with the indicated input argument, the prefetch address can thus be calculated according to the following generic prefetch function (assuming that the data store does not have any index dependencies):
prefetch address = dfn (ban) + p(arg) • recsize (ban) + offset (ban) (1) This means that the dfn value gives the data area start address, the p(arg) value multiplied with the recsize value gives the relevant record, and the offset value finally gives the correct variable within the identified record. The prefetch unit then requests a prefetch of the data variable, or preferably an entire data cache block, from the data store into the data cache based on the calculated address.
In the following, a particular implementation of the above address calculation algorithm in a commercial Alpha processor from Compaq will be described. It is assumed that each set of data storage information is 32 bits, with the following store layout:
[C (15 bits), arg nr (5 bits), ban (12 bits)]
If a 64-bit Alpha processor is used, it is thus possible to get data storage information for two cache block predictions per processor load:
[C2, arg nr2, ban2, , arg nr banj]
The following steps may be performed in order to calculate the needed cache blocks:
1. Load 64 bits of data storage information, with the store layout defined above.
2. Extract a base address number ban by logically AND-ing the loaded value with 12 bits corresponding to the decimal value 4095.
3. Use the extracted base address number to load dfn an), recsize (ban) and offset an).
4. Shift the data storage information 12 steps to the right. 5. Extract an argument number arg nr by logically AND-ing the value in step 4 with 5 bits corresponding to the decimal value 31, and use this value to address the required input argument.
6. Shift the data storage information 5 steps further to the right.
7. Extract a constant C by logically AND-ing the value in step 6 with 15 bits corresponding to the decimal value 32767.
8. Determine a pointer value p(arg) by adding the value of the input argument addressed in step 5 with the constant obtained in step 7.
§9. Calculate a cache block prefetch address according to the generic prefetch function defined in formula (1) above.
10. Calculate the next cache block prefetch address by repeating steps 2-9.
11. Load the next 64 bits of data storage information and go to step number 2.
The cache block determination defined by steps 1-11 is a very straightforward implementation based on simple logical operations and shift operations, and does not involve any logical decisions. This is important for minimizing the overhead for the data prefetch mechanism according to the invention.
A trace-driven simulation was used to study the cache-hit ratio obtained by using the memory address determination algorithm proposed above. The trace was recorded in a live telecommunication exchange based on a 64-bit Alpha processor and included approximately 6 million assembler instructions. The cache block size was 64 bytes. The simulator counted the needed number of data cache blocks for every executed signal in the trace and compared that number with the number of preloaded cache blocks. Table I below shows the results of the simulation.
Table I
Figure imgf000020_0001
The simulation shows that 48-56% of the cache blocks could be preloaded with the algorithm used by the invention. The percentage of data store accesses to these cache blocks was 58-65%. Calculations and measurements show that nearly 50% of the execution time is stalled due to data store accesses in the Alpha-based processor architecture. Based on the simulation results, it is estimated that the proposed prefetch mechanism could reduce the number of stall cycles with approximately 25-50%. This corresponds to a total capacity gain of about 10-25%, which is a remarkable improvement offered by the invention. In real-life applications, improvements of up to at least 5-10% is expected.
The invention can be used in any computer system that allows non-binding asynchronous transfer of data from one level of memory to another level of memory. This includes most modern computer systems such as pipelined processor systems, superscalar processor systems, multiprocessor systems and combinations thereof.
The invention is particularly applicable to computer systems in which a number of externally and/or internally generated jobs are arranged, explicitly or implicitly, in a queue, and in applications with a high ratio of so-called context switching. For example, most transaction-based systems have a job queue in which jobs are buffered, awaiting processing by the processor or processors within the system. In database applications, a number of requests or queries from various clients are typically buffered for subsequent processing by a database server.
It is also possible to realize the invention in a job-parallel processor, in which multiple jobs are speculatively executed in parallel, as long as the job prefetch unit operates towards the same job queue and utilizes the same general job allocation principle as the job execution scheduler.
The invention is also applicable to process-based computer systems. For example, many commercial operating systems such as Unix and Windows NT work with processes. In a system having an execution model based on processes, incoming signal messages originating from events in the system or from communication messaging are directed to corresponding processes. A process is normally represented by its process control block, which holds the process state when the process is not executing and possibly administrative information required by the operating system. The process state includes program address information and input arguments required for execution of the current job of the process. A process can be either READY, waiting in a ready queue for execution, EXECUTING, meaning that a job is executed based on the current process state of the process control block, or BLOCKED, waiting for a required signal message in a blocked queue.
In analogy with the previously described embodiments of the invention, it is thus possible to peek into the ready queue and access input arguments for a given job from the process control block in advance of execution of the job. One or more of these input arguments can then be combined with corresponding data storage information in order to generate a prefetch address. For additional information on process-oriented operating systems and execution models, please refer to Operating System Concepts by Silberschatz and Peterson, Addison- Wesley Publ. Co., 1988, pp. 149-185. For the interested reader additional insight into the invention can be gained by referring to the C code example in Appendix A.
The embodiments described above are merely given as examples, and it should be understood that the present invention is not limited thereto. Further modifications, changes and improvements which retain the basic underlying principles disclosed and claimed herein are within the scope and spirit of the invention.
APPENDDX A
In the following C programming language example, a job queue is defined and used by a main executive. The job queue consists of job-queue entries, and each entry includes information about the specific procedure to be executed and the input arguments to the procedure. In this example, two simple program procedures are defined. A generic prefetch function is inlined in the main executive. The prefetch function uses information found in the job-queue in combination with procedure- specific data storage information. The data storage information would normally be generated by the compiler.
Code layout:
/* Main memory definitions */
#define WA_SIZE (64*1024*1024) /* 64 MByte */ int *work_area;
/* Job queue definitions */
#define JQ_SIZE 256 /* 256 job queue entries */
typedef struct { unsigned int unc; /* function code */ int argl; /* argument 1 for function */ int arg2; /* argument 2 for function */
} jq_entry;
jq_entry jqe[JQ_SIZE]; /* Job queue */
int next_free_entry=0; /* Next free entry in queue */ int execute_pointer=0; /* Entry in queue to execute */
int LOOKAHEAD; /* Lookahead distance for prefetch */
#define FUNC NCR 0
#define FUNC_DECR 1 #define FUNC COUNT 2 /* Prefetch function structure */
typedef struct {
unsigned int argl : l; /* Use argl */ unsigned int arg2: l; /* Use arg2 */ unsigned int pad: 6; /* Pad structure for alignment */ unsigned int shift_right:8; /* Divide with 2**shift_right using shift */ unsigned int constant: 16; /* Add constant */
} prefetch_function;
prefetch function pf_func[FUNC_COUNT] ;
/* Example procedure 1: */
void procedure_decrement(int element_number, int arg2) {
unsigned int new_function = element_number & 1; /* Pick lowest bit for new function */ int argl;
element number = element_number + 123;
if ((element_number > 0) && (element iumber < WA_SIZE)) {
if (work_area[element_number] > arg2) {
work_area[element_number] = work_area[element_number] - 1;
} argl = (work_area[element_number] Λ 257) & (WA_SIZE-1); arg2 = arg2 + (work_area[element_number] & Oxffff);
} else { /* Element number of of range */
argl = element_number; if (argl < 0) argl = argl * -1; argl = (argl A (257*4011)) & (WA_SIZE-1);
} /* Add new job to queue */
jqe[next_free_entry].func = new_function; jqe[next_free_entry] .argl = argl; jqe[next_free_entry+ +].arg2 = arg2;
if (next_free_entry = = JQ_SIZE) next_free_entry = 0; /* Wrap around at end of queue */
}
/* Example procedure 2: */
void procedure_increment(int argl, int element_number) {
unsigned int new_function = (element_number > > 1) & 1; /* Pick 2nd lowest bit for new function */ int arg2;
element_number = element_number > > 1 ; if ((element_number > 0) && (element_number < WA_SIZE)) {
if (work_area[element_number] < argl) {
work_area[element_number] = work_area[element_number] + 1;
} arg2 = (argl + work_area[element_number]) Λ 0x2048; argl = (work_area[element_number] " 257) & (WA_SIZE-1);
} else { /* Element number of of range */
arg2 = element_number; if (arg2 < 0) arg2 = arg2 * -1; arg2 = (arg2 Λ (247*4011)) & (WA_SIZE-1);
} /* Add new job to queue */ jqe[next_free_entry].func = new_function; jqe[next_free_entry].argl = argl; jqe[next_free_entry+ +].arg2 = arg2; if (next_free_entry = = JQ_SIZE) next_free_entry = 0; /* Wrap around at end of queue */
} void init() { int i;
/* Allocate work_area */
work_area = (int *) malloc(sizeof(int)*WA_SIZE); if (!work_area) { perror("Malloc failed. "); exit(l);
}
/* Initialize the work_area */
for (i=0;i< WA_SIZE;i+ +) { work_area[i] = random();
}
/* Setup the initial job-queue */
for (i=0;i< (JQ_SIZE/2);i+ +) {
jqe[next_free_entry].func = random()&l; jqe[next_free_entry].argl = random()%WA_SIZE; jqe[next_free_entry].arg2 = random()%WA_SIZE; next_free_entry + + ;
} /* Setup the procedure-specific data storage information */
/* This information about each procedure would normally be generated by the compiler */
pf_func[FUNC_INCR].argl = 0; pf_func[FUNC_INCR].arg2 = 1; pf_func[FUNC_INCR].shift_right = 1; pf_func[FUNC_INCR]. constant = 0;
pf_func[FUNC_DECR].argl = 1; pf_func[FUNC_DECR].arg2 = 0; pf_func[FUNC_DECR] .shift_right = 0; pf_func[FUNC_DECR]. constant = 123;
} void main() {
int i,pf_pointer; char *env_str;
if((env_str=getenv("PF_LOOKAHEAD")) == NULL) LOOKAHEAD=JQSIZE;
LOOKAHEAD = strtoul(env_str,NULL,0);
init();
/* Main loop of the 'executive' */
#define NUM JOBS 50000000
for (i=0;i < NUM_JOBS;i+ +) {
int tmp_func; int tmp_vall ,tmp_val2,tmp_sum;
if (LOOKAHEAD) {
/* Get job queue entry to prefetch */
pf_pointer = (execute_pointer + LOOKAHEAD) %JQ_SIZE;
/* Execute prefetch function */
tmp_func = jqe[pfjpointer] .func; tmp_val 1 = jqe[pf_pointer] . arg 1 ; tmp_val2 = jqe[pf_pointer].arg2; tmp_vall = pf_func[tmp_func] .argl ? tmp_vall : 0; tmp_val2 = pf_func[tmp_func] .arg2 ? tmp_val2 : 0; tmp_sum = tmp_vall + tmp_val2; tmp_sum = tmp_sum > > pf_func[tmp_func].shift_right; tmp_sum = tmp_sum + pf_func[tmp_func]. constant; PREFETCH(&work_area[tmp_sum]) ;
} /* Execute job */ switch(jqe[execute_pointer].func) { case FUNC NCR: procedure_increment(jqe[execute_pointer] .argl ,jqe[execute_pointer] .arg2); break; case FUNC DECR: procedure_decrement(jqe[execute_pointer] .argl ,jqe[execute_pointer] .arg2); break; default: printf(" Invalid entry in job queue\n"); exit(l);
}
/* Increment queue pointers */
execute_pointer+ + ; if (execute_pointer = = JQ_SIZE) execute_pointer = 0;

Claims

1. A method for prefetching data in a computer system, characterized by: - determining a prefetch address based on data storage information generated during program code analysis combined with at least one run-time argument; and requesting a prefetch of data based on the determined prefetch address.
2. The method for prefetching data according to claim 1, characterized by fetching, from a queue of jobs scheduled for execution by the computer system, said at least one run-time argument as an input argument for a given job in advance of execution of the job.
3. The method for prefetching data according to claim 2, characterized in that said data storage information is associated with a program procedure indicated in said given job, wherein said at least one run-time input argument is related to said program procedure.
4. The method for prefetching data according to claim 2 or 3, characterized by determining individual data storage information for each of a plurality of program procedures by program code analysis, and selecting, for said given job, the appropriate data storage information based on a program procedure indicated in said given job.
5. The method for prefetching data according to any of the claims 2 to 4, characterized in that said data storage information comprises at least a data area start address as well as information concerning which input argument or arguments for said given job that are required as said at least one run-time argument to determine the relevant prefetch address.
6. The method for prefetching data according to claim 2, characterized by initiating prefetching of data for a given job a predetermined number of jobs in advance of execution of the job.
7. The method for prefetching data according to claim 1, characterized in that said steps of determining a prefetch address and requesting a prefetch of data are executed by operating system software or scheduling software/hardware .
8. The method for prefetching data according to claim 1, characterized in that said step of determining said prefetch address comprises the step of processing said data storage information by means of straightforward logic and shift operations.
9. The method for prefetching data according to claim 1, characterized in that said data is prefetched from one level of memory to another level of memory in said computer system.
10. The method for prefetching data according to claim 1, characterized in that said data storage information and said at least one run-time argument are combined in a generic prefetch address function.
11. A method for supporting prefetching of data in a computer system, characterized by: - determining data storage information for at least one program procedure by analyzing the corresponding program code; storing said data storage information for access during run-time in order to enable determination of a prefetch address based on a combination of said data storage information and at least one run-time argument.
12. The method for supporting prefetching of data according to claim 11, characterized by determining data storage information for each of a plurality of program procedures at compile-time analysis of the program code, and storing the data storage information in a compile-time generated table for access during run- time.
13. The method for supporting prefetching of data according to claim 11 or 12, characterized in that said data storage information comprises at least a data area start address as well as information concerning which run-time argument or arguments that are required to determine the relevant prefetch address.
14. A system for prefetching data in a computer system, characterized by: means for determining a prefetch address based on data storage information generated during program code analysis combined with at least one run-time argument; and means for requesting a prefetch of data based on the determined prefetch address.
15. The system for prefetching data according to claim 14, characterized by means for fetching, from a queue of jobs scheduled for execution by the computer system, at least one run-time argument as an input argument for a given job in advance of execution of the job.
16. The system for prefetching data according to claim 15, characterized in that said data storage information is associated with a program procedure indicated in said given job, wherein said at least one run-time input argument is related to said program procedure.
17. The system for prefetching data according to claim 15 or 16, characterized by determining individual data storage information for each of a plurality of program procedures at compile-time analysis of the program code, and selecting, for said given job, the appropriate data storage information based on a program procedure indicated in said given job.
18. The system for prefetching data according to any of the claims 15 to 17, characterized in that said data storage information comprises at least a data area start address as well as information concerning which input argument or arguments for said given job that are required as said at least one run-time argument to determine the relevant prefetch address.
19. The system for prefetching data according to claim 15, characterized in that prefetching of data for a given job is initiated a predetermined number of jobs in advance of execution of the job.
20. The system for prefetching data according to claim 14, characterized in that said determining means and said requesting means are implemented by the operating system.
21. The system for prefetching data according to claim 14, characterized in that said determining means comprises means for processing said data storage information by using straightforward logic and shift operations.
22. The system for prefetching data according to claim 14, characterized in that said data is prefetched from one level of memory to another level of memory in said computer system.
23. The system for prefetching data according to claim 22, characterized in that said data storage information and said at least one run-time argument are combined in a generic prefetch address function.
24. A system for supporting prefetching of data in a computer system, characterized by: means for determining data storage information for at least one program procedure by analyzing the corresponding program code; means for storing said data storage information for access during run-time in order to enable determination of a prefetch address based on a combination of said data storage information and at least one run-time argument.
25. The system for supporting prefetching of data according to claim 24, characterized by means for determining data storage information for each of a plurality of program procedures at compile-time analysis of the program code, and means for storing the data storage information in a compile-time generated table for access during run- time.
26. The system for supporting prefetching of data according to claim 24 or 25, characterized in that said data storage information comprises at least a data area start address as well as information concerning which run-time argument or arguments that are required to determine the relevant prefetch address.
27. A system for prefetching data in a computer system, characterized by: means for fetching, from a queue of jobs scheduled for execution by the computer system, program procedure information and a number of input arguments for a given job in advance of execution of the job; means for determining a prefetch address based on compile-time generated data storage information that is specific for the program procedure indicated in said given job combined with at least one of the input arguments; means for requesting a prefetch of data based on the determined prefetch address.
PCT/SE2001/002290 2001-10-19 2001-10-19 Data prefecthing in a computer system WO2003034229A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP01979147A EP1444584A1 (en) 2001-10-19 2001-10-19 Data prefecthing in a computer system
PCT/SE2001/002290 WO2003034229A1 (en) 2001-10-19 2001-10-19 Data prefecthing in a computer system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/SE2001/002290 WO2003034229A1 (en) 2001-10-19 2001-10-19 Data prefecthing in a computer system

Publications (1)

Publication Number Publication Date
WO2003034229A1 true WO2003034229A1 (en) 2003-04-24

Family

ID=20284641

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/SE2001/002290 WO2003034229A1 (en) 2001-10-19 2001-10-19 Data prefecthing in a computer system

Country Status (2)

Country Link
EP (1) EP1444584A1 (en)
WO (1) WO2003034229A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006038664A1 (en) * 2004-10-01 2006-04-13 Sony Computer Entertainment Inc. Dynamic loading and unloading for processing unit
US8285941B2 (en) 2008-02-25 2012-10-09 International Business Machines Corporation Enhancing timeliness of cache prefetching
JP2014225089A (en) * 2013-05-15 2014-12-04 オリンパス株式会社 Arithmetic unit
JP2014225088A (en) * 2013-05-15 2014-12-04 オリンパス株式会社 Arithmetic unit
WO2017016380A1 (en) * 2015-07-28 2017-02-02 Huawei Technologies Co., Ltd. Advance cache allocator

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5778435A (en) * 1996-05-30 1998-07-07 Lucent Technologies, Inc. History-based prefetch cache including a time queue
US6175898B1 (en) * 1997-06-23 2001-01-16 Sun Microsystems, Inc. Method for prefetching data using a micro-TLB

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5761506A (en) * 1996-09-20 1998-06-02 Bay Networks, Inc. Method and apparatus for handling cache misses in a computer system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5778435A (en) * 1996-05-30 1998-07-07 Lucent Technologies, Inc. History-based prefetch cache including a time queue
US6175898B1 (en) * 1997-06-23 2001-01-16 Sun Microsystems, Inc. Method for prefetching data using a micro-TLB

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
TIEN-FU CHEN BAER J.-L.: "A performance study of soft-ware and hardware data prefetching schemes", PROCEEDINGS OF THE 21ST ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE, 18 April 1994 (1994-04-18) - 21 April 1994 (1994-04-21), pages 223 - 232, XP010098130 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2006038664A1 (en) * 2004-10-01 2006-04-13 Sony Computer Entertainment Inc. Dynamic loading and unloading for processing unit
US8285941B2 (en) 2008-02-25 2012-10-09 International Business Machines Corporation Enhancing timeliness of cache prefetching
JP2014225089A (en) * 2013-05-15 2014-12-04 オリンパス株式会社 Arithmetic unit
JP2014225088A (en) * 2013-05-15 2014-12-04 オリンパス株式会社 Arithmetic unit
WO2017016380A1 (en) * 2015-07-28 2017-02-02 Huawei Technologies Co., Ltd. Advance cache allocator
EP3317769A4 (en) * 2015-07-28 2018-07-04 Huawei Technologies Co., Ltd. Advance cache allocator
US10042773B2 (en) 2015-07-28 2018-08-07 Futurewei Technologies, Inc. Advance cache allocator

Also Published As

Publication number Publication date
EP1444584A1 (en) 2004-08-11

Similar Documents

Publication Publication Date Title
Joseph et al. Prefetching using markov predictors
US9003169B2 (en) Systems and methods for indirect register access using status-checking and status-setting instructions
CA2285760C (en) Method for prefetching structured data
US7904661B2 (en) Data stream prefetching in a microprocessor
US7467377B2 (en) Methods and apparatus for compiler managed first cache bypassing
JP3816586B2 (en) Method and system for generating prefetch instructions
US7958316B2 (en) Dynamic adjustment of prefetch stream priority
USRE45086E1 (en) Method and apparatus for prefetching recursive data structures
US7716427B2 (en) Store stream prefetching in a microprocessor
US20060179236A1 (en) System and method to improve hardware pre-fetching using translation hints
US8949837B2 (en) Assist thread for injecting cache memory in a microprocessor
US6981119B1 (en) System and method for storing performance-enhancing data in memory space freed by data compression
Chen et al. Exploiting method-level parallelism in single-threaded Java programs
JP3681647B2 (en) Cache memory system device
US6662273B1 (en) Least critical used replacement with critical cache
US20030084433A1 (en) Profile-guided stride prefetching
Tsai et al. Performance study of a concurrent multithreaded processor
Vander Wiel et al. A compiler-assisted data prefetch controller
US6760816B1 (en) Critical loads guided data prefetching
WO2003034229A1 (en) Data prefecthing in a computer system
Lee et al. A dual-mode instruction prefetch scheme for improved worst case and average case program execution times
Chen et al. Using incorrect speculation to prefetch data in a concurrent multithreaded processor
Kyriacou et al. Cacheflow: A short-term optimal cache management policy for data driven multithreading
Rau et al. The effect of instruction fetch strategies upon the performance of pipelined instruction units
Sair et al. Quantifying load stream behavior

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BY BZ CA CH CN CO CR CU CZ DE DM DZ EC EE ES FI GB GD GE GH HR HU ID IL IN IS JP KE KG KP KR LC LK LR LS LT LU LV MA MD MG MN MW MX MZ NO NZ PH PL PT RO SD SE SG SI SK SL TJ TM TR TT TZ UG US UZ VN YU ZA

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): GH GM KE LS MW MZ SD SL SZ UG ZW AM AZ BY KG KZ MD TJ TM AT BE CH CY DE DK ES FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GQ GW MR NE SN TD TG US

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
WWE Wipo information: entry into national phase

Ref document number: 2001979147

Country of ref document: EP

WWP Wipo information: published in national office

Ref document number: 2001979147

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: JP