US20040236922A1 - Methods and systems for memory allocation - Google Patents

Methods and systems for memory allocation Download PDF

Info

Publication number
US20040236922A1
US20040236922A1 US10/442,375 US44237503A US2004236922A1 US 20040236922 A1 US20040236922 A1 US 20040236922A1 US 44237503 A US44237503 A US 44237503A US 2004236922 A1 US2004236922 A1 US 2004236922A1
Authority
US
United States
Prior art keywords
memory
variable
cache
address range
variables
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
US10/442,375
Other versions
US6952760B2 (en
Inventor
Michael Boucher
Theresa Do
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Oracle America Inc
Original Assignee
Sun Microsystems Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Microsystems Inc filed Critical Sun Microsystems Inc
Priority to US10/442,375 priority Critical patent/US6952760B2/en
Assigned to SUN MICROSYSTEMS, INC. reassignment SUN MICROSYSTEMS, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: BOUCHER, MICHAEL, DO, THERESA
Publication of US20040236922A1 publication Critical patent/US20040236922A1/en
Application granted granted Critical
Publication of US6952760B2 publication Critical patent/US6952760B2/en
Assigned to Oracle America, Inc. reassignment Oracle America, Inc. MERGER AND CHANGE OF NAME (SEE DOCUMENT FOR DETAILS). Assignors: Oracle America, Inc., ORACLE USA, INC., SUN MICROSYSTEMS, INC.
Adjusted expiration legal-status Critical
Expired - Lifetime legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/0223User address space allocation, e.g. contiguous or non contiguous base addressing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/14Fourier, Walsh or analogous domain transformations, e.g. Laplace, Hilbert, Karhunen-Loeve, transforms
    • G06F17/141Discrete Fourier transforms
    • G06F17/142Fast Fourier transforms, e.g. using a Cooley-Tukey type algorithm
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0844Multiple simultaneous or quasi-simultaneous cache accessing
    • G06F12/0846Cache with multiple tag or data arrays being simultaneously accessible
    • G06F12/0848Partitioned cache, e.g. separate instruction and operand caches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0864Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches using pseudo-associative means, e.g. set-associative or hashing

Definitions

  • This invention relates to memory allocation in data processing systems.
  • this invention relates to strategically allocating memory areas for program data during fast Fourier transform processing in order to reduce the overhead associated with accessing the program data.
  • Modern computer systems store data throughout a hierarchy of memories. For example, an extremely fast (but typically small) cache memory is commonly provided closest to the system processor (in some instances on the same die as the processor). Beyond the cache memory and external to the processor are memory modules that hold much larger amounts of random access memory (RAM). In addition, most modern operating systems provide a virtual memory subsystem that allows the computer system to treat the enormous capacity of magnetic storage (e.g., disk drives) as additional system memory.
  • RAM random access memory
  • the “closer” the memory is to the processor the faster the processor may access the data stored in the memory.
  • the processor quite rapidly executes read and write operations to the cache, and executes somewhat slower read and write operations to the external RAM.
  • the slowest access generally arises from a read or write operation that requires the operating system to access memory space that has been stored on the disk.
  • the access penalties associated with retrieving data stored outside the cache are so severe that program performance can be crippled if the program requires frequent access to those memory areas (and more particularly, through the virtual memory system to the disk).
  • NUMA non-uniform memory architecture
  • the time to access memory typically varied from one processor to another. This was typically because the physical memory chips were located on boards that took differing amounts of time to reach. If a processor repeatedly made such access requests, the operating system might create a copy of the requested data and place it in a memory on the same system board as the requesting processor. This process, sometimes referred to as page migration, worked only at a very coarse level (i.e., by determining no more than on which board data should reside). Also, there were systems, however, in which all memory accesses cost the same regardless of location relative to the reading or writing processor.
  • HPF High Performance Fortran
  • a programmer could, by hand, attempt to specify an optimal layout for one or more pieces of program data. For example, a programmer might manually manipulate array sizes so that the array fell into desirable parts of memory. Doing so, however, led to atrocious programmer time and resource costs, and was still not guaranteed to provide an efficient solution over all of the various operating systems, hardware platforms, and process loads under which the program might run.
  • a typical FFT computing program uses at least two arrays to compute an FFT.
  • the arrays include a first array for storing inputted signal samples and a second array for providing a workspace. If each of the arrays has a size of 1024 words, then based on conventional memory allocation techniques, the arrays are offset by 1024 words. In other words, the arrays are offset in memory by a power-of-two stride of 1024 words (i.e., 2 ⁇ circumflex over ( ) ⁇ 9).
  • Methods and systems consistent with the present invention provide a mechanism that automatically allocates memory for program data during FFT computation in a way that is favorable for the memory system of the data processing system.
  • the methods and systems reduce the overhead associated with accessing the program data.
  • the FFT computing program-that manipulates the program data runs faster and produces results more quickly than typical methods and systems.
  • Methods and systems consistent with the present invention overcome the shortcomings of the related art by allocating memory for the program data with an offset other than a power of two offset.
  • the memory is allocated, for example, by taking into consideration the structure of the memory hierarchy in the data processing system when allocating the memory for the program data.
  • the program incurs less memory access overhead during its execution. For example, the program may more often find its data in cache rather than swapped out to disk.
  • a method for allocating memory during fast Fourier transform calculating includes receiving from a fast Fourier transform calculating program a request to allocate memory for at least first and second variables. The method then determines when both variables will simultaneously fit in a cache memory and, in response, allocates a first address range in the main memory for the first variable and a second address range in the main memory for the second variable. The first address range maps to a different location in the first cache memory than the second address range. The method then returns to the fast Fourier transform calculating program memory references for the address ranges.
  • a data processing system includes a cache memory, a main memory, and a processor.
  • the main memory includes a memory allocation program for receiving a request to allocate memory for a first variable and a second variable during fast Fourier transform computation, the first variable and the second variable being used by a fast Fourier transform computing program to compute a fast Fourier transform, determining when both variables will simultaneously fit in the first cache memory, and in response, allocating a first address range in the main memory for the first variable and a second address range in the main memory for the second variable. Again, the first address range maps to a different location in the first cache memory than the second address range.
  • the processor runs the memory allocation program.
  • a computer-readable medium contains instructions that cause a data processing system including a cache memory and a main memory to perform a method for allocating memory during fast Fourier transform computation.
  • the method includes receiving from a fast Fourier transform computing program a request to allocate memory for a first variable and a second variable, determining when both variables will simultaneously fit in the first cache memory, and in response, allocating a first address range in the main memory for the first variable and a second address range in the main memory for the second variable, such that the first address range maps to a different location in the first cache memory than the second address range.
  • the method also returns to the fast Fourier transform computing program a first memory reference for the first address range and a second memory reference for the second address range.
  • FIG. 1 depicts a block diagram of a data processing system suitable for use with methods and systems consistent with the present invention.
  • FIG. 2 depicts a memory hierarchy for the data processing system shown in FIG. 1 in which a memory allocation program running in the data processing system shown in FIG. 1 allocates space for program variables.
  • FIG. 3 depicts an example of a direct mapped cache in the memory hierarchy of the data processing system shown in FIG. 1.
  • FIG. 4 depicts a flow diagram showing processing performed by the memory allocation program running in the data processing system shown in FIG. 1 in order to allocate memory for program variables.
  • FIG. 1 depicts a block diagram of a data processing system 100 suitable for use with methods and systems consistent with the present invention.
  • the data processing system 100 comprises a central processing unit (CPU) 102 , an input-output (I/O) unit 104 , a memory 106 , a secondary storage device 108 , and a video display 110 .
  • the data processing system 100 may further include input devices such as a keyboard 112 , a mouse 114 or a speech processor (not illustrated).
  • the memory 106 contains an FFT computing program 116 that communicates via message passing, function calls, or the like with an operating system 118 .
  • the program 116 represents any FFT computing program running on the data processing system 100 that uses memory for storing variables (e.g., a first variable 120 and a second variable 122 ) or data.
  • Program 116 comprises program code 126 for computing an FFT based on an FFT algorithm. FFT algorithms and program code for computing FFTs are known to one having skill in the art and will not be described in detail herein.
  • the FFT algorithm used by program 116 can be, for example, the “fft.f” subroutine attached hereto in Appendix A, which is incorporated herein by reference.
  • the “fft.f” subroutine is written in Fortran.
  • program 116 is not limited to being written in Fortran and is not limited to the “fft.f” subroutine.
  • Program 116 can be written in any suitable programming language and include any FFT algorithm suitable for use with methods and systems consistent with the present invention.
  • the FFT algorithm of program 116 uses at least two arrays while computing an FFT.
  • a first array INPUT stores signal samples and a second array WORK provides a workspace, a trigonometric table useful for computing the FFT, and a prime factorization of the number of signal samples.
  • the FFT algorithm can run faster if both arrays are kept close to the CPU 102 , such as in a cache of the CPU 210 .
  • a memory allocation program 124 allocates an address range for each array variable to avoid cache displacement at one or more levels of the cache, if possible.
  • Operating system 118 includes the memory allocation program 124 that responds to memory allocation requests, for example, from program 116 or from operating system 118 . As will be explained in more detail below, the memory allocation program 124 allocates space in the memory 106 for program variables.
  • FIG. 2 a lower-level block diagram 200 of the memory hierarchy of the data processing system 100 is shown.
  • a CPU core 202 e.g., internal logic and control circuits
  • Cache memory 204 is a data cache that stores, generally, the data most recently used by CPU core 202 .
  • Cache memory 206 is a prefetch cache that CPU core 202 uses to prefetch data that it expects to soon need from main memory.
  • First level cache memories 204 and 206 are generally the smallest and fastest caches available to CPU core 202 .
  • a second level cache 208 is also provided.
  • Second level cache 208 is generally larger than first level caches 204 and 206 and is also implemented as an extremely high-speed memory. Generally, however, CPU core 202 needs additional clock cycles to obtain data from second level cache 208 . Thus, accessing data from second level cache 208 typically takes more time than accessing data from first level caches 204 and 206 .
  • the first level cache and second level cache are incorporated onto a single die, or into a single package with multiple dies that forms a CPU 210 .
  • CPUs 212 and 214 are similar to CPU 210 .
  • Each of CPUs 210 , 212 , and 214 are similar to CPU 102 .
  • one of more CPUs 210 , 212 , and 214 couple to a memory controller 216 .
  • Memory controller 216 handles memory access cycles generated by CPUs 210 - 214 .
  • memory controller 216 determines where the data may be found, generates memory control signals to retrieve the data, and forwards the data to the CPU.
  • memory controller 216 communicates with a main memory 218 and a virtual memory system 220 (which may be implemented, for example, using part of the secondary storage 108 ).
  • the main memory includes multiple memory banks.
  • the main memory shown in FIG. 2 includes a first memory bank 222 and a second memory bank 224 .
  • Memory banks 222 and 224 are generally independent in the sense that read or write operations to one of the banks does not prevent the memory controller from immediately reading or writing data to the other bank.
  • Main memory 218 can be implemented with large capacity dynamic RAMs or DRAM modules. In most implementations, however, dynamic RAMs need to be refreshed when data is read out. As a result, two consecutive reads to the same memory bank occur more slowly than two consecutive reads to different memory banks. Thus, it can be advantageous to place the start of variables (e.g., array INPUT and array WORK) needed in sequence in separate memory banks. Because the memory banks can be interleaved, sequential accesses to a block of data will sequentially move through each bank.
  • start of variables e.g., array INPUT and array WORK
  • prefetch data cache 206 second level cache 208 , or virtual memory system 220 need not be present in the data processing system.
  • size and configuration of cache memories 204 , 206 , and 208 and main memory 218 may vary in size, speed, location, and organization.
  • FIG. 3 depicts a cache memory 300 .
  • Cache memory 300 can be, for example, data cache 204 , prefetch cache 206 , or external cache 208 .
  • Cache memory 300 has a particular physical organization that determines how much data will fit in the cache, and where that data will reside. More specifically, each cache memory has a size and an organization. With regard to FIG. 3, cache memory 300 is organized as 256 directly mapped cache lines, each 32 bytes in length. Cache memory 300 thereby has a size of 8192 bytes. Alternate organizations are also possible. For example, cache memory 300 may be a set associative cache or a fully associative cache, or cache memory 300 may have greater or fewer lines or bytes per line.
  • a cache is directly mapped when a memory block (retrieved from outside the cache) can only be placed in one predetermined line in the cache.
  • a cache is fully associative when a memory block can be placed anywhere in the cache.
  • a cache is set associative when a memory block can be placed in a set of 2 or more lines in the cache. More information on cache organization and operation can be found, for example, in Computer Architecture, A Quantitative Approach , Patterson & Hennessy, Morgan Kaufmann Publishers, Inc. (1990).
  • a memory address is divided into pieces when determining where a memory block retrieved from the main memory will reside in the cache.
  • One piece is referred to as the block-frame address and represents the higher-order address bits that identify a memory block.
  • the second piece is referred to as the block-offset address and is the lower-order piece of the address that represents data within the memory block. Assuming, for example, a 32 byte cache line (i.e., a 32 byte memory block) and 32 bit addresses, the upper 27 address bits are the block-frame address, while the lower 5 bits represent data within the block.
  • a data block for a given address range there are multiple locations in which a data block for a given address range may reside.
  • data blocks covered by four address ranges that would otherwise map to the same cache line may be accommodated in the cache simultaneously.
  • a fifth data block covered by an address range that maps to the same cache line would then displace one of the four data blocks.
  • a subsequent address range maps to a location in the cache with existing data, that existing data is overwritten or displaced.
  • FIG. 4 depicts a flow diagram illustrating the exemplary steps performed by memory allocation program 124 for allocating memory.
  • memory allocation program 124 receives a memory allocation request from program 116 (Step 402 ).
  • the memory allocation request may be, for example, a function call or message sent to memory allocation program 124 that asks for memory for one or more variables.
  • the memory allocation request may be a function call that asks for memory for array INPUT and array WORK.
  • the memory allocation request may specify one or more memory block sizes needed for the variables.
  • memory allocation program 124 determines sizes for the variables using the information provided with the allocation request (Step 404 ).
  • memory allocation program 124 determines whether the variables specified will fit into first level cache 204 (Step 406 ). To do so, memory allocation program 124 determines the size of first level cache 204 and its configuration (e.g., number of cache lines and number of bytes per cache line) by, for example, querying operating system 118 or reading a configuration file. Memory allocation program 124 , knowing the size of the variables then determines if both variables can coexist in one or more cache lines in first level cache 204 .
  • the array INPUT may reside in cache lines 0 and 1, while the array WORK may reside in cache lines 2, 3, 4, and 5.
  • memory allocation program 124 determines whether the variables are too large to simultaneously fit into the first level cache 204 in step 406 . If memory allocation program 124 determines whether the variables will simultaneously fit into second level cache 208 (Step 408 ). Again, to make that determination, memory allocation program 124 may determine the size of second level cache 208 and its configuration.
  • memory allocation program 124 allocates memory for the variables such that they will map to different locations in the fastest cache (Step 410 ).
  • the fastest cache is typically the smallest cache closest to CPU core 202 .
  • memory allocation program 124 may allocate address ranges as shown below in Table 2 so that the array INPUT will be placed in cache lines 0 and 1, while the array WORK will be placed in cache lines 2-5.
  • memory allocation program 124 may further take into consideration the number and organization of memory banks 222 and 223 in main memory 218 (step 412 ). As noted above, sequential reads to the same memory bank can be slower than sequential reads to different memory banks. Thus, in addition to selecting address ranges that map the variables into different locations in the cache, memory allocation program 124 may also adjust the memory ranges for variables that are accessed sequentially so that they start in different memory banks. Thus, if program 116 sequentially accesses the variables, memory bank conflict will not hinder the retrieval of the data from main memory.
  • memory allocation program 124 returns memory references for the allocated memory regions to requesting program 116 (Step 414 ). For example, memory allocation program 124 may return pointers to the beginning of the allocated memory regions. Program 116 may then store its data (e.g., array INPUT and array WORK) in the allocated memory regions and benefit from having multiple variables reside in the cache simultaneously.
  • data e.g., array INPUT and array WORK
  • operating system 118 includes an extended version of the Unix ‘C’ library function madvise( ) in order to help guide program data closer to the processor that will access the program data. Accordingly, during FFT computation, variables such as signal samples and workspace data can be stored closer to the processor.
  • madvise( ) is used to store arrays INPUT and WORK close to the CPU.
  • the madvise( ) function accepts, as parameters, a starting address, a length, and an advisory flag (e.g., madvise(caddr_t addr, size_t len, int advice)).
  • the advisory flag guides the operating system 118 in locating or relocating the memory referred to in the call to madvise( ).
  • Table 3 shows and explains the new advisory flags in the extended madvise( ) function.
  • MADV_ACCESS Resets the kernel's expectation for how the DEFAULT specified address range will be accessed.
  • MADV_ACCESS Tells the kernel that the next LWP (i.e., light LWP weight process, or thread) to touch the specified address range will access it most heavily. The kernel should try to allocate the memory and other resources for the address range and the LWP accordingly (e.g., closer to the processor and memory that runs the LWP).
  • MADV_ACCESS Tells the kernel that many processes or LWPs MANY will access the specified address range randomly across the machine.
  • the kernel should try to allocate the memory and other resources for the address range accordingly (e.g., by making copies of the data and distributing a copy to each processor that runs an LWP that accesses the address range).
  • the madvise( ) function allows program 116 to specify that certain address ranges should be located as closely as possible to the processor that access the address ranges.
  • operating system 118 may determine which thread has accessed the address range, then relocate the data in the address range so that the data is close to the processor that accesses the data (i.e., the processor that runs the thread). To that end, operating system 118 may take into consideration the aspects of the memory hierarchy explained above, and for example, attempt to place the data so that it will fit into one or more levels of cache, or so that it will start in independent memory banks.
  • operating system 118 may address the same considerations in response to the MADV_ACCESS_MANY flag. However, operating system 118 addresses those considerations for each of a predetermined number of threads that access the specified memory range after the call to madvise( ). More particularly, operating system 118 may make multiple copies of the data in the memory range, and distribute those copies close to the processors that run the individual threads. For example, operating system 118 may migrate pages in which the memory range lies to one or more memory boards.
  • the MADV_ACCESS_DEFAULT flag instructs operating system 118 to disregard any prior flags applied to the specified memory range. In other words, no operating system 118 will no longer perform the special processing noted above for the specified memory range.
  • Program 116 may specify the MADV_ACCESS_DEFAULT flag before freeing a memory block, for example.
  • operating system 118 provides a memory advisory library that is useful when program 116 source code cannot be modified to include madvise( ) functions. More specifically, the madv library (stored, for example, in an object named madv.so.1) may operate as explained below in Table 4. TABLE 4 The madv.so.1 shared object provides a means by which virtual memory advice can be selectively configured for launched process(es) and its descendants.
  • LD_PRELOAD $LD_PRELOAD:madv.so.1 ENVIRONMENT VARIABLES If the madv.so.1 shared object is specified in the LD_PRELOAD list, the following environment variables are read by the madv shared object to determine which created process(es) to apply the specified advice.
  • MADV ⁇ advice> MADV specifies the virtual memory advice to use for all heap, shared memory, and mmap( ) regions in the process address space. This advice is applied to all created processes.
  • ⁇ config-file> is, for example, a text file which contains one or more madv configuration entries of the form: ⁇ exec-name> ⁇ exec-args>: ⁇ advice-opts>
  • Advice specified in ⁇ config-file> takes precedence over that specified by the MADV environment variable. When MADVCFGFILE is not set, advice is taken from file /etc/madv.conf file if it exists.
  • ⁇ exec-name> specifies the name of a program.
  • ⁇ exec-name> can be a full pathname, a base name or a pattern string. See the manual pages on sh( ), and the section File Name Generation, for a discussion of pattern matching.
  • ⁇ exec-args> is an optionally specified pattern string to match against arguments. Advice is set if ⁇ exec-args> is not specified or occurs within the arguments to ⁇ exec-name>.
  • heap ⁇ advice> The heap is defined to be the brk area (see the manual pages on brk( )). Applies to the existing heap and for any additional heap memory allocated in the future.
  • Options mapshared, mapprivate and mapanon take precedence over option map.
  • Option mapanon takes precedence over mapshared and mapprivate.
  • MADVERRFILE ⁇ pathname> By default, error messages are logged via syslog( ) using Level LOG_ERR and facility LOG_USER.
  • MADVERRFILE contains a valid ⁇ pathname> (such as /dev/ stderr), error messages will be logged there instead.
  • NOTES The advice is inherited; a child process has the same advice as its parent. On exec( ) (see the manual pages on exec( )), the advice is set back to the default system advice unless different advice has been configured via the madv shared object. Advice is applied to mmap( ) regions explicitly created by the user program. Those regions established by the run-time linker or by system libraries making direct system calls (e.g. libthread allocations for thread stacks) are not affected.
  • Table 5 shows several examples of how to use the MADV library environment variables. TABLE 5
  • operating system 118 when operating system 118 recognizes that the MADV environment variables have been set, operating system 118 responsively creates startup code that executes before program 116 enters its main function. For example, when operating system 118 is the SolarisTM operating system, operating system 118 may create a function and store it in a section of the executable program or associated library known as a ‘.init’ section. The .init section contains code that executes before the user-written portions of the program 116 start. Solaris is manufactured by Sun Microsystems, Inc. of Santa Clara, Calif. Sun, Sun Microsystems, and Solaris are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries.
  • Operating system 118 reads the MADV environment variables and responsively creates calls to madvise( ) that reflect the environment variable settings.
  • each madvise( ) function call may specify a starting memory address for a particular memory region called out by the environment variables, the length of that region, and an advisory flag (e.g., MADV_ACCESS_LWP) that causes operating system 118 to respond to memory accesses in the way specified by the environment variable flags.
  • Operating system 118 places the constructed madvise( ) calls in the .init function and they are therefore executed prior to the first or main routine in program 116 .
  • memory allocation program 124 allocates memory for program data during FFT computation in a manner that improves data access efficiency compared to typical memory allocation methods used during FFT computation.
  • the memory allocation of the methods and systems consistent with the present invention is dynamic and memory addresses are automatically assigned, taking into consideration, for example the memory architecture of a given data processing system. Because the overhead associated with accessing the program data is reduced, program 116 typically runs faster and produces results more quickly. Further, the power-of-two offset problems associated with typical memory allocation methods is eliminated.
  • memory allocation program 124 may, in general, consider more than two variables. That is, memory allocation program 124 may determine which combination of variables will fit into the cache memory, and allocate an address range for each variable that causes the variable to map to a different location in the cache than the remaining variables.

Abstract

Methods and systems consistent with the present invention allocate memory for program data during fast Fourier transform computation in a way that is favorable for a given access pattern for the program data, and for the memory architecture of a given data processing system. As a result, the overhead associated with accessing the program data is reduced compared to typical memory allocation performed during fast Fourier transform computation. Thus, a fast Fourier transform computing program that manipulates the program data typically runs faster and produces results more quickly.

Description

    FIELD OF THE INVENTION
  • This invention relates to memory allocation in data processing systems. In particular, this invention relates to strategically allocating memory areas for program data during fast Fourier transform processing in order to reduce the overhead associated with accessing the program data. [0001]
  • BACKGROUND OF THE INVENTION
  • Modern computer systems store data throughout a hierarchy of memories. For example, an extremely fast (but typically small) cache memory is commonly provided closest to the system processor (in some instances on the same die as the processor). Beyond the cache memory and external to the processor are memory modules that hold much larger amounts of random access memory (RAM). In addition, most modern operating systems provide a virtual memory subsystem that allows the computer system to treat the enormous capacity of magnetic storage (e.g., disk drives) as additional system memory. [0002]
  • In general, the “closer” the memory is to the processor, the faster the processor may access the data stored in the memory. Thus, the processor quite rapidly executes read and write operations to the cache, and executes somewhat slower read and write operations to the external RAM. The slowest access generally arises from a read or write operation that requires the operating system to access memory space that has been stored on the disk. The access penalties associated with retrieving data stored outside the cache are so severe that program performance can be crippled if the program requires frequent access to those memory areas (and more particularly, through the virtual memory system to the disk). [0003]
  • In the past, there were few approaches available for placing data in memory in order to keep data “close” to the processor. As one example, in non-uniform memory architecture (NUMA) machines (i.e., machines that included multiple memories and processors distributed over multiple distinct system boards), the time to access memory typically varied from one processor to another. This was typically because the physical memory chips were located on boards that took differing amounts of time to reach. If a processor repeatedly made such access requests, the operating system might create a copy of the requested data and place it in a memory on the same system board as the requesting processor. This process, sometimes referred to as page migration, worked only at a very coarse level (i.e., by determining no more than on which board data should reside). Also, there were systems, however, in which all memory accesses cost the same regardless of location relative to the reading or writing processor. [0004]
  • Another approach, taken by High Performance Fortran (HPF) was to add proprietary extensions to a programming language to give the programmer a small amount of control over data placement in memory. For example, a programmer might be able to specify that an array be distributed in blocks over several boards in a NUMA architecture. However, the language itself was generally unaware of the operating system, the hardware, and their impact on placement of data in memory. Thus, while HPF could also provide some coarse control over data placement, the code was not portable, and the programmer was unduly constrained in choices of programming languages. [0005]
  • Alternatively, a programmer could, by hand, attempt to specify an optimal layout for one or more pieces of program data. For example, a programmer might manually manipulate array sizes so that the array fell into desirable parts of memory. Doing so, however, led to atrocious programmer time and resource costs, and was still not guaranteed to provide an efficient solution over all of the various operating systems, hardware platforms, and process loads under which the program might run. [0006]
  • Further, during computation of fast Fourier transforms (FFTs), conventional memory allocation techniques typically offset program data by power-of-two strides, making it difficult to place program data close to the processor and causing memory conflicts. For example, a typical FFT computing program uses at least two arrays to compute an FFT. The arrays include a first array for storing inputted signal samples and a second array for providing a workspace. If each of the arrays has a size of 1024 words, then based on conventional memory allocation techniques, the arrays are offset by 1024 words. In other words, the arrays are offset in memory by a power-of-two stride of 1024 words (i.e., 2{circumflex over ( )}9). [0007]
  • Offsetting the arrays by 1024 words, however, creates a conflict with a system that is configured, for example, for sequential memory access or for an offset of 512 words. Also, if the program computing the FFT alternates access to the arrays, the alternating access can result in a conflict when the arrays are offset by a power-of-two-word displacement. [0008]
  • Therefore, a need has long existed for a memory allocation technique that overcomes the problems noted above and others previously experienced. [0009]
  • SUMMARY OF THE INVENTION
  • Methods and systems consistent with the present invention provide a mechanism that automatically allocates memory for program data during FFT computation in a way that is favorable for the memory system of the data processing system. The methods and systems reduce the overhead associated with accessing the program data. As a result, the FFT computing program-that manipulates the program data runs faster and produces results more quickly than typical methods and systems. [0010]
  • Methods and systems consistent with the present invention overcome the shortcomings of the related art by allocating memory for the program data with an offset other than a power of two offset. The memory is allocated, for example, by taking into consideration the structure of the memory hierarchy in the data processing system when allocating the memory for the program data. As a result, the program incurs less memory access overhead during its execution. For example, the program may more often find its data in cache rather than swapped out to disk. [0011]
  • According to methods consistent with the present invention, a method for allocating memory during fast Fourier transform calculating is provided in a data processing system. The method includes receiving from a fast Fourier transform calculating program a request to allocate memory for at least first and second variables. The method then determines when both variables will simultaneously fit in a cache memory and, in response, allocates a first address range in the main memory for the first variable and a second address range in the main memory for the second variable. The first address range maps to a different location in the first cache memory than the second address range. The method then returns to the fast Fourier transform calculating program memory references for the address ranges. [0012]
  • In accordance with apparatuses consistent with the present invention, a data processing system is provided. The data processing system includes a cache memory, a main memory, and a processor. The main memory includes a memory allocation program for receiving a request to allocate memory for a first variable and a second variable during fast Fourier transform computation, the first variable and the second variable being used by a fast Fourier transform computing program to compute a fast Fourier transform, determining when both variables will simultaneously fit in the first cache memory, and in response, allocating a first address range in the main memory for the first variable and a second address range in the main memory for the second variable. Again, the first address range maps to a different location in the first cache memory than the second address range. The processor runs the memory allocation program. [0013]
  • In addition, a computer-readable medium is provided. The computer-readable medium contains instructions that cause a data processing system including a cache memory and a main memory to perform a method for allocating memory during fast Fourier transform computation. The method includes receiving from a fast Fourier transform computing program a request to allocate memory for a first variable and a second variable, determining when both variables will simultaneously fit in the first cache memory, and in response, allocating a first address range in the main memory for the first variable and a second address range in the main memory for the second variable, such that the first address range maps to a different location in the first cache memory than the second address range. The method also returns to the fast Fourier transform computing program a first memory reference for the first address range and a second memory reference for the second address range. [0014]
  • Other apparatus, methods, features and advantages of the present invention will be or will become apparent to one with skill in the art upon examination of the following figures and detailed description. It is intended that all such additional systems, methods, features and advantages be included within this description, be within the scope of the present invention, and be protected by the accompanying drawings. [0015]
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1 depicts a block diagram of a data processing system suitable for use with methods and systems consistent with the present invention. [0016]
  • FIG. 2 depicts a memory hierarchy for the data processing system shown in FIG. 1 in which a memory allocation program running in the data processing system shown in FIG. 1 allocates space for program variables. [0017]
  • FIG. 3 depicts an example of a direct mapped cache in the memory hierarchy of the data processing system shown in FIG. 1. [0018]
  • FIG. 4 depicts a flow diagram showing processing performed by the memory allocation program running in the data processing system shown in FIG. 1 in order to allocate memory for program variables.[0019]
  • DETAILED DESCRIPTION OF THE INVENTION
  • Reference will now be made in detail to an implementation in accordance with methods, systems, and articles of manufacture consistent with the present invention as illustrated in the accompanying drawings. The same reference numbers may be used throughout the drawings and the following description to refer to the same or like parts. [0020]
  • FIG. 1 depicts a block diagram of a [0021] data processing system 100 suitable for use with methods and systems consistent with the present invention. The data processing system 100 comprises a central processing unit (CPU) 102, an input-output (I/O) unit 104, a memory 106, a secondary storage device 108, and a video display 110. The data processing system 100 may further include input devices such as a keyboard 112, a mouse 114 or a speech processor (not illustrated).
  • The [0022] memory 106 contains an FFT computing program 116 that communicates via message passing, function calls, or the like with an operating system 118. The program 116 represents any FFT computing program running on the data processing system 100 that uses memory for storing variables (e.g., a first variable 120 and a second variable 122) or data. Program 116 comprises program code 126 for computing an FFT based on an FFT algorithm. FFT algorithms and program code for computing FFTs are known to one having skill in the art and will not be described in detail herein. The FFT algorithm used by program 116 can be, for example, the “fft.f” subroutine attached hereto in Appendix A, which is incorporated herein by reference. The “fft.f” subroutine is written in Fortran. One having skill in the art will appreciate that program 116 is not limited to being written in Fortran and is not limited to the “fft.f” subroutine. Program 116 can be written in any suitable programming language and include any FFT algorithm suitable for use with methods and systems consistent with the present invention.
  • The FFT algorithm of [0023] program 116 uses at least two arrays while computing an FFT. As an illustrative example, a first array INPUT stores signal samples and a second array WORK provides a workspace, a trigonometric table useful for computing the FFT, and a prime factorization of the number of signal samples. Because the FFT algorithm alternates accesses between the arrays during the FFT computation, the FFT algorithm can run faster if both arrays are kept close to the CPU 102, such as in a cache of the CPU 210. Thus, as will be described in more detail below, a memory allocation program 124 allocates an address range for each array variable to avoid cache displacement at one or more levels of the cache, if possible.
  • [0024] Operating system 118 includes the memory allocation program 124 that responds to memory allocation requests, for example, from program 116 or from operating system 118. As will be explained in more detail below, the memory allocation program 124 allocates space in the memory 106 for program variables.
  • Although aspects of methods, systems, and articles of manufacture consistent with the present invention are depicted as being stored in [0025] memory 106, one skilled in the art will appreciate that these aspects may be stored on or read from other computer-readable media, such as secondary storage devices, like hard disks, floppy disks, and CD-ROMs; a carrier wave received from a network such as the Internet; or other forms of ROM or RAM either currently known or later developed. Further, although specific components of data processing system 100 are described, one skilled in the art will appreciate that a data processing system suitable for use with methods, systems, and articles of manufacture consistent with the present invention may contain additional or different components.
  • Referring to FIG. 2, a lower-level block diagram [0026] 200 of the memory hierarchy of the data processing system 100 is shown. Closest to a CPU core 202 (e.g., internal logic and control circuits) are two first level cache memories 204 and 206. Cache memory 204 is a data cache that stores, generally, the data most recently used by CPU core 202. Cache memory 206 is a prefetch cache that CPU core 202 uses to prefetch data that it expects to soon need from main memory. First level cache memories 204 and 206 are generally the smallest and fastest caches available to CPU core 202.
  • A [0027] second level cache 208 is also provided. Second level cache 208 is generally larger than first level caches 204 and 206 and is also implemented as an extremely high-speed memory. Generally, however, CPU core 202 needs additional clock cycles to obtain data from second level cache 208. Thus, accessing data from second level cache 208 typically takes more time than accessing data from first level caches 204 and 206. In many processors, the first level cache and second level cache are incorporated onto a single die, or into a single package with multiple dies that forms a CPU 210. CPUs 212 and 214 are similar to CPU 210. Each of CPUs 210, 212, and 214 are similar to CPU 102.
  • Continuing with reference to FIG. 2, one of [0028] more CPUs 210, 212, and 214 couple to a memory controller 216. Memory controller 216 handles memory access cycles generated by CPUs 210-214. Thus, for example, when CPU 210 needs data that is not found it its cache, memory controller 216 determines where the data may be found, generates memory control signals to retrieve the data, and forwards the data to the CPU.
  • To that end, [0029] memory controller 216 communicates with a main memory 218 and a virtual memory system 220 (which may be implemented, for example, using part of the secondary storage 108). The main memory, as shown in FIG. 2, includes multiple memory banks. In particular, the main memory shown in FIG. 2 includes a first memory bank 222 and a second memory bank 224. Memory banks 222 and 224 are generally independent in the sense that read or write operations to one of the banks does not prevent the memory controller from immediately reading or writing data to the other bank.
  • [0030] Main memory 218 can be implemented with large capacity dynamic RAMs or DRAM modules. In most implementations, however, dynamic RAMs need to be refreshed when data is read out. As a result, two consecutive reads to the same memory bank occur more slowly than two consecutive reads to different memory banks. Thus, it can be advantageous to place the start of variables (e.g., array INPUT and array WORK) needed in sequence in separate memory banks. Because the memory banks can be interleaved, sequential accesses to a block of data will sequentially move through each bank.
  • It is noted that the memory hierarchy illustrated in FIG. 2 is merely illustrative and methods and systems consistent with the present invention are not limited thereto. For example, [0031] prefetch data cache 206, second level cache 208, or virtual memory system 220 need not be present in the data processing system. Furthermore, the size and configuration of cache memories 204, 206, and 208 and main memory 218 may vary in size, speed, location, and organization.
  • FIG. 3 depicts a [0032] cache memory 300. Cache memory 300 can be, for example, data cache 204, prefetch cache 206, or external cache 208. Cache memory 300 has a particular physical organization that determines how much data will fit in the cache, and where that data will reside. More specifically, each cache memory has a size and an organization. With regard to FIG. 3, cache memory 300 is organized as 256 directly mapped cache lines, each 32 bytes in length. Cache memory 300 thereby has a size of 8192 bytes. Alternate organizations are also possible. For example, cache memory 300 may be a set associative cache or a fully associative cache, or cache memory 300 may have greater or fewer lines or bytes per line.
  • A cache is directly mapped when a memory block (retrieved from outside the cache) can only be placed in one predetermined line in the cache. A cache is fully associative when a memory block can be placed anywhere in the cache. A cache is set associative when a memory block can be placed in a set of 2 or more lines in the cache. More information on cache organization and operation can be found, for example, in [0033] Computer Architecture, A Quantitative Approach, Patterson & Hennessy, Morgan Kaufmann Publishers, Inc. (1990).
  • A memory address is divided into pieces when determining where a memory block retrieved from the main memory will reside in the cache. One piece is referred to as the block-frame address and represents the higher-order address bits that identify a memory block. The second piece is referred to as the block-offset address and is the lower-order piece of the address that represents data within the memory block. Assuming, for example, a 32 byte cache line (i.e., a 32 byte memory block) and 32 bit addresses, the upper 27 address bits are the block-frame address, while the lower 5 bits represent data within the block. [0034]
  • For a directly mapped cache, the location at which a memory block covered by an address range will be placed in the cache is given, for example, by (block-frame address) modulo (number of cache lines). Table 1 below gives two exemplary mappings for address ranges to cache lines in the direct mapped [0035] cache 300.
    TABLE 1
    Start of address range End of address range Cache line
    1100 0111 0101 1111 1100 0111 0101 1111 1101 1011 (line 219)
    0101 1011 0110 0000 0101 1011 0111 1111
    0xC75F5B60 0xC75F5B7F
    1100 0111 0101 1111 1100 0111 0101 1111 0101 1011 (line 91)
    0100 1011 0110 0000 0100 1011 0111 1111
    0xC75F4B60 0xC75F4B7F
  • For associative caches, there are multiple locations in which a data block for a given address range may reside. Thus, for example, for a 4-way set associative cache, data blocks covered by four address ranges that would otherwise map to the same cache line may be accommodated in the cache simultaneously. A fifth data block covered by an address range that maps to the same cache line would then displace one of the four data blocks. In general, when a subsequent address range maps to a location in the cache with existing data, that existing data is overwritten or displaced. [0036]
  • When existing data is displaced, additional clock cycles are required to subsequently obtain that data and store it in the cache again so that a program may again manipulate it. For that reason, the [0037] memory allocation program 124 allocates, for program variables, address ranges that do not cause displacement in the cache between individual program variables.
  • FIG. 4 depicts a flow diagram illustrating the exemplary steps performed by [0038] memory allocation program 124 for allocating memory. First, memory allocation program 124 receives a memory allocation request from program 116 (Step 402). The memory allocation request may be, for example, a function call or message sent to memory allocation program 124 that asks for memory for one or more variables. For example, the memory allocation request may be a function call that asks for memory for array INPUT and array WORK. To that end, the memory allocation request may specify one or more memory block sizes needed for the variables. Thus, memory allocation program 124 determines sizes for the variables using the information provided with the allocation request (Step 404).
  • Next, [0039] memory allocation program 124 determines whether the variables specified will fit into first level cache 204 (Step 406). To do so, memory allocation program 124 determines the size of first level cache 204 and its configuration (e.g., number of cache lines and number of bytes per cache line) by, for example, querying operating system 118 or reading a configuration file. Memory allocation program 124, knowing the size of the variables then determines if both variables can coexist in one or more cache lines in first level cache 204. In the illustrative example, if the array INPUT is 64 bytes long and the array WORK is 128 bytes long, then for the illustrative cache having 32 byte cache lines, the array INPUT may reside in cache lines 0 and 1, while the array WORK may reside in cache lines 2, 3, 4, and 5.
  • If [0040] memory allocation program 124 determines that the variables are too large to simultaneously fit into the first level cache 204 in step 406, then memory allocation program 124 determines whether the variables will simultaneously fit into second level cache 208 (Step 408). Again, to make that determination, memory allocation program 124 may determine the size of second level cache 208 and its configuration.
  • If the variables will fit into either [0041] first level cache 204 or second level cache 208, then memory allocation program 124 allocates memory for the variables such that they will map to different locations in the fastest cache (Step 410). The fastest cache is typically the smallest cache closest to CPU core 202. Continuing the illustrative example given above, memory allocation program 124 may allocate address ranges as shown below in Table 2 so that the array INPUT will be placed in cache lines 0 and 1, while the array WORK will be placed in cache lines 2-5.
    TABLE 2
    Variable Start of address range End of address range Cache lines
    INPUT 1100 0111 0101 1111 1100 0111 0101 1111 0 and 1
     64 bytes 0100 0000 0000 0000 0100 0000 0011 1111
    0xC75F4000 0xC75F403F
    WORK 1100 0111 0101 1111 1100 0111 0101 1111 2-5
    128 bytes 0100 0000 0100 0000 0100 0000 1011 1111
    0xC75F4000 0xC75F403F
  • In addition, [0042] memory allocation program 124 may further take into consideration the number and organization of memory banks 222 and 223 in main memory 218 (step 412). As noted above, sequential reads to the same memory bank can be slower than sequential reads to different memory banks. Thus, in addition to selecting address ranges that map the variables into different locations in the cache, memory allocation program 124 may also adjust the memory ranges for variables that are accessed sequentially so that they start in different memory banks. Thus, if program 116 sequentially accesses the variables, memory bank conflict will not hinder the retrieval of the data from main memory.
  • Subsequently, [0043] memory allocation program 124 returns memory references for the allocated memory regions to requesting program 116 (Step 414). For example, memory allocation program 124 may return pointers to the beginning of the allocated memory regions. Program 116 may then store its data (e.g., array INPUT and array WORK) in the allocated memory regions and benefit from having multiple variables reside in the cache simultaneously.
  • In another illustrative example, [0044] operating system 118 includes an extended version of the Unix ‘C’ library function madvise( ) in order to help guide program data closer to the processor that will access the program data. Accordingly, during FFT computation, variables such as signal samples and workspace data can be stored closer to the processor. Referring to the illustrative example introduced above, madvise( ) is used to store arrays INPUT and WORK close to the CPU. The madvise( ) function accepts, as parameters, a starting address, a length, and an advisory flag (e.g., madvise(caddr_t addr, size_t len, int advice)). The advisory flag guides the operating system 118 in locating or relocating the memory referred to in the call to madvise( ). In particular, Table 3 shows and explains the new advisory flags in the extended madvise( ) function.
    TABLE 3
    Extension Explanation
    MADV_ACCESS Resets the kernel's expectation for how the
    DEFAULT specified address range will be accessed.
    MADV_ACCESS Tells the kernel that the next LWP (i.e., light
    LWP weight process, or thread) to touch the
    specified address range will access it most
    heavily. The kernel should try to allocate the
    memory and other resources for the address
    range and the LWP accordingly (e.g., closer
    to the processor and memory that runs the
    LWP).
    MADV_ACCESS Tells the kernel that many processes or LWPs
    MANY will access the specified address range
    randomly across the machine. The kernel
    should try to allocate the memory and other
    resources for the address range accordingly
    (e.g., by making copies of the data and
    distributing a copy to each processor that runs
    an LWP that accesses the address range).
  • The madvise( ) function allows [0045] program 116 to specify that certain address ranges should be located as closely as possible to the processor that access the address ranges. Thus, in response to the MADV_ACCESS_LWP flag, for example, operating system 118 may determine which thread has accessed the address range, then relocate the data in the address range so that the data is close to the processor that accesses the data (i.e., the processor that runs the thread). To that end, operating system 118 may take into consideration the aspects of the memory hierarchy explained above, and for example, attempt to place the data so that it will fit into one or more levels of cache, or so that it will start in independent memory banks.
  • Similarly, [0046] operating system 118 may address the same considerations in response to the MADV_ACCESS_MANY flag. However, operating system 118 addresses those considerations for each of a predetermined number of threads that access the specified memory range after the call to madvise( ). More particularly, operating system 118 may make multiple copies of the data in the memory range, and distribute those copies close to the processors that run the individual threads. For example, operating system 118 may migrate pages in which the memory range lies to one or more memory boards.
  • The MADV_ACCESS_DEFAULT flag instructs [0047] operating system 118 to disregard any prior flags applied to the specified memory range. In other words, no operating system 118 will no longer perform the special processing noted above for the specified memory range. Program 116 may specify the MADV_ACCESS_DEFAULT flag before freeing a memory block, for example.
  • In yet another illustrative example, [0048] operating system 118 provides a memory advisory library that is useful when program 116 source code cannot be modified to include madvise( ) functions. More specifically, the madv library (stored, for example, in an object named madv.so.1) may operate as explained below in Table 4.
    TABLE 4
    The madv.so.1 shared object provides a means by which virtual memory
    advice can be selectively configured for launched process(es) and its
    descendants. To enable, the following string is presented in the
    environment:
        LD_PRELOAD=$LD_PRELOAD:madv.so.1
    ENVIRONMENT VARIABLES
      If the madv.so.1 shared object is specified in the LD_PRELOAD
      list, the following environment variables are read by the madv
      shared object to determine which created process(es) to apply
      the specified advice.
      MADV=<advice>
        MADV specifies the virtual memory advice to use for all heap,
        shared memory, and mmap( ) regions in the process address
        space. This advice is applied to all created processes.
        Values for <advice> correspond to values in <sys/mman.h>
        used in madvise( ) to specify memory access patterns:
        normal
        random
        sequential
        access_lwp
        access_many
        access_default
      MADVCFGFILE=<config-file>
        <config-file> is, for example, a text file which contains one or
        more madv configuration entries of the form:
            <exec-name> <exec-args>:<advice-opts>
        Advice specified in <config-file> takes precedence over
        that specified by the MADV environment variable. When
        MADVCFGFILE is not set, advice is taken from file
        /etc/madv.conf file if it exists.
        <exec-name> specifies the name of a program.
        The corresponding advice is set for newly created processes
        (see the manual pages on getexecname( )) that match the first
        <exec-name> found in the file.
        <exec-name> can be a full pathname, a base name or a pattern
        string. See the manual pages on sh( ), and the section
        File Name Generation, for a discussion of pattern matching.
        <exec-args> is an optionally specified pattern string to
        match against arguments. Advice is set if <exec-args>
        is not specified or occurs within the arguments to <exec-name>.
        <advice-opts> is a comma-separated list specifying the
        advice for various memory region(s):
        madv=<advice>
          Applies to all heap, shared memory, and mmap( ) regions
          in the process address space.
        heap=<advice>
          The heap is defined to be the brk area (see the manual
          pages on brk( )). Applies to the existing heap and for any
          additional heap memory allocated in the future.
        shm=<advice>
        ism=<advice>
        dism=<advice>
          Shared memory segments (see the manual pages on
          shmat( )) attached using any flags, flag SHM_SHARE
          MMU, or flag SHM_PAGEABLE respectively. Options
          ism and dism take precedence over option shm.
        map=<advice>
        mapshared=<advice>
        mapprivate=<advice>
        mapanon=<advice>
          Mappings established through mmap(2) using any
          flags, flag MAP_SHARED, flag MAP_PRIVATE, or flag
          MAP_ANON respectively. Options mapshared, mapprivate
          and mapanon take precedence over option map. Option
          mapanon takes precedence over mapshared and mapprivate.
      MADVERRFILE=<pathname>
        By default, error messages are logged via syslog( ) using
        Level LOG_ERR and facility LOG_USER. If
        MADVERRFILE contains a valid <pathname> (such as /dev/
        stderr), error messages will be logged there instead.
    NOTES
      The advice is inherited; a child process has the same advice
      as its parent. On exec( ) (see the manual pages on exec( )),
      the advice is set back to the default system advice unless
      different advice has been configured via the madv shared object.
      Advice is applied to mmap( ) regions explicitly created by the
      user program. Those regions established by the run-time linker
      or by system libraries making direct system calls (e.g. libthread
      allocations for thread stacks) are not affected.
  • Table 5 shows several examples of how to use the MADV library environment variables. [0049]
    TABLE 5
    Example 1.
      $ LD_PRELOAD=$LD_PRELOAD:madv.so.1
      $ MADVCFGFILE=madvcfg
      $ export LD_PRELOAD MADVCFGFILE
      $ cat $MADVCFGFILE
        /usr/bin/foo:ism=access_lwp
      The above configuration applies advice to all ISM segments
      for application /usr/bin/foo.
    Example 2.
      $ LD_PRELOAD=$LD_PRELOAD:madv.so.1
      $ MADV=access_many
      $ MADVCFGFILE=madvcfg
      $ export LD_PRELOAD MADV MADVCFGFILE
      $ cat $MADVCFGFILE
        Is:
      Advice will be set for all applications with the
      exception of ‘Is’.
    Example 3.
      Because MADVCFGFILE takes precedence over MADV,
      specifying ‘*’ (pattern match all) for the <exec-name>
      of the last madv configuration entry would cause the same result
      as setting MADV. The following causes the same result as example
      2:
      $ LD_PRELOAD=$LD_PRELOAD:madv.so.1
      $ MADVCFGFILE=madvcfg
      $ export LD_PRELOAD MADVCFGFILE
      $ cat $MADVCFGFILE
        Is:
        *:madv=access_many
    Example 4.
      $ LD_PRELOAD=$LD_PRELOAD:madv.so.1
      $ MADVCFGFILE=madvcfg
      $ export LD_PRELOAD MADVCFGFILE
      $ cat $MADVCFGFILE
        foo*:madv=access_many,heap=sequential,shm=access_lwp
      The above configuration applies one type of advice for
      mmap( ) regions and different advice for heap and shared
      memory regions for a select set of applications with
      exec names that begin with ‘foo’.
    Example 5.
      $ LD_PRELOAD=$LD_PRELOAD:madv.so.1
      $ MADVCFGFILE=madvcfg
      $ export LD_PRELOAD MADVCFGFILE
      $ cat $MADVCFGFILE
        ora* ora1:heap=access_many
      The above configuration applies advice for the heap
      of applications beginning with ora that have ora1 as
      as an argument.
  • Generally, when operating [0050] system 118 recognizes that the MADV environment variables have been set, operating system 118 responsively creates startup code that executes before program 116 enters its main function. For example, when operating system 118 is the Solaris™ operating system, operating system 118 may create a function and store it in a section of the executable program or associated library known as a ‘.init’ section. The .init section contains code that executes before the user-written portions of the program 116 start. Solaris is manufactured by Sun Microsystems, Inc. of Santa Clara, Calif. Sun, Sun Microsystems, and Solaris are trademarks or registered trademarks of Sun Microsystems, Inc. in the United States and other countries.
  • [0051] Operating system 118 reads the MADV environment variables and responsively creates calls to madvise( ) that reflect the environment variable settings. For example, each madvise( ) function call may specify a starting memory address for a particular memory region called out by the environment variables, the length of that region, and an advisory flag (e.g., MADV_ACCESS_LWP) that causes operating system 118 to respond to memory accesses in the way specified by the environment variable flags. Operating system 118 places the constructed madvise( ) calls in the .init function and they are therefore executed prior to the first or main routine in program 116.
  • The discussion above sets forth one particular example of [0052] operating system 118 interacting with memory advisory environment variables and a memory advisement library during FFT computation. The particular implementation of the environment variables and the library may vary considerably between data processing systems. In other words, methods and systems consistent with the present invention work in conjunction with many different implementations of memory advisory functionality.
  • In many cases, multiple program variables will not fit into cache at the same time in their entirety. As a result, methods and systems consistent with the present invention allocate memory for the variables so that the portion of the data needed by [0053] program 116 at any particular time will reside in the cache with portions of the other variables that program 116 needs at the same time. In other words, knowing the variable access pattern of an FFT algorithm, methods and systems consistent with the present invention will allocate space for the arrays during FFT computation such that the parts of the arrays can be used at any particular time.
  • In summary, [0054] memory allocation program 124 allocates memory for program data during FFT computation in a manner that improves data access efficiency compared to typical memory allocation methods used during FFT computation. The memory allocation of the methods and systems consistent with the present invention is dynamic and memory addresses are automatically assigned, taking into consideration, for example the memory architecture of a given data processing system. Because the overhead associated with accessing the program data is reduced, program 116 typically runs faster and produces results more quickly. Further, the power-of-two offset problems associated with typical memory allocation methods is eliminated.
  • It is noted that although the above-described example considered two variables, [0055] memory allocation program 124 may, in general, consider more than two variables. That is, memory allocation program 124 may determine which combination of variables will fit into the cache memory, and allocate an address range for each variable that causes the variable to map to a different location in the cache than the remaining variables.
  • The foregoing description of an implementation of the invention has been presented for purposes of illustration and description. It is not exhaustive and does not limit the invention to the precise form disclosed. Modifications and variations are possible in light of the above teachings or may be acquired from practicing of the invention. For example, the described implementation includes software but the present invention may be implemented as a combination of hardware and software or in hardware alone. Note also that the implementation may vary between systems. The invention may be implemented with both object-oriented and non-object-oriented programming systems. The claims and their equivalents define the scope of the invention. [0056]

Claims (29)

What is claimed is:
1. A method for allocating memory during fast Fourier transform computation in a data processing system including a first cache memory and a main memory, the method comprising the steps of:
receiving from a fast Fourier transform computing program a request to allocate memory for a first variable and a second variable;
determining when both variables will simultaneously fit in the first cache memory;
when both variables will fit in the first cache memory, allocating a first address range in the main memory for the first variable and a second address range in the main memory for the second variable, such that the first variable and the second variable map to different locations in the first cache memory; and
returning to the fast Fourier transform computing program a first memory reference for the first address range and a second memory reference for the second address range.
2. The method of claim 1 wherein the main memory comprises a first bank of memory and a second bank of memory, and wherein the step of allocating further comprises the step of allocating the first address range to begin in the first bank of memory and allocating the second address range to begin in the second bank of memory.
3. The method of claim 1, wherein:
when both variables will not fit into the first cache memory, allocating the first address range and the second address range such that at least a portion of both the first and second variables will simultaneously reside in the first cache memory.
4. The method of claim 1, wherein the data processing system further includes a second cache memory, and
determining when both variables will not fit in the first cache memory but both variables will fit in the second cache memory, and in response allocating the first address range in the main memory for the first variable and the second address range in the main memory for the second variable, such that the first variable and the second variable map to different locations in the second cache memory.
5. The method of claim 1, wherein the first cache memory is a direct mapped cache memory.
6. The method of claim 1, wherein the first cache memory is an associative cache memory.
7. The method of claim 1, wherein the first variable comprises a signal sample array and wherein the second variable comprises a table of trigonometric values for use in computing a fast Fourier transform.
8. The method of claim 7, wherein the second variable further comprises fast Fourier transform workspace.
9. A computer-readable medium containing instructions that cause a data processing system including a first cache memory and a main memory to perform a method for allocating memory during fast Fourier transform computation, the method comprising the steps of:
receiving from a fast Fourier transform computing program a request to allocate memory for a first variable and a second variable;
determining when both variables will simultaneously fit in the first cache memory;
when both variables will fit in the first cache memory, allocating a first address range in the main memory for the first variable and a second address range in the main memory for the second variable, such that the first variable and the second variable map to different locations in the first cache memory; and
returning to the fast Fourier transform computing program a first memory reference for the first address range and a second memory reference for the second address range.
10. The computer-readable medium of claim 9 wherein the main memory comprises a first bank of memory and a second bank of memory, and wherein the step of allocating further comprises the step of allocating the first address range to begin in the first bank of memory and allocating the second address range to begin in the second bank of memory.
11. The computer-readable medium of claim 9, wherein:
when both variables will not fit into the first cache memory, allocating the first address range and the second address range such that at least a portion of both the first and second variables will simultaneously reside in the first cache memory.
12. The computer-readable medium of claim 9, wherein the data processing system further includes a second cache memory, and
determining when both variables will not fit in the first cache memory but both variables will fit in the second cache memory, and in response allocating the first address range in the main memory for the first variable and the second address range in the main memory for the second variable, such that the first variable and the second variable map to different locations in the second cache memory.
13. The computer-readable medium of claim 9, wherein the first cache memory is a direct mapped cache memory.
14. The computer-readable medium of claim 9, wherein the first cache memory is an associative cache memory.
15. The computer-readable medium of claim 9, wherein the first variable comprises a signal sample array and wherein the second variable comprises a table of trigonometric values for use in computing a fast Fourier transform.
16. The computer-readable medium of claim 15, wherein the second variable further comprises fast Fourier transform workspace.
17. A data processing system comprising:
a first cache memory;
a main memory comprising a memory allocation program, the memory allocation program for receiving a request to allocate memory for a first variable and a second variable during fast Fourier transform computation, the first variable and the second variable being used by a fast Fourier transform computing program to compute a fast Fourier transform, determining when both variables will simultaneously fit in the first cache memory, and in response allocating a first address range in the main memory for the first variable and a second address range in the main memory for the second variable, such that the first variable and the second variable map to different locations in the first cache memory; and
a processor that runs the memory allocation program.
18. The data processing system of claim 17, wherein the memory allocation program determines a first cache memory size and a first cache memory organization.
19. The data processing system of claim 17, wherein:
when both variables will not fit into the first cache memory, allocating the first address range and the second address range such that at least a portion of both the first and second variables will simultaneously reside in the first cache memory.
20. The data processing system of claim 17, wherein the main memory comprises a first memory bank and a second memory bank, and wherein:
the memory allocation program further allocates the first address range to begin in the first memory bank and allocating the second address range to begin in the second memory bank.
21. The data processing system of claim 17, wherein the first cache memory is a direct mapped cache memory.
22. The data processing system of claim 17, wherein the first cache memory is a set associative cache memory.
23. A data processing system comprising:
means for receiving from a fast Fourier transform computing program a request to allocate memory for a first variable and a second variable;
means for determining when both variables will simultaneously fit in the first cache memory and responsively allocating a first address range in the main memory for the first variable and a second address range in the main memory for the second variable, such that the first variable and the second variable map to different locations in the first cache memory; and
means for returning to the fast Fourier transform computing program a first memory reference for the first address range and a second memory reference for the second address range.
24. A method for allocating memory during fast Fourier transform computation in a data processing system including a first level cache, a second level cache, and a main memory comprising a first memory bank and a second memory bank, the method comprising the steps of:
receiving from a fast Fourier transform computing program a request to allocate memory for a first variable and a second variable;
identifying a first variable size for which to allocate memory;
identifying a second variable size for which to allocate memory;
determining when both variables will simultaneously fit in the first level cache by determining a first variable size, a second variable size, a first level cache size and a first level cache organization;
when both variables will simultaneously fit in the first level cache, allocating a first address range in the main memory for the first variable and allocating a second address range in the main memory for the second variable, such that the first variable and the second variable map to different locations in the first level cache;
when both variables will not simultaneously fit in the first level cache, determining when both variables will simultaneously fit in the second level cache by further determining, a second level cache size and a second level cache organization; and
when both variables will simultaneously fit in the second level cache, allocating the first address range in the main memory for the first variable and allocating the second address range in the main memory for the second variable, such that the first variable and the second variable map to different locations in the second level cache; and
when both variables will not simultaneously fit in either the first level cache or the second level cache, then allocating the first address range to begin in the first memory bank and allocating the second address range to begin in the second memory bank.
25. A method according to claim 24, wherein the step of determining the first level cache organization comprises determining a first level cache mapping and a first level cache line size.
26. A method according to claim 24, wherein the step of determining the second level cache organization comprises determining a second level cache mapping and a second level cache line size.
27. A method according to claim 24, wherein the first variable comprises a signal sample array and wherein the second variable comprises a fast Fourier transform workspace.
28. A method according to claim 24, further comprising the step of:
when both variables will not fit into the first cache memory and will not fit into the second cache memory, allocating the first address range and the second address range such that at least a portion of both the first and second variables will simultaneously reside in the first cache memory.
29. A method according to claim 24, further comprising the step of:
when both variables will not fit into the first cache memory and will not fit into the second cache memory, allocating the first address range and the second address range such that at least a portion of both the first and second variables will simultaneously reside in the second cache memory.
US10/442,375 2003-05-21 2003-05-21 Methods and systems for memory allocation Expired - Lifetime US6952760B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US10/442,375 US6952760B2 (en) 2003-05-21 2003-05-21 Methods and systems for memory allocation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US10/442,375 US6952760B2 (en) 2003-05-21 2003-05-21 Methods and systems for memory allocation

Publications (2)

Publication Number Publication Date
US20040236922A1 true US20040236922A1 (en) 2004-11-25
US6952760B2 US6952760B2 (en) 2005-10-04

Family

ID=33450182

Family Applications (1)

Application Number Title Priority Date Filing Date
US10/442,375 Expired - Lifetime US6952760B2 (en) 2003-05-21 2003-05-21 Methods and systems for memory allocation

Country Status (1)

Country Link
US (1) US6952760B2 (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060155886A1 (en) * 2005-01-11 2006-07-13 Da Silva Dilma M Methods and arrangements to manage on-chip memory to reduce memory latency
US8607018B2 (en) 2012-11-08 2013-12-10 Concurix Corporation Memory usage configuration based on observations
WO2013191720A1 (en) * 2012-06-19 2013-12-27 Concurix Corporation Usage aware numa process scheduling
US8656135B2 (en) 2012-11-08 2014-02-18 Concurix Corporation Optimized memory configuration deployed prior to execution
US8656134B2 (en) 2012-11-08 2014-02-18 Concurix Corporation Optimized memory configuration deployed on executing code
US8700838B2 (en) 2012-06-19 2014-04-15 Concurix Corporation Allocating heaps in NUMA systems
US9043788B2 (en) 2012-08-10 2015-05-26 Concurix Corporation Experiment manager for manycore systems
US9529539B1 (en) * 2015-06-09 2016-12-27 Winbond Electronics Corp. Data allocating apparatus, signal processing apparatus, and data allocating method
US9575813B2 (en) 2012-07-17 2017-02-21 Microsoft Technology Licensing, Llc Pattern matching process scheduler with upstream optimization
US9665474B2 (en) 2013-03-15 2017-05-30 Microsoft Technology Licensing, Llc Relationships derived from trace data
US20220404981A1 (en) * 2021-06-22 2022-12-22 Micron Technology, Inc. Alleviating memory hotspots on systems with multiple memory controllers

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7069543B2 (en) * 2002-09-11 2006-06-27 Sun Microsystems, Inc Methods and systems for software watchdog support
US7133993B1 (en) * 2004-01-06 2006-11-07 Altera Corporation Inferring size of a processor memory address based on pointer usage
US20080288379A1 (en) * 2004-06-29 2008-11-20 Allin Patrick J Construction payment management system and method with automated electronic document generation features
US7469404B2 (en) * 2004-06-30 2008-12-23 Intel Corporation Bank assignment for partitioned register banks
US10706208B1 (en) 2018-08-17 2020-07-07 Synopsis, Inc. Priority aware balancing of memory usage between geometry operation and file storage

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5802544A (en) * 1995-06-07 1998-09-01 International Business Machines Corporation Addressing multiple removable memory modules by remapping slot addresses
US6408368B1 (en) * 1999-06-15 2002-06-18 Sun Microsystems, Inc. Operating system page placement to maximize cache data reuse

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5802544A (en) * 1995-06-07 1998-09-01 International Business Machines Corporation Addressing multiple removable memory modules by remapping slot addresses
US6408368B1 (en) * 1999-06-15 2002-06-18 Sun Microsystems, Inc. Operating system page placement to maximize cache data reuse

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7437517B2 (en) * 2005-01-11 2008-10-14 International Business Machines Corporation Methods and arrangements to manage on-chip memory to reduce memory latency
US20080263284A1 (en) * 2005-01-11 2008-10-23 International Business Machines Corporation Methods and Arrangements to Manage On-Chip Memory to Reduce Memory Latency
US7934061B2 (en) 2005-01-11 2011-04-26 International Business Machines Corporation Methods and arrangements to manage on-chip memory to reduce memory latency
US20060155886A1 (en) * 2005-01-11 2006-07-13 Da Silva Dilma M Methods and arrangements to manage on-chip memory to reduce memory latency
US8700838B2 (en) 2012-06-19 2014-04-15 Concurix Corporation Allocating heaps in NUMA systems
WO2013191720A1 (en) * 2012-06-19 2013-12-27 Concurix Corporation Usage aware numa process scheduling
US9047196B2 (en) 2012-06-19 2015-06-02 Concurix Corporation Usage aware NUMA process scheduling
US9575813B2 (en) 2012-07-17 2017-02-21 Microsoft Technology Licensing, Llc Pattern matching process scheduler with upstream optimization
US9043788B2 (en) 2012-08-10 2015-05-26 Concurix Corporation Experiment manager for manycore systems
US8656134B2 (en) 2012-11-08 2014-02-18 Concurix Corporation Optimized memory configuration deployed on executing code
US8656135B2 (en) 2012-11-08 2014-02-18 Concurix Corporation Optimized memory configuration deployed prior to execution
US8607018B2 (en) 2012-11-08 2013-12-10 Concurix Corporation Memory usage configuration based on observations
US9665474B2 (en) 2013-03-15 2017-05-30 Microsoft Technology Licensing, Llc Relationships derived from trace data
US9529539B1 (en) * 2015-06-09 2016-12-27 Winbond Electronics Corp. Data allocating apparatus, signal processing apparatus, and data allocating method
US20220404981A1 (en) * 2021-06-22 2022-12-22 Micron Technology, Inc. Alleviating memory hotspots on systems with multiple memory controllers
US11740800B2 (en) * 2021-06-22 2023-08-29 Micron Technology, Inc. Alleviating memory hotspots on systems with multiple memory controllers

Also Published As

Publication number Publication date
US6952760B2 (en) 2005-10-04

Similar Documents

Publication Publication Date Title
US6952760B2 (en) Methods and systems for memory allocation
US7743222B2 (en) Methods, systems, and media for managing dynamic storage
US6412053B2 (en) System method and apparatus for providing linearly scalable dynamic memory management in a multiprocessing system
US6026475A (en) Method for dynamically remapping a virtual address to a physical address to maintain an even distribution of cache page addresses in a virtual address space
US5802341A (en) Method for the dynamic allocation of page sizes in virtual memory
US7376808B2 (en) Method and system for predicting the performance benefits of mapping subsets of application data to multiple page sizes
Kistler et al. Automated data-member layout of heap objects to improve memory-hierarchy performance
US6430656B1 (en) Cache and management method using combined software and hardware congruence class selectors
US20060026183A1 (en) Method and system provide concurrent access to a software object
US6366994B1 (en) Cache aware memory allocation
US7493464B2 (en) Sparse matrix
PT590645E (en) PROCESS AND SYSTEM TO REDUCE MEMORANDUM ATTRIBUTION REQUESTS
US6421761B1 (en) Partitioned cache and management method for selectively caching data by type
JP2009528612A (en) Data processing system and data and / or instruction prefetch method
US20030097536A1 (en) System and method for physical memory allocation in advanced operating systems
US6370618B1 (en) Method and system for allocating lower level cache entries for data castout from an upper level cache
US5996055A (en) Method for reclaiming physical pages of memory while maintaining an even distribution of cache page addresses within an address space
US6457107B1 (en) Method and apparatus for reducing false sharing in a distributed computing environment
US6016529A (en) Memory allocation technique for maintaining an even distribution of cache page addresses within a data structure
Asai MCDRAM as High-Bandwidth Memory (HBM) in Knights Landing processors: developers guide
US20070300210A1 (en) Compiling device, list vector area assignment optimization method, and computer-readable recording medium having compiler program recorded thereon
US8185693B2 (en) Cache-line aware collection for runtime environments
JP2000250814A (en) Dynamic memory allocation method for maintaining uniform distribution of cache page address in address space
CN113535392B (en) Memory management method and system for realizing support of large memory continuous allocation based on CMA
WO2017142525A1 (en) Allocating a zone of a shared memory region

Legal Events

Date Code Title Description
AS Assignment

Owner name: SUN MICROSYSTEMS, INC., CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:BOUCHER, MICHAEL;DO, THERESA;REEL/FRAME:014098/0182

Effective date: 20030516

STCF Information on status: patent grant

Free format text: PATENTED CASE

FPAY Fee payment

Year of fee payment: 4

FPAY Fee payment

Year of fee payment: 8

AS Assignment

Owner name: ORACLE AMERICA, INC., CALIFORNIA

Free format text: MERGER AND CHANGE OF NAME;ASSIGNORS:ORACLE USA, INC.;SUN MICROSYSTEMS, INC.;ORACLE AMERICA, INC.;REEL/FRAME:037280/0132

Effective date: 20100212

FPAY Fee payment

Year of fee payment: 12