WO2019105565A1 - Systems for compiling and executing code within one or more virtual memory pages - Google Patents

Systems for compiling and executing code within one or more virtual memory pages Download PDF

Info

Publication number
WO2019105565A1
WO2019105565A1 PCT/EP2017/081116 EP2017081116W WO2019105565A1 WO 2019105565 A1 WO2019105565 A1 WO 2019105565A1 EP 2017081116 W EP2017081116 W EP 2017081116W WO 2019105565 A1 WO2019105565 A1 WO 2019105565A1
Authority
WO
WIPO (PCT)
Prior art keywords
virtual memory
page
size
blocks
sub
Prior art date
Application number
PCT/EP2017/081116
Other languages
French (fr)
Inventor
Antonio BARBALACE
Yi Chen
Jani Kokkonen
Alexander SPYRIDAKIS
Original Assignee
Huawei Technologies Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co., Ltd. filed Critical Huawei Technologies Co., Ltd.
Priority to PCT/EP2017/081116 priority Critical patent/WO2019105565A1/en
Priority to CN201780096871.XA priority patent/CN111344667B/en
Publication of WO2019105565A1 publication Critical patent/WO2019105565A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/0223User address space allocation, e.g. contiguous or non contiguous base addressing
    • G06F12/023Free address space management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/10Address translation
    • G06F12/109Address translation for multiple virtual address spaces, e.g. segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/10Address translation
    • G06F12/1027Address translation using associative or pseudo-associative address translation means, e.g. translation look-aside buffer [TLB]
    • G06F12/1036Address translation using associative or pseudo-associative address translation means, e.g. translation look-aside buffer [TLB] for multiple virtual address spaces, e.g. segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/10Providing a specific technical effect
    • G06F2212/1008Correctness of operation, e.g. memory ordering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/10Providing a specific technical effect
    • G06F2212/1016Performance improvement
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/65Details of virtual memory and virtual address translation
    • G06F2212/652Page size control
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/65Details of virtual memory and virtual address translation
    • G06F2212/653Page colouring
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/65Details of virtual memory and virtual address translation
    • G06F2212/657Virtual address space management

Definitions

  • the present invention in some embodiments thereof, relates to virtual memory management and, more specifically, but not exclusively, to systems and methods for clustering sub-pages of virtual memory pages.
  • Memory resources include, for example, processor caches, including one or more of: Ll, L2, L3, and L4 (e.g., Ll, L1-L2, L1-L3, and L3-L4) (the highest level is termed last level cache (LLC)), the processor memory bus/ring that interconnects multiple groups/clusters via their LLC, and the memory controller and its (parallel) interconnections to the parallel memory elements (banks).
  • LLC last level cache
  • page-coloring a software only technology that requires virtual memory to be implemented.
  • page-coloring requires at least at the LLC physically indexed and tagged caches.
  • memory bandwidth partition page-coloring may require software configuration of bank interleaving.
  • an apparatus for compiling code for runtime execution within a plurality of virtual memory sub-pages of at least one virtual memory page comprises: a compiler executable by a processor, the compiler configured to: receive pre-compilation code for compilation, wherein the size of the pre-compilation code, when compiled and loaded into a memory, is at least the size of one virtual memory sub-page, wherein the at least one virtual memory sub-page corresponds to one of a plurality of physical memory blocks that are mapped to a virtual memory page, the size of each physical memory block is the size of a virtual memory sub-page, divide the pre-compilation code into a plurality of blocks such that each block of the plurality of blocks when complied into a respective executable binary block of a plurality of executable binary blocks is less than or equal to the size of a virtual memory sub-page of the at least one virtual memory page corresponding to the size of one physical memory block, compile the plurality of blocks into the plurality of executable binary blocks, and
  • an apparatus for loading code for execution within a plurality of virtual memory sub-pages of at least one virtual memory page comprises: a processor, a memory storing code instructions for execution by the processor, comprising: code to identify a binary file of an application divided into a plurality of blocks, where a size of each block of the plurality of blocks is less than or equal to a size of a virtual memory sub-page, code to retrieve an initial allocation of a plurality of clusters of physical memory blocks each of a size corresponding to a virtual memory sub-page size for the application, code to receive an allocation of at least one virtual memory page for the application, wherein the size of the at least one virtual memory page is mapped to an equal size of contiguous physical memory areas, wherein the at least virtual memory page includes a plurality of virtual memory sub-pages mapped to the plurality of clusters of physical memory blocks each of a size corresponding to a virtual memory sub-page size, code to load the plurality of blocks of the binary file of the application into the allocated
  • the systems, apparatus, methods, and/or code instructions described herein extend page coloring (i.e., clustering) to huge virtual memory pages.
  • the implementation of the systems, apparatus, methods, and/or code instructions described herein is transparent to executing code (e.g., the program, application).
  • Existing (e.g., legacy) code e.g., programs, applications
  • existing code e.g., programs, applications
  • New programs designed for implementation based on the systems, apparatus, methods, and/or code instructions described herein are not necessarily required.
  • the systems, apparatus, methods, and/or code instructions described herein provide a software -based solution based on modification of the system software (e.g., operating system code, runtime code, compiler code, and/or linker code).
  • the software-based solution does not necessarily require any modification of processing hardware and/or addition of new processing hardware, and may be executed by existing processing hardware, for example, in comparison to other proposed solutions that are based on at least some modification of processing hardware and/or new hardware component(s).
  • ISAs Commodity processor instruction set architectures
  • LLC last level cache
  • the compiler is further configured to divide a function of a .text section of the pre-compilation code that is larger than the size of one virtual memory sub-page when compiled into executable code, into a plurality of sub-functions that are each smaller than or equal to the size of one virtual memory sub-page when compiled into executable binary blocks, wherein the executable binary blocks of the divided function of the .text are placed by the supervisor software within a cluster of virtual memory sub-pages of a virtual memory page that map to a corresponding cluster of physical memory blocks each of a size corresponding to a virtual memory sub-page size.
  • the compiler is further configured to arrange a plurality of functions that are each smaller than the size of one virtual memory sub- page when compiled, to fit entirely within one virtual memory sub-page when compiled.
  • the pre-compilation code includes a data storage structure larger than the size of one virtual memory sub-page when compiled, and wherein the compiler is further configured to divide the data storage structure into a plurality of sub-data storage structures each smaller than the size of one virtual memory sub-page when compiled.
  • the compiler is further configured to create a dereferencing data structure for accessing each element of each sub-data storage structure, wherein the dereferencing data structure adds an offset according to the size of the virtual memory sub-pages of the virtual memory page storing the data structure during runtime and clusters of physical memory blocks each of a size corresponding to a virtual memory sub- page size allocated to the application associated with the data storage structure.
  • the compiler is further configured to access and manage a program stack by incrementing the program stack in divided blocks each having a size smaller than or equal to one virtual memory sub-page.
  • the compiler is further configured to add a new program stack frame that updates a program stack pointer that points to each divided block by adding an offset according to the size of the virtual memory sub-pages of the virtual memory page storing the program stack during runtime and clusters of physical memory blocks allocated to the application associated with the data storage structure.
  • the size of the virtual memory sub- page is at least as large as a predefined standard size of a physical memory block associated with the processor.
  • each binary block of the plurality of binary blocks is relocatable in its entirety as a continuous segment of code from one virtual memory sub-page to another virtual memory sub-page.
  • the apparatus further comprising code to dynamically move at least one of the plurality of binary blocks from a first virtual memory sub-page of a first cluster to a second memory sub-page of a second cluster, and update a mapping between virtual memory sub-pages and clusters of physical memory blocks according to the dynamic move.
  • the apparatus further comprises code to populate data of a dereferencing data structure for accessing each element of sub-data storage structures of a data storage structure, wherein the dereferencing data structure adds an offset according to the size of the virtual memory sub-pages of the virtual memory page storing the sub-data structures of the data structure during runtime and the clusters of physical memory blocks each of a size corresponding to a virtual memory sub-page size allocated to the application associated with the data storage structure.
  • the application includes complied code for growing a program stack in blocks each having a size smaller than or equal to the size of one virtual memory sub-page of the virtual memory pages storing the program and the program stack during runtime, according to an added new program stack frame that updates a program stack pointer to point to the respective program stack blocks with an offset computed according to the size of the virtual memory sub-pages of the virtual memory pages storing the program stack blocks of the program stack during runtime and the clusters of physical memory blocks each of a size corresponding to a virtual memory sub-page size allocated to the program stack.
  • the application includes compiled code for storing a plurality of sub-functions that are each smaller than or equal to the size of one virtual memory sub-page of a function that is larger than the size of one virtual memory sub- page, at respective virtual memory sub-page mapped to a cluster of physical memory blocks each of a size corresponding to a virtual memory sub-page size, and storing the location of each of the plurality of sub-functions in a mapping data structure for runtime execution of the function.
  • the at least one virtual memory sub-page, which is part of a virtual memory page is mapped to one physical memory block which is part of a plurality of contiguous physical memory blocks that makes up the size of a virtual memory page.
  • FIG. 1 is a schematic depicting how page colors are arranged in a physical address space, to help in understanding the technical problem addressed by some implementations of the present invention
  • FIG. 2 is a schematic depicting an application using virtual pages of three different colors, to help in understanding the technical problem addressed by some implementations of the present invention
  • FIG. 3 is a schematic depicting an example of an application that uses virtual memory paging with at least one huge virtual memory page, in accordance with some embodiments of the present invention
  • FIG. 4 is a schematic of a block diagram of a system that includes a computing device for compiling code for runtime execution within virtual memory sub-pages and/or for loading code for execution within virtual memory sub-pages, in accordance with some embodiments of the present invention
  • FIG. 5 is a flowchart of a method of compiling code for runtime execution within of virtual memory sub-pages of virtual memory page(s), in accordance with some embodiments of the present invention
  • FIG. 6 is a flowchart of a method of loading code for execution within virtual memory sub-pages of virtual memory page(s), in accordance with some embodiments of the present invention
  • FIG. 7 is a schematic depicting division of an example .text section into multiple sub- functions, in accordance with some embodiments of the present invention
  • FIG. 8 is a schematic depicting a dereferencing table for accessing each element of sub arrays which are obtained by dividing an array, in accordance with some embodiments of the present invention
  • FIG. 9 is an example of code (e.g., native code, pseudo assembly code) generated by the compiler to enable data access to one element of each sub-data storage structure, in accordance with some embodiments of the present invention.
  • code e.g., native code, pseudo assembly code
  • FIG. 10 is a schematic depicting additional exemplary components of a compiler and a linker for compiling code for runtime execution within virtual memory sub-pages of one or more virtual memory pages, in accordance with some embodiments of the present invention
  • FIG. 11 is a schematic depicting additional exemplary components of a runtime and/or operating system and/or memory management for loading code for execution within virtual memory sub-pages, in accordance with some embodiments of the present invention
  • FIG. 12 is a flowchart depicting an exemplary implementation of dividing a function of a .text section of the pre-compilation code when compiled into sub-functions that are each smaller than or equal to the size of one virtual memory sub-page when compiled, in accordance with some embodiments of the present invention.
  • FIG. 13 is a flowchart of an exemplary method for execution of a .text section of an executable binary file within virtual memory sub-pages of one or more virtual memory pages, in accordance with some embodiments of the present invention.
  • the present invention in some embodiments thereof, relates to virtual memory management and, more specifically, but not exclusively, to systems and methods for clustering sub-pages of virtual memory pages.
  • cluster or clustering and the word color or coloring are interchangeable. For example, each cluster is assigned a certain color.
  • huge virtual memory page refers to a virtual huge memory page that is larger than the size of a physical memory page implementation defined by the hardware. It is noted that different implementations may refer to huge pages with other terms, for example, large pages.
  • standard size virtual memory page refers to a virtual memory page defined by the hardware as the minimum amount of translation.
  • the size of each physical memory block is the size of a virtual memory sub-page.
  • huge virtual memory page, standard virtual memory page, and virtual memory page are sometimes interchangeable.
  • An aspect of some embodiments of the present invention relates to an apparatus, systems, methods, and/or code instructions (stored in a data storage device executable by one or more hardware processors) for compiling pre-compilation code for runtime execution within virtual memory sub-pages of virtual memory page(s).
  • the size of the pre-compilation code when compiled and loaded into a memory, is at least the size of one virtual memory sub-page.
  • the virtual memory sub-page corresponds to one of multiple physical memory blocks that are mapped to a virtual memory page.
  • the size of each physical memory block is the size of a virtual memory sub-page.
  • the pre-compilation code is divided into blocks, such that each block when complied into a respective executable binary block is less than or equal to the size of a virtual memory sub-page (of the virtual memory page corresponding to the size of one physical memory block).
  • the blocks are compiled into executable binary blocks.
  • the executable binary blocks are linked into a program.
  • the program includes a designation of the executable binary blocks for loading of the program by supervisor software into an allocated virtual memory page.
  • the supervisor software loads the executable binary blocks into physical memory blocks according to a mapping between virtual memory sub-pages of the virtual memory page and allocated clusters of physical memory blocks.
  • Each block of a size corresponding to a virtual memory sub-page size, for example, 4 kilobytes (kB) is the smallest page size available for processors based on the x86 architecture.
  • An aspect of some embodiments of the present invention relates to an apparatus, systems, methods, and/or code instructions (stored in a data storage device executable by one or more hardware processors) for loading code for execution within virtual memory sub-pages of virtual memory page(s).
  • a binary file of an application divided into blocks is identified.
  • a size of each block is less than or equal to a size of a virtual memory sub-page.
  • An initial allocation of clusters of physical memory blocks each of a size corresponding to a virtual memory sub-page size is retrieved for the application.
  • An allocation of virtual memory page(s) for the application is received.
  • the size of the virtual memory page is mapped to an equal size of contiguous physical memory areas.
  • the virtual memory page includes virtual memory sub-pages mapped to the clusters of physical memory blocks.
  • each block corresponds to the size of a virtual memory sub-page.
  • the blocks of the binary file of the application are loaded into the allocated virtual memory page(s).
  • the blocks are loaded into physical memory areas according to the mapping between the virtual memory sub-pages and the allocated clusters of physical memory blocks.
  • Virtual memory sub-pages mapped to respective clusters of memory blocks may be located non-contiguously within the virtual memory page.
  • Virtual memory sub-pages of different clusters may be contiguous with one another, optionally in a repeating pattern, for example for three defined clusters arranged as: 1,2, 3, 1,2, 3, 1,2, 3.
  • Software-based page coloring is incompatible with hardware-based huge pages because page coloring is designed to operate according to the smallest predefined and/or standard page granularity. Based on existing technology, an attempt to extend the technique of software-based page coloring to hardware -based huge-page coloring may result either in an extremely small number of colors, or no colors at all, which effectively eliminates any potential benefits of implementing coloring.
  • the apparatus, system, methods, and/or code instructions (stored in a data storage device executed by one or more processors) described herein effectively implement a combination of coloring and huge pages in a manner that improves performance and/or deterministic execution of applications running concurrently on the same computer device.
  • Cache Allocation Technology of Intel®.
  • CAT is designed to transparently support huge pages.
  • CAT cannot be easily controlled and/or generally implemented, since the solution is designed specifically for the processors produced by Intel® based on the x86 architecture.
  • CAT cannot scale to a high number of applications.
  • the present invention may be a system, a method, and/or a computer program product.
  • the computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention.
  • the computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device.
  • the computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
  • Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network.
  • a network for example, the Internet, a local area network, a wide area network and/or a wireless network.
  • the computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
  • LAN local area network
  • WAN wide area network
  • Internet Service Provider for example, AT&T, MCI, Sprint, EarthLink, MSN, GTE, etc.
  • electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
  • FPGA field-programmable gate arrays
  • PLA programmable logic arrays
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s).
  • the functions noted in the block may occur out of the order noted in the figures.
  • two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • FIG. 1 is a schematic depicting how page colors (i.e., clusters) are arranged in a physical address space 102, to help in understanding the technical problem addressed by some implementations of the present invention.
  • FIG. 1 depicts the traditional page coloring (i.e., clustering) approach that uses virtual memory to group physically scattered memory pages of the same color together within the same virtual address range. Page colors are periodically repeated. For example, one set of virtual memory pages corresponds to physical memory blocks of page size. It is noted that the page size is of a standard page size as defined by the processor (e.g., 4kB in x86 architectures) are assigned physical memory pages having the color blue (e.g., cluster 1) 104.
  • the processor e.g., 4kB in x86 architectures
  • Another set of virtual memory pages are assigned physical memory pages having the color green (e.g., cluster 2) 106. It is noted that the labels blue and green are meant as tags to identify the clusters, and do not reflect actual colors of the memory. The colors blue and green are periodically repeated. Pages with the same color have a constant offset 108 in the physical address space.
  • the color green e.g., cluster 2
  • FIG. 2 is a schematic depicting an application (App 1) using virtual pages of three different colors (i.e., clusters), blue 282, green, 284, and yellow 286, to help in understanding the technical problem addressed by some implementations of the present invention.
  • a virtual memory subsystem (component implemented in hardware and/or software) enables the application to organize strictly disposed physical pages of a physical address space 288 into linear (virtual) memory ranges of a virtual address space 290. Note that a specific page color organization is shown in FIG. 2, but it is to be understood that there are multiple possible organizations.
  • FIG. 3 is a schematic depicting an example of an application (App 1) that uses paging with at least one virtual memory page coloring (i.e., clustering) within a huge virtual memory page, in accordance with some embodiments of the present invention.
  • App 1 uses paging with at least one virtual memory page coloring (i.e., clustering) within a huge virtual memory page, in accordance with some embodiments of the present invention.
  • a huge page 302 within a physical address space 304 may be located anywhere within the application’s assigned virtual address space 306.
  • the colored sub- pages e.g., one set 308 depicted for clarity
  • FIG. 4 is a schematic of a block diagram of a system 400 that includes a computing device 402 for compiling code for runtime execution within virtual memory sub-pages of virtual memory page(s) of a virtual memory 404 and/or for loading code for execution within virtual memory sub-pages of virtual memory page(s) of virtual memory 404, in accordance with some embodiments of the present invention.
  • FIG. 5 is a flowchart of a method of compiling code for runtime execution within of virtual memory sub-pages of virtual memory page(s), in accordance with some embodiments of the present invention.
  • FIG. 5 is a flowchart of a method of compiling code for runtime execution within of virtual memory sub-pages of virtual memory page(s), in accordance with some embodiments of the present invention.
  • FIG. 5 is a flowchart of a method of compiling code for runtime execution within of virtual memory sub-pages of virtual memory page(s), in accordance with some embodiments of the present invention.
  • FIG. 5 is
  • FIG. 6 which is a flowchart of a method of loading code for execution within virtual memory sub-pages of virtual memory page(s), in accordance with some embodiments of the present invention.
  • the methods of FIG. 5 and/or FIG. 6 may be implemented by code sorted in data storage device 412 executed by processor(s) 406.
  • Data storage device 412 may be implemented as random access memory (RAM), or code may be moved from data storage device 412 to RAM for execution by processor(s) 406.
  • the method of FIG. 5 may be implemented by compiler code 412A and/or linker code 412B.
  • the method of FIG. 6 may be implemented by loading code 412C, for example, supervisor code, application loader, and/or library loader.
  • supervisor e.g., code, software
  • loading code may be interchanged.
  • compile code 412A and linker code 412B may be implemented a single component referred to herein as compiler. Alternatively, the compiler and linker are implemented as distinct components.
  • computing device 402 may compile code (or re-compile previously compiled code) for runtime execution within virtual memory sub-pages of virtual memory page(s), and load the compiled code for execution within virtual memory sub-pages of virtual memory page(s).
  • one computing device 402 performs the compilation of the code, for example, for locally stored code, for code transmitted by client terminal(s) and/or server(s), and/or providing remote services to client terminals(s) and/or server(s) (e.g., via a software interface such as an application programming interface (API), software development kit (SDK), a web site interface, and an application interface that is loaded on the client terminal and/or server).
  • API application programming interface
  • SDK software development kit
  • the compiled code may be provided for execution within virtual memory sub-pages of virtual memory page(s) of another computing device, for example, by the client terminal(s) and/or server(s) that provided the code for compilation, and/or by another client terminal and/or server that receive the compiled code for local execution.
  • processor(s) 406 includes a paging mechanism 416 that maps between virtual memory 404 and physical memory 408. It is noted that virtual memory 404 represents an abstraction and/or a virtual component, since virtual memory 404 does not represent an actual physical virtual memory device. Paging mechanism 416 may be implemented in hardware. When an implementation of processor(s) lacks a paging mechanism, the virtual memory sub- page, which is part of a virtual memory page, is mapped to one physical memory block which is part of contiguous physical memory blocks that makes up the size of a virtual memory page. Optionally, the physical memory block offset to the beginning of the contiguous physical memory blocks is the same as the offset that the virtual memory sub-page has to the beginning of the virtual memory page.
  • Virtual memory sub-pages are physical memory blocks.
  • Virtual memory pages are a collection of contiguous physical memory blocks.
  • the systems, apparatus, methods, and/or code instructions described herein enable page coloring without necessarily requiring a virtual memory subsystem.
  • Computing device 402 may be implemented as, for example, one of more of: a single computing device (e.g., client terminal), a group of computing devices arranged in parallel, a network server, a web server, a storage server, a local server, a remote server, a client terminal, a mobile device, a stationary device, a kiosk, a smartphone, a laptop, a tablet computer, a wearable computing device, a glasses computing device, a watch computing device, a desktop computer, and an interne of things (IoT) device.
  • a single computing device e.g., client terminal
  • a group of computing devices arranged in parallel a network server, a web server, a storage server, a local server, a remote server, a client terminal, a mobile device, a stationary device, a kiosk, a smartphone, a laptop, a tablet computer, a wearable computing device, a glasses computing device, a watch computing device, a desktop computer, and an interne of things (IoT) device
  • Processor(s) 406 may be implemented as for example, central processing unit(s) (CPU), graphics processing unit(s) (GPU), field programmable gate array(s) (FPGA), digital signal processor(s) (DSP), application specific integrated circuit(s) (ASIC), customized circuit(s), microprocessing unit (MPU), processors for interfacing with other units, and/or specialized hardware accelerators.
  • Processor(s) 406 may be implemented as a single processor, a multi-core processor, and/or a cluster of processors arranged for parallel processing (which may include homogenous and/or heterogeneous processor architectures).
  • Physical memory device(s) 408 and/or data storage device 412 are implemented as, for example as one or more of, a random access memory (RAM), read-only memory (ROM), and/or a storage device, for example, non-volatile memory, magnetic media, semiconductor memory devices, hard drive, removable storage, and optical media (e.g., DVD, CD-ROM).
  • RAM random access memory
  • ROM read-only memory
  • storage device for example, non-volatile memory, magnetic media, semiconductor memory devices, hard drive, removable storage, and optical media (e.g., DVD, CD-ROM).
  • paging mechanism 416 is the memory component that creates virtual memory 404 from physical memory 408 and/or data storage device 412.
  • Computing device 402 may be in communication with a user interface 414 that presents data and/or includes a mechanism for entry of data, for example, one or more of: a touch-screen, a display, a keyboard, a mouse, voice activated software, and a microphone.
  • User interface 414 may be used to configure parameters, for example, define the size of each virtual memory sub- page, and/or define the number of available clusters.
  • FIG. 5 is a flowchart of a method for compiling code for runtime execution within virtual memory sub-pages of virtual memory page(s).
  • the size of each virtual memory sub-page is at least as large as a predefined size of a physical memory block associated with the processor.
  • high-level languages e.g., C/C++, Fortran, Java, Phyton, and the like
  • the machine code is outputted by the compiler. Modifications to the machine code based on the method described with reference to FIG. 5 are transparent to the programmer. It is noted that the compiler assumes that the application will be run on virtual memory.
  • pre-compilation code is received for compilation by the compiler.
  • the pre compilation code may include source code may include text-based code written by a programmer.
  • the pre-compilation code may include object code that is already compiled but not yet linked.
  • the pre-compilation code may include an internal representation of the code within the compiler.
  • the source code may be written in different programming languages.
  • the pre compilation code may be new code for a first- time compilation, or may include old code (e.g., legacy application) that has been previously compiled but is now being re-compiled for runtime execution within virtual memory sub-pages of virtual memory page(s).
  • the size of the pre-compilation code, when compiled and loaded into a memory, is at least the size of one virtual memory sub-page.
  • the virtual memory sub-page corresponds to one of multiple physical memory blocks that are mapped to a virtual memory page.
  • the size of each physical memory block is the size of a virtual memory sub-page
  • the pre-compilation code which cannot fit into one virtual memory sub-page when compiled, is divided into blocks.
  • Each block when complied into a respective executable binary block has a size less than or equal to the size of a virtual memory sub-page of the virtual memory page corresponding to the size of one physical memory block.
  • Each binary block is relocatable in its entirety as a continuous segment of code from one virtual memory sub-page to another virtual memory sub-page.
  • Blocks may be relocated at runtime, by moving each block from one area of physical memory to another area of the physical memory. Since each block is mapped to a virtual memory sub-page, a block is moved from one virtual memory sub-page to another virtual memory sub-page. Blocks may be moved to a contiguous virtual memory sub-page, or another virtual memory sub-page that is non contiguous. For example, a block in virtual memory sub-page labeled as 1234 may be moved to virtual memory sub-page 1235, or 123456789.
  • a function of a .text section of the pre-compilation code that is larger than the size of one virtual memory sub-page when compiled into executable code is divided into multiple sub-functions that are each smaller than or equal to the size of one virtual memory sub- page when compiled into executable binary blocks.
  • the executable binary blocks of the divided function of the .text when loaded into memory for program execution as described with reference to FIG. 6, are placed by the loading code (e.g., supervisor software) within a cluster of virtual memory sub-pages of a virtual memory page that map to a corresponding cluster of physical memory blocks each of a size corresponding to a virtual memory sub-page size.
  • FIG. 7 is a schematic depicting division of an example .text section 702 into multiple sub-functions, in accordance with some embodiments of the present invention .
  • text section 702 includes three functions, fun_a(), fun_b(), and fun_c().
  • Schematic 704 depicts a standard implementation based on existing methods, in which .text section 702 is placed into physical memory as a continuous set of code spanning across multiple corresponding virtual memory sub-pages (one virtual memory sub-page marked 706 for clarity). Functions, fun_a(), fun_b(), and fun_c() are stored contiguously.
  • Schematic 708 depicts a division of .text 702 into three sub-functions fim_a(), fun_b(), and fun_c(), where each .text portion of each sub-function ( text_a , text_b, and text_c) is placed in a common cluster (i.e. color) 710 of physical memory.
  • the size of each .text section of each function is smaller than one virtual memory sub-page.
  • the entire .text segment is divided into blocks each smaller than or equal to the size of one virtual memory sub-page when compiled.
  • a single function cannot exceed the size of one virtual memory sub-page, for example, function outlining may be used for support.
  • function outlining may be used for support. It is noted that both LLVM and GCC (the most widely used compiler toolchains) already implement function outlining.
  • functions that are each smaller than the size of one virtual memory sub-page when compiled are arranged to fit entirely within one virtual memory sub-page when compiled.
  • the pre-compilation code includes a data storage structure larger than the size of one virtual memory sub-page when compiled.
  • the data storage structure is divided into multiple sub-data storage structures each smaller than the size of one virtual memory sub-page when compiled.
  • Exemplary data structures include: array and vector.
  • a dereferencing data structure (e.g., implemented as a table) stores data for accessing each element of each sub-data storage structure.
  • the dereferencing data structure may be created and/or the data may be stored within an existing dereferencing data structure.
  • the dereferencing data structure adds an offset according to the size of the virtual memory sub-pages of the virtual memory page storing the data structure during runtime and clusters of physical memory blocks each of a size corresponding to a virtual memory sub-page size allocated to the application associated with the data storage structure.
  • FIG. 8 is a schematic depicting a dereferencing table 802 (also referred to as subcolor _arr ay) for accessing each element of sub-arrays (one sub-array 804 depicted for clarity) which are obtained by dividing an array, in accordance with some embodiments of the present invention.
  • the array is stored in a virtual memory page 806, optionally a huge page.
  • Each sub-array 802 is less than or equal to the size of one virtual memory-sub page (one sub-page 808 depicted for clarity) of virtual memory page 806.
  • FIG. 9 is an example of code (e.g., native code, pseudo assembly code) generated by the compiler to enable data access to one element of each sub-data storage structure (the last 4 lines), in accordance with some embodiments of the present invention.
  • code e.g., native code, pseudo assembly code
  • the code represents a possible ASM translation. Different ISAs may enable faster data access.
  • a program stack is accessed and/or managed by incrementing the program stack in divided blocks each having a size smaller than or equal to one virtual memory sub-page.
  • the code outputted by the compiler may be modified for accessing and/or managing the stack.
  • a new program stack frame that updates a program stack pointer that points to each divided block is added.
  • the new program stack frame is added by adding an offset according to the size of the virtual memory sub-pages storing the program stack during runtime and clusters of physical memory blocks allocated to the application associated with the data storage structure.
  • the stack may be located for a certain set of page colors.
  • the caller function checks for the stack size. Since the argument sizes are already known at compile time, the caller code may decide to insert the new stack frame described herein after calculating the new stack position. Then, at the new location, the argument for the caller is laid out. At that point the caller code may pass the execution to the needed function, by updating the stack pointer.
  • the called function code When the called function returns, the called function code saves the return value for the caller. Eventually unsaved caller registers are restored and at the time of unrolling to the previous stack frame the called function code notices the proposed additional stack frame. Because of the new stack frame, the returning function will adjust the stack frame pointer before giving back the control to the calling function.
  • the blocks are compiled into executable binary blocks.
  • Functions e.g., .text section
  • .text section divided into blocks may be compiled with one .text section per function, which may enable quick re-coloring.
  • a table storing the relocation data may be created for future re-coloring.
  • a designation may be included, of the executable binary blocks for loading of the program by supervisor software into an allocated virtual memory page(s) by loading the executable binary blocks into physical memory blocks according to a mapping between virtual memory sub-pages of the virtual memory page(s) and allocated clusters of physical memory blocks each of a size corresponding to a virtual memory sub-page size.
  • the designation may be stored, for example, as metadata within the program, by a specialized data structure external to the program (e.g., table indicating whether the program is associated with the designation), and/or a value in a field stored by the program indicative of the designation.
  • the program is provided for execution.
  • the program may be, for example, locally stored in a data storage device, and/or transmitted to another computing device (e.g., a client terminal that provided the pre-compilation code, and/or another client terminal)
  • another computing device e.g., a client terminal that provided the pre-compilation code, and/or another client terminal
  • FIG. 10 is a schematic depicting additional exemplary components of compiler 412A and linker 412B (as described with reference to FIG. 4) for compiling code for runtime execution within virtual memory sub-pages of one or more virtual memory pages, in accordance with some embodiments of the present invention.
  • the components may represent a modification of the traditional compilation and/or traditional static and/or dynamic linking process for each of the main application parts.
  • compiler 412A Additional and/or modified components of compiler 412A include:
  • Functional outliner 1002 for dividing a function (e.g., of a .text section) of the pre compilation code that is larger than the size of one virtual memory sub-page when compiled into executable code, into a sub-functions that are each smaller than or equal to the size of one virtual memory sub-page when compiled into executable binary blocks, as described herein.
  • a function e.g., of a .text section
  • Scattered data structure support 1004 for dividing the data storage structure into sub data storage structures each smaller than the size of one virtual memory sub-page when compiled, as described herein.
  • Stack support 1006 for accessing and/or managing a program stack by incrementing the program stack in divided blocks each having a size smaller than or equal to one virtual memory sub-page, as described herein.
  • Defaults 1008 adds new default to the compiler, such as the default compilation methods that may include or exclude the support for page coloring in huge pages.
  • linker 412B Additional and/or modified components of linker 412B include:
  • Loader hooks 1014 creates additional handles for the loader to help functionalities such as re-coloring and/or runtime coloring.
  • Metadata generation 1016 for including a designation of divided executable binary blocks for appropriate loading of the program by supervisor software.
  • FIG. 6 is a flowchart of a method for execution of the program within virtual memory sub-pages of virtual memory page(s), in accordance with some embodiments of the present invention.
  • instructions to load an application for execution are received. For example, a user clicks on an icon associated with the application, and/or another process triggers loading of the application.
  • a binary file of the application divided into blocks is identified, for example, based on an analysis of the designation associated with the application (as described with reference to act 508 of FIG. 5).
  • the size of each block of the divided application is less than or equal to a size of a virtual memory sub-page.
  • Each physical memory block is of a size corresponding to a virtual memory sub-page size allocated for the application.
  • an allocation of virtual memory page(s) for the application is received.
  • the size of the virtual memory page(s) is mapped to an equal size of contiguous physical memory area(s).
  • the virtual memory page(s) include virtual memory sub-pages mapped to the clusters of physical memory blocks. Each physical memory block has a size corresponding to the size of a virtual memory sub-page.
  • the binary loader may allocate a virtual memory page (e.g., huge page) for the .text.
  • the binary loader issues a request to the supervisor code for the allocated color(s).
  • the loader may be placed at any virtual memory sub-page of any color, since the user-space loader is executed once during initialization.
  • the .text code may be stored in a virtual memory page (e.g., huge page) thus preserving the coloring.
  • the code may be re-linked, including symbols.
  • the loader may be modified to perform a runtime re-linking based on the selected colors during a re-coloring phase.
  • the application loader may implement a memory allocator supporting page coloring.
  • Page colors allocated to the application may be dynamically updated at run-time.
  • the application address space may be dynamically updated to allocate additional virtual memory pages (e.g., hue pages) to the application.
  • the blocks of the binary file of the application are loaded into the allocated virtual memory page(s).
  • the blocks are loaded into physical memory areas according to the mapping between the virtual memory sub-pages and the allocated clusters of physical memory blocks.
  • Each application is loaded with a limited number of the allocated page colors. Different applications are assigned different colors selected from all the available colors, to enable multiple applications to be loaded simultaneously.
  • the dereferencing data structure is populated with data for accessing each element of sub-data storage structures of the data storage structure.
  • the loader may populate the dereferencing table based on the application’s assigned colors.
  • the dereferencing data structure adds an offset according to the size of the virtual memory sub-pages of the virtual memory page storing the sub-data structures of the data structure during runtime and the clusters of physical memory blocks. Each physical memory block of a size corresponding to a virtual memory sub-page size allocated to the application associated with the data storage structure.
  • Each sub-data storage structure may be placed on page boundaries.
  • a program stack is grown in blocks. Each block having a size smaller than or equal to the size of one virtual memory sub-page of the virtual memory pages storing the program and the program stack during runtime.
  • the program stack is grown according to an added new program stack frame that updates a program stack pointer to point to the respective program stack blocks with an offset.
  • the offset is computed according to the size of the virtual memory sub-pages of the virtual memory pages storing the program stack blocks of the program stack during runtime and the clusters of physical memory blocks.
  • the application includes sub-functions that are each smaller than or equal to the size of one virtual memory sub-page of a function that is larger than the size of one virtual memory sub-page.
  • the sub-functions are stored at respective virtual memory sub-pages mapped to a cluster of physical memory blocks each of a size corresponding to a virtual memory sub- page size.
  • the location of each sub-function is stored in a mapping data structure for runtime execution of the function.
  • the application is executing. Re-coloring of the application may be performed at runtime.
  • one or more of the binary blocks are dynamically moved from a first virtual memory sub-page of a first cluster to a second memory sub-page of a second cluster.
  • the mapping between virtual memory sub-pages and clusters of physical memory blocks is updated according to the dynamic move.
  • Runtime relocation of the dereferencing data structure may be performed when no pointers to the actual elements of the data structure are stored by the code. For example, the code is prevented from saving pointers to the data structure elements. Access to the data structure elements is provided via indexes.
  • FIG. 11 is a schematic depicting additional exemplary components of a runtime 1102 and/or operating system 1104 and/or memory management 1106 for loading code for execution within virtual memory sub-pages of virtual memory page(s), in accordance with some embodiments of the present invention.
  • Additional and/or modified components of runtime 1102 include:
  • Load- time and run-time symbol relocation 1108 for dynamically moving binary block(s) from a first virtual memory sub-page of a first cluster to a second memory sub-page of a second cluster and updating a mapping between virtual memory sub-pages and clusters of physical memory blocks according to the move.
  • Additional and/or modified components of executable binary loader 1112 of operating system 1104 include:
  • Additional and/or modified components of memory management 1106 include:
  • Coloring allocator 1116 that performs an allocation of virtual memory page(s) for the application according to clusters, as described herein.
  • FIG. 12 is a flowchart depicting an exemplary implementation of dividing a function of a .text section of the pre-compilation code when compiled into sub-functions that are each smaller than or equal to the size of one virtual memory sub-page when compiled, in accordance with some embodiments of the present invention. It is noted that the method is not necessarily
  • the division is performed for a source program 202 by a compiler 204 to create an object code program 206 is as follows:
  • a parser unit parses source program 202, for example, according to common compiler practice.
  • an intermediate code conversion unit performs intermediate code conversion, for example, according to common compiler practice.
  • an optimization unit 223 performs optimization of the intermediate code, for example, according to common compiler practice.
  • a code generation unit generates code, for example, according to common compiler practice.
  • functions larger than the size of one virtual memory sub-page are divided into sub-functions that are each smaller than the size of one virtual memory sub-page.
  • LLVM and GCC are exemplary compiler frameworks that are production quality and commonly used in software development.
  • LLVM and GCC implement function outlining.
  • An example implementation of outlining is in the framework called OpenMP.
  • An example of code that can be outlined is a loop.
  • the compilation outputs an object file 206 with one section per sub-function and associated relocatable code (i.e., relocations). Relocation symbols may be defined in the .reloc section.
  • the jump tables define how the blocks which are loaded into non-contiguous memory areas are linked to one another.
  • the sectioning unit helps the compiler divide code and/or data objects in the size of a sub-page.
  • a packing unit of linker 208 packs code functions from object code program 20 in the minimum size according to one virtual memory sub-page (e.g., 4 kB).
  • the information about functions may be maintained or discarded.
  • the packing creates the order in which functions are placed by the supervisor software within a cluster of virtual memory sub-pages. Padding may be applied to avoid function spanning across multiple virtual memory sub-pages (e.g., over 4 kB).
  • a jump table generation unit computes the jump table.
  • a linking generation unit performs linking according to standard linker practice, the linking generation unit assumes one single continuous .text section since each program block is continuous in the virtual address space without overlapping one another, as defined by the jump table.
  • a relocation and symbol generation unit saves the relocation information in the executable binary 212, along with symbol information.
  • an additional metadata generation unit adds a tag to the executable binary 212, that acts an as indication to the program loader that executable binary 212 has been compiled to support page coloring with huge pages, and therefore amendable to load-time block relocation.
  • FIG. 13 is a flowchart of an exemplary method for execution of a .text section of an executable binary file within virtual memory sub-pages of one or more virtual memory pages, in accordance with some embodiments of the present invention. It is noted that the .text section is described as one example, with the operation principles of the method applicable to other executable binary sections.
  • Machine code 212 is received by supervisor software 214.
  • Machine code 212 is created based on the method described with reference to FIG. 12.
  • Executable binary loader 216 of supervisor software 214 may exist as part of the operating system (OS), and/or loaded by the OS in the same address space of the application.
  • OS operating system
  • the implementations depicted herein (which is not necessarily limiting) is based on executable binary loader 216 implemented within the OS.
  • the executable binary loader 216 performs the following:
  • the header parsing unit reads the set of headers that described the executable binary file and parses the content of the description, in accordance with standard supervisor software practice and/or executable binary loader practice.
  • the binary file is checked for the tag that indicates that the binary has been compiled for page coloring with (optionally huge) virtual memory pages (e.g., the tag is created by the compiler to distinguish the type of compilation, for example, as described with reference to act 231 of FIG. 12). It is noted that the tag is an exemplary implementation and not necessarily limiting.
  • the binary file may further be checked to verify that no code function is larger than the size of one virtual memory sub-page (e.g., 4 kB).
  • the binary file may further be checked to verify that the relocation symbols are available in the executable.
  • the generate page color allocation unit determines the colors that are to be assigned to the .text section.
  • the page/huge page memory allocation unit allocates a certain number of virtual memory pages (e.g., huge pages) for the binary and loads the entire .text section at the beginning of the allocated memory.
  • the function/data relocation unit moves each .text section block (of size of one virtual memory sub-page or less, e.g., 4 kB) to a virtual memory page which respects the assigned coloring, saving the offset for each page.
  • a scheduler unit schedules execution of the application, according to common supervisor software practice.
  • a runtime binary loader 218 of program 220 performs symbol relocations according to common runtime binary loader practice.
  • runtime binary loader 218 uses the relocation information to pass through the entire .text section to change function pointers at runtime by generating runtime jump tables.
  • the start address of the program is changed due to coloring, the start address is updated.
  • control is passed to the application that begins running.
  • compositions comprising, “comprising”, “includes”, “including”, “having” and their conjugates mean “including but not limited to”. This term encompasses the terms “consisting of” and “consisting essentially of”.
  • Consisting essentially of means that the composition or method may include additional ingredients and/or steps, but only if the additional ingredients and/or steps do not materially alter the basic and novel characteristics of the claimed composition or method.
  • a compound or “at least one compound” may include a plurality of compounds, including mixtures thereof.
  • range format is merely for convenience and brevity and should not be construed as an inflexible limitation on the scope of the invention. Accordingly, the description of a range should be considered to have specifically disclosed all the possible subranges as well as individual numerical values within that range. For example, description of a range such as from 1 to 6 should be considered to have specifically disclosed subranges such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6 etc., as well as individual numbers within that range, for example, 1, 2, 3, 4, 5, and 6. This applies regardless of the breadth of the range.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

There is provided a compiler configured to: receive pre-compilation code for compilation, wherein the pre-compilation code, when compiled and loaded into a memory, is at least the size of one virtual memory sub-page corresponding to one physical memory block that is mapped to a virtual memory page, divide the pre-compilation code into blocks what when complied into a respective executable binary block is less than or equal to the size of a virtual memory sub-page, compile the blocks into executable binary blocks; and link the executable binary blocks into a program and include a designation of the executable binary blocks for loading of the program by supervisor software into an allocated virtual memory page(s) by loading the executable binary blocks into physical memory blocks according to a mapping between virtual memory sub-pages and allocated clusters of physical memory blocks each of a size corresponding to a virtual memory sub-page size.

Description

SYSTEMS FOR COMPILING AND EXECUTING CODE WITHIN ONE OR MORE VIRTUAL MEMORY PAGES
BACKGROUND
The present invention, in some embodiments thereof, relates to virtual memory management and, more specifically, but not exclusively, to systems and methods for clustering sub-pages of virtual memory pages.
In the framework of multiprocessors/multicore processors which have a high number of cores, and/or software that precludes the co-execution of multiple logical execution units (tasks), sharing access among execution entities to memory resources is increasingly important for performance and energy efficiency reason. Memory resources include, for example, processor caches, including one or more of: Ll, L2, L3, and L4 (e.g., Ll, L1-L2, L1-L3, and L3-L4) (the highest level is termed last level cache (LLC)), the processor memory bus/ring that interconnects multiple groups/clusters via their LLC, and the memory controller and its (parallel) interconnections to the parallel memory elements (banks).
In order to partition the usage of memory resources among different execution entities, different techniques have been introduced, which include page-coloring, a software only technology that requires virtual memory to be implemented. In the case of the cache partitioning page-coloring requires at least at the LLC physically indexed and tagged caches. In the cases of memory bandwidth partition page-coloring may require software configuration of bank interleaving.
SUMMARY
It is an object of the present invention to provide an apparatus, systems, methods, and/or code instructions for compiling code for runtime execution within virtual memory sub-pages of a virtual memory page(s), and/or for loading code for execution within virtual memory sub-pages of a virtual memory page(s).
The foregoing and other objects are achieved by the features of the independent claims. Further implementation forms are apparent from the dependent claims, the description and the figures.
According to a first aspect, an apparatus for compiling code for runtime execution within a plurality of virtual memory sub-pages of at least one virtual memory page, comprises: a compiler executable by a processor, the compiler configured to: receive pre-compilation code for compilation, wherein the size of the pre-compilation code, when compiled and loaded into a memory, is at least the size of one virtual memory sub-page, wherein the at least one virtual memory sub-page corresponds to one of a plurality of physical memory blocks that are mapped to a virtual memory page, the size of each physical memory block is the size of a virtual memory sub-page, divide the pre-compilation code into a plurality of blocks such that each block of the plurality of blocks when complied into a respective executable binary block of a plurality of executable binary blocks is less than or equal to the size of a virtual memory sub-page of the at least one virtual memory page corresponding to the size of one physical memory block, compile the plurality of blocks into the plurality of executable binary blocks, and link the plurality of executable binary blocks into a program and include a designation of the plurality of executable binary blocks for loading of the program by supervisor software into an allocated at least one virtual memory page by loading the plurality of executable binary blocks into physical memory blocks according to a mapping between virtual memory sub-pages of the at least one virtual memory page and allocated plurality of clusters of physical memory blocks each of a size corresponding to a virtual memory sub-page size.
According to a second aspect, an apparatus for loading code for execution within a plurality of virtual memory sub-pages of at least one virtual memory page, comprises: a processor, a memory storing code instructions for execution by the processor, comprising: code to identify a binary file of an application divided into a plurality of blocks, where a size of each block of the plurality of blocks is less than or equal to a size of a virtual memory sub-page, code to retrieve an initial allocation of a plurality of clusters of physical memory blocks each of a size corresponding to a virtual memory sub-page size for the application, code to receive an allocation of at least one virtual memory page for the application, wherein the size of the at least one virtual memory page is mapped to an equal size of contiguous physical memory areas, wherein the at least virtual memory page includes a plurality of virtual memory sub-pages mapped to the plurality of clusters of physical memory blocks each of a size corresponding to a virtual memory sub-page size, code to load the plurality of blocks of the binary file of the application into the allocated at least one virtual memory page, wherein the plurality of blocks are loaded into physical memory areas according to the mapping between the virtual memory sub-pages and the allocated plurality of clusters of physical memory blocks.
The systems, apparatus, methods, and/or code instructions described herein extend page coloring (i.e., clustering) to huge virtual memory pages. The implementation of the systems, apparatus, methods, and/or code instructions described herein is transparent to executing code (e.g., the program, application).
The implementation of the systems, apparatus, methods, and/or code instructions described herein is based on system software.
Existing (e.g., legacy) code (e.g., programs, applications) may be re-compiled for utilizing of the implementation based on the systems, apparatus, methods, and/or code instructions described herein. New programs designed for implementation based on the systems, apparatus, methods, and/or code instructions described herein are not necessarily required.
The systems, apparatus, methods, and/or code instructions described herein provide a software -based solution based on modification of the system software (e.g., operating system code, runtime code, compiler code, and/or linker code). The software-based solution does not necessarily require any modification of processing hardware and/or addition of new processing hardware, and may be executed by existing processing hardware, for example, in comparison to other proposed solutions that are based on at least some modification of processing hardware and/or new hardware component(s).
The systems, apparatus, methods, and/or code instructions described herein, being based on software without requiring modification of hardware and/or new hardware provides scalability, for example, in comparison to other attempts based on new and/or modified hardware that are limited in scalability by the hardware. The problem of scalability may be further explained as follows. Commodity processor instruction set architectures (ISAs) offer only a limited/fixed number of ways for the last level cache (LLC) independently of the number of cores available on the multicore processor, hence the scalability issue (note that there exist other approaches that rely on hardware extensions to partition the cache to different physical or logical execution units, an example is way-partitioning).
In a further implementation form of the first aspect, the compiler is further configured to divide a function of a .text section of the pre-compilation code that is larger than the size of one virtual memory sub-page when compiled into executable code, into a plurality of sub-functions that are each smaller than or equal to the size of one virtual memory sub-page when compiled into executable binary blocks, wherein the executable binary blocks of the divided function of the .text are placed by the supervisor software within a cluster of virtual memory sub-pages of a virtual memory page that map to a corresponding cluster of physical memory blocks each of a size corresponding to a virtual memory sub-page size. In a further implementation form of the first aspect, the compiler is further configured to arrange a plurality of functions that are each smaller than the size of one virtual memory sub- page when compiled, to fit entirely within one virtual memory sub-page when compiled.
In a further implementation form of the first aspect, the pre-compilation code includes a data storage structure larger than the size of one virtual memory sub-page when compiled, and wherein the compiler is further configured to divide the data storage structure into a plurality of sub-data storage structures each smaller than the size of one virtual memory sub-page when compiled.
In a further implementation form of the first aspect, the compiler is further configured to create a dereferencing data structure for accessing each element of each sub-data storage structure, wherein the dereferencing data structure adds an offset according to the size of the virtual memory sub-pages of the virtual memory page storing the data structure during runtime and clusters of physical memory blocks each of a size corresponding to a virtual memory sub- page size allocated to the application associated with the data storage structure.
In a further implementation form of the first aspect, the compiler is further configured to access and manage a program stack by incrementing the program stack in divided blocks each having a size smaller than or equal to one virtual memory sub-page.
In a further implementation form of the first aspect, the compiler is further configured to add a new program stack frame that updates a program stack pointer that points to each divided block by adding an offset according to the size of the virtual memory sub-pages of the virtual memory page storing the program stack during runtime and clusters of physical memory blocks allocated to the application associated with the data storage structure.
In a further implementation form of the first aspect, the size of the virtual memory sub- page is at least as large as a predefined standard size of a physical memory block associated with the processor.
In a further implementation form of the first aspect, each binary block of the plurality of binary blocks is relocatable in its entirety as a continuous segment of code from one virtual memory sub-page to another virtual memory sub-page.
In a further implementation form of the second aspect, the apparatus further comprising code to dynamically move at least one of the plurality of binary blocks from a first virtual memory sub-page of a first cluster to a second memory sub-page of a second cluster, and update a mapping between virtual memory sub-pages and clusters of physical memory blocks according to the dynamic move. In a further implementation form of the second aspect, the apparatus further comprises code to populate data of a dereferencing data structure for accessing each element of sub-data storage structures of a data storage structure, wherein the dereferencing data structure adds an offset according to the size of the virtual memory sub-pages of the virtual memory page storing the sub-data structures of the data structure during runtime and the clusters of physical memory blocks each of a size corresponding to a virtual memory sub-page size allocated to the application associated with the data storage structure.
In a further implementation form of the second aspect, the application includes complied code for growing a program stack in blocks each having a size smaller than or equal to the size of one virtual memory sub-page of the virtual memory pages storing the program and the program stack during runtime, according to an added new program stack frame that updates a program stack pointer to point to the respective program stack blocks with an offset computed according to the size of the virtual memory sub-pages of the virtual memory pages storing the program stack blocks of the program stack during runtime and the clusters of physical memory blocks each of a size corresponding to a virtual memory sub-page size allocated to the program stack.
In a further implementation form of the second aspect, the application includes compiled code for storing a plurality of sub-functions that are each smaller than or equal to the size of one virtual memory sub-page of a function that is larger than the size of one virtual memory sub- page, at respective virtual memory sub-page mapped to a cluster of physical memory blocks each of a size corresponding to a virtual memory sub-page size, and storing the location of each of the plurality of sub-functions in a mapping data structure for runtime execution of the function.
In a further implementation form of the second aspect, in an implementation of the processor lacking a paging mechanism the at least one virtual memory sub-page, which is part of a virtual memory page, is mapped to one physical memory block which is part of a plurality of contiguous physical memory blocks that makes up the size of a virtual memory page.
Unless otherwise defined, all technical and/or scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the invention pertains. Although methods and materials similar or equivalent to those described herein can be used in the practice or testing of embodiments of the invention, exemplary methods and/or materials are described below. In case of conflict, the patent specification, including definitions, will control. In addition, the materials, methods, and examples are illustrative only and are not intended to be necessarily limiting.
BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS
Some embodiments of the invention are herein described, by way of example only, with reference to the accompanying drawings. With specific reference now to the drawings in detail, it is stressed that the particulars shown are by way of example and for purposes of illustrative discussion of embodiments of the invention. In this regard, the description taken with the drawings makes apparent to those skilled in the art how embodiments of the invention may be practiced.
In the drawings:
FIG. 1 is a schematic depicting how page colors are arranged in a physical address space, to help in understanding the technical problem addressed by some implementations of the present invention;
FIG. 2 is a schematic depicting an application using virtual pages of three different colors, to help in understanding the technical problem addressed by some implementations of the present invention;
FIG. 3 is a schematic depicting an example of an application that uses virtual memory paging with at least one huge virtual memory page, in accordance with some embodiments of the present invention;
FIG. 4 is a schematic of a block diagram of a system that includes a computing device for compiling code for runtime execution within virtual memory sub-pages and/or for loading code for execution within virtual memory sub-pages, in accordance with some embodiments of the present invention;
FIG. 5 is a flowchart of a method of compiling code for runtime execution within of virtual memory sub-pages of virtual memory page(s), in accordance with some embodiments of the present invention;
FIG. 6 is a flowchart of a method of loading code for execution within virtual memory sub-pages of virtual memory page(s), in accordance with some embodiments of the present invention;
FIG. 7 is a schematic depicting division of an example .text section into multiple sub- functions, in accordance with some embodiments of the present invention; FIG. 8 is a schematic depicting a dereferencing table for accessing each element of sub arrays which are obtained by dividing an array, in accordance with some embodiments of the present invention;
FIG. 9 is an example of code (e.g., native code, pseudo assembly code) generated by the compiler to enable data access to one element of each sub-data storage structure, in accordance with some embodiments of the present invention;
FIG. 10 is a schematic depicting additional exemplary components of a compiler and a linker for compiling code for runtime execution within virtual memory sub-pages of one or more virtual memory pages, in accordance with some embodiments of the present invention;
FIG. 11 is a schematic depicting additional exemplary components of a runtime and/or operating system and/or memory management for loading code for execution within virtual memory sub-pages, in accordance with some embodiments of the present invention;
FIG. 12 is a flowchart depicting an exemplary implementation of dividing a function of a .text section of the pre-compilation code when compiled into sub-functions that are each smaller than or equal to the size of one virtual memory sub-page when compiled, in accordance with some embodiments of the present invention; and
FIG. 13 is a flowchart of an exemplary method for execution of a .text section of an executable binary file within virtual memory sub-pages of one or more virtual memory pages, in accordance with some embodiments of the present invention.
DETAILED DESCRIPTION
The present invention, in some embodiments thereof, relates to virtual memory management and, more specifically, but not exclusively, to systems and methods for clustering sub-pages of virtual memory pages.
As used herein, the term cluster or clustering and the word color or coloring are interchangeable. For example, each cluster is assigned a certain color.
As used herein, the term huge virtual memory page refers to a virtual huge memory page that is larger than the size of a physical memory page implementation defined by the hardware. It is noted that different implementations may refer to huge pages with other terms, for example, large pages.
As used herein, the term standard size virtual memory page refers to a virtual memory page defined by the hardware as the minimum amount of translation. The size of each physical memory block is the size of a virtual memory sub-page. The terms huge virtual memory page, standard virtual memory page, and virtual memory page are sometimes interchangeable.
An aspect of some embodiments of the present invention relates to an apparatus, systems, methods, and/or code instructions (stored in a data storage device executable by one or more hardware processors) for compiling pre-compilation code for runtime execution within virtual memory sub-pages of virtual memory page(s). The size of the pre-compilation code, when compiled and loaded into a memory, is at least the size of one virtual memory sub-page. The virtual memory sub-page corresponds to one of multiple physical memory blocks that are mapped to a virtual memory page. The size of each physical memory block is the size of a virtual memory sub-page. The pre-compilation code is divided into blocks, such that each block when complied into a respective executable binary block is less than or equal to the size of a virtual memory sub-page (of the virtual memory page corresponding to the size of one physical memory block). The blocks are compiled into executable binary blocks. The executable binary blocks are linked into a program. The program includes a designation of the executable binary blocks for loading of the program by supervisor software into an allocated virtual memory page. The supervisor software loads the executable binary blocks into physical memory blocks according to a mapping between virtual memory sub-pages of the virtual memory page and allocated clusters of physical memory blocks. Each block of a size corresponding to a virtual memory sub-page size, for example, 4 kilobytes (kB) is the smallest page size available for processors based on the x86 architecture.
An aspect of some embodiments of the present invention relates to an apparatus, systems, methods, and/or code instructions (stored in a data storage device executable by one or more hardware processors) for loading code for execution within virtual memory sub-pages of virtual memory page(s). A binary file of an application divided into blocks is identified. A size of each block is less than or equal to a size of a virtual memory sub-page. An initial allocation of clusters of physical memory blocks each of a size corresponding to a virtual memory sub-page size is retrieved for the application. An allocation of virtual memory page(s) for the application is received. The size of the virtual memory page is mapped to an equal size of contiguous physical memory areas. The virtual memory page includes virtual memory sub-pages mapped to the clusters of physical memory blocks. The size of each block corresponds to the size of a virtual memory sub-page. The blocks of the binary file of the application are loaded into the allocated virtual memory page(s). The blocks are loaded into physical memory areas according to the mapping between the virtual memory sub-pages and the allocated clusters of physical memory blocks.
Virtual memory sub-pages mapped to respective clusters of memory blocks may be located non-contiguously within the virtual memory page. Virtual memory sub-pages of different clusters may be contiguous with one another, optionally in a repeating pattern, for example for three defined clusters arranged as: 1,2, 3, 1,2, 3, 1,2, 3.
The apparatus and/or system described herein address the technical problem of combining software-based memory page clustering (also referred to herein as page coloring) and hardware -based huge memory pages in an efficient and operable manner. Such combination is not practically possible with current hardware and software architectures. A brief explanation of the current state of the art and resulting incompatibility of software -based page coloring and hardware -based huge memory pages is now provided.
Current multicore/multiprocessor computers are ubiquitous. Such computer architectures provide an improved performance compared to their predecessors by enabling the parallel execution of software on multiple hardware computer devices. However, to enable multiple computer devices to share the same data, which resides in memory, all of the computer devices need to access the same memory locations, which are usually mediated by the (hardware) last level cache. When the last level cache is shared among computer devices, performance issues result due to unfair usage of the last level cache by the different software applications running atop the computer devices (i.e., core or CPUs). The unfair usage may degrade performance of each application, especially in cases that application code is memory -bounded (i.e., a large number of memory accesses are performed) and the memory access pattern is characterized by temporal locality. Page coloring techniques, which are currently implemented as pure software, are used to fairly share the last level cache and reduce application interference.
Generally, current software applications use virtual memory provided by a paging mechanism of the computing device. The minimum granularity of virtual to physical translation is a standard page. The standard page is a small page, there can be other small pages defined by the hardware. When an application operates on a wide memory area, the usage of small pages (which may be greater than the size of a standard page) significantly impact performance due to the high cost of virtual memory translations. A high number of page translations may result in a high number of misses in the TLB cache, requiring numerous memory accesses to fetch each translation (i.e., an operation termed page walk). Hardware huge pages are implemented to solve the described problem to reduce TLB misses. Software-based page coloring is incompatible with hardware-based huge pages because page coloring is designed to operate according to the smallest predefined and/or standard page granularity. Based on existing technology, an attempt to extend the technique of software-based page coloring to hardware -based huge-page coloring may result either in an extremely small number of colors, or no colors at all, which effectively eliminates any potential benefits of implementing coloring.
The apparatus, system, methods, and/or code instructions (stored in a data storage device executed by one or more processors) described herein effectively implement a combination of coloring and huge pages in a manner that improves performance and/or deterministic execution of applications running concurrently on the same computer device.
A brief discussion of other attempts at combining coloring and huge pages is now provided, to help understand the addressed technical problem and described solution. One described strategy is to implement a hardware-based solution to the problem. However, such hardware-only solutions requires manufacturing of new hardware processors designed to enable the exploitation of page coloring combined with huge pages. In general, such solutions are complicated and not practical for implementation due to technical difficulty in design and/or manufacturing. Moreover, such solutions are not generic enough to cover expected possible application demands.
Another attempt at addressing the technical problem of combining page coloring and huge pages in an operable manner is termed Cache Allocation Technology (CAT) of Intel®. CAT is designed to transparently support huge pages. However, CAT cannot be easily controlled and/or generally implemented, since the solution is designed specifically for the processors produced by Intel® based on the x86 architecture. Moreover, CAT cannot scale to a high number of applications.
Before explaining at least one embodiment of the invention in detail, it is to be understood that the invention is not necessarily limited in its application to the details of construction and the arrangement of the components and/or methods set forth in the following description and/or illustrated in the drawings and/or the Examples. The invention is capable of other embodiments or of being practiced or carried out in various ways.
The present invention may be a system, a method, and/or a computer program product. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention. The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network.
The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.
Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.
The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.
Reference is now made to FIG. 1, which is a schematic depicting how page colors (i.e., clusters) are arranged in a physical address space 102, to help in understanding the technical problem addressed by some implementations of the present invention. FIG. 1 depicts the traditional page coloring (i.e., clustering) approach that uses virtual memory to group physically scattered memory pages of the same color together within the same virtual address range. Page colors are periodically repeated. For example, one set of virtual memory pages corresponds to physical memory blocks of page size. It is noted that the page size is of a standard page size as defined by the processor (e.g., 4kB in x86 architectures) are assigned physical memory pages having the color blue (e.g., cluster 1) 104. Another set of virtual memory pages are assigned physical memory pages having the color green (e.g., cluster 2) 106. It is noted that the labels blue and green are meant as tags to identify the clusters, and do not reflect actual colors of the memory. The colors blue and green are periodically repeated. Pages with the same color have a constant offset 108 in the physical address space.
Reference is now made to FIG. 2, which is a schematic depicting an application (App 1) using virtual pages of three different colors (i.e., clusters), blue 282, green, 284, and yellow 286, to help in understanding the technical problem addressed by some implementations of the present invention. A virtual memory subsystem (component implemented in hardware and/or software) enables the application to organize strictly disposed physical pages of a physical address space 288 into linear (virtual) memory ranges of a virtual address space 290. Note that a specific page color organization is shown in FIG. 2, but it is to be understood that there are multiple possible organizations.
One technical problem with the implementation of virtual memory page coloring with virtual memory huge pages is that the coloring associates one color per page, independently of the size of the page. Therefore, with virtual memory pages one page corresponds to one color, and with virtual memory huge pages, one huge page corresponds to one color. Because a huge virtual memory page incorporates multiple pages, multiple pages of different huge pages are integrated in a single huge page that is mapped 1:1 from physical memory to virtual memory. This implies that in a system in which some applications use virtual memory pages but also virtual memory huge pages, page coloring is incompatible with virtual huge page coloring, due to the fact that the huge page integrates multiple pages of all possible colors.
Reference is now made to FIG. 3, which is a schematic depicting an example of an application (App 1) that uses paging with at least one virtual memory page coloring (i.e., clustering) within a huge virtual memory page, in accordance with some embodiments of the present invention. A huge page 302 within a physical address space 304 may be located anywhere within the application’s assigned virtual address space 306. However, the colored sub- pages (e.g., one set 308 depicted for clarity) are fixed within huge page 302.
Reference is now made to FIG. 4, which is a schematic of a block diagram of a system 400 that includes a computing device 402 for compiling code for runtime execution within virtual memory sub-pages of virtual memory page(s) of a virtual memory 404 and/or for loading code for execution within virtual memory sub-pages of virtual memory page(s) of virtual memory 404, in accordance with some embodiments of the present invention. Reference is also made to FIG. 5, which is a flowchart of a method of compiling code for runtime execution within of virtual memory sub-pages of virtual memory page(s), in accordance with some embodiments of the present invention. Reference is also made to FIG. 6, which is a flowchart of a method of loading code for execution within virtual memory sub-pages of virtual memory page(s), in accordance with some embodiments of the present invention. The methods of FIG. 5 and/or FIG. 6 may be implemented by code sorted in data storage device 412 executed by processor(s) 406. Data storage device 412 may be implemented as random access memory (RAM), or code may be moved from data storage device 412 to RAM for execution by processor(s) 406. For example, the method of FIG. 5 may be implemented by compiler code 412A and/or linker code 412B. The method of FIG. 6 may be implemented by loading code 412C, for example, supervisor code, application loader, and/or library loader.
As used herein, the term supervisor (e.g., code, software) and loading code may be interchanged.
It is noted that compile code 412A and linker code 412B may be implemented a single component referred to herein as compiler. Alternatively, the compiler and linker are implemented as distinct components.
It is noted that different architectures of computing device 402 may be implemented. For example, the same computing device 402 may compile code (or re-compile previously compiled code) for runtime execution within virtual memory sub-pages of virtual memory page(s), and load the compiled code for execution within virtual memory sub-pages of virtual memory page(s). Alternatively, one computing device 402 performs the compilation of the code, for example, for locally stored code, for code transmitted by client terminal(s) and/or server(s), and/or providing remote services to client terminals(s) and/or server(s) (e.g., via a software interface such as an application programming interface (API), software development kit (SDK), a web site interface, and an application interface that is loaded on the client terminal and/or server). The compiled code may be provided for execution within virtual memory sub-pages of virtual memory page(s) of another computing device, for example, by the client terminal(s) and/or server(s) that provided the code for compilation, and/or by another client terminal and/or server that receive the compiled code for local execution.
Optionally, processor(s) 406 includes a paging mechanism 416 that maps between virtual memory 404 and physical memory 408. It is noted that virtual memory 404 represents an abstraction and/or a virtual component, since virtual memory 404 does not represent an actual physical virtual memory device. Paging mechanism 416 may be implemented in hardware. When an implementation of processor(s) lacks a paging mechanism, the virtual memory sub- page, which is part of a virtual memory page, is mapped to one physical memory block which is part of contiguous physical memory blocks that makes up the size of a virtual memory page. Optionally, the physical memory block offset to the beginning of the contiguous physical memory blocks is the same as the offset that the virtual memory sub-page has to the beginning of the virtual memory page. In a processor without a paging mechanism there is no virtual page concept. Virtual memory sub-pages are physical memory blocks. Virtual memory pages are a collection of contiguous physical memory blocks. The systems, apparatus, methods, and/or code instructions described herein enable page coloring without necessarily requiring a virtual memory subsystem.
Computing device 402 may be implemented as, for example, one of more of: a single computing device (e.g., client terminal), a group of computing devices arranged in parallel, a network server, a web server, a storage server, a local server, a remote server, a client terminal, a mobile device, a stationary device, a kiosk, a smartphone, a laptop, a tablet computer, a wearable computing device, a glasses computing device, a watch computing device, a desktop computer, and an interne of things (IoT) device.
Processor(s) 406 may be implemented as for example, central processing unit(s) (CPU), graphics processing unit(s) (GPU), field programmable gate array(s) (FPGA), digital signal processor(s) (DSP), application specific integrated circuit(s) (ASIC), customized circuit(s), microprocessing unit (MPU), processors for interfacing with other units, and/or specialized hardware accelerators. Processor(s) 406 may be implemented as a single processor, a multi-core processor, and/or a cluster of processors arranged for parallel processing (which may include homogenous and/or heterogeneous processor architectures).
Physical memory device(s) 408 and/or data storage device 412 are implemented as, for example as one or more of, a random access memory (RAM), read-only memory (ROM), and/or a storage device, for example, non-volatile memory, magnetic media, semiconductor memory devices, hard drive, removable storage, and optical media (e.g., DVD, CD-ROM).
It is noted that paging mechanism 416 is the memory component that creates virtual memory 404 from physical memory 408 and/or data storage device 412.
Computing device 402 may be in communication with a user interface 414 that presents data and/or includes a mechanism for entry of data, for example, one or more of: a touch-screen, a display, a keyboard, a mouse, voice activated software, and a microphone. User interface 414 may be used to configure parameters, for example, define the size of each virtual memory sub- page, and/or define the number of available clusters.
Reference is now made to FIG. 5, which is a flowchart of a method for compiling code for runtime execution within virtual memory sub-pages of virtual memory page(s). The size of each virtual memory sub-page is at least as large as a predefined size of a physical memory block associated with the processor. It is noted that in high-level languages (e.g., C/C++, Fortran, Java, Phyton, and the like) the machine code is outputted by the compiler. Modifications to the machine code based on the method described with reference to FIG. 5 are transparent to the programmer. It is noted that the compiler assumes that the application will be run on virtual memory.
At 502 pre-compilation code is received for compilation by the compiler. The pre compilation code may include source code may include text-based code written by a programmer. The pre-compilation code may include object code that is already compiled but not yet linked. The pre-compilation code may include an internal representation of the code within the compiler. The source code may be written in different programming languages. The pre compilation code may be new code for a first- time compilation, or may include old code (e.g., legacy application) that has been previously compiled but is now being re-compiled for runtime execution within virtual memory sub-pages of virtual memory page(s).
The size of the pre-compilation code, when compiled and loaded into a memory, is at least the size of one virtual memory sub-page. The virtual memory sub-page corresponds to one of multiple physical memory blocks that are mapped to a virtual memory page. The size of each physical memory block is the size of a virtual memory sub-page
At 504, the pre-compilation code, which cannot fit into one virtual memory sub-page when compiled, is divided into blocks. Each block when complied into a respective executable binary block has a size less than or equal to the size of a virtual memory sub-page of the virtual memory page corresponding to the size of one physical memory block.
Each binary block is relocatable in its entirety as a continuous segment of code from one virtual memory sub-page to another virtual memory sub-page. Blocks may be relocated at runtime, by moving each block from one area of physical memory to another area of the physical memory. Since each block is mapped to a virtual memory sub-page, a block is moved from one virtual memory sub-page to another virtual memory sub-page. Blocks may be moved to a contiguous virtual memory sub-page, or another virtual memory sub-page that is non contiguous. For example, a block in virtual memory sub-page labeled as 1234 may be moved to virtual memory sub-page 1235, or 123456789.
Not necessarily limiting methods for division of some exemplary data structures that cannot fit into one virtual memory sub-page when compiled are discussed. It is to be understood that other data structures not explicitly discussed herein may be divided based on similar principles.
Optionally, a function of a .text section of the pre-compilation code that is larger than the size of one virtual memory sub-page when compiled into executable code, is divided into multiple sub-functions that are each smaller than or equal to the size of one virtual memory sub- page when compiled into executable binary blocks. The executable binary blocks of the divided function of the .text, when loaded into memory for program execution as described with reference to FIG. 6, are placed by the loading code (e.g., supervisor software) within a cluster of virtual memory sub-pages of a virtual memory page that map to a corresponding cluster of physical memory blocks each of a size corresponding to a virtual memory sub-page size.
Reference is now made to FIG. 7, which is a schematic depicting division of an example .text section 702 into multiple sub-functions, in accordance with some embodiments of the present invention .text section 702 includes three functions, fun_a(), fun_b(), and fun_c(). Schematic 704 depicts a standard implementation based on existing methods, in which .text section 702 is placed into physical memory as a continuous set of code spanning across multiple corresponding virtual memory sub-pages (one virtual memory sub-page marked 706 for clarity). Functions, fun_a(), fun_b(), and fun_c() are stored contiguously. Schematic 708 depicts a division of .text 702 into three sub-functions fim_a(), fun_b(), and fun_c(), where each .text portion of each sub-function ( text_a , text_b, and text_c) is placed in a common cluster (i.e. color) 710 of physical memory. The size of each .text section of each function is smaller than one virtual memory sub-page.
Returning now to act 504 of FIG. 5, it is noted that functions (e.g., .text section) smaller than one virtual memory sub-page are reloadable, and do not necessarily require division.
Optionally, the entire .text segment is divided into blocks each smaller than or equal to the size of one virtual memory sub-page when compiled. A single function cannot exceed the size of one virtual memory sub-page, for example, function outlining may be used for support. It is noted that both LLVM and GCC (the most widely used compiler toolchains) already implement function outlining.
Optionally, functions that are each smaller than the size of one virtual memory sub-page when compiled are arranged to fit entirely within one virtual memory sub-page when compiled.
Optionally, the pre-compilation code includes a data storage structure larger than the size of one virtual memory sub-page when compiled. The data storage structure is divided into multiple sub-data storage structures each smaller than the size of one virtual memory sub-page when compiled. Exemplary data structures include: array and vector.
Optionally, a dereferencing data structure (e.g., implemented as a table) stores data for accessing each element of each sub-data storage structure. The dereferencing data structure may be created and/or the data may be stored within an existing dereferencing data structure. The dereferencing data structure adds an offset according to the size of the virtual memory sub-pages of the virtual memory page storing the data structure during runtime and clusters of physical memory blocks each of a size corresponding to a virtual memory sub-page size allocated to the application associated with the data storage structure.
Reference is now made to FIG. 8, which is a schematic depicting a dereferencing table 802 (also referred to as subcolor _arr ay) for accessing each element of sub-arrays (one sub-array 804 depicted for clarity) which are obtained by dividing an array, in accordance with some embodiments of the present invention. The array is stored in a virtual memory page 806, optionally a huge page. Each sub-array 802 is less than or equal to the size of one virtual memory-sub page (one sub-page 808 depicted for clarity) of virtual memory page 806.
Reference is now made to FIG. 9, which is an example of code (e.g., native code, pseudo assembly code) generated by the compiler to enable data access to one element of each sub-data storage structure (the last 4 lines), in accordance with some embodiments of the present invention. The code represents a possible ASM translation. Different ISAs may enable faster data access.
A new programming language keyword _colored may be introduced to force heap allocated data structures (who size may be unknown at compilation time) to be accessed as described herein. For example, for an array of integers, _colored int* a = malloc(4096*sizeof(int)), which is implementable in the C/C++ programming language. The keyword may be implemented accordingly for each programming language.
Returning now to act 504 of FIG. 5, a program stack is accessed and/or managed by incrementing the program stack in divided blocks each having a size smaller than or equal to one virtual memory sub-page. The code outputted by the compiler may be modified for accessing and/or managing the stack. A new program stack frame that updates a program stack pointer that points to each divided block is added. The new program stack frame is added by adding an offset according to the size of the virtual memory sub-pages storing the program stack during runtime and clusters of physical memory blocks allocated to the application associated with the data storage structure. The stack may be located for a certain set of page colors.
An example and not necessarily limiting implementation based on the program stack described herein is now provided. When the application code calls a new function, the caller function checks for the stack size. Since the argument sizes are already known at compile time, the caller code may decide to insert the new stack frame described herein after calculating the new stack position. Then, at the new location, the argument for the caller is laid out. At that point the caller code may pass the execution to the needed function, by updating the stack pointer.
When the called function returns, the called function code saves the return value for the caller. Eventually unsaved caller registers are restored and at the time of unrolling to the previous stack frame the called function code notices the proposed additional stack frame. Because of the new stack frame, the returning function will adjust the stack frame pointer before giving back the control to the calling function.
At 506, the blocks are compiled into executable binary blocks.
Functions (e.g., .text section) divided into blocks may be compiled with one .text section per function, which may enable quick re-coloring. A table storing the relocation data may be created for future re-coloring.
At 508, the executable binary blocks are linked into a program, A designation may be included, of the executable binary blocks for loading of the program by supervisor software into an allocated virtual memory page(s) by loading the executable binary blocks into physical memory blocks according to a mapping between virtual memory sub-pages of the virtual memory page(s) and allocated clusters of physical memory blocks each of a size corresponding to a virtual memory sub-page size. The designation may be stored, for example, as metadata within the program, by a specialized data structure external to the program (e.g., table indicating whether the program is associated with the designation), and/or a value in a field stored by the program indicative of the designation.
At 510, the program is provided for execution. The program may be, for example, locally stored in a data storage device, and/or transmitted to another computing device (e.g., a client terminal that provided the pre-compilation code, and/or another client terminal)
Reference is now made to FIG. 10, which is a schematic depicting additional exemplary components of compiler 412A and linker 412B (as described with reference to FIG. 4) for compiling code for runtime execution within virtual memory sub-pages of one or more virtual memory pages, in accordance with some embodiments of the present invention. The components may represent a modification of the traditional compilation and/or traditional static and/or dynamic linking process for each of the main application parts.
Additional and/or modified components of compiler 412A include:
* Functional outliner 1002 for dividing a function (e.g., of a .text section) of the pre compilation code that is larger than the size of one virtual memory sub-page when compiled into executable code, into a sub-functions that are each smaller than or equal to the size of one virtual memory sub-page when compiled into executable binary blocks, as described herein.
* Scattered data structure support 1004 for dividing the data storage structure into sub data storage structures each smaller than the size of one virtual memory sub-page when compiled, as described herein.
* Stack support 1006 for accessing and/or managing a program stack by incrementing the program stack in divided blocks each having a size smaller than or equal to one virtual memory sub-page, as described herein.
* Defaults 1008 adds new default to the compiler, such as the default compilation methods that may include or exclude the support for page coloring in huge pages.
Additional and/or modified components of linker 412B include:
* Function/data packing in predefined page sizes (e.g., 4 kB) 1010 for arranging functions that are each smaller than the size of one virtual memory sub-page when compiled, to fit entirely within one virtual memory sub-page when compiled, as described herein. * Relocations and dereferencing tables 1012 for creating a dereferencing data structure for accessing each element of each sub-data storage structure and/or relocating a binary block in its entirety as a continuous segment of code from one virtual memory sub-page to another virtual memory sub-page, as described herein.
* Loader hooks 1014 creates additional handles for the loader to help functionalities such as re-coloring and/or runtime coloring. * Metadata generation 1016 for including a designation of divided executable binary blocks for appropriate loading of the program by supervisor software.
Reference is now made to FIG. 6, which is a flowchart of a method for execution of the program within virtual memory sub-pages of virtual memory page(s), in accordance with some embodiments of the present invention.
At 602, instructions to load an application for execution are received. For example, a user clicks on an icon associated with the application, and/or another process triggers loading of the application.
At 604, a binary file of the application divided into blocks is identified, for example, based on an analysis of the designation associated with the application (as described with reference to act 508 of FIG. 5).
The size of each block of the divided application is less than or equal to a size of a virtual memory sub-page.
At 606, an initial allocation of clusters of physical memory blocks is received. Each physical memory block is of a size corresponding to a virtual memory sub-page size allocated for the application.
At 608, an allocation of virtual memory page(s) for the application is received. The size of the virtual memory page(s) is mapped to an equal size of contiguous physical memory area(s). The virtual memory page(s) include virtual memory sub-pages mapped to the clusters of physical memory blocks. Each physical memory block has a size corresponding to the size of a virtual memory sub-page.
At load time, the binary loader may allocate a virtual memory page (e.g., huge page) for the .text. The binary loader issues a request to the supervisor code for the allocated color(s). For user- space loader, the loader may be placed at any virtual memory sub-page of any color, since the user-space loader is executed once during initialization. After allocation, the .text code may be stored in a virtual memory page (e.g., huge page) thus preserving the coloring. The code may be re-linked, including symbols. The loader may be modified to perform a runtime re-linking based on the selected colors during a re-coloring phase.
The application loader may implement a memory allocator supporting page coloring.
Page colors allocated to the application may be dynamically updated at run-time. The application address space may be dynamically updated to allocate additional virtual memory pages (e.g., hue pages) to the application.
At 610, the blocks of the binary file of the application are loaded into the allocated virtual memory page(s). The blocks are loaded into physical memory areas according to the mapping between the virtual memory sub-pages and the allocated clusters of physical memory blocks.
Each application is loaded with a limited number of the allocated page colors. Different applications are assigned different colors selected from all the available colors, to enable multiple applications to be loaded simultaneously.
Optionally, when a data structure is divided into multiple sub-data storage structures (as described with reference to act 504 of FIG. 5), the dereferencing data structure is populated with data for accessing each element of sub-data storage structures of the data storage structure. The loader may populate the dereferencing table based on the application’s assigned colors. The dereferencing data structure adds an offset according to the size of the virtual memory sub-pages of the virtual memory page storing the sub-data structures of the data structure during runtime and the clusters of physical memory blocks. Each physical memory block of a size corresponding to a virtual memory sub-page size allocated to the application associated with the data storage structure. Each sub-data storage structure may be placed on page boundaries.
Optionally, a program stack is grown in blocks. Each block having a size smaller than or equal to the size of one virtual memory sub-page of the virtual memory pages storing the program and the program stack during runtime. The program stack is grown according to an added new program stack frame that updates a program stack pointer to point to the respective program stack blocks with an offset. The offset is computed according to the size of the virtual memory sub-pages of the virtual memory pages storing the program stack blocks of the program stack during runtime and the clusters of physical memory blocks. Each physical memory block of a size corresponding to a virtual memory sub-page size allocated to the program stack.
Optionally, the application includes sub-functions that are each smaller than or equal to the size of one virtual memory sub-page of a function that is larger than the size of one virtual memory sub-page. The sub-functions are stored at respective virtual memory sub-pages mapped to a cluster of physical memory blocks each of a size corresponding to a virtual memory sub- page size. The location of each sub-function is stored in a mapping data structure for runtime execution of the function.
At 612, the application is executing. Re-coloring of the application may be performed at runtime.
Optionally, one or more of the binary blocks are dynamically moved from a first virtual memory sub-page of a first cluster to a second memory sub-page of a second cluster. The mapping between virtual memory sub-pages and clusters of physical memory blocks is updated according to the dynamic move.
Runtime relocation of the dereferencing data structure may be performed when no pointers to the actual elements of the data structure are stored by the code. For example, the code is prevented from saving pointers to the data structure elements. Access to the data structure elements is provided via indexes.
Reference is now made to FIG. 11, which is a schematic depicting additional exemplary components of a runtime 1102 and/or operating system 1104 and/or memory management 1106 for loading code for execution within virtual memory sub-pages of virtual memory page(s), in accordance with some embodiments of the present invention.
Additional and/or modified components of runtime 1102 include:
* Load- time and run-time symbol relocation 1108 for dynamically moving binary block(s) from a first virtual memory sub-page of a first cluster to a second memory sub-page of a second cluster and updating a mapping between virtual memory sub-pages and clusters of physical memory blocks according to the move.
* Huge-page support 1110 for identifying a binary file of the application divided into blocks, as described herein.
Additional and/or modified components of executable binary loader 1112 of operating system 1104 include:
* New executable with compiler coloring binary loader 1114 for loading the blocks of the binary file of the application into the allocated virtual memory page(s).
Additional and/or modified components of memory management 1106 include:
* Coloring allocator 1116 that performs an allocation of virtual memory page(s) for the application according to clusters, as described herein.
Reference is now made to FIG. 12, which is a flowchart depicting an exemplary implementation of dividing a function of a .text section of the pre-compilation code when compiled into sub-functions that are each smaller than or equal to the size of one virtual memory sub-page when compiled, in accordance with some embodiments of the present invention. It is noted that the method is not necessarily
The division is performed for a source program 202 by a compiler 204 to create an object code program 206 is as follows:
At 221, a parser unit parses source program 202, for example, according to common compiler practice.
At 222, an intermediate code conversion unit performs intermediate code conversion, for example, according to common compiler practice.
At 223, an optimization unit 223 performs optimization of the intermediate code, for example, according to common compiler practice.
At 224, a code generation unit generates code, for example, according to common compiler practice.
At 225, functions larger than the size of one virtual memory sub-page (e.g., 4kB) are divided into sub-functions that are each smaller than the size of one virtual memory sub-page.
LLVM and GCC are exemplary compiler frameworks that are production quality and commonly used in software development. LLVM and GCC implement function outlining. An example implementation of outlining is in the framework called OpenMP. An example of code that can be outlined is a loop.
At 226, the compilation outputs an object file 206 with one section per sub-function and associated relocatable code (i.e., relocations). Relocation symbols may be defined in the .reloc section. The jump tables define how the blocks which are loaded into non-contiguous memory areas are linked to one another. The sectioning unit helps the compiler divide code and/or data objects in the size of a sub-page.
At 227, a packing unit of linker 208 (and/or a pre-linker tool) packs code functions from object code program 20 in the minimum size according to one virtual memory sub-page (e.g., 4 kB). The information about functions may be maintained or discarded. The packing creates the order in which functions are placed by the supervisor software within a cluster of virtual memory sub-pages. Padding may be applied to avoid function spanning across multiple virtual memory sub-pages (e.g., over 4 kB).
At 228, a jump table generation unit computes the jump table.
At 229, a linking generation unit performs linking according to standard linker practice, the linking generation unit assumes one single continuous .text section since each program block is continuous in the virtual address space without overlapping one another, as defined by the jump table.
At 230, a relocation and symbol generation unit saves the relocation information in the executable binary 212, along with symbol information.
At 231, an additional metadata generation unit adds a tag to the executable binary 212, that acts an as indication to the program loader that executable binary 212 has been compiled to support page coloring with huge pages, and therefore amendable to load-time block relocation.
Reference is now made to FIG. 13, which is a flowchart of an exemplary method for execution of a .text section of an executable binary file within virtual memory sub-pages of one or more virtual memory pages, in accordance with some embodiments of the present invention. It is noted that the .text section is described as one example, with the operation principles of the method applicable to other executable binary sections.
Machine code 212 is received by supervisor software 214. Machine code 212 is created based on the method described with reference to FIG. 12.
Executable binary loader 216 of supervisor software 214 may exist as part of the operating system (OS), and/or loaded by the OS in the same address space of the application. The implementations depicted herein (which is not necessarily limiting) is based on executable binary loader 216 implemented within the OS.
The executable binary loader 216 performs the following:
At 239, the header parsing unit reads the set of headers that described the executable binary file and parses the content of the description, in accordance with standard supervisor software practice and/or executable binary loader practice.
At 238, the binary file is checked for the tag that indicates that the binary has been compiled for page coloring with (optionally huge) virtual memory pages (e.g., the tag is created by the compiler to distinguish the type of compilation, for example, as described with reference to act 231 of FIG. 12). It is noted that the tag is an exemplary implementation and not necessarily limiting. The binary file may further be checked to verify that no code function is larger than the size of one virtual memory sub-page (e.g., 4 kB). The binary file may further be checked to verify that the relocation symbols are available in the executable.
At 237, the generate page color allocation unit determines the colors that are to be assigned to the .text section.
At 236, based on the total size of the .text section and the number of assigned colors, the page/huge page memory allocation unit allocates a certain number of virtual memory pages (e.g., huge pages) for the binary and loads the entire .text section at the beginning of the allocated memory.
At 235, the function/data relocation unit moves each .text section block (of size of one virtual memory sub-page or less, e.g., 4 kB) to a virtual memory page which respects the assigned coloring, saving the offset for each page.
At 234, a scheduler unit schedules execution of the application, according to common supervisor software practice.
At 233, a runtime binary loader 218 of program 220 performs symbol relocations according to common runtime binary loader practice.
At 232, runtime binary loader 218 uses the relocation information to pass through the entire .text section to change function pointers at runtime by generating runtime jump tables. When the start address of the program is changed due to coloring, the start address is updated.
At 222, the control is passed to the application that begins running.
Other systems, methods, features, and advantages of the present disclosure will be or become apparent to one with skill in the art upon examination of the following drawings and detailed description. It is intended that all such additional systems, methods, features, and advantages be included within this description, be within the scope of the present disclosure, and be protected by the accompanying claims.
The descriptions of the various embodiments of the present invention have been presented for purposes of illustration, but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement over technologies found in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.
It is expected that during the life of a patent maturing from this application many relevant compilers, linkers, and operating systems will be developed and the scope of the terms compiler, linker, and operating system is intended to include all such new technologies a priori.
As used herein the term“about” refers to ± 10 %.
The terms "comprises", "comprising", "includes", "including", “having” and their conjugates mean "including but not limited to". This term encompasses the terms "consisting of" and "consisting essentially of". The phrase "consisting essentially of" means that the composition or method may include additional ingredients and/or steps, but only if the additional ingredients and/or steps do not materially alter the basic and novel characteristics of the claimed composition or method.
As used herein, the singular form "a", "an" and "the" include plural references unless the context clearly dictates otherwise. For example, the term "a compound" or "at least one compound" may include a plurality of compounds, including mixtures thereof.
The word“exemplary” is used herein to mean“serving as an example, instance or illustration”. Any embodiment described as“exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments and/or to exclude the incorporation of features from other embodiments.
The word“optionally” is used herein to mean“is provided in some embodiments and not provided in other embodiments”. Any particular embodiment of the invention may include a plurality of“optional” features unless such features conflict.
Throughout this application, various embodiments of this invention may be presented in a range format. It should be understood that the description in range format is merely for convenience and brevity and should not be construed as an inflexible limitation on the scope of the invention. Accordingly, the description of a range should be considered to have specifically disclosed all the possible subranges as well as individual numerical values within that range. For example, description of a range such as from 1 to 6 should be considered to have specifically disclosed subranges such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6 etc., as well as individual numbers within that range, for example, 1, 2, 3, 4, 5, and 6. This applies regardless of the breadth of the range.
Whenever a numerical range is indicated herein, it is meant to include any cited numeral (fractional or integral) within the indicated range. The phrases“ranging/ranges between” a first indicate number and a second indicate number and“ranging/ranges from” a first indicate number “to” a second indicate number are used herein interchangeably and are meant to include the first and second indicated numbers and all the fractional and integral numerals therebetween.
It is appreciated that certain features of the invention, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the invention, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable subcombination or as suitable in any other described embodiment of the invention. Certain features described in the context of various embodiments are not to be considered essential features of those embodiments, unless the embodiment is inoperative without those elements.
All publications, patents and patent applications mentioned in this specification are herein incorporated in their entirety by reference into the specification, to the same extent as if each individual publication, patent or patent application was specifically and individually indicated to be incorporated herein by reference. In addition, citation or identification of any reference in this application shall not be construed as an admission that such reference is available as prior art to the present invention. To the extent that section headings are used, they should not be construed as necessarily limiting.

Claims

WHAT IS CLAIMED IS:
1. An apparatus (402) for compiling code for runtime execution within at least one virtual memory sub-page of at least one virtual memory page, the apparatus comprising:
a compiler (412A) executable by a processor (406), the compiler (412A) configured to: receive pre-compilation code for compilation,
wherein the size of the pre-compilation code, when compiled and loaded into a memory, is at least the size of one virtual memory sub-page, wherein the at least one virtual memory sub-page corresponds to one of a plurality of physical memory blocks that are mapped to a virtual memory page, the size of each physical memory block is the size of a virtual memory sub-page;
divide the pre-compilation code into a plurality of blocks such that each block of the plurality of blocks when complied into a respective executable binary block of a plurality of executable binary blocks is less than or equal to the size of a virtual memory sub-page of the at least one virtual memory page corresponding to the size of one physical memory block;
compile the plurality of blocks into the plurality of executable binary blocks; and link (412B) the plurality of executable binary blocks into a program and include a designation of the plurality of executable binary blocks for loading of the program by supervisor software into an allocated at least one virtual memory page by loading the plurality of executable binary blocks into physical memory blocks according to a mapping between virtual memory sub-pages of the at least one virtual memory page and allocated plurality of clusters of physical memory blocks each of a size corresponding to a virtual memory sub-page size.
2. The apparatus (402) according to claim 1, wherein the compiler (412A) is further configured to divide a function of a .text section (702) of the pre-compilation code that is larger than the size of one virtual memory sub-page when compiled into executable code, into a plurality of sub-functions that are each smaller than or equal to the size of one virtual memory sub-page when compiled into executable binary blocks, wherein the executable binary blocks of the divided function of the .text are placed by supervisor software (412C) within a cluster (710) of virtual memory sub-pages of a virtual memory page that map to a corresponding cluster of physical memory blocks each of a size corresponding to a virtual memory sub-page size.
3. The apparatus (402) according to claim 1, wherein the compiler (412A) is further configured to arrange a plurality of functions that are each smaller than the size of one virtual memory sub-page when compiled, to fit entirely within one virtual memory sub-page when compiled.
4. The apparatus (402) according to any of the previous claims, wherein the pre-compilation code includes a data storage structure larger than the size of one virtual memory sub-page when compiled, and wherein the compiler (412A) is further configured to divide the data storage structure into a plurality of sub-data storage structures (804) each smaller than the size of one virtual memory sub-page (808) when compiled.
5. The apparatus (402) according to claim 4, wherein the compiler (412A) is further configured to create a dereferencing data structure (802) for accessing each element of each sub data storage structure (804), wherein the dereferencing data structure (802) adds an offset according to the size of the virtual memory sub-pages (808) of the virtual memory page (806) storing the data structure during runtime and clusters of physical memory blocks each of a size corresponding to a virtual memory sub-page size allocated to the application associated with the data storage structure.
6. The apparatus (402) according to any of the previous claims, wherein the compiler (412A) is further configured to access and manage a program stack by incrementing the program stack in divided blocks each having a size smaller than or equal to one virtual memory sub-page.
7. The apparatus (402) according to claim 6, wherein the compiler (412A) is further configured to add a new program stack frame that updates a program stack pointer that points to each divided block by adding an offset according to the size of the virtual memory sub-pages of the virtual memory page storing the program stack during runtime and clusters of physical memory blocks allocated to the application associated with the data storage structure.
8. The apparatus (402) according to any of the previous claims, wherein the size of the virtual memory sub-page is at least as large as a predefined standard size of a physical memory block associated with the processor (406).
9. The apparatus (402) according to any of the previous claims, wherein each binary block of the plurality of binary blocks is relocatable in its entirety as a continuous segment of code from one virtual memory sub-page to another virtual memory sub-page.
10. An apparatus (402) for loading code for execution within at least one virtual memory sub-page of at least one virtual memory page, the apparatus (402) comprising:
a processor (406);
a memory (412) storing code instructions (412C) for execution by the processor (406), comprising:
code to identify a binary file of an application divided into a plurality of blocks, where a size of each block of the plurality of blocks is less than or equal to a size of a virtual memory sub-page,
code to retrieve an initial allocation of a plurality of clusters of physical memory blocks each of a size corresponding to a virtual memory sub-page size for the application, code to receive an allocation of at least one virtual memory page for the application, wherein the size of the at least one virtual memory page is mapped to an equal size of contiguous physical memory areas, wherein the at least virtual memory page includes a plurality of virtual memory sub-pages mapped to the plurality of clusters of physical memory blocks each of a size corresponding to a virtual memory sub-page size,
code to load the plurality of blocks of the binary file of the application into the allocated at least one virtual memory page, wherein the plurality of blocks are loaded into physical memory areas according to the mapping between the virtual memory sub-pages and the allocated plurality of clusters of physical memory blocks.
11. The apparatus (402) according to claim 10, further comprising code to dynamically move at least one of the plurality of blocks from a first virtual memory sub-page of a first cluster to a second memory sub-page of a second cluster, and update a mapping between virtual memory sub-pages and clusters of physical memory blocks according to the dynamic move.
12. The apparatus (402) according to any of claims 10-11, wherein the apparatus (402) further comprises code to populate data of a dereferencing data structure (802) for accessing each element of sub-data storage structures (804) of a data storage structure, wherein the dereferencing data structure (802) adds an offset according to the size of the virtual memory sub- pages (808) of the virtual memory page (806) storing the sub-data structures (804) of the data structure during runtime and the clusters of physical memory blocks each of a size corresponding to a virtual memory sub-page size allocated to the application associated with the data storage structure.
13. The apparatus (402) according to any of claims 10-12, wherein the application includes complied code for growing a program stack in blocks each having a size smaller than or equal to the size of one virtual memory sub-page of the virtual memory pages storing the program and the program stack during runtime, according to an added new program stack frame that updates a program stack pointer to point to the respective program stack blocks with an offset computed according to the size of the virtual memory sub-pages of the virtual memory pages storing the program stack blocks of the program stack during runtime and the clusters of physical memory blocks each of a size corresponding to a virtual memory sub-page size allocated to the program stack.
14. The apparatus (402) according to any of claims 10-13, wherein the application includes compiled code for storing a plurality of sub-functions that are each smaller than or equal to the size of one virtual memory sub-page of a function (702) that is larger than the size of one virtual memory sub-page, at respective virtual memory sub-page mapped to a cluster (710) of physical memory blocks each of a size corresponding to a virtual memory sub-page size, and storing the location of each of the plurality of sub-functions in a mapping data structure for runtime execution of the function.
15. The apparatus (402) according to any of the previous claims, wherein in an implementation of the processor (406) lacking a paging mechanism (416) the at least one virtual memory sub-page, which is part of a virtual memory page, is mapped to one physical memory block which is part of a plurality of contiguous physical memory blocks that makes up the size of a virtual memory page.
PCT/EP2017/081116 2017-12-01 2017-12-01 Systems for compiling and executing code within one or more virtual memory pages WO2019105565A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/EP2017/081116 WO2019105565A1 (en) 2017-12-01 2017-12-01 Systems for compiling and executing code within one or more virtual memory pages
CN201780096871.XA CN111344667B (en) 2017-12-01 2017-12-01 System and method for compiling and executing code within virtual memory sub-pages of one or more virtual memory pages

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2017/081116 WO2019105565A1 (en) 2017-12-01 2017-12-01 Systems for compiling and executing code within one or more virtual memory pages

Publications (1)

Publication Number Publication Date
WO2019105565A1 true WO2019105565A1 (en) 2019-06-06

Family

ID=60569915

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2017/081116 WO2019105565A1 (en) 2017-12-01 2017-12-01 Systems for compiling and executing code within one or more virtual memory pages

Country Status (2)

Country Link
CN (1) CN111344667B (en)
WO (1) WO2019105565A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113821272A (en) * 2021-09-23 2021-12-21 武汉深之度科技有限公司 Application program running method, computing device and storage medium
CN116560667A (en) * 2023-07-11 2023-08-08 安元科技股份有限公司 Splitting scheduling system and method based on precompiled delay execution

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116382785B (en) * 2023-06-01 2023-09-12 紫光同芯微电子有限公司 Method and device for data processing, computing equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070283125A1 (en) * 2006-06-05 2007-12-06 Sun Microsystems, Inc. Dynamic selection of memory virtualization techniques
US20080184195A1 (en) * 2007-01-26 2008-07-31 Oracle International Corporation Code generation in the presence of paged memory

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102122268B (en) * 2010-01-07 2013-01-23 华为技术有限公司 Virtual machine memory allocation access method, device and system
CN103902459B (en) * 2012-12-25 2017-07-28 华为技术有限公司 Determine the method and relevant device of shared virtual memory page management pattern
CN104516826B (en) * 2013-09-30 2017-11-17 华为技术有限公司 The corresponding method and device of a kind of virtual big page and the big page of physics
CN105740042B (en) * 2016-01-15 2019-07-02 北京京东尚科信息技术有限公司 The management method and management system of virutal machine memory

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070283125A1 (en) * 2006-06-05 2007-12-06 Sun Microsystems, Inc. Dynamic selection of memory virtualization techniques
US20080184195A1 (en) * 2007-01-26 2008-07-31 Oracle International Corporation Code generation in the presence of paged memory

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113821272A (en) * 2021-09-23 2021-12-21 武汉深之度科技有限公司 Application program running method, computing device and storage medium
CN113821272B (en) * 2021-09-23 2023-09-12 武汉深之度科技有限公司 Application program running method, computing device and storage medium
CN116560667A (en) * 2023-07-11 2023-08-08 安元科技股份有限公司 Splitting scheduling system and method based on precompiled delay execution
CN116560667B (en) * 2023-07-11 2023-10-13 安元科技股份有限公司 Splitting scheduling system and method based on precompiled delay execution

Also Published As

Publication number Publication date
CN111344667A (en) 2020-06-26
CN111344667B (en) 2021-10-15

Similar Documents

Publication Publication Date Title
US11500778B2 (en) Prefetch kernels on data-parallel processors
Gummaraju et al. Twin peaks: a software platform for heterogeneous computing on general-purpose and graphics processors
Koshy et al. VMSTAR: synthesizing scalable runtime environments for sensor networks
KR101769260B1 (en) Concurrent accesses of dynamically typed object data
KR101759266B1 (en) Mapping processing logic having data parallel threads across processors
EP2788864B1 (en) Techniques to prelink software to improve memory de-duplication in a virtual system
KR101761650B1 (en) Sharing virtual functions in a shared virtual memory between heterogeneous processors of a computing platform
US20110153957A1 (en) Sharing virtual memory-based multi-version data between the heterogenous processors of a computer platform
CN102135904A (en) Multi-core target system oriented mapping method and device
US20180101483A1 (en) Memory Structure Comprising Scratchpad Memory
US10635472B2 (en) Import mechanism for hardware intrinsics
Thaler et al. Porting the COSMO weather model to manycore CPUs
CN104025185A (en) Mechanism for Using a GPU Controller for Preloading Caches
CN106445656B (en) A kind of method and device for realizing thread-local storage
WO2019105565A1 (en) Systems for compiling and executing code within one or more virtual memory pages
Pellegrini Distillating knowledge about Scotch
KR20230138031A (en) Dynamic allocation of executable code for multi-architecture heterogeneous computing.
WO2019105566A1 (en) Systems and methods for clustering sub-pages of physical memory pages
US7707565B2 (en) Method for consistent and efficient management of program configuration and customizing data
WO2019076442A1 (en) Computing system for unified memory access
Wolfe et al. The OpenACC data model: Preliminary study on its major challenges and implementations
EP2854036B1 (en) Storage space mapping method and device
US20130042235A1 (en) Dynamic bootstrap literal processing within a managed runtime environment
Jääskeläinen et al. Offloading C++ 17 parallel STL on system shared virtual memory platforms
Yang et al. Support OpenCL 2.0 Compiler on LLVM for PTX Simulators

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17808448

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17808448

Country of ref document: EP

Kind code of ref document: A1