CN111344667B

CN111344667B - System and method for compiling and executing code within virtual memory sub-pages of one or more virtual memory pages

Info

Publication number: CN111344667B
Application number: CN201780096871.XA
Authority: CN
Inventors: 安东尼奥·巴巴拉斯; 陈熠; 亚尼·科科宁; 亚历山大·斯皮里达基斯
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2017-12-01
Filing date: 2017-12-01
Publication date: 2021-10-15
Anticipated expiration: 2037-12-01
Also published as: WO2019105565A1; CN111344667A

Abstract

Providing a compiler for: receiving a pre-compiled code for compilation, the pre-compiled code being the size of at least one virtual memory sub-page when compiled and loaded into a memory, the one virtual memory sub-page corresponding to one physical memory block mapped to a virtual memory page; dividing the precompiled code into blocks, wherein when the precompiled code is compiled into corresponding executable binary blocks, the size of the executable binary blocks is smaller than or equal to the size of a virtual memory sub-page; compiling the block into an executable binary block; and linking the executable binary block into a program and including a designation by hypervisor software to load the program into the allocated virtual memory pages of the executable binary block, the executable binary block being loaded into the physical memory blocks by mapping between the virtual memory sub-pages and the allocated clusters of physical memory blocks. Wherein the size of each physical memory block corresponds to the virtual memory sub-page size.

Description

System and method for compiling and executing code within virtual memory sub-pages of one or more virtual memory pages

Background

The present invention, in some embodiments thereof, relates to virtual memory management and, more particularly, but not by way of limitation, to systems and methods for clustering sub-pages of a virtual memory page.

In a multi-processor/multi-core processor framework with a large number of cores and/or software that prevents multiple logical execution units (tasks) from executing together, it is increasingly important to share access to memory resources between execution entities for performance and energy efficiency reasons. Memory resources include, for example, processor caches, which include one or more of the following: l1, L2, L3 and L4 (e.g., L1, L1-L2, L1-L3 and L3-L4) (the highest level is called the Last Level Cache (LLC), the processor memory bus/ring through which the LLC interconnects groups/clusters, and the memory controllers and their (parallel) interconnect lines to parallel memory elements (banks).

To divide the use of memory resources among different execution entities, different techniques have been introduced, including page shading, which is a pure software technique that requires the implementation of virtual memory. In the case of cache partitioning, page shading requires at least a cache of LLC physical indices and tags. In the case of memory bandwidth partitioning, page shading may require software configuration of memory interleaving.

Disclosure of Invention

It is an object of the present invention to provide an apparatus, system, method and/or code instructions for compiling code for execution at runtime within a virtual memory sub-page of a virtual memory page and/or for loading code for execution within a virtual memory sub-page of a virtual memory page.

The above and other objects are achieved by the features of the independent claims. Further embodiments are evident from the dependent claims, the description and the drawings.

According to a first aspect, an apparatus for compiling code for execution at runtime within a plurality of virtual memory sub-pages of at least one virtual memory page, the apparatus comprising: a compiler executable by a processor, the compiler configured to receive a pre-compiled code for compilation, wherein a size of the pre-compiled code, when the pre-compiled code is compiled and loaded into a memory, is a size of at least one virtual memory sub-page, wherein the at least one virtual memory sub-page corresponds to one of a plurality of physical memory blocks mapped to a virtual memory page, and a size of each physical memory block is the size of a virtual memory sub-page; partitioning said precompiled code into a plurality of blocks, such that each block of said plurality of blocks, when compiled into a corresponding executable binary block of a plurality of executable binary blocks, is less than or equal to said size of a virtual memory sub-page of said at least one virtual memory sub-page, said size of said one virtual memory sub-page corresponding to said size of one physical memory block; compiling the plurality of blocks into the plurality of executable binary blocks; and linking the plurality of executable binary blocks into a program, and including designating the plurality of executable binary blocks for: loading, by hypervisor software, the program into the allocated at least one virtual memory page by loading the plurality of executable binary blocks into a physical memory block according to a mapping between virtual memory sub-pages of the at least one virtual memory page and clusters of allocated multiple physical memory blocks, wherein the size of each of the physical memory blocks corresponds to the size of a virtual memory sub-page.

According to a second aspect, an apparatus for loading execution code within a plurality of virtual memory sub-pages of at least one virtual memory page comprises: the processor, the memory, the code instruction that the said memory storage processor carries out, including: code for identifying a binary file of an application program divided into a plurality of blocks, wherein a size of each block of the plurality of blocks is less than or equal to a size of a virtual memory sub-page, code for retrieving an initial allocation of a plurality of clusters in a physical memory block, the size of each of the physical memory blocks corresponding to a virtual memory sub-page size of the application program, code for receiving an allocation of at least one virtual memory page for the application program, wherein the size of the at least one virtual memory page maps into a contiguous physical memory region of the same size, wherein a virtual memory page includes at least a plurality of virtual memory sub-pages, the virtual memory sub-pages map into the plurality of clusters of physical memory blocks, the size of each of the physical memory blocks corresponds to a size of a virtual memory sub-page, code for loading the plurality of blocks of the binary file of the application program into the allocated at least one virtual memory page, and loading the plurality of blocks into a physical memory area according to the mapping between the virtual memory sub-page and the plurality of clusters of the allocated physical memory block.

Systems, apparatus, methods, and/or code instructions described herein extend page shading (i.e., clustering) to very large virtual memory pages.

Implementations of the systems, apparatus, methods, and/or code instructions described herein are transparent to executing code (e.g., programs, applications).

The implementation of the systems, apparatus, methods and/or code instructions described herein is based on system software

Existing (e.g., legacy) code (e.g., programs, applications) may be recompiled to take advantage of the embodiments, based on the systems, apparatus, methods, and/or code instructions described herein. No new program for implementation need be designed based on the systems, apparatuses, methods, and/or code instructions described herein.

The systems, apparatus, methods, and/or code instructions described herein provide software-based solutions based on modifications to the system software (e.g., operating system code, runtime code, compiler code, and/or linker code). The software-based solution does not require any modification of the processing hardware and/or addition of new processing hardware and may be performed by existing processing hardware, for example, in comparison to other proposed solutions based on at least some modifications of the processing hardware and/or new hardware components.

The systems, apparatus, methods, and/or code instructions described herein are software-based and do not require modification of hardware and/or new hardware, e.g., it provides scalability that is limited in scalability by hardware, as compared to other attempts based on new and/or modified hardware. The problem of scalability can be further explained as follows. Commodity processor Instruction Set Architectures (ISAs) provide only a limited/fixed number of paths to Last Level Caches (LLC), regardless of the number of cores available on the multi-core processor, thus providing scalability issues (note that there are other approaches that rely on hardware extended partitions to cache to different physical or logical execution units, an example being path partitions).

In a further embodiment of the first aspect, the compiler is further configured to: dividing a function of a text segment of the pre-compiled code into a plurality of sub-functions, wherein the size of the pre-compiled code segment is larger than the size of a virtual memory sub-page when the pre-compiled code segment is compiled into an executable code, and the size of each sub-function is smaller than or equal to the size of a virtual memory sub-page when the plurality of sub-functions are compiled into the executable code, wherein the executable binary blocks of the text partition function are placed in a cluster of virtual memory sub-pages of a virtual memory page by hypervisor software, the virtual memory sub-pages of the virtual memory page are mapped to a corresponding cluster of physical memory blocks, and the size of each of the physical memory blocks corresponds to the size of the virtual memory sub-page.

In a further embodiment of the first aspect, the compiler is further configured to arrange a plurality of functions to fit completely within a virtual memory sub-page at compile time, wherein a size of each of the functions is smaller than the size of a virtual memory sub-page at compile time.

In a further embodiment of the first aspect, the precompiled code includes a data storage structure that is larger than the size of one virtual memory sub-page at compile time, and wherein the compiler is further configured to partition the data storage structure into a plurality of sub-data storage structures, each of which is smaller than the size of one virtual memory sub-page at compile time.

In a further embodiment of the first aspect, the compiler is further configured to: creating a dereference data structure for accessing each element of each sub data storage structure, wherein the dereference data structure adds an offset according to the size of a virtual memory sub-page of a virtual memory page storing the data structure during runtime and a cluster of physical memory blocks, wherein the size of each of the physical memory blocks corresponds to a virtual memory sub-page size allocated to the application associated with the data storage structure.

In a further embodiment of the first aspect, the compiler is further configured to access and manage a program stack by incrementing the program stack in blocks, each of the blocks having a size less than or equal to a size of a virtual memory sub-page.

In a further implementation of the first aspect, the compiler is further configured to add a new program stack frame, the new program stack updating a program stack pointer to each partition by adding an offset according to the size of the virtual memory sub-page of the virtual memory page storing the program stack during runtime and a cluster of physical memory blocks allocated to the application associated with the data storage structure.

In a further embodiment of the first aspect, the size of the virtual memory sub-page is at least as large as a predefined standard size of a physical memory block associated with the processor.

In a further embodiment of the first aspect, each binary block of the plurality of binary blocks is collectively repositionable as a contiguous segment of code from one virtual memory sub-page to another virtual memory sub-page.

In a further embodiment of the second aspect, the apparatus further comprises: for dynamically moving at least one of the plurality of blocks from a first virtual memory sub-page of a first cluster to a second storage sub-page of a second cluster, and updating a mapping between the virtual memory sub-page and the cluster of physical memory according to the dynamic movement.

In a further embodiment of the second aspect, code is provided for populating data of a dereference data structure for accessing each element of a sub-data storage structure of a data storage structure, wherein the dereference data structure adds an offset according to a size of a virtual memory sub-page of a virtual memory page of the sub-data structure that stores the data structure during runtime, and a cluster of physical memory blocks, the size of each of the physical memory blocks corresponding to a virtual memory sub-page size allocated to the application associated with the data storage structure.

In a further implementation of the second aspect, the application program includes compiled code for expanding a program stack in blocks according to an added new program stack frame, the size of each block being smaller than or equal to the size of the one virtual memory sub-page of the virtual memory page in which the program and the program stack are stored at runtime, the new program stack updating a program stack pointer to point to the corresponding program stack block of the calculated offset according to the size of the virtual memory sub-page of the virtual memory block in which the program stack block of the program stack is stored during runtime and the cluster of physical memory blocks, the size of the cluster of each physical memory block corresponding to the virtual memory sub-page size allocated to the program stack.

In a further implementation of the second aspect, the application program includes compiled code for storing a plurality of sub-functions in respective virtual memory sub-pages of a cluster mapped to a physical memory block, each of the plurality of sub-functions being smaller than or equal to the size of the one virtual memory sub-page of the function, the size of the function being larger than the size of the one virtual memory sub-page, the size of the cluster of physical memory blocks corresponding to the virtual memory sub-page size, and storing the location of each of the plurality of sub-functions in a mapping data structure for use in executing the function at runtime.

In a further embodiment of the second aspect, in an embodiment of the processor lacking a paging mechanism, the at least one sub-page of virtual memory that is a segment of virtual memory is mapped to one physical memory block that is a segment of a plurality of contiguous physical memory blocks of the size that make up a virtual memory page.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Although methods and materials similar or equivalent to those described herein can be used in the practice or testing of embodiments of the present invention, exemplary methods and/or materials are described below. In case of conflict, the present patent specification, including definitions, will control. In addition, the materials, methods, and examples are illustrative only and not intended to be necessarily limiting.

Drawings

Some embodiments of the invention are described herein, by way of example only, with reference to the accompanying drawings. With specific reference now to the drawings in detail, it is stressed that the particulars shown are by way of example and for purposes of illustrative discussion of the embodiments of the present invention. Thus, it will be apparent to one skilled in the art from the description of the figures how embodiments of the invention may be practiced.

In the drawings:

FIG. 1 is a schematic diagram depicting how page colors are arranged in a physical address space to help understand the technical problem addressed by some embodiments of the present invention;

FIG. 2 is a schematic diagram depicting an application using three different colored virtual pages to help understand the technical problem addressed by some embodiments of the present invention;

FIG. 3 is a schematic diagram depicting an example of an application program using virtual memory paging with at least one super virtual memory page, in accordance with some embodiments of the invention;

FIG. 4 is a schematic diagram of a block diagram of a system including a computing device to compile code for execution at runtime within a virtual memory sub-page and/or to load code for execution within a virtual memory sub-page in accordance with some embodiments of the invention;

FIG. 5 is a flow diagram of a method of compiling code for execution while running within a virtual memory sub-page of a virtual memory page in accordance with some embodiments of the invention;

FIG. 6 is a flow diagram of a method of loading code for execution within a virtual memory sub-page of a virtual memory page in accordance with some embodiments of the invention;

FIG. 7 is a diagram depicting a division of an example text segment into a plurality of sub-functions, according to some embodiments of the invention;

FIG. 8 is a schematic diagram depicting a de-reference table for accessing each element of a sub-array obtained by partitioning the array, in accordance with some embodiments of the present invention;

FIG. 9 is an example of code (e.g., native code, pseudo assembly code) generated by a compiler to allow data access to one element of each sub data storage structure, according to some embodiments of the invention;

FIG. 10 is a schematic diagram depicting additional exemplary components of a compiler and a linker for compiling code for execution while running within virtual memory sub-pages of one or more virtual memory pages, in accordance with some embodiments of the present invention;

FIG. 11 is a schematic diagram depicting additional exemplary components of a runtime and/or operating system and/or memory management for loading code for execution within virtual memory sub-pages, in accordance with some embodiments of the invention;

FIG. 12 is a flowchart depicting an exemplary implementation of dividing a function of a text segment of precompiled code into sub-functions each of a size less than or equal to one virtual memory sub-page when compiled, in accordance with some embodiments of the invention; and

FIG. 13 is a flow diagram of an exemplary method for executing a text segment of an executable binary file within a virtual memory sub-page of one or more virtual memory pages in accordance with some embodiments of the invention.

Detailed Description

As used herein, the terms clustering or clustering technique and word color or shading are interchangeable. For example, each cluster is assigned a particular color.

As used herein, the term oversized virtual memory page refers to a virtual oversized memory page that is larger than the size of the physical memory page implementation defined by the hardware. It should be noted that different embodiments may refer to very large pages with other terminology, such as large pages.

As used herein, the term standard size virtual memory page refers to a virtual memory page defined by hardware as the minimum translation amount. The size of each physical memory block is the size of a virtual memory sub-page.

The terms super large virtual memory pages, standard virtual memory pages, and virtual memory pages are sometimes interchangeable. .

An aspect of some embodiments of the invention relates to an apparatus, system, method, and/or code instructions (stored in a data store executable by one or more hardware processors) for compiling pre-compiled code for execution while running within virtual memory sub-pages of a virtual memory page. When compiled and loaded into memory, the size of the pre-compiled code is at least the size of one virtual memory sub-page. The virtual memory sub-page corresponds to one of multiple physical memory blocks, which are mapped to the virtual memory page. The size of each physical memory block is the size of a virtual memory sub-page. The precompiled code is partitioned into blocks such that each block, when compiled into a corresponding binary executable block, is less than or equal to (a virtual memory page corresponding to the size of one physical memory block) the size of the virtual memory sub-page. The block is compiled into an executable binary block. Executable blocks are linked into a program. The program includes specifying an executable binary block for loading the program by hypervisor software into an allocated virtual memory page. The hypervisor software loads the executable binary block into the physical memory block according to the mapping between the virtual memory sub-pages of the virtual memory page and the clusters of the allocated physical memory blocks. The size of each block corresponding to a virtual memory sub-page size of, for example, 4 kilobytes (kB) is based on the minimum page size available to the processor of the x86 architecture.

An aspect of some embodiments of the invention relates to an apparatus, system, method, and/or code instructions (stored in a data storage device executable by one or more hardware processors) for loading code for execution within virtual memory sub-pages of a virtual memory page. A binary file of the application divided into blocks is identified. The size of each block is less than or equal to the size of a virtual memory sub-page. An initial allocation of a cluster of physical memory blocks is retrieved for an application, each physical memory block having a size corresponding to a virtual memory sub-page size. An allocation of virtual memory pages for an application is received. The size of the virtual memory pages is mapped to contiguous physical memory regions of equal size. The virtual memory pages comprise virtual memory sub-pages that are mapped to clusters of physical memory blocks. The size of each block corresponds to the size of a virtual memory sub-page. Loading blocks of the binary file of the application program into the allocated virtual memory pages. And loading the blocks into the physical memory area according to the mapping between the virtual memory sub-pages and the clusters of the allocated physical memory blocks.

The virtual memory sub-pages may be located non-contiguously within virtual memory pages that are mapped to corresponding clusters of memory blocks. The virtual memory sub-pages of different clusters may be contiguous with each other, optionally in a repeating pattern, for example arranged as: 1. 2, 3, 1, 2, 3.

The apparatus and/or systems described herein address the technical problem of combining software-based memory page clustering (also referred to herein as page shading) with hardware-based ultralarge memory pages in an efficient and operable manner. With current hardware and software architectures, such a combination is virtually impossible. A brief description of the prior art and the resulting incompatibility of software-based page shading with hardware-based ultralarge memory pages is now provided.

Current multi-core/multi-processor computers are ubiquitous. By executing software in parallel on multiple hardware computer devices, this computer architecture provides improved performance over its predecessors. However, in order for multiple computer devices to be able to share the same data residing in memory, all computer devices need to access the same memory locations, which are typically mediated by a (hardware) last level cache. When the last level cache is shared among computer devices, performance problems result due to the unfair use of the last level cache by different software applications running on the computer devices (i.e., cores or CPUs). Unfair use can degrade performance of each application, especially where the application code is memory-limited (i.e., performing a large number of memory accesses) and the memory access pattern is characterized by temporal locality. Page shading techniques are currently implemented as pure software for fairly sharing last level cache and reducing application interference.

Typically, current software applications use virtual memory provided by the paging mechanism of the computing device. The minimum granularity of virtual to physical translation is a (standard) page. The standard page is a small page, which may be other small pages defined by hardware. When an application operates on a wide memory region, the use of small pages (which may be larger than the size of a standard page) severely impacts performance due to the high cost of virtual memory translation. A large number of page translations may result in a large number of misses in the TLB cache, requiring a large number of memory accesses to retrieve each translation (i.e., an operation called a page walk). Hardware large pages are implemented to address the described problem to reduce TLB misses.

Software-based page shading is incompatible with hardware-based oversized pages because page shading is designed to operate according to a minimum predefined and/or standard page granularity. Based on the prior art, attempts to extend software-based page shading techniques to hardware-based oversized page shading may result in little or no color, which virtually eliminates any potential benefits of implementing shading.

The apparatus, systems, methods, and/or code instructions described herein (stored in a data store executed by one or more processors) effectively implement a combination of shading and oversized pages in a manner that improves performance and/or deterministic execution of applications running in parallel on the same computer device.

A brief discussion of other attempts to combine coloring and extra-large pages is now provided to help understand the technical problem solved and the solution described. One described strategy is to implement a hardware-based problem solution. However, this hardware-only solution requires the manufacture of new hardware processors designed to be able to employ page shading in combination with oversized pages. Often, such solutions are complex and impractical to implement due to technical difficulties in design and/or manufacture. Furthermore, these solutions are not versatile enough to meet the expected possible application requirements.

Another attempt to solve the technical problem of combining page shading and extra-large pages in an operable manner is known as

Cache Allocation Technology (CAT). CAT is designed to transparently support very large pages. However, CAT cannot be easily controlled and/or implemented in general, as the solution is specific to x 86-based architectures

The processor produced. Furthermore, CAT cannot be extended to a large number of applications.

Before explaining at least one embodiment of the invention in detail, it is to be understood that the invention is not necessarily limited in its application to the details of construction and the arrangement of the components and/or methods set forth in the following description and/or illustrated in the drawings and/or examples. The invention is capable of other embodiments or of being practiced or carried out in various ways.

The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium (or media) having computer-readable program instructions embodied therewith for causing a processor to perform various aspects of the present invention.

The computer readable storage medium may be a tangible device capable of holding and storing instructions for use by an instruction execution device. The computer readable storage medium may be, for example but not limited to: electronic storage, magnetic storage, optical storage, electromagnetic storage, semiconductor storage, or any suitable combination of the foregoing.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a corresponding computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network.

The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, an electronic circuit, including, for example, a programmable logic circuit, a field-programmable gate array (FPGA), or a Programmable Logic Array (PLA), may execute computer-readable program instructions to perform aspects of the present invention by personalizing the electronic circuit with state information of the computer-readable program instructions.

Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Reference is now made to fig. 1, which is a schematic diagram illustrating how page colors (i.e., clusters) are arranged in a physical address space 102 to aid in understanding the technical problem addressed by some embodiments of the present invention. FIG. 1 depicts a conventional page coloring (i.e., clustering) approach that uses virtual memory to group physically dispersed memory pages of the same color together within the same virtual address range. The page color is repeated periodically. For example, one set of virtual memory pages corresponds to a page-sized block of physical memory. Note that a page size having a standard page size defined by the processor (e.g., 4kB in an x86 structure) is allocated to a physical memory page having a blue color (e.g., cluster 1) 104. Another set of virtual memory pages is allocated as physical memory pages with a green color (e.g., cluster 2) 106. Note that labels blue and green mean labels that identify clusters and do not reflect the actual color of memory. Blue and green are repeated periodically. Pages with the same color have a constant offset 108 in the physical address space.

Referring now to fig. 2, a schematic diagram is shown depicting an application (App1) that uses virtual pages of three different colors (i.e., clusters), blue 282, green 284, and yellow 286, to help understand the technical problems addressed by some embodiments of the present invention. The virtual memory subsystem (a component implemented in hardware and/or software) enables an application to organize precisely arranged physical pages of the physical address space 288 into linear (virtual) memory ranges of the virtual address space 290. It should be noted that a particular page color organization is shown in FIG. 2, but it should be understood that there are a number of possible organizations.

One technical problem with embodiments in which virtual memory page coloring is associated with virtual memory very large pages is that coloring is associated with one color per page, regardless of page size. Thus, for a virtual memory page, one page corresponds to one color, and for a virtual memory superpage, one superpage corresponds to one color. Because the oversized virtual memory page comprises multiple pages, multiple pages of different oversized pages are integrated in the mapping from physical memory to virtual memory 1: 1, in a single very large page. This means that in some systems where applications use virtual memory pages as well as virtual memory superpages, page shading is not compatible with virtual superpage shading because a superpage integrates multiple pages of all possible colors.

Referring now to fig. 3, which is a schematic diagram of an example of an application program (App1) that uses at least one virtual memory page shading (i.e., clustering) within oversized virtual memory pages, according to some embodiments of the present invention. The extra large pages 302 within the physical address space 304 may be located anywhere within the application's assigned virtual address space 306. While a shader sub-page (e.g., one set 308 depicted for clarity) is fixed within the extra large page 302.

Referring now to fig. 4, a block diagram of a system 400 according to some embodiments of the invention is shown, the system 400 including a computing device 402 for compiling code for execution while running within virtual memory sub-pages of virtual memory 404 and/or for loading code for execution within virtual memory sub-pages of virtual memory 404. Referring additionally to fig. 5, a flowchart of a method of compiling code for execution while running within a virtual memory sub-page of a virtual memory page is shown, according to some embodiments of the present invention. Referring additionally to fig. 6, a flowchart of a method of loading code for execution within a virtual memory sub-page of a virtual memory page is shown, in accordance with some embodiments of the present invention. The methods of fig. 5 and/or 6 may be implemented by code sequenced in data storage 412 executed by processor 406. The data storage 412 may be implemented as Random Access Memory (RAM), or code may be moved from the data storage 412 to RAM for execution by the processor 406. For example, the method of FIG. 5 may be implemented by compiler code 412A and/or linker code 412B. The method of FIG. 6 may be implemented by loading code 412C, such as hypervisor code, an application loader, and/or a library loader.

As used herein, the terms hypervisor (e.g., code, software) and load code are interchangeable.

It should be noted that compiled code 412A and linked program code 412B may be implemented as a single component, referred to herein as a compiler. Alternatively, the compiler and the linker may be implemented as different components.

It should be noted that different architectures of the computing device 402 may be implemented. For example, the same computing device 402 may compile code (or recompile previously compiled code) for execution at runtime within a virtual memory sub-page of a virtual memory page, and load the compiled code for execution within the virtual memory sub-page of the virtual memory page. Alternatively, one computing device 402 performs compilation of code, e.g., for locally stored code, for transmission by and/or to provide remote services to client terminals and/or servers (e.g., via, for example, an Application Programming Interface (API), a Software Development Kit (SDK), a website interface, and an application programming interface loaded on a client terminal and/or server). The compiled code may be provided for execution within a virtual memory sub-page of a virtual memory page of another computing device, e.g., by a client terminal and/or server providing the code for compilation, and/or by another client terminal and/or server receiving the compiled code for local execution.

Optionally, processor 406 includes paging mechanism 416, and paging mechanism 416 is mapped between virtual memory 404 and physical memory 408. It should be noted that virtual memory 404 represents abstract and/or virtual components, as virtual memory 404 does not represent an actual physical virtual memory device. Paging mechanism 416 may be implemented in hardware. When a processor implementation lacks a paging mechanism, virtual memory sub-pages are mapped to a physical memory block, where the virtual memory sub-pages are part of a virtual memory page and the physical memory block is part of a contiguous physical memory block that constitutes the virtual memory page size. Optionally, the offset from the physical memory block to the beginning of the contiguous physical memory block is the same as the offset from the virtual memory sub-page to the beginning of the virtual memory page. In processors without paging mechanisms, there is no virtual paging concept. The virtual memory sub-pages are physical memory blocks. A virtual memory page is a collection of contiguous physical memory blocks. The systems, apparatus, methods, and/or code instructions described herein allow page shading and do not require a virtual memory subsystem.

Computing device 402 may be implemented, for example, as one or more of the following: a single computing device (e.g., a client terminal), a group of computing devices arranged in parallel, a network server, a web server, a storage server, a local server, a remote server, a client terminal, a mobile device, a stationary device, a kiosk, a smartphone, a laptop, a tablet, a wearable computing device, an eyewear computing device, a watch computing device, a desktop computer, and an Internet of Things (IoT) device.

The processor 406 may be implemented, for example, as a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a Field Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a custom circuit, a Microprocessor (MPU), a processor for interfacing with other units, and/or a dedicated hardware accelerator. The processors 406 may be implemented as a single processor, multi-core processors, and/or clusters of processors arranged for parallel processing (which may include homogeneous and/or heterogeneous processor architectures).

The physical memory device 408 and/or the data storage device 412 are implemented, for example, as one or more of a Random Access Memory (RAM), a read-only memory (ROM), and/or a storage device such as a non-volatile memory, a magnetic medium, a semiconductor memory device, a hard disk drive, a removable memory, and an optical medium (e.g., DVD, CD-ROM).

Note that paging mechanism 416 is a memory component, and that paging mechanism 416 creates virtual memory 404 from physical memory 408 and/or data storage device 412.

The computing device 402 may communicate with a user interface 414 that presents data and/or include mechanisms for inputting data, such as one or more of a touch screen, a display, a keyboard, a mouse, voice-activated software, and a microphone. The user interface 414 may be used to configure parameters, such as defining the size of each virtual memory sub-page, and/or defining the number of clusters available.

Turning now to FIG. 5, a flowchart of a method for compiling code for execution while running within a virtual memory sub-page of a virtual memory page is shown. The size of each virtual memory sub-page is at least as large as a predefined size of a physical memory block associated with the processor. It should be noted that in high-level languages (e.g., C/C + +, Fortran, Java, Phyton, etc.), the machine code is output by a compiler. The modification of the machine code based on the method described with reference to fig. 5 is transparent to the programmer. It should be noted that the compiler assumes that the application will run on virtual memory.

At 502, pre-compiled code is received for compilation by a compiler. The pre-compiled code may include text-based code written by a programmer. The pre-compiled code may include object code that has been compiled but has not yet been linked. The pre-compiled code may include an inherent representation of code within the compiler. The source code may be written in different programming languages. The pre-compiled code may be new code for the first compilation, or may include old code (e.g., legacy applications) that has been previously compiled but is now being recompiled for execution within the virtual memory sub-pages of the virtual memory page.

When compiled and loaded into memory, the size of the pre-compiled code is at least the size of one virtual memory sub-page. The virtual memory sub-page corresponds to one of multiple physical memory blocks, which are mapped to the virtual memory page. The size of each physical memory block is the size of a virtual memory sub-page.

At 504, pre-compiled code that cannot fit into a virtual memory sub-page when compiled is divided into blocks. Each block, when compiled into a corresponding binary executable block, is smaller than or equal to the size of a physical memory block, and the size of a virtual memory sub-page of the virtual memory page corresponds to the size of the physical memory block.

Each binary block as a whole may be relocated into contiguous code segments from one virtual memory sub-page to another. Blocks may be relocated at run-time by moving each block from one region of physical memory to another region of physical memory. Because each block is mapped to a virtual memory sub-page, the block is moved from one virtual memory sub-page to another. The block may be moved to a contiguous virtual memory sub-page, or another virtual memory sub-page that is not contiguous. For example, a block in a virtual memory sub-page labeled 1234 may be moved to virtual memory sub-page 1235 or 123456789.

These exemplary data structures are not amenable to a virtual memory sub-page at compile time, so the limiting method for partitioning the exemplary data structures need not be discussed. It should be understood that other data structures not explicitly discussed herein may be partitioned based on similar principles.

When the code is compiled into the executable binary block, the size of each sub-function is smaller than or equal to the size of one virtual memory sub-page. When loaded into memory for execution of the program described with reference to fig. 6, the executable binary block of the split function of text is placed by the load code (e.g., hypervisor software) within a cluster of virtual memory sub-pages of the virtual memory page that is mapped to a corresponding cluster of virtual memory blocks, the size of each physical memory block corresponding to the size of the virtual memory sub-page.

Referring to FIG. 7, a diagram illustrating an example text segment 702 divided into a plurality of sub-functions is shown, according to some embodiments of the present invention. The text segment 702 includes three functions fun _ a (), fun _ b (), and fun _ c (). The diagram 704 depicts a standard implementation based on existing methods, where the text segment 702 is placed in physical memory as a contiguous set of code spanning multiple corresponding virtual memory sub-pages (one virtual memory sub-page labeled 706 for clarity). Functions fun _ a (), fun _ b (), and fun _ c () are stored in series. The diagram 708 depicts the division of the text 702 into three sub-functions, fun _ a (), fun _ b (), and fun _ c (), where each text segment of each sub-function (text _ a,. text _ b, and. text _ c) is placed in a common cluster (i.e., color) 710 of physical memory. The size of each text segment of each function is less than the size of one virtual memory sub-page.

Returning now to act 504 of fig. 5, it is noted that functions smaller than one virtual memory sub-page (e.g.,. text segment) are reloadable and need not be partitioned.

Optionally, the entire text segment is partitioned into blocks each of which is less than or equal to the size of one virtual memory sub-page at compile time. A single function cannot exceed the size of one virtual memory sub-page, for example, a function schema may be used for support. It should be noted that both LLVM and GCC (the most widely used compiler toolchain) have implemented a function schema.

Optionally, each function that is smaller than the size of one virtual memory sub-page is set at compile time to fit completely into one virtual memory sub-page at compile time.

Optionally, the pre-compiled code includes data storage structures that are larger than one virtual memory sub-page size at compile time. The data storage structure is partitioned into a plurality of sub-data storage structures, each sub-data storage structure being smaller than the size of a virtual memory sub-page at compile time. Exemplary data structures include: an array and a vector.

Optionally, a different data structure (e.g., implemented as a table) stores data for accessing each element of each sub data storage structure. A dereference data structure can be created and/or data can be stored within an existing dereference data structure. The method further includes de-referencing the data structure to add an offset based on a size of a virtual memory sub-page of a virtual memory page storing the data structure during runtime and a cluster of physical memory blocks, wherein each size of the physical memory blocks corresponds to a virtual memory sub-page size allocated to an application associated with the data storage structure.

Reference is now made to fig. 8, which is a schematic diagram depicting a de-reference table 802 (also referred to as a subcolor _ array) for accessing each element of a sub-array (one sub-array 804 is depicted for clarity) obtained by partitioning the array, in accordance with some embodiments of the invention. The arrays are stored in virtual memory pages 806, optionally in super-large pages. Each subarray 802 is less than or equal to the size of one virtual memory sub-page (one sub-page 808 depicted for clarity) of the virtual memory page 806.

Reference is now made to FIG. 9, which is an example of code (e.g., native code, pseudo assembly code) generated by a compiler to allow data access to one element of each sub data storage structure (last 4 rows) in accordance with some embodiments of the present invention. The code represents possible ASM conversions. Different ISAs may enable faster data access.

A new programming language key _ coded may be introduced to force access to the heap allocated data structure (whose size is unknown at compile time), as described herein. For example, for integer array _ coded int a ═ malloc (4096 sizerof (int)), it can be implemented in the C/C + + programming language. Keywords may be implemented accordingly for each programming language.

Returning now to act 504 of FIG. 5, the program stack is accessed and/or managed by adding the program stack as partitions, each partition having a size less than or equal to one virtual memory sub-page. Code output by the compiler may be modified to access and/or manage the stack. A new program stack frame is added that updates the program stack pointer to each partition. New program stack frames are added by adding an offset based on the size of the virtual memory sub-page storing the program stack at runtime and the cluster of physical memory blocks allocated to the application associated with the data storage structure. A stack may be located for a collection of page colors.

An example and not necessarily limiting implementation based on the program stack described herein is now provided. When the application code calls a new function, the calling function checks the stack size. Since the variable parameter size is already known at compile time, after calculating the new stack position, the calling code may decide to insert the new stack frame described herein. Then, in the new location, variable parameters are formulated for the invocation. At this point, the calling code may pass execution to the desired function by updating the stack pointer.

When the called function returns, the called function code saves the return value of the caller. Finally, the unbroken call registers are restored and the proposed additional stack frame is noted by the called function code when unwinding to the previous stack frame. Due to the new stack frame, the return function will adjust the stack frame pointer before returning control to the calling function.

At 506, the block is compiled into an executable binary block.

Functions (e.g.,. text segments) that are divided into blocks may be compiled with one. text segment per function, which may enable fast recoloring. A table storing relocation data may be created for future recoloring.

At 508, the executable binary block is linked into the program. May include designating a binary executable block for loading a program into an allocated virtual memory page by loading the binary executable block into a physical memory block according to a mapping between virtual memory sub-pages of the virtual memory page and a cluster of allocated physical memory blocks, the designation being stored as metadata within the program, e.g., by a dedicated data structure external to the program (e.g., a table indicating whether the program is associated with the designation) and/or by a value in a field indicating storage of the designated program.

At 510, a program is provided for execution. The program may be stored locally in a data storage device, for example, and/or transmitted to another computing device (e.g., a client terminal that provides precompiled code, and/or another client terminal).

Reference is now made to fig. 10, which is a schematic diagram depicting additional exemplary components of a compiler 412A and a linker 412B (as described with reference to fig. 4) for compiling code for execution while running within virtual memory sub-pages of one or more virtual memory pages, in accordance with some embodiments of the present invention. The components may represent conventional compilation of each host application component and/or modification of conventional static and/or dynamic linking processes.

Additional and/or modified components of compiler 412A include:

function schema editor 1002, the function schema editor 1002 for partitioning functions (e.g., of text segments) of pre-compiled code larger than a size of one virtual memory sub-page into sub-functions each smaller than or equal to the size of one virtual memory sub-page when compiled into executable code as described herein.

Discrete data structure support 1004, the discrete data structure support 1004 for, when compiled, partitioning a data storage structure into sub-data storage structures, each sub-data storage structure being smaller than a size of one virtual memory sub-page, as described herein.

Stack support 1006, the stack support 1006 for accessing and/or managing a program stack by incrementing the program stack in chunks, each chunk having a size less than or equal to a size of a virtual memory sub-page, as described herein

Default 1008, the default 1008 adds new default values to the compiler, e.g., may or may not include a default compilation method that supports page shading in oversized pages.

The components attached and/or modified by linker 412B include:

as described herein, functions/data are packed with a predefined page size (e.g., 4kB)1010 for use in laying out functions each smaller than the size of one virtual memory sub-page at compile time to fit completely within one virtual memory sub-page at compile time.

A relocation and dereferencing table 1012, the relocation and dereferencing table 1012 for creating dereferencing data structures for accessing each element of each sub-data storage structure and/or relocating the binary as a whole into contiguous code segments from one virtual memory sub-page to another as described herein.

Loader hook 1014, the loader hook 1014 creating additional handles for the loader to facilitate functions such as recoloring and/or runtime shading.

Metadata generation 1016, the metadata generation 1016 to include specification of the partitioned executable binary blocks for proper loading of the program by the hypervisor software.

Reference is now made to FIG. 6, which is a flow diagram illustrating a method for executing programs within virtual memory sub-pages of a virtual memory page in accordance with some embodiments of the present invention.

At 602, an instruction to load an executing application is received. For example, the user clicks on an icon associated with the application, and/or another process triggers the loading of the application.

At 604, a binary of the application divided into blocks is identified, e.g., based on analyzing the designation associated with the application (see act 508 described with reference to FIG. 5).

The size of each block of the partitioned application is less than or equal to the size of a virtual memory sub-page.

At 606, an initial allocation of a cluster of physical memory blocks is received. The size of each physical memory block corresponds to the virtual memory sub-page size allocated to the application.

At 608, an allocation of virtual memory pages for an application is received. The size of the virtual memory pages is mapped to contiguous physical memory regions of equal size. The virtual memory pages comprise virtual memory sub-pages that are mapped to clusters of physical memory blocks. Each physical memory block has a size corresponding to the size of a virtual memory sub-page.

At load time, the binary loader may assign virtual memory pages (e.g., very large pages) to the text. The binary loader issues a request to the hypervisor code for the assigned color. For a user space loader, because the user space loader is executed once during initialization, the loader can be placed at any virtual memory sub-page of any color. After allocation, the text code may be stored in a virtual memory page (e.g., a very large page), thereby preserving shading. The code may be re-linked, including symbols. The loader may be modified to perform runtime relinking based on the selected color during the recoloring phase.

The application loader may implement a memory allocator that supports page shading.

The page colors assigned to the application may be dynamically updated at runtime. The application address space may be dynamically updated to allocate additional virtual memory pages (e.g., tone pages) to the application.

At 610, blocks of the binary file of the application are loaded into the allocated virtual memory pages. And loading the blocks into the physical memory area according to the mapping between the virtual memory sub-pages and the clusters of the allocated physical memory blocks.

Each application is loaded with a limited number of assigned page colors. Different colors are selected from all available colors for assignment to different applications so that multiple applications can be loaded simultaneously.

Optionally, when the data structure is divided into a plurality of sub-data storage structures (see act 504 described with reference to FIG. 5), the dereference data structure is populated with data for accessing each element of the sub-data storage structures of the data storage structure. The loader may populate the dereference table based on the specified color of the application. And adding an offset by referring to the data structure according to the size of the virtual memory sub-page of the virtual memory page of the sub-data structure of the data structure during storage and the cluster of the physical memory block. The size of each physical memory block corresponds to a virtual memory sub-page size allocated to an application associated with the data storage structure. Each sub data storage structure may be placed on a page boundary.

Optionally, the program stack is extended in blocks. Each block has a size less than or equal to the size of one virtual memory sub-page of the virtual memory page in which the program and the program stack are stored during runtime. The program stack is extended according to the new program stack frame added, which updates the program stack pointer to point to the corresponding program stack block with the offset. The offset is calculated from the size of the virtual memory sub-page of the virtual memory page and the cluster of physical memory blocks storing the program stack block of the program stack during runtime. The size of each physical memory block corresponds to the size of the virtual memory sub-page allocated to the program stack.

Optionally, the application program includes sub-functions, each sub-function being smaller than or equal to a size of a virtual memory sub-page of the function, the function being larger than the size of a virtual memory sub-page. The sub-functions are stored at respective virtual memory sub-pages of the cluster mapped to the physical memory blocks, the size of each physical memory block corresponding to the size of the virtual memory sub-page. The location of each sub-function is stored in a mapping data structure for runtime execution of the function.

At 612, the application is executing. The recoloring of the application may be performed at runtime.

Optionally, one or more binary blocks are dynamically moved from a first virtual memory sub-page of a first cluster to a second memory sub-page of a second cluster. And updating the mapping between the virtual memory sub-pages and the clusters of the physical memory blocks according to the dynamic movement.

Runtime relocation to reference a data structure may be performed when the code does not have a pointer to the actual element of the data structure. For example, code is prevented from holding pointers to data structure elements. Access to the data structure elements is provided by the index.

Reference is now made to fig. 11, which is a diagram depicting additional exemplary components of a runtime 1102 and/or an operating system 1104 and/or a memory management 1106 for loading code for execution within virtual memory sub-pages of a virtual memory page, in accordance with some embodiments of the invention.

Additional and/or modified components of the runtime 1102 include:

load time and runtime symbol relocation 1108, the load time and runtime symbol relocation 1108 to dynamically move a binary block from a first virtual memory sub-page of a first cluster to a second memory sub-page of a second cluster and update a mapping between the virtual memory sub-pages and the clusters of physical memory blocks according to the movement.

Extra large page support 1110, the extra large page support 1110 identifying a binary file of an application divided into blocks as described herein.

Additional and/or modified components of the operating system 1104 that may execute the binary loader 1112 include:

new executable shading binary loader with compiler 1114, the new executable shading binary loader with compiler 1114 is used to load blocks of the application's binary file into the allocated virtual memory pages.

Additional and/or modified components of memory management 1106 include:

a coloring allocator 1116, the coloring allocator 1116 allocating virtual memory pages for the applications according to the clusters as described herein.

Turning now to FIG. 12, a flowchart is presented depicting an exemplary implementation of dividing a function of a text segment of precompiled code into sub-functions each of which is less than or equal to the size of a sub-page of virtual memory when compiled, in accordance with some embodiments of the present invention. It should be noted that this method is not essential.

The compiler 204 performs the partitioning of the source program 202 to create the object code program 206 as follows:

at 221, the parser unit parses the source program 202, for example, according to common compiler practices.

At 222, the intermediate transcoding unit performs intermediate transcoding, e.g., according to common compiler practice.

At 223, optimization unit 223 performs optimization of the intermediate code, for example, according to general compiler practice.

At 224, a code generation unit generates code, for example, according to common compiler practice.

At 225, functions larger than the size of one virtual memory sub-page (e.g., 4kB) are partitioned into sub-functions each smaller than the size of one virtual memory sub-page.

LLVM and GCC are exemplary compiler frameworks that are production quality and are commonly used for software development. LLVM and GCC implement the function schema. One exemplary implementation of the schema is in a framework called OpenMP. An example of code that can be outlined is a ring.

At 226, the output object files 206 are compiled, each object file having a segment of each sub-function and associated relocatable code (i.e., relocation). The relocation symbol may be defined in reloc section. The jump table defines how blocks loaded into the noncontiguous storage area are linked to each other. The segmentation unit helps the compiler to divide the code and/or data objects in the size of the sub-page.

At 227, the packing unit (and/or pre-linker tool) of linker 208 packs the code functions from object code program 20 at a minimum size according to a virtual memory sub-page (e.g., 4 kB). Information about the function may be kept or discarded. The packaging creates an order that the hypervisor software places the functions within the cluster of virtual memory sub-pages. Padding may be employed to avoid functions that span multiple virtual memory sub-pages (e.g., over 4 kB).

At 228, the jump table generation unit calculates a jump table.

At 229, the link generation unit performs the linking according to standard linking program practice, the link generation unit assuming a single contiguous text section because each block is contiguous in the virtual address space and does not overlap each other as defined by the jump table.

At 230, the relocation and symbol generation unit saves the relocation information along with the symbol information in executable binary 212.

At 231, the additional metadata generation unit adds a tag to executable binary 212 as an indication of the program loader, executable binary 212 is compiled to support page shading with extra large pages and thus can be modified to load time block relocation.

Turning now to FIG. 13, a flowchart of an exemplary method for executing a text segment of an executable binary file within a virtual memory sub-page of one or more virtual memory pages is depicted in accordance with some embodiments of the present invention. It should be noted that a text segment is described as an example, and the operating principle of the method is applicable to other executable binary segments.

The hypervisor software 214 receives the machine code 212. The machine code 212 is created based on the method described with reference to fig. 12.

Executable binary loader 216 of hypervisor software 214 may exist as part of an Operating System (OS) and/or be loaded by the OS in the same address space of an application. The implementation described herein (which need not be limiting) is based on an executable binary loader 216 implemented within the OS.

Executable loader 216 performs the following steps:

at 239, the header parsing unit reads the set of headers describing the executable binary file and parses the contents of the description according to standard hypervisor software practices and/or executable binary loader practices.

At 238, the binary file is checked for a tag indicating that the binary has been compiled for page shading with (optionally oversized) virtual memory pages (e.g., the tag is created by a compiler to distinguish between compilation types, e.g., act 231 described with reference to fig. 12). It should be noted that the labels are exemplary embodiments and are not necessarily limiting. The binary file may be further checked to verify that no code function is larger than the size of one virtual memory sub-page (e.g., 4 kB). The binary file may be further checked to verify that the relocation symbol is available for execution.

At 237, the generate page color assignment unit determines the color to assign to the text segment.

At 236, based on the total size of the text segment and the number of colors allocated, the page/superpage memory allocation unit allocates a certain amount of virtual memory (e.g., a superpage) to the binary and loads the entire text segment at the beginning of the allocated memory.

At 235, the function/data relocation unit moves each text segment block (having a size of one or less virtual memory sub-pages, e.g., 4kB) to a virtual memory page for the assigned color, saving an offset for each page.

At 234, the scheduler unit schedules execution of the application according to the common hypervisor software practice.

At 233, the runtime binary loader 218 of the program 220 performs symbol relocation according to the common runtime binary loader practice.

At 232, the runtime binary loader 218 uses the relocation information to traverse the entire text segment to change the runtime function pointer by generating a runtime jump table. When the start address of the program changes due to coloring, the start address is updated.

At 222, control is passed to the application that started running.

Other systems, methods, features and advantages of the disclosure will be, or will become, apparent to one with skill in the art upon examination of the following figures and detailed description. It is intended that all such additional systems, methods, features and advantages be included within this description, be within the scope of the present disclosure, and be protected by the accompanying claims.

The description of the various embodiments of the present invention is intended to be illustrative, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen to best explain the principles of the embodiments, the practical application, or technical advances, or to enable others skilled in the art to understand the embodiments disclosed herein, as compared to techniques available in the market.

It is expected that during the life of a patent maturing from this application many relevant compilers, linkers, and operating systems will be developed and the scope of compilers, linkers, and operating systems is intended to include all such new technologies a priori.

The term "about" as used herein means 10%.

The terms "including," comprising, "" having, "and variations thereof mean" including, but not limited to. This term includes the terms "consisting of … …" and "consisting essentially of … …".

The phrase "consisting essentially of …" means that the composition or method may include additional ingredients and/or steps, provided that the additional ingredients and/or steps do not materially alter the basic and novel characteristics of the claimed composition or method.

As used herein, the singular forms "a", "an" and "the" include plural referents unless the context clearly dictates otherwise. For example, the term "compound" or "at least one compound" may include a plurality of compounds, including mixtures thereof.

The word "exemplary" is used herein to mean "serving as an example, instance, or illustration. Any "exemplary" embodiment is not necessarily to be construed as preferred or advantageous over other embodiments and/or to exclude the presence of other combinations of features of embodiments.

The word "optionally" is used herein to mean "provided in some embodiments and not provided in other embodiments". Any particular embodiment of the invention may include a plurality of "optional" features unless these features are mutually inconsistent.

Throughout this application, various embodiments of the present invention may be presented in a range format. It is to be understood that the description of the range format is merely for convenience and brevity and should not be construed as a fixed limitation on the scope of the present invention. Accordingly, the description of a range should be considered to have specifically disclosed all the possible sub-ranges as well as individual numerical values within that range. For example, description of a range such as from 1 to 6 should be considered to have specifically disclosed sub-ranges such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6, etc., as well as individual numbers within the range, such as 1, 2, 3, 4, 5, and 6. This applies regardless of the breadth of the range.

When a range of numbers is indicated herein, the expression includes any number (fractional or integer) recited within the indicated range. The phrases "in the first indicated number and the second indicated number range" and "from the first indicated number to the second indicated number range" and used interchangeably herein are meant to include the first and second indicated numbers and all fractions and integers in between.

It is appreciated that certain features of the invention, which are, for brevity, described in the context of a single embodiment, may also be provided in combination in a single embodiment. Conversely, various features of the invention which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable subcombination or as any suitable other embodiment of the invention. Certain features described in the context of various embodiments are not considered essential features of those embodiments unless the embodiments are not otherwise invalid.

All publications, patents and patent specifications mentioned in this specification are herein incorporated in the specification by reference, and likewise, each individual publication, patent or patent specification is specifically and individually incorporated herein by reference. In addition, citation or identification of any reference in this application shall not be construed as an admission that such reference is available as prior art to the present invention. To the extent that section headings are used, they should not be construed as necessarily limiting.

Claims

1. An apparatus (402) for compiling code for execution at runtime within at least one virtual memory sub-page of at least one virtual memory page, the apparatus comprising:

a compiler (412A), the compiler (412A) executable by a processor (406), the compiler (412A) to receive pre-compiled code for compilation, wherein a size of the pre-compiled code, when compiled and loaded into memory, is a size of at least one virtual memory sub-page, wherein the at least one virtual memory sub-page corresponds to one of a plurality of physical memory blocks mapped to a virtual memory page, each physical memory block size being the size of a virtual memory sub-page; if the processor (406) includes a paging mechanism, virtual memory is created for the paging mechanism from at least one of physical memory and a data storage device; if the processor (406) does not include the paging mechanism, the virtual memory sub-page is a physical memory block, the virtual memory page is a set of contiguous physical memory blocks;

partitioning said precompiled code into a plurality of blocks, such that each block of said plurality of blocks, when compiled into a corresponding executable binary block of a plurality of executable binary blocks, is less than or equal to said size of a virtual memory sub-page of said at least one virtual memory sub-page, said size of said one virtual memory sub-page corresponding to said size of one physical memory block;

compiling the plurality of blocks into the plurality of executable binary blocks; and

linking (412B) the plurality of executable binary blocks into a program, and including designating the plurality of executable binary blocks for: loading, by hypervisor software, the program into the allocated at least one virtual memory page by loading the plurality of executable binary blocks into a physical memory block according to a mapping between virtual memory sub-pages of the at least one virtual memory page and clusters of allocated multiple physical memory blocks, wherein the size of each of the physical memory blocks corresponds to the size of a virtual memory sub-page.

2. The apparatus (402) of claim 1, wherein the compiler (412A) is further configured to: dividing a function of a text segment (702) of the pre-compiled code into a plurality of sub-functions, the size of the pre-compiled code segment being greater than the size of a virtual memory sub-page when compiled into executable code, the size of each sub-function being less than or equal to the size of a virtual memory sub-page when the plurality of sub-functions are compiled into executable code, wherein the executable binary blocks of the text-divided function are placed by hypervisor software (412C) within a cluster (710) of virtual memory sub-pages of a virtual page memory, the virtual memory sub-pages of the virtual memory pages being mapped to a corresponding cluster of physical memory blocks, the size of each of the physical memory blocks corresponding to a virtual memory sub-page size.

3. The apparatus (402) of claim 1, wherein said compiler (412A) is further configured to arrange a plurality of functions to fit completely within a virtual memory sub-page at compile time, wherein a size of each of said functions is smaller than said size of a virtual memory sub-page at compile time.

4. The apparatus (402) of any of the preceding claims, wherein the precompiled code comprises a data storage structure that is larger than the size of one virtual memory sub-page (808) at compile time, and wherein the compiler (412A) is further configured to partition the data storage structure into a plurality of sub-data storage structures (804), each of which is smaller than the size of one virtual memory sub-page (808) at compile time.

5. The apparatus (402) of claim 4, wherein the compiler (412A) is further configured to: creating a dereference data structure (802) for accessing each element of each sub data storage structure (804), wherein the dereference data structure (802) adds an offset according to the size of a virtual memory sub-page (808) of a virtual memory page (806) that stores the data structure during runtime, and a cluster of physical memory blocks, wherein the size of each of the physical memory blocks corresponds to a virtual memory sub-page size allocated to an application associated with the data storage structure.

6. The apparatus (402) of any of claims 1, 2, 3, 5, the compiler (412A) further configured to access and manage a program stack by incrementing the program stack in partitions, each of the partitions having a size less than or equal to a size of one virtual memory sub-page.

7. The apparatus (402) of claim 6, wherein said compiler (412A) is further configured to add a new program stack frame, said new program stack updating a program stack pointer to each chunk by adding an offset based on said size of said virtual memory sub-page storing said virtual memory pages of said program stack during runtime and a cluster of physical memory blocks allocated to an application associated with a data storage structure.

8. The apparatus (402) of any of claims 1-3, 5, 7, wherein the size of the virtual memory sub-page is at least as large as a predefined standard size of a physical memory block associated with the processor (406).

9. The apparatus (402) of any of claims 1-3, 5, 7, wherein each binary block in the plurality of binary blocks is entirely relocatable as a contiguous code segment from one virtual memory sub-page to another virtual memory sub-page.

10. An apparatus (402) for loading execution code within at least one virtual memory sub-page of at least one virtual memory page, the apparatus (402) comprising: a processor (406);

a memory (412), the memory (412) storing code instructions (412C) for execution by the processor (406), comprising:

code for identifying a binary file of an application that is partitioned into a plurality of blocks, wherein a size of each block in the plurality of blocks is less than or equal to a size of a virtual memory sub-page,

code for retrieving an initial allocation of a plurality of clusters in physical memory blocks, each of the physical memory blocks having a size corresponding to a virtual memory sub-page size of the application,

code for receiving an allocation of at least one virtual memory page for the application, wherein the size of the at least one virtual memory page maps into a contiguous physical memory region of the same size, wherein a virtual memory page comprises at least a plurality of virtual memory sub-pages, the virtual memory sub-pages map into the plurality of clusters of physical memory blocks, each of the physical memory blocks having a size corresponding to a size of a virtual memory sub-page,

code for loading the plurality of blocks of the binary file of the application program to the allocated at least one virtual memory page, wherein the plurality of blocks are loaded to a physical memory region according to a mapping between the virtual memory sub-page and a plurality of clusters of the allocated physical memory block; if the processor (406) includes a paging mechanism, virtual memory is created for the paging mechanism from at least one of physical memory and a data storage device; if the processor (406) does not include the paging mechanism, the virtual memory sub-page is a physical memory block and the virtual memory page is a set of contiguous physical memory blocks.

11. The apparatus (402) of claim 10, further comprising: for dynamically moving at least one of the plurality of blocks from a first virtual memory sub-page of a first cluster to a second storage sub-page of a second cluster, and updating a mapping between the virtual memory sub-page and the cluster of physical memory according to the dynamic movement.

12. The apparatus (402) according to claim 10 or 11, wherein the apparatus (402) further comprises: code for populating data of a dereference data structure (802) for accessing each element of a sub-data storage structure (804) of a data storage structure, wherein the dereference data structure (802) adds an offset based on a size of a virtual memory sub-page (808) of a virtual memory page (806) of the sub-data storage structure (804) that stores the data structure to which it belongs during runtime, and a cluster of physical memory blocks, the size of each of the physical memory blocks corresponding to a virtual memory sub-page size allocated to the application program associated with the data storage structure.

13. The apparatus (402) according to claim 10 or 11, wherein the application program comprises compiled code for expanding a program stack in blocks according to an added new program stack frame, the size of each of the blocks being smaller than or equal to the size of the one of the virtual memory sub-pages storing the program and the virtual memory pages of the program stack at runtime, the new program stack updating a program stack pointer to point to the size of the virtual memory sub-page according to the virtual memory sub-page storing the program stack block of the program stack during runtime and a cluster of the physical memory blocks, the corresponding program stack block of calculated offsets, the size of the cluster of each physical memory block corresponding to the virtual memory sub-page size allocated to a program stack.

14. The apparatus (402) of claim 10 or 11, wherein the application program comprises compiled code for storing a plurality of sub-functions in respective virtual memory sub-pages of a cluster (710) mapped to a physical memory block, each of the plurality of sub-functions being smaller than or equal to the size of the one virtual memory sub-page of a function (702), the size of the function (702) being larger than the size of the one virtual memory sub-page, the size of the cluster (710) of physical memory blocks corresponding to the virtual memory sub-page size, and storing a location of each of the plurality of sub-functions in a mapping data structure for runtime execution of a function.