WO2019105565A1

WO2019105565A1 - Systems for compiling and executing code within one or more virtual memory pages

Info

Publication number: WO2019105565A1
Application number: PCT/EP2017/081116
Authority: WO
Inventors: Antonio BARBALACE; Yi Chen; Jani Kokkonen; Alexander SPYRIDAKIS
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2017-12-01
Filing date: 2017-12-01
Publication date: 2019-06-06
Also published as: CN111344667A; CN111344667B

Abstract

There is provided a compiler configured to: receive pre-compilation code for compilation, wherein the pre-compilation code, when compiled and loaded into a memory, is at least the size of one virtual memory sub-page corresponding to one physical memory block that is mapped to a virtual memory page, divide the pre-compilation code into blocks what when complied into a respective executable binary block is less than or equal to the size of a virtual memory sub-page, compile the blocks into executable binary blocks; and link the executable binary blocks into a program and include a designation of the executable binary blocks for loading of the program by supervisor software into an allocated virtual memory page(s) by loading the executable binary blocks into physical memory blocks according to a mapping between virtual memory sub-pages and allocated clusters of physical memory blocks each of a size corresponding to a virtual memory sub-page size.

Description

SYSTEMS FOR COMPILING AND EXECUTING CODE WITHIN ONE OR MORE VIRTUAL MEMORY PAGES

BACKGROUND

The present invention, in some embodiments thereof, relates to virtual memory management and, more specifically, but not exclusively, to systems and methods for clustering sub-pages of virtual memory pages.

In the framework of multiprocessors/multicore processors which have a high number of cores, and/or software that precludes the co-execution of multiple logical execution units (tasks), sharing access among execution entities to memory resources is increasingly important for performance and energy efficiency reason. Memory resources include, for example, processor caches, including one or more of: Ll, L2, L3, and L4 (e.g., Ll, L1-L2, L1-L3, and L3-L4) (the highest level is termed last level cache (LLC)), the processor memory bus/ring that interconnects multiple groups/clusters via their LLC, and the memory controller and its (parallel) interconnections to the parallel memory elements (banks).

In order to partition the usage of memory resources among different execution entities, different techniques have been introduced, which include page-coloring, a software only technology that requires virtual memory to be implemented. In the case of the cache partitioning page-coloring requires at least at the LLC physically indexed and tagged caches. In the cases of memory bandwidth partition page-coloring may require software configuration of bank interleaving.

SUMMARY

It is an object of the present invention to provide an apparatus, systems, methods, and/or code instructions for compiling code for runtime execution within virtual memory sub-pages of a virtual memory page(s), and/or for loading code for execution within virtual memory sub-pages of a virtual memory page(s).

The foregoing and other objects are achieved by the features of the independent claims. Further implementation forms are apparent from the dependent claims, the description and the figures.

According to a first aspect, an apparatus for compiling code for runtime execution within a plurality of virtual memory sub-pages of at least one virtual memory page, comprises: a compiler executable by a processor, the compiler configured to: receive pre-compilation code for compilation, wherein the size of the pre-compilation code, when compiled and loaded into a memory, is at least the size of one virtual memory sub-page, wherein the at least one virtual memory sub-page corresponds to one of a plurality of physical memory blocks that are mapped to a virtual memory page, the size of each physical memory block is the size of a virtual memory sub-page, divide the pre-compilation code into a plurality of blocks such that each block of the plurality of blocks when complied into a respective executable binary block of a plurality of executable binary blocks is less than or equal to the size of a virtual memory sub-page of the at least one virtual memory page corresponding to the size of one physical memory block, compile the plurality of blocks into the plurality of executable binary blocks, and link the plurality of executable binary blocks into a program and include a designation of the plurality of executable binary blocks for loading of the program by supervisor software into an allocated at least one virtual memory page by loading the plurality of executable binary blocks into physical memory blocks according to a mapping between virtual memory sub-pages of the at least one virtual memory page and allocated plurality of clusters of physical memory blocks each of a size corresponding to a virtual memory sub-page size.

According to a second aspect, an apparatus for loading code for execution within a plurality of virtual memory sub-pages of at least one virtual memory page, comprises: a processor, a memory storing code instructions for execution by the processor, comprising: code to identify a binary file of an application divided into a plurality of blocks, where a size of each block of the plurality of blocks is less than or equal to a size of a virtual memory sub-page, code to retrieve an initial allocation of a plurality of clusters of physical memory blocks each of a size corresponding to a virtual memory sub-page size for the application, code to receive an allocation of at least one virtual memory page for the application, wherein the size of the at least one virtual memory page is mapped to an equal size of contiguous physical memory areas, wherein the at least virtual memory page includes a plurality of virtual memory sub-pages mapped to the plurality of clusters of physical memory blocks each of a size corresponding to a virtual memory sub-page size, code to load the plurality of blocks of the binary file of the application into the allocated at least one virtual memory page, wherein the plurality of blocks are loaded into physical memory areas according to the mapping between the virtual memory sub-pages and the allocated plurality of clusters of physical memory blocks.

The systems, apparatus, methods, and/or code instructions described herein extend page coloring (i.e., clustering) to huge virtual memory pages. The implementation of the systems, apparatus, methods, and/or code instructions described herein is transparent to executing code (e.g., the program, application).

The implementation of the systems, apparatus, methods, and/or code instructions described herein is based on system software.

Existing (e.g., legacy) code (e.g., programs, applications) may be re-compiled for utilizing of the implementation based on the systems, apparatus, methods, and/or code instructions described herein. New programs designed for implementation based on the systems, apparatus, methods, and/or code instructions described herein are not necessarily required.

The systems, apparatus, methods, and/or code instructions described herein provide a software -based solution based on modification of the system software (e.g., operating system code, runtime code, compiler code, and/or linker code). The software-based solution does not necessarily require any modification of processing hardware and/or addition of new processing hardware, and may be executed by existing processing hardware, for example, in comparison to other proposed solutions that are based on at least some modification of processing hardware and/or new hardware component(s).

The systems, apparatus, methods, and/or code instructions described herein, being based on software without requiring modification of hardware and/or new hardware provides scalability, for example, in comparison to other attempts based on new and/or modified hardware that are limited in scalability by the hardware. The problem of scalability may be further explained as follows. Commodity processor instruction set architectures (ISAs) offer only a limited/fixed number of ways for the last level cache (LLC) independently of the number of cores available on the multicore processor, hence the scalability issue (note that there exist other approaches that rely on hardware extensions to partition the cache to different physical or logical execution units, an example is way-partitioning).

In a further implementation form of the first aspect, the compiler is further configured to divide a function of a .text section of the pre-compilation code that is larger than the size of one virtual memory sub-page when compiled into executable code, into a plurality of sub-functions that are each smaller than or equal to the size of one virtual memory sub-page when compiled into executable binary blocks, wherein the executable binary blocks of the divided function of the .text are placed by the supervisor software within a cluster of virtual memory sub-pages of a virtual memory page that map to a corresponding cluster of physical memory blocks each of a size corresponding to a virtual memory sub-page size. In a further implementation form of the first aspect, the compiler is further configured to arrange a plurality of functions that are each smaller than the size of one virtual memory sub- page when compiled, to fit entirely within one virtual memory sub-page when compiled.

In a further implementation form of the first aspect, the pre-compilation code includes a data storage structure larger than the size of one virtual memory sub-page when compiled, and wherein the compiler is further configured to divide the data storage structure into a plurality of sub-data storage structures each smaller than the size of one virtual memory sub-page when compiled.

In a further implementation form of the first aspect, the compiler is further configured to create a dereferencing data structure for accessing each element of each sub-data storage structure, wherein the dereferencing data structure adds an offset according to the size of the virtual memory sub-pages of the virtual memory page storing the data structure during runtime and clusters of physical memory blocks each of a size corresponding to a virtual memory sub- page size allocated to the application associated with the data storage structure.

In a further implementation form of the first aspect, the compiler is further configured to access and manage a program stack by incrementing the program stack in divided blocks each having a size smaller than or equal to one virtual memory sub-page.

In a further implementation form of the first aspect, the compiler is further configured to add a new program stack frame that updates a program stack pointer that points to each divided block by adding an offset according to the size of the virtual memory sub-pages of the virtual memory page storing the program stack during runtime and clusters of physical memory blocks allocated to the application associated with the data storage structure.

In a further implementation form of the first aspect, the size of the virtual memory sub- page is at least as large as a predefined standard size of a physical memory block associated with the processor.

In a further implementation form of the first aspect, each binary block of the plurality of binary blocks is relocatable in its entirety as a continuous segment of code from one virtual memory sub-page to another virtual memory sub-page.

In a further implementation form of the second aspect, the apparatus further comprising code to dynamically move at least one of the plurality of binary blocks from a first virtual memory sub-page of a first cluster to a second memory sub-page of a second cluster, and update a mapping between virtual memory sub-pages and clusters of physical memory blocks according to the dynamic move. In a further implementation form of the second aspect, the apparatus further comprises code to populate data of a dereferencing data structure for accessing each element of sub-data storage structures of a data storage structure, wherein the dereferencing data structure adds an offset according to the size of the virtual memory sub-pages of the virtual memory page storing the sub-data structures of the data structure during runtime and the clusters of physical memory blocks each of a size corresponding to a virtual memory sub-page size allocated to the application associated with the data storage structure.

In a further implementation form of the second aspect, the application includes complied code for growing a program stack in blocks each having a size smaller than or equal to the size of one virtual memory sub-page of the virtual memory pages storing the program and the program stack during runtime, according to an added new program stack frame that updates a program stack pointer to point to the respective program stack blocks with an offset computed according to the size of the virtual memory sub-pages of the virtual memory pages storing the program stack blocks of the program stack during runtime and the clusters of physical memory blocks each of a size corresponding to a virtual memory sub-page size allocated to the program stack.

In a further implementation form of the second aspect, the application includes compiled code for storing a plurality of sub-functions that are each smaller than or equal to the size of one virtual memory sub-page of a function that is larger than the size of one virtual memory sub- page, at respective virtual memory sub-page mapped to a cluster of physical memory blocks each of a size corresponding to a virtual memory sub-page size, and storing the location of each of the plurality of sub-functions in a mapping data structure for runtime execution of the function.

In a further implementation form of the second aspect, in an implementation of the processor lacking a paging mechanism the at least one virtual memory sub-page, which is part of a virtual memory page, is mapped to one physical memory block which is part of a plurality of contiguous physical memory blocks that makes up the size of a virtual memory page.

Unless otherwise defined, all technical and/or scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the invention pertains. Although methods and materials similar or equivalent to those described herein can be used in the practice or testing of embodiments of the invention, exemplary methods and/or materials are described below. In case of conflict, the patent specification, including definitions, will control. In addition, the materials, methods, and examples are illustrative only and are not intended to be necessarily limiting.

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS

Some embodiments of the invention are herein described, by way of example only, with reference to the accompanying drawings. With specific reference now to the drawings in detail, it is stressed that the particulars shown are by way of example and for purposes of illustrative discussion of embodiments of the invention. In this regard, the description taken with the drawings makes apparent to those skilled in the art how embodiments of the invention may be practiced.

In the drawings:

FIG. 1 is a schematic depicting how page colors are arranged in a physical address space, to help in understanding the technical problem addressed by some implementations of the present invention;

FIG. 2 is a schematic depicting an application using virtual pages of three different colors, to help in understanding the technical problem addressed by some implementations of the present invention;

FIG. 3 is a schematic depicting an example of an application that uses virtual memory paging with at least one huge virtual memory page, in accordance with some embodiments of the present invention;

FIG. 4 is a schematic of a block diagram of a system that includes a computing device for compiling code for runtime execution within virtual memory sub-pages and/or for loading code for execution within virtual memory sub-pages, in accordance with some embodiments of the present invention;

FIG. 5 is a flowchart of a method of compiling code for runtime execution within of virtual memory sub-pages of virtual memory page(s), in accordance with some embodiments of the present invention;

FIG. 6 is a flowchart of a method of loading code for execution within virtual memory sub-pages of virtual memory page(s), in accordance with some embodiments of the present invention;

FIG. 7 is a schematic depicting division of an example .text section into multiple sub- functions, in accordance with some embodiments of the present invention; FIG. 8 is a schematic depicting a dereferencing table for accessing each element of sub arrays which are obtained by dividing an array, in accordance with some embodiments of the present invention;

FIG. 9 is an example of code (e.g., native code, pseudo assembly code) generated by the compiler to enable data access to one element of each sub-data storage structure, in accordance with some embodiments of the present invention;

FIG. 10 is a schematic depicting additional exemplary components of a compiler and a linker for compiling code for runtime execution within virtual memory sub-pages of one or more virtual memory pages, in accordance with some embodiments of the present invention;

FIG. 11 is a schematic depicting additional exemplary components of a runtime and/or operating system and/or memory management for loading code for execution within virtual memory sub-pages, in accordance with some embodiments of the present invention;

FIG. 12 is a flowchart depicting an exemplary implementation of dividing a function of a .text section of the pre-compilation code when compiled into sub-functions that are each smaller than or equal to the size of one virtual memory sub-page when compiled, in accordance with some embodiments of the present invention; and

FIG. 13 is a flowchart of an exemplary method for execution of a .text section of an executable binary file within virtual memory sub-pages of one or more virtual memory pages, in accordance with some embodiments of the present invention.

DETAILED DESCRIPTION

As used herein, the term cluster or clustering and the word color or coloring are interchangeable. For example, each cluster is assigned a certain color.

As used herein, the term huge virtual memory page refers to a virtual huge memory page that is larger than the size of a physical memory page implementation defined by the hardware. It is noted that different implementations may refer to huge pages with other terms, for example, large pages.

As used herein, the term standard size virtual memory page refers to a virtual memory page defined by the hardware as the minimum amount of translation. The size of each physical memory block is the size of a virtual memory sub-page. The terms huge virtual memory page, standard virtual memory page, and virtual memory page are sometimes interchangeable.

An aspect of some embodiments of the present invention relates to an apparatus, systems, methods, and/or code instructions (stored in a data storage device executable by one or more hardware processors) for compiling pre-compilation code for runtime execution within virtual memory sub-pages of virtual memory page(s). The size of the pre-compilation code, when compiled and loaded into a memory, is at least the size of one virtual memory sub-page. The virtual memory sub-page corresponds to one of multiple physical memory blocks that are mapped to a virtual memory page. The size of each physical memory block is the size of a virtual memory sub-page. The pre-compilation code is divided into blocks, such that each block when complied into a respective executable binary block is less than or equal to the size of a virtual memory sub-page (of the virtual memory page corresponding to the size of one physical memory block). The blocks are compiled into executable binary blocks. The executable binary blocks are linked into a program. The program includes a designation of the executable binary blocks for loading of the program by supervisor software into an allocated virtual memory page. The supervisor software loads the executable binary blocks into physical memory blocks according to a mapping between virtual memory sub-pages of the virtual memory page and allocated clusters of physical memory blocks. Each block of a size corresponding to a virtual memory sub-page size, for example, 4 kilobytes (kB) is the smallest page size available for processors based on the x86 architecture.

An aspect of some embodiments of the present invention relates to an apparatus, systems, methods, and/or code instructions (stored in a data storage device executable by one or more hardware processors) for loading code for execution within virtual memory sub-pages of virtual memory page(s). A binary file of an application divided into blocks is identified. A size of each block is less than or equal to a size of a virtual memory sub-page. An initial allocation of clusters of physical memory blocks each of a size corresponding to a virtual memory sub-page size is retrieved for the application. An allocation of virtual memory page(s) for the application is received. The size of the virtual memory page is mapped to an equal size of contiguous physical memory areas. The virtual memory page includes virtual memory sub-pages mapped to the clusters of physical memory blocks. The size of each block corresponds to the size of a virtual memory sub-page. The blocks of the binary file of the application are loaded into the allocated virtual memory page(s). The blocks are loaded into physical memory areas according to the mapping between the virtual memory sub-pages and the allocated clusters of physical memory blocks.

Virtual memory sub-pages mapped to respective clusters of memory blocks may be located non-contiguously within the virtual memory page. Virtual memory sub-pages of different clusters may be contiguous with one another, optionally in a repeating pattern, for example for three defined clusters arranged as: 1,2, 3, 1,2, 3, 1,2, 3.

The apparatus and/or system described herein address the technical problem of combining software-based memory page clustering (also referred to herein as page coloring) and hardware -based huge memory pages in an efficient and operable manner. Such combination is not practically possible with current hardware and software architectures. A brief explanation of the current state of the art and resulting incompatibility of software -based page coloring and hardware -based huge memory pages is now provided.

Current multicore/multiprocessor computers are ubiquitous. Such computer architectures provide an improved performance compared to their predecessors by enabling the parallel execution of software on multiple hardware computer devices. However, to enable multiple computer devices to share the same data, which resides in memory, all of the computer devices need to access the same memory locations, which are usually mediated by the (hardware) last level cache. When the last level cache is shared among computer devices, performance issues result due to unfair usage of the last level cache by the different software applications running atop the computer devices (i.e., core or CPUs). The unfair usage may degrade performance of each application, especially in cases that application code is memory -bounded (i.e., a large number of memory accesses are performed) and the memory access pattern is characterized by temporal locality. Page coloring techniques, which are currently implemented as pure software, are used to fairly share the last level cache and reduce application interference.

Generally, current software applications use virtual memory provided by a paging mechanism of the computing device. The minimum granularity of virtual to physical translation is a standard page. The standard page is a small page, there can be other small pages defined by the hardware. When an application operates on a wide memory area, the usage of small pages (which may be greater than the size of a standard page) significantly impact performance due to the high cost of virtual memory translations. A high number of page translations may result in a high number of misses in the TLB cache, requiring numerous memory accesses to fetch each translation (i.e., an operation termed page walk). Hardware huge pages are implemented to solve the described problem to reduce TLB misses. Software-based page coloring is incompatible with hardware-based huge pages because page coloring is designed to operate according to the smallest predefined and/or standard page granularity. Based on existing technology, an attempt to extend the technique of software-based page coloring to hardware -based huge-page coloring may result either in an extremely small number of colors, or no colors at all, which effectively eliminates any potential benefits of implementing coloring.

The apparatus, system, methods, and/or code instructions (stored in a data storage device executed by one or more processors) described herein effectively implement a combination of coloring and huge pages in a manner that improves performance and/or deterministic execution of applications running concurrently on the same computer device.

A brief discussion of other attempts at combining coloring and huge pages is now provided, to help understand the addressed technical problem and described solution. One described strategy is to implement a hardware-based solution to the problem. However, such hardware-only solutions requires manufacturing of new hardware processors designed to enable the exploitation of page coloring combined with huge pages. In general, such solutions are complicated and not practical for implementation due to technical difficulty in design and/or manufacturing. Moreover, such solutions are not generic enough to cover expected possible application demands.

Another attempt at addressing the technical problem of combining page coloring and huge pages in an operable manner is termed Cache Allocation Technology (CAT) of Intel®. CAT is designed to transparently support huge pages. However, CAT cannot be easily controlled and/or generally implemented, since the solution is designed specifically for the processors produced by Intel® based on the x86 architecture. Moreover, CAT cannot scale to a high number of applications.

Before explaining at least one embodiment of the invention in detail, it is to be understood that the invention is not necessarily limited in its application to the details of construction and the arrangement of the components and/or methods set forth in the following description and/or illustrated in the drawings and/or the Examples. The invention is capable of other embodiments or of being practiced or carried out in various ways.

The present invention may be a system, a method, and/or a computer program product. The computer program product may include a computer readable storage medium (or media) having computer readable program instructions thereon for causing a processor to carry out aspects of the present invention. The computer readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.

Computer readable program instructions described herein can be downloaded to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network.

The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer readable program instructions by utilizing state information of the computer readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.

Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.

The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.

Reference is now made to FIG. 1, which is a schematic depicting how page colors (i.e., clusters) are arranged in a physical address space 102, to help in understanding the technical problem addressed by some implementations of the present invention. FIG. 1 depicts the traditional page coloring (i.e., clustering) approach that uses virtual memory to group physically scattered memory pages of the same color together within the same virtual address range. Page colors are periodically repeated. For example, one set of virtual memory pages corresponds to physical memory blocks of page size. It is noted that the page size is of a standard page size as defined by the processor (e.g., 4kB in x86 architectures) are assigned physical memory pages having the color blue (e.g., cluster 1) 104. Another set of virtual memory pages are assigned physical memory pages having the color green (e.g., cluster 2) 106. It is noted that the labels blue and green are meant as tags to identify the clusters, and do not reflect actual colors of the memory. The colors blue and green are periodically repeated. Pages with the same color have a constant offset 108 in the physical address space.

Reference is now made to FIG. 2, which is a schematic depicting an application (App 1) using virtual pages of three different colors (i.e., clusters), blue 282, green, 284, and yellow 286, to help in understanding the technical problem addressed by some implementations of the present invention. A virtual memory subsystem (component implemented in hardware and/or software) enables the application to organize strictly disposed physical pages of a physical address space 288 into linear (virtual) memory ranges of a virtual address space 290. Note that a specific page color organization is shown in FIG. 2, but it is to be understood that there are multiple possible organizations.

One technical problem with the implementation of virtual memory page coloring with virtual memory huge pages is that the coloring associates one color per page, independently of the size of the page. Therefore, with virtual memory pages one page corresponds to one color, and with virtual memory huge pages, one huge page corresponds to one color. Because a huge virtual memory page incorporates multiple pages, multiple pages of different huge pages are integrated in a single huge page that is mapped 1:1 from physical memory to virtual memory. This implies that in a system in which some applications use virtual memory pages but also virtual memory huge pages, page coloring is incompatible with virtual huge page coloring, due to the fact that the huge page integrates multiple pages of all possible colors.

Reference is now made to FIG. 3, which is a schematic depicting an example of an application (App 1) that uses paging with at least one virtual memory page coloring (i.e., clustering) within a huge virtual memory page, in accordance with some embodiments of the present invention. A huge page 302 within a physical address space 304 may be located anywhere within the application’s assigned virtual address space 306. However, the colored sub- pages (e.g., one set 308 depicted for clarity) are fixed within huge page 302.

Reference is now made to FIG. 4, which is a schematic of a block diagram of a system 400 that includes a computing device 402 for compiling code for runtime execution within virtual memory sub-pages of virtual memory page(s) of a virtual memory 404 and/or for loading code for execution within virtual memory sub-pages of virtual memory page(s) of virtual memory 404, in accordance with some embodiments of the present invention. Reference is also made to FIG. 5, which is a flowchart of a method of compiling code for runtime execution within of virtual memory sub-pages of virtual memory page(s), in accordance with some embodiments of the present invention. Reference is also made to FIG. 6, which is a flowchart of a method of loading code for execution within virtual memory sub-pages of virtual memory page(s), in accordance with some embodiments of the present invention. The methods of FIG. 5 and/or FIG. 6 may be implemented by code sorted in data storage device 412 executed by processor(s) 406. Data storage device 412 may be implemented as random access memory (RAM), or code may be moved from data storage device 412 to RAM for execution by processor(s) 406. For example, the method of FIG. 5 may be implemented by compiler code 412A and/or linker code 412B. The method of FIG. 6 may be implemented by loading code 412C, for example, supervisor code, application loader, and/or library loader.

As used herein, the term supervisor (e.g., code, software) and loading code may be interchanged.

It is noted that compile code 412A and linker code 412B may be implemented a single component referred to herein as compiler. Alternatively, the compiler and linker are implemented as distinct components.

It is noted that different architectures of computing device 402 may be implemented. For example, the same computing device 402 may compile code (or re-compile previously compiled code) for runtime execution within virtual memory sub-pages of virtual memory page(s), and load the compiled code for execution within virtual memory sub-pages of virtual memory page(s). Alternatively, one computing device 402 performs the compilation of the code, for example, for locally stored code, for code transmitted by client terminal(s) and/or server(s), and/or providing remote services to client terminals(s) and/or server(s) (e.g., via a software interface such as an application programming interface (API), software development kit (SDK), a web site interface, and an application interface that is loaded on the client terminal and/or server). The compiled code may be provided for execution within virtual memory sub-pages of virtual memory page(s) of another computing device, for example, by the client terminal(s) and/or server(s) that provided the code for compilation, and/or by another client terminal and/or server that receive the compiled code for local execution.

Optionally, processor(s) 406 includes a paging mechanism 416 that maps between virtual memory 404 and physical memory 408. It is noted that virtual memory 404 represents an abstraction and/or a virtual component, since virtual memory 404 does not represent an actual physical virtual memory device. Paging mechanism 416 may be implemented in hardware. When an implementation of processor(s) lacks a paging mechanism, the virtual memory sub- page, which is part of a virtual memory page, is mapped to one physical memory block which is part of contiguous physical memory blocks that makes up the size of a virtual memory page. Optionally, the physical memory block offset to the beginning of the contiguous physical memory blocks is the same as the offset that the virtual memory sub-page has to the beginning of the virtual memory page. In a processor without a paging mechanism there is no virtual page concept. Virtual memory sub-pages are physical memory blocks. Virtual memory pages are a collection of contiguous physical memory blocks. The systems, apparatus, methods, and/or code instructions described herein enable page coloring without necessarily requiring a virtual memory subsystem.

Computing device 402 may be implemented as, for example, one of more of: a single computing device (e.g., client terminal), a group of computing devices arranged in parallel, a network server, a web server, a storage server, a local server, a remote server, a client terminal, a mobile device, a stationary device, a kiosk, a smartphone, a laptop, a tablet computer, a wearable computing device, a glasses computing device, a watch computing device, a desktop computer, and an interne of things (IoT) device.

Processor(s) 406 may be implemented as for example, central processing unit(s) (CPU), graphics processing unit(s) (GPU), field programmable gate array(s) (FPGA), digital signal processor(s) (DSP), application specific integrated circuit(s) (ASIC), customized circuit(s), microprocessing unit (MPU), processors for interfacing with other units, and/or specialized hardware accelerators. Processor(s) 406 may be implemented as a single processor, a multi-core processor, and/or a cluster of processors arranged for parallel processing (which may include homogenous and/or heterogeneous processor architectures).

Physical memory device(s) 408 and/or data storage device 412 are implemented as, for example as one or more of, a random access memory (RAM), read-only memory (ROM), and/or a storage device, for example, non-volatile memory, magnetic media, semiconductor memory devices, hard drive, removable storage, and optical media (e.g., DVD, CD-ROM).

It is noted that paging mechanism 416 is the memory component that creates virtual memory 404 from physical memory 408 and/or data storage device 412.

Computing device 402 may be in communication with a user interface 414 that presents data and/or includes a mechanism for entry of data, for example, one or more of: a touch-screen, a display, a keyboard, a mouse, voice activated software, and a microphone. User interface 414 may be used to configure parameters, for example, define the size of each virtual memory sub- page, and/or define the number of available clusters.

Reference is now made to FIG. 5, which is a flowchart of a method for compiling code for runtime execution within virtual memory sub-pages of virtual memory page(s). The size of each virtual memory sub-page is at least as large as a predefined size of a physical memory block associated with the processor. It is noted that in high-level languages (e.g., C/C++, Fortran, Java, Phyton, and the like) the machine code is outputted by the compiler. Modifications to the machine code based on the method described with reference to FIG. 5 are transparent to the programmer. It is noted that the compiler assumes that the application will be run on virtual memory.

At 502 pre-compilation code is received for compilation by the compiler. The pre compilation code may include source code may include text-based code written by a programmer. The pre-compilation code may include object code that is already compiled but not yet linked. The pre-compilation code may include an internal representation of the code within the compiler. The source code may be written in different programming languages. The pre compilation code may be new code for a first- time compilation, or may include old code (e.g., legacy application) that has been previously compiled but is now being re-compiled for runtime execution within virtual memory sub-pages of virtual memory page(s).

The size of the pre-compilation code, when compiled and loaded into a memory, is at least the size of one virtual memory sub-page. The virtual memory sub-page corresponds to one of multiple physical memory blocks that are mapped to a virtual memory page. The size of each physical memory block is the size of a virtual memory sub-page

At 504, the pre-compilation code, which cannot fit into one virtual memory sub-page when compiled, is divided into blocks. Each block when complied into a respective executable binary block has a size less than or equal to the size of a virtual memory sub-page of the virtual memory page corresponding to the size of one physical memory block.

Each binary block is relocatable in its entirety as a continuous segment of code from one virtual memory sub-page to another virtual memory sub-page. Blocks may be relocated at runtime, by moving each block from one area of physical memory to another area of the physical memory. Since each block is mapped to a virtual memory sub-page, a block is moved from one virtual memory sub-page to another virtual memory sub-page. Blocks may be moved to a contiguous virtual memory sub-page, or another virtual memory sub-page that is non contiguous. For example, a block in virtual memory sub-page labeled as 1234 may be moved to virtual memory sub-page 1235, or 123456789.

Not necessarily limiting methods for division of some exemplary data structures that cannot fit into one virtual memory sub-page when compiled are discussed. It is to be understood that other data structures not explicitly discussed herein may be divided based on similar principles.

Optionally, a function of a .text section of the pre-compilation code that is larger than the size of one virtual memory sub-page when compiled into executable code, is divided into multiple sub-functions that are each smaller than or equal to the size of one virtual memory sub- page when compiled into executable binary blocks. The executable binary blocks of the divided function of the .text, when loaded into memory for program execution as described with reference to FIG. 6, are placed by the loading code (e.g., supervisor software) within a cluster of virtual memory sub-pages of a virtual memory page that map to a corresponding cluster of physical memory blocks each of a size corresponding to a virtual memory sub-page size.

Reference is now made to FIG. 7, which is a schematic depicting division of an example .text section 702 into multiple sub-functions, in accordance with some embodiments of the present invention .text section 702 includes three functions, fun_a(), fun_b(), and fun_c(). Schematic 704 depicts a standard implementation based on existing methods, in which .text section 702 is placed into physical memory as a continuous set of code spanning across multiple corresponding virtual memory sub-pages (one virtual memory sub-page marked 706 for clarity). Functions, fun_a(), fun_b(), and fun_c() are stored contiguously. Schematic 708 depicts a division of .text 702 into three sub-functions fim_a(), fun_b(), and fun_c(), where each .text portion of each sub-function ( text_a , text_b, and text_c) is placed in a common cluster (i.e. color) 710 of physical memory. The size of each .text section of each function is smaller than one virtual memory sub-page.

Returning now to act 504 of FIG. 5, it is noted that functions (e.g., .text section) smaller than one virtual memory sub-page are reloadable, and do not necessarily require division.

Optionally, the entire .text segment is divided into blocks each smaller than or equal to the size of one virtual memory sub-page when compiled. A single function cannot exceed the size of one virtual memory sub-page, for example, function outlining may be used for support. It is noted that both LLVM and GCC (the most widely used compiler toolchains) already implement function outlining.

Optionally, functions that are each smaller than the size of one virtual memory sub-page when compiled are arranged to fit entirely within one virtual memory sub-page when compiled.

Optionally, the pre-compilation code includes a data storage structure larger than the size of one virtual memory sub-page when compiled. The data storage structure is divided into multiple sub-data storage structures each smaller than the size of one virtual memory sub-page when compiled. Exemplary data structures include: array and vector.

Optionally, a dereferencing data structure (e.g., implemented as a table) stores data for accessing each element of each sub-data storage structure. The dereferencing data structure may be created and/or the data may be stored within an existing dereferencing data structure. The dereferencing data structure adds an offset according to the size of the virtual memory sub-pages of the virtual memory page storing the data structure during runtime and clusters of physical memory blocks each of a size corresponding to a virtual memory sub-page size allocated to the application associated with the data storage structure.

Reference is now made to FIG. 8, which is a schematic depicting a dereferencing table 802 (also referred to as subcolor _arr ay) for accessing each element of sub-arrays (one sub-array 804 depicted for clarity) which are obtained by dividing an array, in accordance with some embodiments of the present invention. The array is stored in a virtual memory page 806, optionally a huge page. Each sub-array 802 is less than or equal to the size of one virtual memory-sub page (one sub-page 808 depicted for clarity) of virtual memory page 806.

Reference is now made to FIG. 9, which is an example of code (e.g., native code, pseudo assembly code) generated by the compiler to enable data access to one element of each sub-data storage structure (the last 4 lines), in accordance with some embodiments of the present invention. The code represents a possible ASM translation. Different ISAs may enable faster data access.

A new programming language keyword _colored may be introduced to force heap allocated data structures (who size may be unknown at compilation time) to be accessed as described herein. For example, for an array of integers, _colored int* a = malloc(4096*sizeof(int)), which is implementable in the C/C++ programming language. The keyword may be implemented accordingly for each programming language.

Returning now to act 504 of FIG. 5, a program stack is accessed and/or managed by incrementing the program stack in divided blocks each having a size smaller than or equal to one virtual memory sub-page. The code outputted by the compiler may be modified for accessing and/or managing the stack. A new program stack frame that updates a program stack pointer that points to each divided block is added. The new program stack frame is added by adding an offset according to the size of the virtual memory sub-pages storing the program stack during runtime and clusters of physical memory blocks allocated to the application associated with the data storage structure. The stack may be located for a certain set of page colors.

An example and not necessarily limiting implementation based on the program stack described herein is now provided. When the application code calls a new function, the caller function checks for the stack size. Since the argument sizes are already known at compile time, the caller code may decide to insert the new stack frame described herein after calculating the new stack position. Then, at the new location, the argument for the caller is laid out. At that point the caller code may pass the execution to the needed function, by updating the stack pointer.

When the called function returns, the called function code saves the return value for the caller. Eventually unsaved caller registers are restored and at the time of unrolling to the previous stack frame the called function code notices the proposed additional stack frame. Because of the new stack frame, the returning function will adjust the stack frame pointer before giving back the control to the calling function.

At 506, the blocks are compiled into executable binary blocks.

Functions (e.g., .text section) divided into blocks may be compiled with one .text section per function, which may enable quick re-coloring. A table storing the relocation data may be created for future re-coloring.

At 508, the executable binary blocks are linked into a program, A designation may be included, of the executable binary blocks for loading of the program by supervisor software into an allocated virtual memory page(s) by loading the executable binary blocks into physical memory blocks according to a mapping between virtual memory sub-pages of the virtual memory page(s) and allocated clusters of physical memory blocks each of a size corresponding to a virtual memory sub-page size. The designation may be stored, for example, as metadata within the program, by a specialized data structure external to the program (e.g., table indicating whether the program is associated with the designation), and/or a value in a field stored by the program indicative of the designation.

At 510, the program is provided for execution. The program may be, for example, locally stored in a data storage device, and/or transmitted to another computing device (e.g., a client terminal that provided the pre-compilation code, and/or another client terminal)

Reference is now made to FIG. 10, which is a schematic depicting additional exemplary components of compiler 412A and linker 412B (as described with reference to FIG. 4) for compiling code for runtime execution within virtual memory sub-pages of one or more virtual memory pages, in accordance with some embodiments of the present invention. The components may represent a modification of the traditional compilation and/or traditional static and/or dynamic linking process for each of the main application parts.

Additional and/or modified components of compiler 412A include:

* Functional outliner 1002 for dividing a function (e.g., of a .text section) of the pre compilation code that is larger than the size of one virtual memory sub-page when compiled into executable code, into a sub-functions that are each smaller than or equal to the size of one virtual memory sub-page when compiled into executable binary blocks, as described herein.

* Scattered data structure support 1004 for dividing the data storage structure into sub data storage structures each smaller than the size of one virtual memory sub-page when compiled, as described herein.

* Stack support 1006 for accessing and/or managing a program stack by incrementing the program stack in divided blocks each having a size smaller than or equal to one virtual memory sub-page, as described herein.

* Defaults 1008 adds new default to the compiler, such as the default compilation methods that may include or exclude the support for page coloring in huge pages.

Additional and/or modified components of linker 412B include:

* Function/data packing in predefined page sizes (e.g., 4 kB) 1010 for arranging functions that are each smaller than the size of one virtual memory sub-page when compiled, to fit entirely within one virtual memory sub-page when compiled, as described herein. * Relocations and dereferencing tables 1012 for creating a dereferencing data structure for accessing each element of each sub-data storage structure and/or relocating a binary block in its entirety as a continuous segment of code from one virtual memory sub-page to another virtual memory sub-page, as described herein.

* Loader hooks 1014 creates additional handles for the loader to help functionalities such as re-coloring and/or runtime coloring. * Metadata generation 1016 for including a designation of divided executable binary blocks for appropriate loading of the program by supervisor software.

Reference is now made to FIG. 6, which is a flowchart of a method for execution of the program within virtual memory sub-pages of virtual memory page(s), in accordance with some embodiments of the present invention.

At 602, instructions to load an application for execution are received. For example, a user clicks on an icon associated with the application, and/or another process triggers loading of the application.

At 604, a binary file of the application divided into blocks is identified, for example, based on an analysis of the designation associated with the application (as described with reference to act 508 of FIG. 5).

The size of each block of the divided application is less than or equal to a size of a virtual memory sub-page.

At 606, an initial allocation of clusters of physical memory blocks is received. Each physical memory block is of a size corresponding to a virtual memory sub-page size allocated for the application.

At 608, an allocation of virtual memory page(s) for the application is received. The size of the virtual memory page(s) is mapped to an equal size of contiguous physical memory area(s). The virtual memory page(s) include virtual memory sub-pages mapped to the clusters of physical memory blocks. Each physical memory block has a size corresponding to the size of a virtual memory sub-page.

At load time, the binary loader may allocate a virtual memory page (e.g., huge page) for the .text. The binary loader issues a request to the supervisor code for the allocated color(s). For user- space loader, the loader may be placed at any virtual memory sub-page of any color, since the user-space loader is executed once during initialization. After allocation, the .text code may be stored in a virtual memory page (e.g., huge page) thus preserving the coloring. The code may be re-linked, including symbols. The loader may be modified to perform a runtime re-linking based on the selected colors during a re-coloring phase.

The application loader may implement a memory allocator supporting page coloring.

Page colors allocated to the application may be dynamically updated at run-time. The application address space may be dynamically updated to allocate additional virtual memory pages (e.g., hue pages) to the application.

At 610, the blocks of the binary file of the application are loaded into the allocated virtual memory page(s). The blocks are loaded into physical memory areas according to the mapping between the virtual memory sub-pages and the allocated clusters of physical memory blocks.

Each application is loaded with a limited number of the allocated page colors. Different applications are assigned different colors selected from all the available colors, to enable multiple applications to be loaded simultaneously.

Optionally, when a data structure is divided into multiple sub-data storage structures (as described with reference to act 504 of FIG. 5), the dereferencing data structure is populated with data for accessing each element of sub-data storage structures of the data storage structure. The loader may populate the dereferencing table based on the application’s assigned colors. The dereferencing data structure adds an offset according to the size of the virtual memory sub-pages of the virtual memory page storing the sub-data structures of the data structure during runtime and the clusters of physical memory blocks. Each physical memory block of a size corresponding to a virtual memory sub-page size allocated to the application associated with the data storage structure. Each sub-data storage structure may be placed on page boundaries.

Optionally, a program stack is grown in blocks. Each block having a size smaller than or equal to the size of one virtual memory sub-page of the virtual memory pages storing the program and the program stack during runtime. The program stack is grown according to an added new program stack frame that updates a program stack pointer to point to the respective program stack blocks with an offset. The offset is computed according to the size of the virtual memory sub-pages of the virtual memory pages storing the program stack blocks of the program stack during runtime and the clusters of physical memory blocks. Each physical memory block of a size corresponding to a virtual memory sub-page size allocated to the program stack.

Optionally, the application includes sub-functions that are each smaller than or equal to the size of one virtual memory sub-page of a function that is larger than the size of one virtual memory sub-page. The sub-functions are stored at respective virtual memory sub-pages mapped to a cluster of physical memory blocks each of a size corresponding to a virtual memory sub- page size. The location of each sub-function is stored in a mapping data structure for runtime execution of the function.

At 612, the application is executing. Re-coloring of the application may be performed at runtime.

Optionally, one or more of the binary blocks are dynamically moved from a first virtual memory sub-page of a first cluster to a second memory sub-page of a second cluster. The mapping between virtual memory sub-pages and clusters of physical memory blocks is updated according to the dynamic move.

Runtime relocation of the dereferencing data structure may be performed when no pointers to the actual elements of the data structure are stored by the code. For example, the code is prevented from saving pointers to the data structure elements. Access to the data structure elements is provided via indexes.

Reference is now made to FIG. 11, which is a schematic depicting additional exemplary components of a runtime 1102 and/or operating system 1104 and/or memory management 1106 for loading code for execution within virtual memory sub-pages of virtual memory page(s), in accordance with some embodiments of the present invention.

Additional and/or modified components of runtime 1102 include:

* Load- time and run-time symbol relocation 1108 for dynamically moving binary block(s) from a first virtual memory sub-page of a first cluster to a second memory sub-page of a second cluster and updating a mapping between virtual memory sub-pages and clusters of physical memory blocks according to the move.

* Huge-page support 1110 for identifying a binary file of the application divided into blocks, as described herein.

Additional and/or modified components of executable binary loader 1112 of operating system 1104 include:

* New executable with compiler coloring binary loader 1114 for loading the blocks of the binary file of the application into the allocated virtual memory page(s).

Additional and/or modified components of memory management 1106 include:

* Coloring allocator 1116 that performs an allocation of virtual memory page(s) for the application according to clusters, as described herein.

Reference is now made to FIG. 12, which is a flowchart depicting an exemplary implementation of dividing a function of a .text section of the pre-compilation code when compiled into sub-functions that are each smaller than or equal to the size of one virtual memory sub-page when compiled, in accordance with some embodiments of the present invention. It is noted that the method is not necessarily

The division is performed for a source program 202 by a compiler 204 to create an object code program 206 is as follows:

At 221, a parser unit parses source program 202, for example, according to common compiler practice.

At 222, an intermediate code conversion unit performs intermediate code conversion, for example, according to common compiler practice.

At 223, an optimization unit 223 performs optimization of the intermediate code, for example, according to common compiler practice.

At 224, a code generation unit generates code, for example, according to common compiler practice.

At 225, functions larger than the size of one virtual memory sub-page (e.g., 4kB) are divided into sub-functions that are each smaller than the size of one virtual memory sub-page.

LLVM and GCC are exemplary compiler frameworks that are production quality and commonly used in software development. LLVM and GCC implement function outlining. An example implementation of outlining is in the framework called OpenMP. An example of code that can be outlined is a loop.

At 226, the compilation outputs an object file 206 with one section per sub-function and associated relocatable code (i.e., relocations). Relocation symbols may be defined in the .reloc section. The jump tables define how the blocks which are loaded into non-contiguous memory areas are linked to one another. The sectioning unit helps the compiler divide code and/or data objects in the size of a sub-page.

At 227, a packing unit of linker 208 (and/or a pre-linker tool) packs code functions from object code program 20 in the minimum size according to one virtual memory sub-page (e.g., 4 kB). The information about functions may be maintained or discarded. The packing creates the order in which functions are placed by the supervisor software within a cluster of virtual memory sub-pages. Padding may be applied to avoid function spanning across multiple virtual memory sub-pages (e.g., over 4 kB).

At 228, a jump table generation unit computes the jump table.

At 229, a linking generation unit performs linking according to standard linker practice, the linking generation unit assumes one single continuous .text section since each program block is continuous in the virtual address space without overlapping one another, as defined by the jump table.

At 230, a relocation and symbol generation unit saves the relocation information in the executable binary 212, along with symbol information.

At 231, an additional metadata generation unit adds a tag to the executable binary 212, that acts an as indication to the program loader that executable binary 212 has been compiled to support page coloring with huge pages, and therefore amendable to load-time block relocation.

Reference is now made to FIG. 13, which is a flowchart of an exemplary method for execution of a .text section of an executable binary file within virtual memory sub-pages of one or more virtual memory pages, in accordance with some embodiments of the present invention. It is noted that the .text section is described as one example, with the operation principles of the method applicable to other executable binary sections.

Machine code 212 is received by supervisor software 214. Machine code 212 is created based on the method described with reference to FIG. 12.

Executable binary loader 216 of supervisor software 214 may exist as part of the operating system (OS), and/or loaded by the OS in the same address space of the application. The implementations depicted herein (which is not necessarily limiting) is based on executable binary loader 216 implemented within the OS.

The executable binary loader 216 performs the following:

At 239, the header parsing unit reads the set of headers that described the executable binary file and parses the content of the description, in accordance with standard supervisor software practice and/or executable binary loader practice.

At 238, the binary file is checked for the tag that indicates that the binary has been compiled for page coloring with (optionally huge) virtual memory pages (e.g., the tag is created by the compiler to distinguish the type of compilation, for example, as described with reference to act 231 of FIG. 12). It is noted that the tag is an exemplary implementation and not necessarily limiting. The binary file may further be checked to verify that no code function is larger than the size of one virtual memory sub-page (e.g., 4 kB). The binary file may further be checked to verify that the relocation symbols are available in the executable.

At 237, the generate page color allocation unit determines the colors that are to be assigned to the .text section.

At 236, based on the total size of the .text section and the number of assigned colors, the page/huge page memory allocation unit allocates a certain number of virtual memory pages (e.g., huge pages) for the binary and loads the entire .text section at the beginning of the allocated memory.

At 235, the function/data relocation unit moves each .text section block (of size of one virtual memory sub-page or less, e.g., 4 kB) to a virtual memory page which respects the assigned coloring, saving the offset for each page.

At 234, a scheduler unit schedules execution of the application, according to common supervisor software practice.

At 233, a runtime binary loader 218 of program 220 performs symbol relocations according to common runtime binary loader practice.

At 232, runtime binary loader 218 uses the relocation information to pass through the entire .text section to change function pointers at runtime by generating runtime jump tables. When the start address of the program is changed due to coloring, the start address is updated.

At 222, the control is passed to the application that begins running.

Other systems, methods, features, and advantages of the present disclosure will be or become apparent to one with skill in the art upon examination of the following drawings and detailed description. It is intended that all such additional systems, methods, features, and advantages be included within this description, be within the scope of the present disclosure, and be protected by the accompanying claims.

The descriptions of the various embodiments of the present invention have been presented for purposes of illustration, but are not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the embodiments, the practical application or technical improvement over technologies found in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

It is expected that during the life of a patent maturing from this application many relevant compilers, linkers, and operating systems will be developed and the scope of the terms compiler, linker, and operating system is intended to include all such new technologies a priori.

As used herein the term“about” refers to ± 10 %.

The terms "comprises", "comprising", "includes", "including", “having” and their conjugates mean "including but not limited to". This term encompasses the terms "consisting of" and "consisting essentially of". The phrase "consisting essentially of" means that the composition or method may include additional ingredients and/or steps, but only if the additional ingredients and/or steps do not materially alter the basic and novel characteristics of the claimed composition or method.

As used herein, the singular form "a", "an" and "the" include plural references unless the context clearly dictates otherwise. For example, the term "a compound" or "at least one compound" may include a plurality of compounds, including mixtures thereof.

The word“exemplary” is used herein to mean“serving as an example, instance or illustration”. Any embodiment described as“exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments and/or to exclude the incorporation of features from other embodiments.

The word“optionally” is used herein to mean“is provided in some embodiments and not provided in other embodiments”. Any particular embodiment of the invention may include a plurality of“optional” features unless such features conflict.

Throughout this application, various embodiments of this invention may be presented in a range format. It should be understood that the description in range format is merely for convenience and brevity and should not be construed as an inflexible limitation on the scope of the invention. Accordingly, the description of a range should be considered to have specifically disclosed all the possible subranges as well as individual numerical values within that range. For example, description of a range such as from 1 to 6 should be considered to have specifically disclosed subranges such as from 1 to 3, from 1 to 4, from 1 to 5, from 2 to 4, from 2 to 6, from 3 to 6 etc., as well as individual numbers within that range, for example, 1, 2, 3, 4, 5, and 6. This applies regardless of the breadth of the range.

Whenever a numerical range is indicated herein, it is meant to include any cited numeral (fractional or integral) within the indicated range. The phrases“ranging/ranges between” a first indicate number and a second indicate number and“ranging/ranges from” a first indicate number “to” a second indicate number are used herein interchangeably and are meant to include the first and second indicated numbers and all the fractional and integral numerals therebetween.

It is appreciated that certain features of the invention, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the invention, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable subcombination or as suitable in any other described embodiment of the invention. Certain features described in the context of various embodiments are not to be considered essential features of those embodiments, unless the embodiment is inoperative without those elements.

All publications, patents and patent applications mentioned in this specification are herein incorporated in their entirety by reference into the specification, to the same extent as if each individual publication, patent or patent application was specifically and individually indicated to be incorporated herein by reference. In addition, citation or identification of any reference in this application shall not be construed as an admission that such reference is available as prior art to the present invention. To the extent that section headings are used, they should not be construed as necessarily limiting.

Claims

WHAT IS CLAIMED IS:

1. An apparatus (402) for compiling code for runtime execution within at least one virtual memory sub-page of at least one virtual memory page, the apparatus comprising:

a compiler (412A) executable by a processor (406), the compiler (412A) configured to: receive pre-compilation code for compilation,

wherein the size of the pre-compilation code, when compiled and loaded into a memory, is at least the size of one virtual memory sub-page, wherein the at least one virtual memory sub-page corresponds to one of a plurality of physical memory blocks that are mapped to a virtual memory page, the size of each physical memory block is the size of a virtual memory sub-page;

divide the pre-compilation code into a plurality of blocks such that each block of the plurality of blocks when complied into a respective executable binary block of a plurality of executable binary blocks is less than or equal to the size of a virtual memory sub-page of the at least one virtual memory page corresponding to the size of one physical memory block;

compile the plurality of blocks into the plurality of executable binary blocks; and link (412B) the plurality of executable binary blocks into a program and include a designation of the plurality of executable binary blocks for loading of the program by supervisor software into an allocated at least one virtual memory page by loading the plurality of executable binary blocks into physical memory blocks according to a mapping between virtual memory sub-pages of the at least one virtual memory page and allocated plurality of clusters of physical memory blocks each of a size corresponding to a virtual memory sub-page size.

2. The apparatus (402) according to claim 1, wherein the compiler (412A) is further configured to divide a function of a .text section (702) of the pre-compilation code that is larger than the size of one virtual memory sub-page when compiled into executable code, into a plurality of sub-functions that are each smaller than or equal to the size of one virtual memory sub-page when compiled into executable binary blocks, wherein the executable binary blocks of the divided function of the .text are placed by supervisor software (412C) within a cluster (710) of virtual memory sub-pages of a virtual memory page that map to a corresponding cluster of physical memory blocks each of a size corresponding to a virtual memory sub-page size.

3. The apparatus (402) according to claim 1, wherein the compiler (412A) is further configured to arrange a plurality of functions that are each smaller than the size of one virtual memory sub-page when compiled, to fit entirely within one virtual memory sub-page when compiled.

4. The apparatus (402) according to any of the previous claims, wherein the pre-compilation code includes a data storage structure larger than the size of one virtual memory sub-page when compiled, and wherein the compiler (412A) is further configured to divide the data storage structure into a plurality of sub-data storage structures (804) each smaller than the size of one virtual memory sub-page (808) when compiled.

5. The apparatus (402) according to claim 4, wherein the compiler (412A) is further configured to create a dereferencing data structure (802) for accessing each element of each sub data storage structure (804), wherein the dereferencing data structure (802) adds an offset according to the size of the virtual memory sub-pages (808) of the virtual memory page (806) storing the data structure during runtime and clusters of physical memory blocks each of a size corresponding to a virtual memory sub-page size allocated to the application associated with the data storage structure.

6. The apparatus (402) according to any of the previous claims, wherein the compiler (412A) is further configured to access and manage a program stack by incrementing the program stack in divided blocks each having a size smaller than or equal to one virtual memory sub-page.

7. The apparatus (402) according to claim 6, wherein the compiler (412A) is further configured to add a new program stack frame that updates a program stack pointer that points to each divided block by adding an offset according to the size of the virtual memory sub-pages of the virtual memory page storing the program stack during runtime and clusters of physical memory blocks allocated to the application associated with the data storage structure.

8. The apparatus (402) according to any of the previous claims, wherein the size of the virtual memory sub-page is at least as large as a predefined standard size of a physical memory block associated with the processor (406).

9. The apparatus (402) according to any of the previous claims, wherein each binary block of the plurality of binary blocks is relocatable in its entirety as a continuous segment of code from one virtual memory sub-page to another virtual memory sub-page.

10. An apparatus (402) for loading code for execution within at least one virtual memory sub-page of at least one virtual memory page, the apparatus (402) comprising:

a processor (406);

a memory (412) storing code instructions (412C) for execution by the processor (406), comprising:

code to identify a binary file of an application divided into a plurality of blocks, where a size of each block of the plurality of blocks is less than or equal to a size of a virtual memory sub-page,

code to retrieve an initial allocation of a plurality of clusters of physical memory blocks each of a size corresponding to a virtual memory sub-page size for the application, code to receive an allocation of at least one virtual memory page for the application, wherein the size of the at least one virtual memory page is mapped to an equal size of contiguous physical memory areas, wherein the at least virtual memory page includes a plurality of virtual memory sub-pages mapped to the plurality of clusters of physical memory blocks each of a size corresponding to a virtual memory sub-page size,

code to load the plurality of blocks of the binary file of the application into the allocated at least one virtual memory page, wherein the plurality of blocks are loaded into physical memory areas according to the mapping between the virtual memory sub-pages and the allocated plurality of clusters of physical memory blocks.

11. The apparatus (402) according to claim 10, further comprising code to dynamically move at least one of the plurality of blocks from a first virtual memory sub-page of a first cluster to a second memory sub-page of a second cluster, and update a mapping between virtual memory sub-pages and clusters of physical memory blocks according to the dynamic move.

12. The apparatus (402) according to any of claims 10-11, wherein the apparatus (402) further comprises code to populate data of a dereferencing data structure (802) for accessing each element of sub-data storage structures (804) of a data storage structure, wherein the dereferencing data structure (802) adds an offset according to the size of the virtual memory sub- pages (808) of the virtual memory page (806) storing the sub-data structures (804) of the data structure during runtime and the clusters of physical memory blocks each of a size corresponding to a virtual memory sub-page size allocated to the application associated with the data storage structure.

13. The apparatus (402) according to any of claims 10-12, wherein the application includes complied code for growing a program stack in blocks each having a size smaller than or equal to the size of one virtual memory sub-page of the virtual memory pages storing the program and the program stack during runtime, according to an added new program stack frame that updates a program stack pointer to point to the respective program stack blocks with an offset computed according to the size of the virtual memory sub-pages of the virtual memory pages storing the program stack blocks of the program stack during runtime and the clusters of physical memory blocks each of a size corresponding to a virtual memory sub-page size allocated to the program stack.

14. The apparatus (402) according to any of claims 10-13, wherein the application includes compiled code for storing a plurality of sub-functions that are each smaller than or equal to the size of one virtual memory sub-page of a function (702) that is larger than the size of one virtual memory sub-page, at respective virtual memory sub-page mapped to a cluster (710) of physical memory blocks each of a size corresponding to a virtual memory sub-page size, and storing the location of each of the plurality of sub-functions in a mapping data structure for runtime execution of the function.

15. The apparatus (402) according to any of the previous claims, wherein in an implementation of the processor (406) lacking a paging mechanism (416) the at least one virtual memory sub-page, which is part of a virtual memory page, is mapped to one physical memory block which is part of a plurality of contiguous physical memory blocks that makes up the size of a virtual memory page.