WO2019076442A1 - Computing system for unified memory access - Google Patents

Computing system for unified memory access Download PDF

Info

Publication number
WO2019076442A1
WO2019076442A1 PCT/EP2017/076477 EP2017076477W WO2019076442A1 WO 2019076442 A1 WO2019076442 A1 WO 2019076442A1 EP 2017076477 W EP2017076477 W EP 2017076477W WO 2019076442 A1 WO2019076442 A1 WO 2019076442A1
Authority
WO
WIPO (PCT)
Prior art keywords
processing unit
memory
application
requirements information
memory segment
Prior art date
Application number
PCT/EP2017/076477
Other languages
French (fr)
Inventor
Antonio BARBALACE
Antonios ILIOPOULOS
Original Assignee
Huawei Technologies Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co., Ltd. filed Critical Huawei Technologies Co., Ltd.
Priority to EP17787381.7A priority Critical patent/EP3695316A1/en
Priority to CN202111316767.1A priority patent/CN114153751A/en
Priority to CN201780096058.2A priority patent/CN111247512B/en
Priority to PCT/EP2017/076477 priority patent/WO2019076442A1/en
Publication of WO2019076442A1 publication Critical patent/WO2019076442A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0877Cache access modes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/45Exploiting coarse grain parallelism in compilation, i.e. parallelism between groups of instructions
    • G06F8/451Code distribution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/0223User address space allocation, e.g. contiguous or non contiguous base addressing
    • G06F12/023Free address space management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/16Handling requests for interconnection or transfer for access to memory bus
    • G06F13/1605Handling requests for interconnection or transfer for access to memory bus based on arbitration
    • G06F13/1652Handling requests for interconnection or transfer for access to memory bus based on arbitration in a multiprocessor architecture
    • G06F13/1663Access to shared memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/45Exploiting coarse grain parallelism in compilation, i.e. parallelism between groups of instructions
    • G06F8/453Data distribution
    • G06F8/454Consistency
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/445Program loading or initiating
    • G06F9/44521Dynamic linking or loading; Link editing at or after load time, e.g. Java class loading
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system
    • G06F9/5088Techniques for rebalancing the load in a distributed system involving task migration

Definitions

  • the present invention relates to a computing system and corresponding method for unified memory access.
  • the system and method of the present invention affect the way an operating system (OS) allocates a shared memory in a multi-processor system, based on requirements information.
  • the requirements information preferably includes executable binary code that comprises information regarding a type and/or a state of a memory segment required by at least a part of an application.
  • Emerging computer architectures are characterized by an increasingly heterogeneous memory and processor subsystem, mainly due to novel memory technologies, low-latency load/store access interconnects, and the resurgence of near data processing (NDP) that introduces processing units alongside the memory hierarchy.
  • NDP near data processing
  • Addressable memory is composed by on-chip memory, off-chip memory (e.g. by usual DIMM modules), or remote-machine memories.
  • SCM storage class memory
  • addressable memory can be volatile or persistent.
  • New interconnects including CCIX, Gen-Z, OpenCAPI, or serial memory bus are under development in order to boost intra-machine low-latency memory-mapped communication.
  • the unifying feature beyond these techniques is the provision of a shared memory interconnect across all components within a system, either on a single-node basis or at a rack-scale.
  • Some features also aim to provide hardware -based cache-coherence.
  • NDP is re-gaining attraction. NDP is co-location of processing and memory, in the form of processing in (main) memory (PIM), or in-storage processing (ISC).
  • FIG. 8 shows a scenario in which a main memory is accessed by near data processors, CPUs that are attached to a coherent interconnect, accelerators interconnected to CPUs via a peripheral bus, and remote processing units (e.g. near data processors, CPU, or GPU) connected via a RDMA-enabled interface.
  • Accelerator-CPU- NIC setups in which an accelerator (FPGA, GPU, or Xeon-Phi), a CPU, and a NIC share a common memory area, are a sub-scenario of the aforementioned technology.
  • the amount of processing units accessing the same memory area steadily increases with NDP, and rack- scale computing.
  • all memory is becoming load/store accessible inter- and intra- machine.
  • the present invention aims to improve the conventional systems and methods.
  • the present invention has thereby the object to provide a system, which overcomes the drawbacks of heterogeneity in emerging computer architectures.
  • a programmer is freed from having to design an application to target a specific heterogeneous platform and thus, memory access in a heterogeneous environment can be unified.
  • This also enables backwards- compatibility of legacy applications by avoiding them to be redesigned in order to further make use of the benefits of a heterogeneous platform.
  • An operating system is further enabled to make run-time decisions and actively exploit the benefits of heterogeneous devices (for example, by scheduling and transparently migrating processes when and where this would lead to better performance and efficiency), without requiring explicit involvement at application development time.
  • the present invention proposes a solution that uses "memory contracts” (which can also be referred to as “requirements information”) as a system software solution which is created at compile-time by a compiler or linker and is implemented at run-time, managed by an operating system.
  • memory contracts which can also be referred to as “requirements information”
  • checks can be performed at run-time, e.g. about consistency, protection, or coherence guarantees that the code that acts on a memory area has to comply with.
  • the memory contracts can include an enhanced executable binary format, extended to maintain metadata sections that include memory consistency, ISA and application binary interface (ABI) requirements.
  • a conventional OS binary loader can be enriched in order to recognize the metadata sections and load them at run-time.
  • an OS is enabled to dynamically select matching contracts at run-time for the variety of possible processing elements present in a heterogeneous computing architecture, and in addition to migrate tasks (processes, threads) to any processing unit transparently from the user, so as to achieve uniform access of memory.
  • the present invention thereby solves the following problems existing in prior art: Application writers are alleviated from adopting a specific programming model in order to use heterogeneous resources. Applications are enabled to take advantage of several available heterogeneous resources, instead of targeting a specific resource. An operating system is enabled to transparently make dynamic decisions at runtime, in order to better exploit the available heterogeneous resources.
  • a first aspect of the present invention provides a computing system for unified memory access, comprising a first processing unit and a second processing unit, a shared memory including a first memory segment and a second memory segment, an operating system, operated at least partly by the first processing unit, and an application, operated at least partly by the operating system, wherein the first processing unit and the second processing unit are connected to the shared memory, wherein the operating system is configured to control at least one of the first processing unit and the second processing unit, and the shared memory, based on requirements information comprised in the operating system and/or the application, to allocate the first memory segment to at least a part of the application, wherein the requirements information comprises executable binary code that comprises information regarding a type and/or a state of a memory segment required by at least a part of the application.
  • the operating system As memory access can be performed according to the requirements information comprised in the operating system and/or the application, a programmer is freed from having to design an application to target a specific heterogeneous platform. Further, the operating system is enabled to make run-time decisions regarding memory access and actively exploit the benefits of heterogeneous devices. The OS is enabled to effectively and efficiently control heterogeneous processing units and/or heterogeneous memory segments, when running an application.
  • the first processing unit and the second processing unit are of different processing unit architecture. This ensures that memory access can in particular be unified in heterogeneous computing systems, more specifically in those in which a first and a second computing unit is of different architecture.
  • the requirements information further comprises 1st requirements information relating to properties of executable binary code of at least a part of the application
  • the operating system is further configured to control at least one of the first processing unit and the second processing unit, and the shared memory, based on the 1st requirements information, to allocate the first memory segment to at least a part of the application.
  • the 1st requirements information is executable binary code and comprises information regarding an application binary interface, ABI, used to compile at least a part of the application, and/or a format used to compile at least a part of the application, and/or a persistency characteristic, and/or an ownership of a memory segment required by at least a part of the application, and/or a security policy.
  • the requirements information comprises 2nd requirements information relating to an executable binary code of at least one predefined code segment of the application
  • the operating system is further configured to control at least one of the first processing unit and the second processing unit, and the shared memory, based on the 2nd requirements information, to allocate the first memory segment to at least a part of the application.
  • the 2nd requirements information is executable binary code and comprises information regarding an ABI used to compile a predefined code segment of the application, and/or information regarding a memory model a predefined code segment of the application is compiled for, and/or a security policy for each memory segment the application can access.
  • the requirements information comprises 3rd requirements information relating to a connection between the shared memory and at least one of the first processing unit and the second processing unit
  • the operating system is further configured to control at least one of the first processing unit and the second processing unit, and the shared memory, based on the 3rd requirements information, to allocate the first memory segment to at least a part of the application. This ensures that memory access can in particular by unified by considering information regarding a connection between the shared memory and at least one of the first processing unit and the second processing unit, when operating the computing system.
  • the 3rd requirements information is created by the operating system and comprises information regarding a cache coherency guarantee between at least one of the first memory segment and the second memory segment and at least one of the first processing unit and the second processing unit, and/or a memory access latency between at least one of the first memory segment and the second memory segment and at least one of the first processing unit and the second processing unit, and/or information regarding existence and a type of hardware protection mechanisms in the shared memory.
  • the operating system is further configured to, if at least one of the first processing unit and the second processing unit, and/or at least one of the first memory segment and the second memory segment, and/or at least a part of the application does not comply with a requirement in the requirements information, adjust a configuration of at least one of the first processing unit and the second processing unit, and/or at least one of the first memory segment and the second memory segment, and/or at least a part of the application, based on the requirements information, to allocate at least one of the first memory segment and the second memory segment to at least a part of the application.
  • the operating system is further configured to, if the first processing unit does not comply with a requirement in the requirements information, migrate at least a part of the application from being operated by means of the first processing unit to be operated by means of the second processing unit, and to control the second processing unit to allocate the first memory segment to at least a part of the application, based on the requirements information.
  • the operating system is further configured to, if a predefined part of executable binary code in the application does not comply with a requirement in the requirements information, exchange the predefined part of executable binary code in the application with precompiled executable binary code that complies with the requirements, and to allocate the first memory segment to at least a part of the application, based on the requirements information and the precompiled executable binary code.
  • the operating system is further configured to, if the first memory segment does not comply with a requirement in the requirements information, control the at least one of the first processing unit and the second processing unit, and the shared memory, to allocate the second memory segment to at least a part of the application, based on the requirements information.
  • the operating system is further configured to, if the first processing unit, the first memory segment and a predefined part of executable binary code in the application does not comply with a requirement in the requirements information, allocate the first memory segment to at least a part of the application by means of software memory emulation, based on the requirements information.
  • the at least two memory segments are of different memory segment architecture.
  • a second aspect of the present invention provides a method for operating a computing system for unified memory access that comprises a first processing unit and a second processing unit, a shared memory including a first memory segment and a second memory segment, an operating system, operated at least partly by the first processing unit, and an application, operated at least partly by the operating system, wherein the first processing unit and the second processing unit are connected to the shared memory, the method comprising the steps of controlling, by the operating system, at least one of the first processing unit and the second processing unit, and the shared memory, based on requirements information comprised in the operating system and/or the application, to allocate the first memory segment to at least a part of the application, wherein the requirements information comprises executable binary code that comprises information regarding a type and/or a state of a memory segment required by at least a part of the application.
  • the first processing unit and the second processing unit are of different processing unit architecture.
  • the requirements information further comprises 1st requirements information relating to properties of executable binary code of at least a part of the application
  • the method further includes controlling, by the operating system, at least one of the first processing unit and the second processing unit, and the shared memory, based on the 1st requirements information, to allocate the first memory segment to at least a part of the application.
  • the 1st requirements information is executable binary code and comprises information regarding an application binary interface, ABI, used to compile at least a part of the application, and/or a format used to compile at least a part of the application, and/or a persistency characteristic, and/or an ownership of a memory segment required by at least a part of the application, and/or a security policy.
  • the requirements information comprises 2nd requirements information relating to an executable binary code of at least one predefined code segment of the application
  • the method further includes controlling, by the operating system, at least one of the first processing unit and the second processing unit, and the shared memory, based on the 2nd requirements information, to allocate the first memory segment to at least a part of the application.
  • the 2nd requirements information is executable binary code and comprises information regarding an ABI used to compile a predefined code segment of the application, and/or information regarding a memory model a predefined code segment of the application is compiled for, and/or a security policy for each memory segment the application can access.
  • the requirements information comprises 3rd requirements information relating to a connection between the shared memory and at least one of the first processing unit and the second processing unit
  • the method further includes controlling, by the operating system, at least one of the first processing unit and the second processing unit, and the shared memory, based on the 3rd requirements information, to allocate the first memory segment to at least a part of the application.
  • the 3rd requirements information is created by the operating system and comprises information regarding a cache coherency guarantee between at least one of the first memory segment and the second memory segment and at least one of the first processing unit and the second processing unit, and/or a memory access latency between at least one of the first memory segment and the second memory segment and at least one of the first processing unit and the second processing unit, and/or information regarding existence and a type of hardware protection mechanisms in the shared memory.
  • the method further includes, if at least one of the first processing unit and the second processing unit, and/or at least one of the first memory segment and the second memory segment, and/or at least a part of the application does not comply with a requirement in the requirements information, adjusting, by the operating system, a configuration of at least one of the first processing unit and the second processing unit, and/or at least one of the first memory segment and the second memory segment, and/or at least a part of the application, based on the requirements information, to allocate at least one of the first memory segment and the second memory segment to at least a part of the application.
  • the method further includes, if the first processing unit does not comply with a requirement in the requirements information, migrating, by the operating system, at least a part of the application from being operated by means of the first processing unit to be operated by means of the second processing unit, and controlling, by the operating system, the second processing unit to allocate the first memory segment to at least a part of the application, based on the requirements information.
  • the method further includes, if a predefined part of executable binary code in the application does not comply with a requirement in the requirements information, exchanging, by the operating system, the predefined part of executable binary code in the application with precompiled executable binary code that complies with the requirements information, and to allocate the first memory segment to at least a part of the application, based on the requirements information and the precompiled executable binary code.
  • the method further includes, if the first memory segment does not comply with a requirement in the requirements information, controlling, by the operating system, the at least one of the first processing unit and the second processing unit, and the shared memory, to allocate the second memory segment to at least a part of the application, based on the requirements information.
  • the method further includes, if the first processing unit, the first memory segment and a predefined part of executable binary code in the application does not comply with a requirement in the requirements information, allocating, by the operating system, the first memory segment to at least a part of the application by means of software memory emulation, based on the requirements information.
  • the at least two memory segments are of different memory segment architecture.
  • FIG. 1 shows a computing system according to an embodiment of the present invention
  • Fig. 2 shows a computing system according to an embodiment of the present invention
  • Fig. 3 shows a computing system according to an embodiment of the present invention in more detail
  • Fig. 4 shows a schematic view of an ELF format and of a PE/COFF format
  • Fig. 5 shows a schematic view of an OS process descriptor according to the present invention
  • Fig. 6 shows a flowchart of an operating manner for unifying memory access of an OS kernel
  • Fig. 7 shows a schematic overview of a method according to an embodiment of the invention
  • Fig. 8 shows a computing system according to the prior art.
  • Fig. 1 shows a computing system 100 according to an embodiment of the present invention.
  • the computing system 100 allows for unified memory access and comprises a first processing unit 101 and a second processing unit 102, as well as a shared memory 103 including a first memory segment 104 and a second memory segment 105.
  • Each of the processing units 101, 102 thereby can a for example be one of a CPU, a CPU core, a GPU, a GPU core, a near data processor, a CPU that is attached to a coherent interconnect, an accelerator interconnected to a CPU via a peripheral bus, a remote processing unit (e.g. near data processor, CPU, or GPU) connected via a RDMA-enabled interface, or a kernel, e.g. of a kernel of an OS.
  • the computing system 100 can include an arbitrary amount of processing units, as long as it includes at least the first processing unit 101 and the second processing unit 102, e.g. according to the above definition.
  • the first processing unit 101 and the second processing unit 102 optionally can be of different processing unit architecture. This can include that the first processing unit 101 is a first entity, selected from the above list, and the second processing unit 102 is a different entity, selected from the above list. This also can include that the first processing unit 101 and the second processing unit 102 are binary incompatible, e.g. because they operate according to different IS As.
  • the shared memory 103 includes the first memory segment 104 and the second memory segment 105.
  • Each memory segment 104, 105 can either be classic main memory, such as a random access memory (RAM), or storage class memory.
  • Each memory segment 104, 105 can be volatile or non- volatile, as well as coherent or non-coherent. More specifically, each of the memory segments 104, 105 can be implemented on-chip, off-chip (e.g. by usual DIMM modules), by a conventional coherency interconnect, or by inter- machine memories. More specifically, each link to memory segment 104, 105 can be implemented using cache-coherent or non-coherent interconnects, or with new technologies, e.g. CCIX, Gen-Z, OpenCAPI, a serial memory bus.
  • the first memory segment 104 and the second memory segment 105 can optionally be of different memory segment architecture, e.g. each chip providing one or more memory segment can be built according to a different memory technology and/or design, and/or can comprise multiple sections of different memory technology and/or design.
  • each chip providing one or more memory segment can be built according to a different memory technology and/or design, and/or can comprise multiple sections of different memory technology and/or design.
  • the first memory segment 104 is a first entity selected from the above list and that the second memory segment 105 is a second, different entity selected from the above list.
  • the first memory segment 104 and the second memory segment 105 are binary incompatible, e.g. because they operate according to different ISAs.
  • the shared memory 103 can be configured to enable simultaneous access of multiple operating systems, and/or multiple processing units, and/or multiple applications, to the shared memory 103, preferably to a same memory segment 104, 105 in the shared memory, at a same time.
  • the computing system 100 further comprises an OS 106.
  • the OS 106 can be a conventional single- or multi-tasking and/or single- or multi-user OS, e.g. such as Linux, BSD, Windows, or Apple OSX.
  • the OS can also be a distributed, a templated, an embedded, a real-time, or a library OS.
  • the OS 106 can also exclusively be a single kernel.
  • the OS 106 can also comprise multiple kernels.
  • the computing system 100 can be operated using a single OS 106, but also using multiple OSes 106, as long as there is at least one OS 106.
  • the OS 106 can in particular be binary incompatible with the first processing unit 101 and/or the second processing unit 102, i.e. operated according to a different ISA.
  • the OSes 106 can be binary incompatible with each other.
  • Fig. 1 in particular shows an OS 106 that operates on multiple processing units such as near data processors, CPUs, accelerators, and remote units. These multiple processing units also comprise the first processing unit 101 and the second processing unit 102.
  • the OS 106 at least partly needs the first processing unit 101, to be operated.
  • the second processing unit 102 can be controlled by the OS 106, to optimize a configuration of the computing system 100, to enable unified memory access.
  • the computing system 100 comprises multiple OSes 106.
  • the computing system 100 further comprises an application 107.
  • the application 107 is at least partly operated by the OS 106. This includes, that the application e.g. can be a distributed application, which is operated by means of OSes, one of those being the OS 106. More specifically, the application 107 at least comprises one of the application parts "Code A", “Code B", “Code C” or "Code D", as shown in Fig. 1 or Fig.
  • the application 107 can be operated by one OS on multiple processing units, and even by multiple kernels, OSes or runtimes on multiple processing units.
  • the application 107 acquires memory access, e.g. by trying to allocate a memory segment 104, 105 in the shared memory. Acquiring memory access also can involve operation of the first and/or the second processing unit 101, 102, the shard memory 103, one of its memory segments 104, 105 or the OS 106, to enable unified memory access.
  • the first processing unit 101 and the second processing unit 102 are connected to the shared memory 103, e.g. by means of a bus, more specifically, via a bus that supports the load/store interface.
  • the bus may offer different types of consistency for each different interconnect segment.
  • the OS 106 is configured to control at least one of the first processing unit 101 and the second processing unit 102, and the shared memory 103, based on requirements information 108 comprised in the operating system 106 and/or the application 107, to allocate the first memory segment 104 to at least a part of the application 107.
  • the requirements information 108 further can be maintained by means of software, e.g. by means of the OS 106 or the application 107.
  • the OS considers the requirements information 108 and performs predefined actions, based on the requirements information 108, to control at least one of the first processing unit 101, the second processing unit 102, or the shared memory 103 to access the shared memory 103, more specifically to allocate the first memory segment 104 to at least a part of the application 107, that needs memory to be allocated.
  • the requirements information 108 comprises executable binary code that comprises information regarding a type and/or a state of a memory segment 104, 105 required by at least a part of the application 107.
  • the OS 106 can load the executable binary code that also provides supplementary information for each program section of the application 107.
  • the requirements information 108 can also be called memory contract.
  • the memory contract can define for each code (.text) subsection of the application 107 a minimal requirement that is required for correct memory access.
  • the requirements information 108 more specifically the executable binary code, can be created by a compiler that automatically generates the executable binary code transparently to a program developer, who thus can use additional language pragmas or a new OS API to modify the memory contracts at compile-time or run-time, respectively.
  • the requirements information 108 can also include 1st, 2nd and 3rd requirements information, which is going to be described in detail in view of Fig. 3 below.
  • the 1st requirements information can also be called data contract
  • the 2nd requirements information can also be called actor contract
  • the 3rd requirements information can also be called topology contract.
  • Fig. 2 shows an example configuration of the computing system 100 according to an embodiment of the present invention.
  • the computing system 100 as shown in Fig. 2 includes all features of the computing system 100 of Fig. 1 and can in particular be operated using multiple OSes.
  • the OSes can in particular be of a type as described in view of Fig. 1.
  • the OS 106 is comprised by the multiple OSes in the computing system 100 in Fig. 2.
  • the multiple OSes can in particular be binary incompatible with each other, and/or with the first processing unit 101 and/or the second processing unit 102, i.e. operated according to a different ISA.
  • the computing system 100 in Fig. 2 at least requires the OS 106 to be operated.
  • the computing system 100 can also be implemented in environments in which multiple OSes are running on multi-core, multi-processor, or distributed systems.
  • Each of the multiple OSes can run on a different processing unit, each.
  • each OS can run on a different processing unit at least partly, i.e. in a distributed fashion.
  • the OS 106 is however required to operate the computing system 100 and to control each of the multiple OSes in the computing system 100.
  • the multiple OSes in the computing system 100 can be controlled by the OS 106, e.g. to optimize a configuration of the computing system 100, to enable unified memory access, e.g. to migrate an application at least partly to a different OS.
  • Fig. 3 shows a computing system 300 according to an embodiment of the present invention in more detail.
  • the computing system 300 includes all features and functionality of the computing system 100 as described in view Fig. 1 and Fig. 2 above. Thus, identical features are labelled with identical reference signs.
  • 1st, 2nd and 3rd requirements information i.e. the data contract, the actor contract, and the topology contract
  • concept of memory contracts are going to be described in more detail.
  • Memory contracts can also be regarded an OS abstraction with an OS interface for a programmer.
  • a memory contract can be metadata for an address space region (being data, code, file, swap, and/or a combination of these).
  • the data contract can be associated to executable code of the application and the actor contract can be associated to memory areas accessed by the application.
  • the topology contract can describe processing-unit-to-memory-segment characteristics.
  • Memory contracts can be used by each OS running on at least two processing units to enforce at least the following memory area properties: format (in the sense of ABI), consistency guarantees, cacheability guarantees, persistency, and user privileges (memory protection). More examples are given below.
  • the requirements information 108 optionally can further comprise the 1st requirements information 301, which also can be called data contract.
  • the 1st requirements information 301 can relate to properties of executable binary code of at least a part of the application 107.
  • the operating system 106 is further configured to control at least one of the first processing unit 101 and the second processing unit 102, and the shared memory 103, based on the 1st requirements information 301, to allocate the first memory segment 104 to at least a part of the application 107.
  • the 1 st requirements information expresses an ABI and/or a format that a compiler used when compiling a specific (initialized or not-initialized) data part of the application 107.
  • ABI and format can be defined at compile time, thus also heap and stack can be characterized by a data contract.
  • Persistency is another characteristic expressed by a data contract. It can also be extracted at compile time due to modern compilers' support for persistent memory.
  • a data contract defines an owner/s of a memory segment and a security policy to grant capabilities to a memory segment and/or processes/processing units.
  • the 1st requirements information 301 can be regarded as executable binary code and comprises information - such as assumptions and rules - used when compiling the binary code, more specifically an ABI used to compile at least a part of the application 107, and/or a format used to compile at least a part of the application 107, and/or a persistency characteristic, and/or an ownership of a memory segment required by at least a part of the application 107, and/or a security policy.
  • the format in particular specifies alignments and/or a data structure field order, and/or an ABI, and/or alignments, and/or padding, and/or structure field organization, and/or persistency, and/or cacheability.
  • Persistency is a property of a memory segment to retain data after a power supply is removed.
  • a memory is non-persistent or volatile when data stored by the memory is lost after the power supply is removed.
  • a memory is persistent if data stored in it is not lost after the power supply is removed.
  • An example of non-persistent memory is SDRAM or SRAM.
  • An example of persistent memory is NVDIMM or Flash.
  • the persistency characteristic includes information regarding the persistency of the shared memory 103 and/or the memory segments 104, 105.
  • the ownership of a memory segment 104, 105 includes information regarding what application 107 or user may access the memory segment 104, 105.
  • the requirements information 108 optionally can further comprise the 2nd requirements information 302, which also can be called actor contract.
  • the second requirements information 302 can relate to an executable binary code of at least one predefined code segment of the application 107.
  • the OS 106 can be further configured to control at least one of the first processing unit 101 and the second processing unit 102, and the shared memory 103, based on the 2nd requirements information 302, to allocate the first memory segment 104 to at least a part of the application 107.
  • the second requirements information 302 (i.e. the actor contract) can express an ABI that a compiler used when compiling a specific code segment of the application 107. Moreover, the actor contract explicates for each code segment which memory model that code segment was compiled for, such as consistency guarantees and cacheability requirements. Finally, an actor contract stores a set of capabilities for each memory area it can access (i.e. a security policy).
  • the 2nd requirements information 302 can be regarded as executable binary code and comprises information regarding an ABI used to compile a predefined code segment of the application 107, and/or information regarding a memory model a predefined code segment of the application 107 is compiled for, and/or a security policy for each memory segment the application 107 can access.
  • Consistency is a memory property.
  • the type/level of consistency defines the way in which modification on the memory by the first processing unit 101 is propagated to the second processing unit 102.
  • a cache is a mechanism in memory hardware that provides consistency.
  • Cacheability is the capability of memory hardware to provide a sort of consistency via the cache.
  • the security policy in particular can define what memory segment a single application 107 can read, write, or execute.
  • the security policy further can be defined as a set of properties that are concerned with a capability of code in live execution on a processing unit to access a predefined memory segment.
  • the above mentioned memory model can define a way in which binary code can access a memory segment in the shared memory. That is because a same memory segment can have different cacheability, access latency, protection mechanisms, and persistency properties regarding different processing units.
  • the work of the operating system can be divided (in different amounts) between the operating system 106, a runtime, and a hypervisor.
  • the requirements information 108 optionally can further comprise the 3rd requirements information 303, which also can be called topology contract.
  • the 3rd requirements information can relate to a connection between the shared memory 103 and at least one of the first processing unit 101 and the second processing unit 102.
  • the operating system 106 can be further configured to control at least one of the first processing unit 101 and the second processing unit 102, and the shared memory 103, based on the 3rd requirements information 303, to allocate the first memory segment 104 to at least a part of the application 107.
  • the OS 106 can be built by multiple kernels, each running on a different processing unit, which share information about a topology of the computing system and data contracts of each shared memory segment.
  • 3rd requirements information (i.e. the topology contract) is required, which describes a connection between a physical memory segment and each specific processing unit.
  • a topology contract also can describe cache coherency guarantees that exist between a memory segment and a processing unit.
  • other information can be associated to the topology contract, e.g. memory access latency, information whether the memory is persistent, information regarding the existence and the type of hardware protection mechanisms between memory links.
  • the 3rd requirements information 303 in particular can be created by the operating system 106, more specifically, based on a hardware topology/geometry of the computing system 300 (i.e. the way that hardware components are connected in the computing system 300).
  • the 3rd requirements information 303 can comprise information regarding a cache coherency guarantee between at least one of the first memory segment 104 and the second memory segment 105 and at least one of the first processing unit 101 and the second processing unit 102, and/or a memory access latency between at least one of the first memory segment 104 and the second memoiy segment 105 and at least one of the first processing unit 101 and the second processing unit 102, and/or information regarding existence and a type of hardware protection mechanisms in the shared memory 103.
  • a cache coherency guarantee can be regarded as a set of different processing units that requires consistent access to a memory segment when a cache coherency mechanism is interposed between the processing units and the memory segments.
  • the cache coherency mechanism can provide different types of consistency, such as sequential consistency or total store order (TSO).
  • TSO total store order
  • Memory access latency can be regarded the time that a memory access instructed by a processing unit requires to be executed. In fact a single access can take a different amount of time based on the physical distance of the memory segment from the processing unit, and is based on the fact whether a specific data is cached or not.
  • Hardware protection mechanisms can be memory paging and/or memory segmentation (segmented memory).
  • the above described types of memory contracts remain transparent to a user of the computing system 300.
  • the memory contracts are handled by a compiler, at runtime, or by an OS, without requiring programmer intervention.
  • the memory contracts are generated in a first instance by a compiler and a linker and consist of descriptions of conventions that have been used to generate code and data (sub) sections.
  • the compiler can augment such descriptions of the memory model, generated internally or by syntax expressions, like C++ 11 atomics.
  • the compiler and linker can embody additional information in the generated binary code.
  • the compiler and linker can slice a usually monolithic .text segment in multiple sub- segments that can be assigned to different actor contracts.
  • further pragmas can be added to programming languages to mark memory segments that will be shared, and hence are accessible with variable base pointers from different devices.
  • the requirements information 108 (i.e. the memory contracts) is used by the OS 106 to provide unified access to the shared memory segments 104, 105 in particular by enforcing multiple properties:
  • the OS 106 enforces all memory contracts by checking that all code sections in the application 107, e.g. running on the first computing unit 101 and/or on other computing units, attached to a same memoiy segment 104, 105 respect the data, actor and topology contract for each connection to the memory segment. If a code section in the application 107 doesn't have a valid data, actor or topology contract (i.e.
  • the OS 106 can take several actions. Although these actions are described in view of Fig. 3, they also can be applied to the computing system 100 as described in view of Fig. 1 or 2. That is, the actions can also be performed based exclusively on the requirements information 108 without the presence of the 1st requirements information 301, the second requirements information 302 and the third requirements information 303 (i.e. the actions can be performed based exclusively on the memory contracts, i.e. without the presence of the data contract, the actor contract or the topology contract).
  • performing said actions can include that the operating system 106 is further configured to, if at least one of the first processing unit 101 and the second processing unit 102, and/or at least one of the first memory segment 104 and the second memory segment 105, and/or at least a part of the application 107 does not comply with a requirement in the requirements information 108, adjust a configuration of at least one of the first processing unit 101 and the second processing unit 102, and/or at least one of the first memory segment 104 and the second memory segment 105, and/or at least a part of the application 107, based on the requirements information 108, to allocate at least one of the first memory segment 104 and the second memory segment 105 to at least a part of the application 107.
  • Adjusting the configuration of at least one of the above mentioned entities can in particular include that, if the application 107 requires to allocate a memory segment that doesn't guarantee to meet requirements in the requirements information 108 (e.g. the 1st, 2nd or 3rd requirements information, or compilation and link properties of the code (sub-)section that is currently executing), the following actions can be taken by the OS:
  • requirements information 108 e.g. the 1st, 2nd or 3rd requirements information, or compilation and link properties of the code (sub-)section that is currently executing
  • Cancel execution of the application 107 It is also possible to disable RW operations to the desired memory segment and raise a fault.
  • the operating system 106 can be further configured to, if a predefined part of executable binary code in the application 107 does not comply with a requirement in the requirements information 108, exchange the predefined part of executable binary code in the application 107 with precompiled executable binary code that complies with the requirements information, and to allocate the first memory segment 104 to at least a part of the application 107, based on the requirements information 108 and the precompiled executable binary code.
  • the OS 106 can switch between different binary code versions (which were compiled based on semantically equivalent source code, or the same source code) in an application 107, to comply with different consistency- and ABI actor contracts (the different binary code versions can be generated at compile time or just- in-time, i.e. during run-time of the application, also in user space).
  • the operating system 100 can be further configured to, if the first processing unit 101 does not comply with a requirement in the requirements information 108, migrate at least a part of the application 107 from being operated by means of the first processing unit 101 to be operated by means of the second processing unit 102, and to control the second processing unit 102 to allocate the first memory segment
  • the operating system 106 can be further configured to, if the first memory segment 104 does not comply with a requirement in the requirements information 108, control the at least one of the first processing unit 101 and the second processing unit 102, and the shared memory 103, to allocate the second memory segment
  • the OS 106 may decide to either provide the data consistency via a distributed shared memory, or move the memory segment of the code block to another memory segment or processing unit which provides a valid topology contract. Additionally or alternatively, the operating system 106 can be further configured to, if the first processing unit 101, the first memory segment 104 and a predefined part of executable binary code in the application 107 does not comply with a requirement in the requirements information 108, allocate the first memory segment 104 to at least a part of the application 107 by means of software memory emulation, based on the requirements information 108.
  • a runtime can finalize pseudo code (e.g., OpenCL) to a specific ABI. Then, a new API exposes a way to manipulate contracts by the programmer at runtime, for user defined behaviors and fine-tuning.
  • pseudo code e.g., OpenCL
  • the requirements information 108 that is used by the computing system 100 comprises executable binary code that includes information regarding a type and/or a state of a memory segment required by at least a part of the application 107.
  • This executable binary code can also be called enhanced executable binaries, enhanced executable binary format or enhanced executable binary code.
  • Fig. 4 shows a schematic view 400 of an executable and linking (ELF) format 401 and of a portable executable (PE) / common object file format (COFF) format 402, which are examples of executable binaries.
  • ELF executable and linking
  • PE portable executable
  • COFF common object file format
  • the present invention can be applied to both formats. In both cases there are header sections 403, code sections 404, data sections 405, and debug/symbol sections 406. Further, the present invention generally can be applied to any possible file format and is not restricted to the given examples, which represent the most common file formats.
  • a used compiler e.g. GCC, LLVM/clang, MSVC
  • GCC LLVM/clang
  • MSVC LLVM/clang
  • a traditional compilation process remains unchanged, but a backend and a linker involved in compiling is modified.
  • the modified backend generates code for multiple versions of the .text (in code section 404) which support different memory models, ISAs, and ABIs.
  • the number of versions is not limited. All different versions can be included in the enhanced executable binaries and should be interchangeable at a same address space's address range.
  • the modified linker can put together all such different code versions in a same executable binary, while marking each subsection and providing backward compatibility with an original format 407.
  • the modified linker further can create a new executable binary program section 408, which can be called "contract". In this section all compiler assumptions used in during compilation which cannot be extracted from a debug sections are included.
  • a conventional OS binary loader can be modified in order to load additional executable binary sections that are added in the enhanced executable binaries. These sections augment common OS data structures that describe an address space.
  • the OS binary loader can detect sections that include the additional metadata emitted by the compiler at binary loading/execution time (e.g. upon an execve() system call in a Linux kernel).
  • the OS binary loader accordingly sets up an address space of a loading process, during this process the OS 106 may check that the allocated memory to the application 107 in the computing system 100 respect the requirements expressed by the metadata included in the executable binary.
  • the address space OS abstraction further is enhanced to include additional alternative code (sub-)sections and metadata information.
  • Fig. 5 shows a schematic view 500 of a process descriptor 501 that can be used by an OS.
  • Fig. 5 shows a schematic view of the process descriptor 501 in a conventional OS, like Linux, BSD, Windows, or Apple OSX.
  • the address space of an application is described by a linked data structure of virtual memory area descriptors 502.
  • Each descriptor is associated with a logical part of a program address space, such as a .text section, a .data section, or a HEAP section.
  • the process descriptor 501 also can include a binary format descriptor 503.
  • each virtual memory area descriptor 502 one or more "memory contracts" are associated, such as a data descriptor 504 or an actor descriptor 505.
  • the data descriptor 504 (or data contract) is associated with a non-code area of the program and describes compilation options and conventions used during compile-time.
  • the actor descriptor 505 (or actor contract) is associated with a code area of a program. It describes for each subsection of the code the minimum memory model required for a consistent memory access. Multiple actor contracts can be associated to multiple versions of the same code facing the address space.
  • an OS can map each code of a non-code section while guaranteeing the requirements explicated in the contracts. Fig.
  • FIG. 6 shows a flowchart 600 of an operating manner for unifying memory access of an OS kernel.
  • the flowchart 600 is in particular depicting how an OS kernel may use additional metadata (i.e. the requirements information) according to the invention during process runtime, in order to allow for access to heterogeneous memory regions.
  • Fig. 7 shows a schematic overview of a method 700 according to an embodiment of the invention.
  • the method 700 corresponds to the system 100 of Fig. 1, and is accordingly for operating a computing system 100 for unified memory access.
  • the method 700 comprises a step of controlling 701, by an operating system 106, at least one of a first processing unit 101 and a second processing unit 102, and a shared memory 103, based on requirements information comprised in the operating system 106 and/or an application 107, to allocate a first memory segment 104 to at least a part of the application 107, wherein the requirements information comprises executable binary code that comprises information regarding a type and/or a state of a memory segment required by at least a part of the application 107.
  • Fig. 8 shows a computing system 800 according to the prior art.
  • the teaching of the patent can in particular be applied to the computer architecture as shown.
  • multiple other processing units such as NDPs, accelerators or RDMAs
  • NDPs non-transitory computer-access memory
  • RDMAs RDMA-based memory access

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multi Processors (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The invention provides a computing system (100) for unified memory access, comprising a first processing unit (101) and a second processing unit (102), a shared memory (103) including a first memory segment (104) and a second memory segment (105), an operating system (106), operated at least partly by the first processing unit (101), and an application (107), operated at least partly by the operating system (106). The first processing unit (101) and the second processing unit (102) are connected to the shared memory (103). The operating system (106) is configured to control at least one of the first processing unit (101) and the second processing unit (102), and the shared memory (103), based on requirements information comprised in the operating system (106) and/or the application (107), to allocate the first memory segment (104) to at least a part of the application (107), wherein the requirements information comprises executable binary code that comprises information regarding a type and/or a state of a memory segment required by at least a part of the application (107).

Description

COMPUTING SYSTEM FOR UNIFIED MEMORY ACCESS
TECHNICAL FIELD
The present invention relates to a computing system and corresponding method for unified memory access. In particular, the system and method of the present invention affect the way an operating system (OS) allocates a shared memory in a multi-processor system, based on requirements information. The requirements information preferably includes executable binary code that comprises information regarding a type and/or a state of a memory segment required by at least a part of an application.
BACKGROUND
Emerging computer architectures are characterized by an increasingly heterogeneous memory and processor subsystem, mainly due to novel memory technologies, low-latency load/store access interconnects, and the resurgence of near data processing (NDP) that introduces processing units alongside the memory hierarchy.
Addressable memory is composed by on-chip memory, off-chip memory (e.g. by usual DIMM modules), or remote-machine memories. Moreover, with the advent of storage class memory (SCM), addressable memory can be volatile or persistent. New interconnects, including CCIX, Gen-Z, OpenCAPI, or serial memory bus are under development in order to boost intra-machine low-latency memory-mapped communication. The unifying feature beyond these techniques is the provision of a shared memory interconnect across all components within a system, either on a single-node basis or at a rack-scale. Some features also aim to provide hardware -based cache-coherence. Lastly, due to technology innovations, NDP is re-gaining attraction. NDP is co-location of processing and memory, in the form of processing in (main) memory (PIM), or in-storage processing (ISC).
The aforementioned technologies enable different types of processing units to access a same memory at a same time. Fig. 8 shows a scenario in which a main memory is accessed by near data processors, CPUs that are attached to a coherent interconnect, accelerators interconnected to CPUs via a peripheral bus, and remote processing units (e.g. near data processors, CPU, or GPU) connected via a RDMA-enabled interface. Accelerator-CPU- NIC setups, in which an accelerator (FPGA, GPU, or Xeon-Phi), a CPU, and a NIC share a common memory area, are a sub-scenario of the aforementioned technology. The amount of processing units accessing the same memory area steadily increases with NDP, and rack- scale computing. Moreover, all memory is becoming load/store accessible inter- and intra- machine.
In the prior art, a shift towards multiple kernel OSes, or multiple OSes/runtime, accessing a same memory area, can be observed. Additionally, conventional OS architectures do not consider memory heterogeneity, apart from (not-)cacheable areas, and non uniform memory access (NUMA).
In conventional computing systems, applications are compiled statically for one computer platform, therefore complying with a specific, homogeneous, memory model. However, the same memory area will be accessed from processors potentially with different instruction set architectures (ISAs), consistency models, and offering different synchronization mechanisms.
In order to fully support emerging architectures in which multiple processing units share access to a same memory area, a different construction of system software that enables format-compatible sharing of data among different (OS-capable) ISA processors, coherent memory access (based on the expectations of the application programmer), protection among processors, transparency for the programmer, and exploitation of efficient communication is required.
These prior art solutions raise questions regarding how shared memory among different computing units of a system can be efficiently and effectively managed by system software.
Prior art solutions deal with the issue of heterogeneity of memory and processing units in a rather static manner. A programmer needs to explicitly address the consistency and incompatibility issues presented due to the heterogeneity before-hand, by preparing software for a particular target architecture, and by potentially adopting a specialized programming model. Using heterogeneous accelerators further involves explicit programming, for example in some domain- specific accelerator language. Further, potential advantages brought by heterogeneity are not fully exploitable at run-time, as they need to be explicitly addressed at programming and application development time, which reduces the opportunities for further optimizations. The disadvantages of the prior art solutions are: forcing application developers to adopt specific programming models and requiring software modifications; targeting a specific heterogeneous environment, and excluding the possibility of running in potentially several available accelerator environments that may be co-hosted; and, not permitting optimization opportunities that could be identified during application run-time.
SUMMARY
In view of the above-mentioned problems and disadvantages, the present invention aims to improve the conventional systems and methods. The present invention has thereby the object to provide a system, which overcomes the drawbacks of heterogeneity in emerging computer architectures. By providing a combination of compile-time and run-time mechanisms at operating system and application level, a programmer is freed from having to design an application to target a specific heterogeneous platform and thus, memory access in a heterogeneous environment can be unified. This also enables backwards- compatibility of legacy applications by avoiding them to be redesigned in order to further make use of the benefits of a heterogeneous platform. An operating system is further enabled to make run-time decisions and actively exploit the benefits of heterogeneous devices (for example, by scheduling and transparently migrating processes when and where this would lead to better performance and efficiency), without requiring explicit involvement at application development time.
In particular the present invention proposes a solution that uses "memory contracts" (which can also be referred to as "requirements information") as a system software solution which is created at compile-time by a compiler or linker and is implemented at run-time, managed by an operating system. According to the memory contracts, checks can be performed at run-time, e.g. about consistency, protection, or coherence guarantees that the code that acts on a memory area has to comply with.
The memory contracts can include an enhanced executable binary format, extended to maintain metadata sections that include memory consistency, ISA and application binary interface (ABI) requirements. Thus, a conventional OS binary loader can be enriched in order to recognize the metadata sections and load them at run-time. Further, an OS is enabled to dynamically select matching contracts at run-time for the variety of possible processing elements present in a heterogeneous computing architecture, and in addition to migrate tasks (processes, threads) to any processing unit transparently from the user, so as to achieve uniform access of memory. The present invention thereby solves the following problems existing in prior art: Application writers are alleviated from adopting a specific programming model in order to use heterogeneous resources. Applications are enabled to take advantage of several available heterogeneous resources, instead of targeting a specific resource. An operating system is enabled to transparently make dynamic decisions at runtime, in order to better exploit the available heterogeneous resources.
The object of the present invention is achieved by the solution provided in the enclosed independent claims. Advantageous implementations of the present invention are further defined in the dependent claims. A first aspect of the present invention provides a computing system for unified memory access, comprising a first processing unit and a second processing unit, a shared memory including a first memory segment and a second memory segment, an operating system, operated at least partly by the first processing unit, and an application, operated at least partly by the operating system, wherein the first processing unit and the second processing unit are connected to the shared memory, wherein the operating system is configured to control at least one of the first processing unit and the second processing unit, and the shared memory, based on requirements information comprised in the operating system and/or the application, to allocate the first memory segment to at least a part of the application, wherein the requirements information comprises executable binary code that comprises information regarding a type and/or a state of a memory segment required by at least a part of the application.
As memory access can be performed according to the requirements information comprised in the operating system and/or the application, a programmer is freed from having to design an application to target a specific heterogeneous platform. Further, the operating system is enabled to make run-time decisions regarding memory access and actively exploit the benefits of heterogeneous devices. The OS is enabled to effectively and efficiently control heterogeneous processing units and/or heterogeneous memory segments, when running an application.
In a first implementation form of the system according to the first aspect, the first processing unit and the second processing unit are of different processing unit architecture. This ensures that memory access can in particular be unified in heterogeneous computing systems, more specifically in those in which a first and a second computing unit is of different architecture.
In a second implementation form of the system according to the first aspect, the requirements information further comprises 1st requirements information relating to properties of executable binary code of at least a part of the application, and the operating system is further configured to control at least one of the first processing unit and the second processing unit, and the shared memory, based on the 1st requirements information, to allocate the first memory segment to at least a part of the application. This ensures that memory access can in particular by unified by considering properties of executable binary code of at least a part of the application, when operating the computing system.
In a third implementation form of the system according to the first aspect, the 1st requirements information is executable binary code and comprises information regarding an application binary interface, ABI, used to compile at least a part of the application, and/or a format used to compile at least a part of the application, and/or a persistency characteristic, and/or an ownership of a memory segment required by at least a part of the application, and/or a security policy.
This ensures that specific and detailed information and parameters in the 1st requirements information can be considered to unify memory access, when operating the computing system.
In a fourth implementation form of the system according to the first aspect, the requirements information comprises 2nd requirements information relating to an executable binary code of at least one predefined code segment of the application, and the operating system is further configured to control at least one of the first processing unit and the second processing unit, and the shared memory, based on the 2nd requirements information, to allocate the first memory segment to at least a part of the application.
This ensures that memory access can in particular by unified by considering executable binary code of at least one predefined code segment of the application, when operating the computing system. In a fifth implementation form of the system according to the first aspect, the 2nd requirements information is executable binary code and comprises information regarding an ABI used to compile a predefined code segment of the application, and/or information regarding a memory model a predefined code segment of the application is compiled for, and/or a security policy for each memory segment the application can access.
This ensures that specific and detailed information and parameters in the 2nd requirements information can be considered to unify memory access, when operating the computing system.
In a sixth implementation form of the system according to the first aspect, the requirements information comprises 3rd requirements information relating to a connection between the shared memory and at least one of the first processing unit and the second processing unit, and the operating system is further configured to control at least one of the first processing unit and the second processing unit, and the shared memory, based on the 3rd requirements information, to allocate the first memory segment to at least a part of the application. This ensures that memory access can in particular by unified by considering information regarding a connection between the shared memory and at least one of the first processing unit and the second processing unit, when operating the computing system.
In a seventh implementation form of the system according to the first aspect, the 3rd requirements information is created by the operating system and comprises information regarding a cache coherency guarantee between at least one of the first memory segment and the second memory segment and at least one of the first processing unit and the second processing unit, and/or a memory access latency between at least one of the first memory segment and the second memory segment and at least one of the first processing unit and the second processing unit, and/or information regarding existence and a type of hardware protection mechanisms in the shared memory.
This ensures that specific and detailed information and parameters in the 3rd requirements information can be considered to unify memory access, when operating the computing system.
In an eighth implementation form of the system according to the first aspect, the operating system is further configured to, if at least one of the first processing unit and the second processing unit, and/or at least one of the first memory segment and the second memory segment, and/or at least a part of the application does not comply with a requirement in the requirements information, adjust a configuration of at least one of the first processing unit and the second processing unit, and/or at least one of the first memory segment and the second memory segment, and/or at least a part of the application, based on the requirements information, to allocate at least one of the first memory segment and the second memory segment to at least a part of the application.
This ensures that a configuration of at least one of the first processing unit and the second processing unit, and/or at least one of the first memory segment and the second memory segment, and/or at least a part of the application can be adjusted based on the requirements information, if it is detected that a requirement in the requirements information is not complied with while performing memory access.
In a ninth implementation form of the system according to the first aspect, the operating system is further configured to, if the first processing unit does not comply with a requirement in the requirements information, migrate at least a part of the application from being operated by means of the first processing unit to be operated by means of the second processing unit, and to control the second processing unit to allocate the first memory segment to at least a part of the application, based on the requirements information.
This ensures that unified memory access can be obtained by migrating at least a part of the application, if it is detected that a requirement in the requirements information is not complied with while performing memory access.
In a tenth implementation form of the system according to the first aspect, the operating system is further configured to, if a predefined part of executable binary code in the application does not comply with a requirement in the requirements information, exchange the predefined part of executable binary code in the application with precompiled executable binary code that complies with the requirements, and to allocate the first memory segment to at least a part of the application, based on the requirements information and the precompiled executable binary code.
This ensures that unified memory access can be obtained by exchanging a predefined part of executable binary code in the application with precompiled executable binary code, if it is detected that a requirement in the requirements information is not complied with while performing memory access. In an eleventh implementation form of the system according to the first aspect, the operating system is further configured to, if the first memory segment does not comply with a requirement in the requirements information, control the at least one of the first processing unit and the second processing unit, and the shared memory, to allocate the second memory segment to at least a part of the application, based on the requirements information.
This ensures that unified memory access can be obtained by controlling at least one of the first processing unit and the second processing unit, and the shared memory, to allocate a second memory segment to at least a part of the application, if it is detected that a requirement in the requirements information is not complied with while performing memory access.
In a twelfth implementation form of the system according to the first aspect, the operating system is further configured to, if the first processing unit, the first memory segment and a predefined part of executable binary code in the application does not comply with a requirement in the requirements information, allocate the first memory segment to at least a part of the application by means of software memory emulation, based on the requirements information.
This ensures that unified memory access can be obtained by allocating the first memory segment to at least a part of the application by means of software memory emulation, if it is detected that a requirement in the requirements information is not complied with while performing memory access.
In a thirteenth implementation form of the system according to the first aspect, the at least two memory segments are of different memory segment architecture.
This ensures that memory access can in particular be unified in heterogeneous computing systems, more specifically in those in which at least two memory segments are different architecture.
A second aspect of the present invention provides a method for operating a computing system for unified memory access that comprises a first processing unit and a second processing unit, a shared memory including a first memory segment and a second memory segment, an operating system, operated at least partly by the first processing unit, and an application, operated at least partly by the operating system, wherein the first processing unit and the second processing unit are connected to the shared memory, the method comprising the steps of controlling, by the operating system, at least one of the first processing unit and the second processing unit, and the shared memory, based on requirements information comprised in the operating system and/or the application, to allocate the first memory segment to at least a part of the application, wherein the requirements information comprises executable binary code that comprises information regarding a type and/or a state of a memory segment required by at least a part of the application.
In a first implementation form of the method according to the second aspect, the first processing unit and the second processing unit are of different processing unit architecture.
In a second implementation form of the method according to the second aspect, the requirements information further comprises 1st requirements information relating to properties of executable binary code of at least a part of the application, and the method further includes controlling, by the operating system, at least one of the first processing unit and the second processing unit, and the shared memory, based on the 1st requirements information, to allocate the first memory segment to at least a part of the application.
In a third implementation form of the method according to the second aspect, the 1st requirements information is executable binary code and comprises information regarding an application binary interface, ABI, used to compile at least a part of the application, and/or a format used to compile at least a part of the application, and/or a persistency characteristic, and/or an ownership of a memory segment required by at least a part of the application, and/or a security policy.
In a fourth implementation form of the method according to the second aspect, the requirements information comprises 2nd requirements information relating to an executable binary code of at least one predefined code segment of the application, and the method further includes controlling, by the operating system, at least one of the first processing unit and the second processing unit, and the shared memory, based on the 2nd requirements information, to allocate the first memory segment to at least a part of the application. In a fifth implementation form of the method according to the second aspect, the 2nd requirements information is executable binary code and comprises information regarding an ABI used to compile a predefined code segment of the application, and/or information regarding a memory model a predefined code segment of the application is compiled for, and/or a security policy for each memory segment the application can access.
In a sixth implementation form of the method according to the second aspect, the requirements information comprises 3rd requirements information relating to a connection between the shared memory and at least one of the first processing unit and the second processing unit, and the method further includes controlling, by the operating system, at least one of the first processing unit and the second processing unit, and the shared memory, based on the 3rd requirements information, to allocate the first memory segment to at least a part of the application.
In a seventh implementation form of the method according to the second aspect, the 3rd requirements information is created by the operating system and comprises information regarding a cache coherency guarantee between at least one of the first memory segment and the second memory segment and at least one of the first processing unit and the second processing unit, and/or a memory access latency between at least one of the first memory segment and the second memory segment and at least one of the first processing unit and the second processing unit, and/or information regarding existence and a type of hardware protection mechanisms in the shared memory.
In an eighth implementation form of the method according to the second aspect, the method further includes, if at least one of the first processing unit and the second processing unit, and/or at least one of the first memory segment and the second memory segment, and/or at least a part of the application does not comply with a requirement in the requirements information, adjusting, by the operating system, a configuration of at least one of the first processing unit and the second processing unit, and/or at least one of the first memory segment and the second memory segment, and/or at least a part of the application, based on the requirements information, to allocate at least one of the first memory segment and the second memory segment to at least a part of the application.
In a ninth implementation form of the method according to the second aspect, the method further includes, if the first processing unit does not comply with a requirement in the requirements information, migrating, by the operating system, at least a part of the application from being operated by means of the first processing unit to be operated by means of the second processing unit, and controlling, by the operating system, the second processing unit to allocate the first memory segment to at least a part of the application, based on the requirements information.
In a tenth implementation form of the method according to the second aspect, the method further includes, if a predefined part of executable binary code in the application does not comply with a requirement in the requirements information, exchanging, by the operating system, the predefined part of executable binary code in the application with precompiled executable binary code that complies with the requirements information, and to allocate the first memory segment to at least a part of the application, based on the requirements information and the precompiled executable binary code. In an eleventh implementation form of the method according to the second aspect, the method further includes, if the first memory segment does not comply with a requirement in the requirements information, controlling, by the operating system, the at least one of the first processing unit and the second processing unit, and the shared memory, to allocate the second memory segment to at least a part of the application, based on the requirements information.
In a twelfth implementation form of the method according to the second aspect, the method further includes, if the first processing unit, the first memory segment and a predefined part of executable binary code in the application does not comply with a requirement in the requirements information, allocating, by the operating system, the first memory segment to at least a part of the application by means of software memory emulation, based on the requirements information.
In a thirteenth implementation form of the method according to the second aspect, the at least two memory segments are of different memory segment architecture.
The method of the second aspect and its implementation forms achieve the same advantages as the system of the first aspect and its respective implementation forms.
It has to be noted that all devices, elements, units and means described in the present application could be implemented in the software or hardware elements or any kind of combination thereof. All steps which are performed by the various entities described in the present application as well as the functionalities described to be performed by the various entities are intended to mean that the respective entity is adapted to or configured to perform the respective steps and functionalities. Even if, in the following description of specific embodiments, a specific functionality or step to be performed by external entities is not reflected in the description of a specific detailed element of that entity which performs that specific step or functionality, it should be clear for a skilled person that these methods and functionalities can be implemented in respective software or hardware elements, or any kind of combination thereof.
BRIEF DESCRIPTION OF DRAWINGS
The above described aspects and implementation forms of the present invention will be explained in the following description of specific embodiments in relation to the enclosed drawings, in which Fig. 1 shows a computing system according to an embodiment of the present invention
Fig. 2 shows a computing system according to an embodiment of the present invention
Fig. 3 shows a computing system according to an embodiment of the present invention in more detail
Fig. 4 shows a schematic view of an ELF format and of a PE/COFF format
Fig. 5 shows a schematic view of an OS process descriptor according to the present invention
Fig. 6 shows a flowchart of an operating manner for unifying memory access of an OS kernel
Fig. 7 shows a schematic overview of a method according to an embodiment of the invention
Fig. 8 shows a computing system according to the prior art. DETAILED DESCRIPTION OF EMBODIMENTS
Fig. 1 shows a computing system 100 according to an embodiment of the present invention. The computing system 100 allows for unified memory access and comprises a first processing unit 101 and a second processing unit 102, as well as a shared memory 103 including a first memory segment 104 and a second memory segment 105.
Each of the processing units 101, 102 thereby can a for example be one of a CPU, a CPU core, a GPU, a GPU core, a near data processor, a CPU that is attached to a coherent interconnect, an accelerator interconnected to a CPU via a peripheral bus, a remote processing unit (e.g. near data processor, CPU, or GPU) connected via a RDMA-enabled interface, or a kernel, e.g. of a kernel of an OS. The computing system 100 can include an arbitrary amount of processing units, as long as it includes at least the first processing unit 101 and the second processing unit 102, e.g. according to the above definition.
The first processing unit 101 and the second processing unit 102 optionally can be of different processing unit architecture. This can include that the first processing unit 101 is a first entity, selected from the above list, and the second processing unit 102 is a different entity, selected from the above list. This also can include that the first processing unit 101 and the second processing unit 102 are binary incompatible, e.g. because they operate according to different IS As.
The shared memory 103 includes the first memory segment 104 and the second memory segment 105. Each memory segment 104, 105 can either be classic main memory, such as a random access memory (RAM), or storage class memory. Each memory segment 104, 105 can be volatile or non- volatile, as well as coherent or non-coherent. More specifically, each of the memory segments 104, 105 can be implemented on-chip, off-chip (e.g. by usual DIMM modules), by a conventional coherency interconnect, or by inter- machine memories. More specifically, each link to memory segment 104, 105 can be implemented using cache-coherent or non-coherent interconnects, or with new technologies, e.g. CCIX, Gen-Z, OpenCAPI, a serial memory bus. In particular, when forming the shared memory 103 (which - among other memory segments that can be present - consists of the first memory segment 104 and the second memory segment 105), the first memory segment 104 and the second memory segment 105 can optionally be of different memory segment architecture, e.g. each chip providing one or more memory segment can be built according to a different memory technology and/or design, and/or can comprise multiple sections of different memory technology and/or design. This can include that the first memory segment 104 is a first entity selected from the above list and that the second memory segment 105 is a second, different entity selected from the above list. This also can include that the first memory segment 104 and the second memory segment 105 are binary incompatible, e.g. because they operate according to different ISAs.
More specifically, the shared memory 103 can be configured to enable simultaneous access of multiple operating systems, and/or multiple processing units, and/or multiple applications, to the shared memory 103, preferably to a same memory segment 104, 105 in the shared memory, at a same time. The computing system 100 further comprises an OS 106. The OS 106 can be a conventional single- or multi-tasking and/or single- or multi-user OS, e.g. such as Linux, BSD, Windows, or Apple OSX. The OS can also be a distributed, a templated, an embedded, a real-time, or a library OS. The OS 106 can also exclusively be a single kernel. The OS 106 can also comprise multiple kernels. The computing system 100 can be operated using a single OS 106, but also using multiple OSes 106, as long as there is at least one OS 106. The OS 106 can in particular be binary incompatible with the first processing unit 101 and/or the second processing unit 102, i.e. operated according to a different ISA. In case that there are multiple OS 106 comprised by the computing system 100 that operate on multiple processing units, the OSes 106 can be binary incompatible with each other. Fig. 1 in particular shows an OS 106 that operates on multiple processing units such as near data processors, CPUs, accelerators, and remote units. These multiple processing units also comprise the first processing unit 101 and the second processing unit 102. The OS 106 at least partly needs the first processing unit 101, to be operated. The second processing unit 102 can be controlled by the OS 106, to optimize a configuration of the computing system 100, to enable unified memory access.
A further configuration of the computing system 100, which also can be included in this embodiment is going to be described in view of Fig. 2 below. In the configuration that is going to be described in view of Fig. 2, the computing system 100 comprises multiple OSes 106. The computing system 100 further comprises an application 107. The application 107 is at least partly operated by the OS 106. This includes, that the application e.g. can be a distributed application, which is operated by means of OSes, one of those being the OS 106. More specifically, the application 107 at least comprises one of the application parts "Code A", "Code B", "Code C" or "Code D", as shown in Fig. 1 or Fig. 2, thereby pointing out that the application 107 can be operated by one OS on multiple processing units, and even by multiple kernels, OSes or runtimes on multiple processing units. The application 107 acquires memory access, e.g. by trying to allocate a memory segment 104, 105 in the shared memory. Acquiring memory access also can involve operation of the first and/or the second processing unit 101, 102, the shard memory 103, one of its memory segments 104, 105 or the OS 106, to enable unified memory access. To access the shared memory 103, the first processing unit 101 and the second processing unit 102 are connected to the shared memory 103, e.g. by means of a bus, more specifically, via a bus that supports the load/store interface. The bus may offer different types of consistency for each different interconnect segment.
In order to allow for unified memory access, the OS 106 is configured to control at least one of the first processing unit 101 and the second processing unit 102, and the shared memory 103, based on requirements information 108 comprised in the operating system 106 and/or the application 107, to allocate the first memory segment 104 to at least a part of the application 107. The requirements information 108 further can be maintained by means of software, e.g. by means of the OS 106 or the application 107. That is, the OS considers the requirements information 108 and performs predefined actions, based on the requirements information 108, to control at least one of the first processing unit 101, the second processing unit 102, or the shared memory 103 to access the shared memory 103, more specifically to allocate the first memory segment 104 to at least a part of the application 107, that needs memory to be allocated. The requirements information 108, among others, comprises executable binary code that comprises information regarding a type and/or a state of a memory segment 104, 105 required by at least a part of the application 107.
The OS 106 can load the executable binary code that also provides supplementary information for each program section of the application 107. The requirements information 108 can also be called memory contract. The memory contract can define for each code (.text) subsection of the application 107 a minimal requirement that is required for correct memory access. The requirements information 108, more specifically the executable binary code, can be created by a compiler that automatically generates the executable binary code transparently to a program developer, who thus can use additional language pragmas or a new OS API to modify the memory contracts at compile-time or run-time, respectively. The requirements information 108 can also include 1st, 2nd and 3rd requirements information, which is going to be described in detail in view of Fig. 3 below. The 1st requirements information can also be called data contract, the 2nd requirements information can also be called actor contract, and the 3rd requirements information can also be called topology contract. Fig. 2 shows an example configuration of the computing system 100 according to an embodiment of the present invention. The computing system 100 as shown in Fig. 2 includes all features of the computing system 100 of Fig. 1 and can in particular be operated using multiple OSes. The OSes can in particular be of a type as described in view of Fig. 1. The OS 106 is comprised by the multiple OSes in the computing system 100 in Fig. 2. The multiple OSes can in particular be binary incompatible with each other, and/or with the first processing unit 101 and/or the second processing unit 102, i.e. operated according to a different ISA.
The computing system 100 in Fig. 2 at least requires the OS 106 to be operated. However, due to the support for multiple OSes, the computing system 100 can also be implemented in environments in which multiple OSes are running on multi-core, multi-processor, or distributed systems. Each of the multiple OSes can run on a different processing unit, each. In particular, each OS can run on a different processing unit at least partly, i.e. in a distributed fashion. The OS 106 is however required to operate the computing system 100 and to control each of the multiple OSes in the computing system 100. The multiple OSes in the computing system 100 can be controlled by the OS 106, e.g. to optimize a configuration of the computing system 100, to enable unified memory access, e.g. to migrate an application at least partly to a different OS.
Fig. 3 shows a computing system 300 according to an embodiment of the present invention in more detail. The computing system 300 includes all features and functionality of the computing system 100 as described in view Fig. 1 and Fig. 2 above. Thus, identical features are labelled with identical reference signs. In the description in view of Fig. 3, in particular the 1st, 2nd and 3rd requirements information, (i.e. the data contract, the actor contract, and the topology contract), and the concept of memory contracts are going to be described in more detail.
Memory contracts can also be regarded an OS abstraction with an OS interface for a programmer. A memory contract can be metadata for an address space region (being data, code, file, swap, and/or a combination of these). There are two types of memory contracts associated to the application 107: data contracts and actor contracts. The data contract can be associated to executable code of the application and the actor contract can be associated to memory areas accessed by the application. Additionally, the topology contract can describe processing-unit-to-memory-segment characteristics. Memory contracts can be used by each OS running on at least two processing units to enforce at least the following memory area properties: format (in the sense of ABI), consistency guarantees, cacheability guarantees, persistency, and user privileges (memory protection). More examples are given below. As it is illustrated in Fig. 3, the requirements information 108 optionally can further comprise the 1st requirements information 301, which also can be called data contract.
The 1st requirements information 301 can relate to properties of executable binary code of at least a part of the application 107. The operating system 106 is further configured to control at least one of the first processing unit 101 and the second processing unit 102, and the shared memory 103, based on the 1st requirements information 301, to allocate the first memory segment 104 to at least a part of the application 107.
The 1st requirements information (i.e. the data contract) expresses an ABI and/or a format that a compiler used when compiling a specific (initialized or not-initialized) data part of the application 107. Even if heap and stack are memory areas that are runtime -populated by an application 107, their ABI and format can be defined at compile time, thus also heap and stack can be characterized by a data contract. For each memory mapped file, its data contract is either inherited by a creating application, or set by a user. Persistency is another characteristic expressed by a data contract. It can also be extracted at compile time due to modern compilers' support for persistent memory. Finally, a data contract defines an owner/s of a memory segment and a security policy to grant capabilities to a memory segment and/or processes/processing units. In other words, the 1st requirements information 301 can be regarded as executable binary code and comprises information - such as assumptions and rules - used when compiling the binary code, more specifically an ABI used to compile at least a part of the application 107, and/or a format used to compile at least a part of the application 107, and/or a persistency characteristic, and/or an ownership of a memory segment required by at least a part of the application 107, and/or a security policy.
The format in particular specifies alignments and/or a data structure field order, and/or an ABI, and/or alignments, and/or padding, and/or structure field organization, and/or persistency, and/or cacheability. Persistency is a property of a memory segment to retain data after a power supply is removed. A memory is non-persistent or volatile when data stored by the memory is lost after the power supply is removed. A memory is persistent if data stored in it is not lost after the power supply is removed. An example of non-persistent memory is SDRAM or SRAM. An example of persistent memory is NVDIMM or Flash. The persistency characteristic includes information regarding the persistency of the shared memory 103 and/or the memory segments 104, 105.
The ownership of a memory segment 104, 105 includes information regarding what application 107 or user may access the memory segment 104, 105.
As it is further illustrated in Fig. 3, the requirements information 108 optionally can further comprise the 2nd requirements information 302, which also can be called actor contract.
The second requirements information 302 can relate to an executable binary code of at least one predefined code segment of the application 107. The OS 106 can be further configured to control at least one of the first processing unit 101 and the second processing unit 102, and the shared memory 103, based on the 2nd requirements information 302, to allocate the first memory segment 104 to at least a part of the application 107.
The second requirements information 302 (i.e. the actor contract) can express an ABI that a compiler used when compiling a specific code segment of the application 107. Moreover, the actor contract explicates for each code segment which memory model that code segment was compiled for, such as consistency guarantees and cacheability requirements. Finally, an actor contract stores a set of capabilities for each memory area it can access (i.e. a security policy). In other words, the 2nd requirements information 302 can be regarded as executable binary code and comprises information regarding an ABI used to compile a predefined code segment of the application 107, and/or information regarding a memory model a predefined code segment of the application 107 is compiled for, and/or a security policy for each memory segment the application 107 can access.
Thereby, consistency guarantees and cacheability can be regarded as interrelated concepts.
Consistency is a memory property. When multiple processing units are operating on a same segment of memory, the type/level of consistency defines the way in which modification on the memory by the first processing unit 101 is propagated to the second processing unit 102. For example,„strong consistency" requires that every change made by the first processing unit 101 immediately appears to the view of the memory segment of the second processing unit 102. A cache is a mechanism in memory hardware that provides consistency. Cacheability is the capability of memory hardware to provide a sort of consistency via the cache. The security policy in particular can define what memory segment a single application 107 can read, write, or execute. The security policy further can be defined as a set of properties that are concerned with a capability of code in live execution on a processing unit to access a predefined memory segment.
The above mentioned memory model can define a way in which binary code can access a memory segment in the shared memory. That is because a same memory segment can have different cacheability, access latency, protection mechanisms, and persistency properties regarding different processing units.
In another implementation example, in the computing system 100, the work of the operating system can be divided (in different amounts) between the operating system 106, a runtime, and a hypervisor.
As it is further illustrated in Fig. 3, the requirements information 108 optionally can further comprise the 3rd requirements information 303, which also can be called topology contract. The 3rd requirements information can relate to a connection between the shared memory 103 and at least one of the first processing unit 101 and the second processing unit 102. The operating system 106 can be further configured to control at least one of the first processing unit 101 and the second processing unit 102, and the shared memory 103, based on the 3rd requirements information 303, to allocate the first memory segment 104 to at least a part of the application 107.
In a specific implementation manner of the computing system 300, the OS 106 can be built by multiple kernels, each running on a different processing unit, which share information about a topology of the computing system and data contracts of each shared memory segment.
Due to the fact that multiple memory segments 104, 105 exist in the computing system 100, and that virtual memory areas can be relocated among such segments, 3rd requirements information (i.e. the topology contract) is required, which describes a connection between a physical memory segment and each specific processing unit. A topology contract also can describe cache coherency guarantees that exist between a memory segment and a processing unit. Additionally, other information can be associated to the topology contract, e.g. memory access latency, information whether the memory is persistent, information regarding the existence and the type of hardware protection mechanisms between memory links. The 3rd requirements information 303 in particular can be created by the operating system 106, more specifically, based on a hardware topology/geometry of the computing system 300 (i.e. the way that hardware components are connected in the computing system 300).
In other words, the 3rd requirements information 303 can comprise information regarding a cache coherency guarantee between at least one of the first memory segment 104 and the second memory segment 105 and at least one of the first processing unit 101 and the second processing unit 102, and/or a memory access latency between at least one of the first memory segment 104 and the second memoiy segment 105 and at least one of the first processing unit 101 and the second processing unit 102, and/or information regarding existence and a type of hardware protection mechanisms in the shared memory 103. A cache coherency guarantee can be regarded as a set of different processing units that requires consistent access to a memory segment when a cache coherency mechanism is interposed between the processing units and the memory segments. The cache coherency mechanism can provide different types of consistency, such as sequential consistency or total store order (TSO). The cache coherency mechanism does not have to be present at all or just can snoop memory bus operations.
Memory access latency can be regarded the time that a memory access instructed by a processing unit requires to be executed. In fact a single access can take a different amount of time based on the physical distance of the memory segment from the processing unit, and is based on the fact whether a specific data is cached or not.
Hardware protection mechanisms can be memory paging and/or memory segmentation (segmented memory). In a further implementation example, the above described types of memory contracts remain transparent to a user of the computing system 300. The memory contracts are handled by a compiler, at runtime, or by an OS, without requiring programmer intervention. The memory contracts are generated in a first instance by a compiler and a linker and consist of descriptions of conventions that have been used to generate code and data (sub) sections. Other than ABI conventions used, the compiler can augment such descriptions of the memory model, generated internally or by syntax expressions, like C++ 11 atomics. The compiler and linker can embody additional information in the generated binary code. The compiler and linker can slice a usually monolithic .text segment in multiple sub- segments that can be assigned to different actor contracts. In a specific implementation example of a compiler and linker, further pragmas can be added to programming languages to mark memory segments that will be shared, and hence are accessible with variable base pointers from different devices.
The requirements information 108 (i.e. the memory contracts) is used by the OS 106 to provide unified access to the shared memory segments 104, 105 in particular by enforcing multiple properties: The OS 106 enforces all memory contracts by checking that all code sections in the application 107, e.g. running on the first computing unit 101 and/or on other computing units, attached to a same memoiy segment 104, 105 respect the data, actor and topology contract for each connection to the memory segment. If a code section in the application 107 doesn't have a valid data, actor or topology contract (i.e. if a memory access to the shared memory 103 does not comply with any one of the 1st, 2nd or third requirements information 301, 302, 303) the OS 106 can take several actions. Although these actions are described in view of Fig. 3, they also can be applied to the computing system 100 as described in view of Fig. 1 or 2. That is, the actions can also be performed based exclusively on the requirements information 108 without the presence of the 1st requirements information 301, the second requirements information 302 and the third requirements information 303 (i.e. the actions can be performed based exclusively on the memory contracts, i.e. without the presence of the data contract, the actor contract or the topology contract).
In other words, performing said actions can include that the operating system 106 is further configured to, if at least one of the first processing unit 101 and the second processing unit 102, and/or at least one of the first memory segment 104 and the second memory segment 105, and/or at least a part of the application 107 does not comply with a requirement in the requirements information 108, adjust a configuration of at least one of the first processing unit 101 and the second processing unit 102, and/or at least one of the first memory segment 104 and the second memory segment 105, and/or at least a part of the application 107, based on the requirements information 108, to allocate at least one of the first memory segment 104 and the second memory segment 105 to at least a part of the application 107.
Adjusting the configuration of at least one of the above mentioned entities can in particular include that, if the application 107 requires to allocate a memory segment that doesn't guarantee to meet requirements in the requirements information 108 (e.g. the 1st, 2nd or 3rd requirements information, or compilation and link properties of the code (sub-)section that is currently executing), the following actions can be taken by the OS:
Cancel execution of the application 107. It is also possible to disable RW operations to the desired memory segment and raise a fault.
Migrate the application 107 to another processing unit which complies with the requirements information, e.g. the compilation and link properties. It is to be noted that this can be subject to availability of executable binary code that can run on the other processing unit. This can be implemented by generating the executable binary code required by the other processing unit at compile time.
Exchange a code (sub-) section of the application 107 with another version that requires a weaker set of guarantees. It is to be noted that this is subject to availability of such executable binary code, if not already available it can be runtime generated.
Migrate the memory segment 104 on which the application 107 intends to operate, to another memory segment 105, which complies with the requirements in the requirement information 108 (e.g. from the point of view of the processing units that want to operate on the memory segments 104, 105).
Comply with the requirements in the requirement information 108 by using software emulation (e.g. via a form of virtual distributed shared memory). To implement the above actions, in other words, the operating system 106 can be further configured to, if a predefined part of executable binary code in the application 107 does not comply with a requirement in the requirements information 108, exchange the predefined part of executable binary code in the application 107 with precompiled executable binary code that complies with the requirements information, and to allocate the first memory segment 104 to at least a part of the application 107, based on the requirements information 108 and the precompiled executable binary code.
Thus, when there is a data consistency- or ABI-mismatch, the OS 106 can switch between different binary code versions (which were compiled based on semantically equivalent source code, or the same source code) in an application 107, to comply with different consistency- and ABI actor contracts (the different binary code versions can be generated at compile time or just- in-time, i.e. during run-time of the application, also in user space).
Additionally or alternatively, the operating system 100 can be further configured to, if the first processing unit 101 does not comply with a requirement in the requirements information 108, migrate at least a part of the application 107 from being operated by means of the first processing unit 101 to be operated by means of the second processing unit 102, and to control the second processing unit 102 to allocate the first memory segment
104 to at least a part of the application 107, based on the requirements information 108.
Additionally or alternatively, the operating system 106 can be further configured to, if the first memory segment 104 does not comply with a requirement in the requirements information 108, control the at least one of the first processing unit 101 and the second processing unit 102, and the shared memory 103, to allocate the second memory segment
105 to at least a part of the application (107, based on the requirements information 108.
Thus, during data consistency violations, the OS 106 may decide to either provide the data consistency via a distributed shared memory, or move the memory segment of the code block to another memory segment or processing unit which provides a valid topology contract. Additionally or alternatively, the operating system 106 can be further configured to, if the first processing unit 101, the first memory segment 104 and a predefined part of executable binary code in the application 107 does not comply with a requirement in the requirements information 108, allocate the first memory segment 104 to at least a part of the application 107 by means of software memory emulation, based on the requirements information 108.
Further, for ABI inconsistencies, a runtime can finalize pseudo code (e.g., OpenCL) to a specific ABI. Then, a new API exposes a way to manipulate contracts by the programmer at runtime, for user defined behaviors and fine-tuning.
The requirements information 108 that is used by the computing system 100 comprises executable binary code that includes information regarding a type and/or a state of a memory segment required by at least a part of the application 107. This executable binary code can also be called enhanced executable binaries, enhanced executable binary format or enhanced executable binary code.
Fig. 4 shows a schematic view 400 of an executable and linking (ELF) format 401 and of a portable executable (PE) / common object file format (COFF) format 402, which are examples of executable binaries. The present invention can be applied to both formats. In both cases there are header sections 403, code sections 404, data sections 405, and debug/symbol sections 406. Further, the present invention generally can be applied to any possible file format and is not restricted to the given examples, which represent the most common file formats.
In order to provide suitable formats of executable binaries, such as the ELF format 401 and the PE/COFF format 402, a used compiler (e.g. GCC, LLVM/clang, MSVC) should support multiple memory models, ISAs, and ABIs. For generating the ELF format 401 or the PE/COFF format 402, a traditional compilation process remains unchanged, but a backend and a linker involved in compiling is modified. The modified backend generates code for multiple versions of the .text (in code section 404) which support different memory models, ISAs, and ABIs. The number of versions is not limited. All different versions can be included in the enhanced executable binaries and should be interchangeable at a same address space's address range. The modified linker can put together all such different code versions in a same executable binary, while marking each subsection and providing backward compatibility with an original format 407. The modified linker further can create a new executable binary program section 408, which can be called "contract". In this section all compiler assumptions used in during compilation which cannot be extracted from a debug sections are included.
To put the present invention practice, a conventional OS binary loader can be modified in order to load additional executable binary sections that are added in the enhanced executable binaries. These sections augment common OS data structures that describe an address space.
The OS binary loader can detect sections that include the additional metadata emitted by the compiler at binary loading/execution time (e.g. upon an execve() system call in a Linux kernel). The OS binary loader accordingly sets up an address space of a loading process, during this process the OS 106 may check that the allocated memory to the application 107 in the computing system 100 respect the requirements expressed by the metadata included in the executable binary. The address space OS abstraction further is enhanced to include additional alternative code (sub-)sections and metadata information.
Fig. 5 shows a schematic view 500 of a process descriptor 501 that can be used by an OS. Fig. 5 shows a schematic view of the process descriptor 501 in a conventional OS, like Linux, BSD, Windows, or Apple OSX. In such OSes, the address space of an application is described by a linked data structure of virtual memory area descriptors 502. Each descriptor is associated with a logical part of a program address space, such as a .text section, a .data section, or a HEAP section. The process descriptor 501 also can include a binary format descriptor 503. According to the invention, to each virtual memory area descriptor 502 one or more "memory contracts" are associated, such as a data descriptor 504 or an actor descriptor 505. The data descriptor 504 (or data contract) is associated with a non-code area of the program and describes compilation options and conventions used during compile-time. The actor descriptor 505 (or actor contract) is associated with a code area of a program. It describes for each subsection of the code the minimum memory model required for a consistent memory access. Multiple actor contracts can be associated to multiple versions of the same code facing the address space. At load time of the computing system 100, an OS can map each code of a non-code section while guaranteeing the requirements explicated in the contracts. Fig. 6 shows a flowchart 600 of an operating manner for unifying memory access of an OS kernel. The flowchart 600 is in particular depicting how an OS kernel may use additional metadata (i.e. the requirements information) according to the invention during process runtime, in order to allow for access to heterogeneous memory regions.
Fig. 7 shows a schematic overview of a method 700 according to an embodiment of the invention. The method 700 corresponds to the system 100 of Fig. 1, and is accordingly for operating a computing system 100 for unified memory access.
The method 700 comprises a step of controlling 701, by an operating system 106, at least one of a first processing unit 101 and a second processing unit 102, and a shared memory 103, based on requirements information comprised in the operating system 106 and/or an application 107, to allocate a first memory segment 104 to at least a part of the application 107, wherein the requirements information comprises executable binary code that comprises information regarding a type and/or a state of a memory segment required by at least a part of the application 107.
Fig. 8 shows a computing system 800 according to the prior art. The teaching of the patent can in particular be applied to the computer architecture as shown. Other than exclusively CPUs, multiple other processing units (such as NDPs, accelerators or RDMAs) can simultaneously access a common memory area, as shown in Fig. 8. The teaching of the present invention can be used to unify this kind of heterogeneous memory access.
The present invention has been described in conjunction with various embodiments as examples as well as implementations. However, other variations can be understood and effected by those persons skilled in the art and practicing the claimed invention, from the studies of the drawings, this disclosure and the independent claims. In the claims as well as in the description the word "comprising" does not exclude other elements or steps and the indefinite article "a" or "an" does not exclude a plurality. A single element or other unit may fulfill the functions of several entities or items recited in the claims. The mere fact that certain measures are recited in the mutual different dependent claims does not indicate that a combination of these measures cannot be used in an advantageous implementation.

Claims

1. A computing system (100, 300) for unified memory access, comprising:
- a first processing unit (101) and a second processing unit (102),
- a shared memory (103) including a first memory segment (104) and a second memory segment (105),
- an operating system (106), operated at least partly by the first processing unit (101), and
- an application (107), operated at least partly by the operating system (106),
wherein the first processing unit (101) and the second processing unit (102) are connected to the shared memory (103),
wherein the operating system (106) is configured to control at least one of the first processing unit (101) and the second processing unit (102), and the shared memory (103), based on requirements information (108) comprised in the operating system (106) and/or the application (107), to allocate the first memory segment ( 104) to at least a part of the application (107),
wherein the requirements information (108) comprises executable binary code that comprises information regarding a type and/or a state of a memory segment required by at least a part of the application (107).
2. The computing system (100, 300) according to claim 1, wherein the first processing unit ( 101) and the second processing unit ( 102) are of different processing unit architecture.
3. The computing system (100, 300) according to claim 1 or 2, wherein the requirements information (108) further comprises 1st requirements information (301) relating to properties of executable binary code of at least a part of the application (107), and wherein the operating system (106) is further configured to control at least one of the first processing unit (101) and the second processing unit (102), and the shared memory (103), based on the 1 st requirements information (301), to allocate the first memory segment (104) to at least a part of the application (107).
4. The computing system (100, 300) according to any one of the preceding claims, wherein the 1st requirements information (301) is executable binary code and comprises information regarding an application binary interface, ABI, used to compile at least a part of the application (107), and/or a format used to compile at least a part of the application (107), and/or a persistency characteristic, and/or an ownership of a memory segment required by at least a part of the application (107), and/or a security policy.
5. The computing system (100, 300) according to any one of the preceding claims, wherein the requirements information (108) comprises 2nd requirements information (302) relating to an executable binary code of at least one predefined code segment of the application (107), and wherein the operating system (106) is further configured to control at least one of the first processing unit (101) and the second processing unit (102), and the shared memory (103), based on the 2nd requirements information (302), to allocate the first memory segment (104) to at least a part of the application (107).
6. The computing system (100, 300) according to any one of the preceding claims, wherein the 2nd requirements information (302) is executable binary code and comprises information regarding an ABI used to compile a predefined code segment of the application (107), and/or information regarding a memory model a predefined code segment of the application (107) is compiled for, and/or a security policy for each memory segment the application (107) can access.
7. The computing system (100, 300) according to any one of the preceding claims, wherein the requirements information (108) comprises 3rd requirements information (303) relating to a connection between the shared memory (103) and at least one of the first processing unit (101) and the second processing unit (102), and wherein the operating system (106) is further configured to control at least one of the first processing unit (101) and the second processing unit (102), and the shared memory (103), based on the 3rd requirements information (303), to allocate the first memory segment (104) to at least a part of the application (107).
8. The computing system (100, 300) according to any one of the preceding claims, wherein the 3rd requirements information (303) is created by the operating system (106) and comprises information regarding a cache coherency guarantee between at least one of the first memory segment (104) and the second memory segment (105) and at least one of the first processing unit (101) and the second processing unit (102), and/or a memory access latency between at least one of the first memory segment (104) and the second memory segment (105) and at least one of the first processing unit (101) and the second processing unit (102), and/or information regarding existence and a type of hardware protection mechanisms in the shared memory (103).
9. The computing system (100, 300) according to any one of the preceding claims, wherein the operating system (106) is further configured to, if at least one of the first processing unit (101) and the second processing unit (102), and/or at least one of the first memory segment (104) and the second memory segment (105), and/or at least a part of the application (107) does not comply with a requirement in the requirements information (108), adjust a configuration of at least one of the first processing unit (101) and the second processing unit (102), and/or at least one of the first memory segment (104) and the second memory segment (105), and/or at least a part of the application (107), based on the requirements information (108), to allocate at least one of the first memory segment (104) and the second memory segment (105) to at least a part of the application (107).
10. The computing system (100, 300) according to any one of the preceding claims, wherein the operating system (100) is further configured to, if the first processing unit (101) does not comply with a requirement in the requirements information (108), migrate at least a part of the application (107) from being operated by means of the first processing unit (101) to be operated by means of the second processing unit (102), and to control the second processing unit (102) to allocate the first memory segment (104) to at least a part of the application (107), based on the requirements information (108).
11. The computing system (100, 300) according to any one of the preceding claims, wherein the operating system (106) is further configured to, if a predefined part of executable binary code in the application (107) does not comply with a requirement in the requirements information (108), exchange the predefined part of executable binary code in the application (107) with precompiled executable binary code that complies with the requirements information, and to allocate the first memory segment (104) to at least a part of the application (107), based on the requirements information (108) and the precompiled executable binary code.
12. The computing system (100, 300) according to any one of the preceding claims, wherein the operating system (106) is further configured to, if the first memory segment (104) does not comply with a requirement in the requirements information (108), control the at least one of the first processing unit (101) and the second processing unit (102), and the shared memory (103), to allocate the second memory segment (105) to at least a part of the application (107), based on the requirements information (108).
13. The computing system (100, 300) according to any one of the preceding claims, wherein the operating system (106) is further configured to, if the first processing unit (101), the first memory segment (104) and a predefined part of executable binary code in the application (107) does not comply with a requirement in the requirements information (108), allocate the first memory segment (104) to at least a part of the application (107) by means of software memory emulation, based on the requirements information (108).
14. The computing system (100, 300) according to any one of the preceding claims, wherein the at least two memory segments (104, 105) are of different memory segment architecture.
15. A method (700) for operating a computing system (100) for unified memory access that comprises:
- a first processing unit (101) and a second processing unit (102),
- a shared memory (103) including a first memory segment (104) and a second memory segment (105),
- an operating system (106), operated at least partly by the first processing unit (101), and
- an application (107), operated at least partly by the operating system (106),
wherein the first processing unit (101) and the second processing unit (102) are connected to the shared memory (103),
the method (700) comprising the steps of:
- controlling (701), by the operating system (106), at least one of the first processing unit (101) and the second processing unit (102), and the shared memory (103), based on requirements information (108) comprised in the operating system (106) and/or the application (107), to allocate the first memory segment (104) to at least a part of the application (107),
wherein the requirements information (108) comprises executable binary code that comprises information regarding a type and/or a state of a memory segment required by at least a part of the application (107).
PCT/EP2017/076477 2017-10-17 2017-10-17 Computing system for unified memory access WO2019076442A1 (en)

Priority Applications (4)

Application Number Priority Date Filing Date Title
EP17787381.7A EP3695316A1 (en) 2017-10-17 2017-10-17 Computing system for unified memory access
CN202111316767.1A CN114153751A (en) 2017-10-17 2017-10-17 Computer system for unified memory access
CN201780096058.2A CN111247512B (en) 2017-10-17 2017-10-17 Computer system for unified memory access
PCT/EP2017/076477 WO2019076442A1 (en) 2017-10-17 2017-10-17 Computing system for unified memory access

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2017/076477 WO2019076442A1 (en) 2017-10-17 2017-10-17 Computing system for unified memory access

Publications (1)

Publication Number Publication Date
WO2019076442A1 true WO2019076442A1 (en) 2019-04-25

Family

ID=60153293

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2017/076477 WO2019076442A1 (en) 2017-10-17 2017-10-17 Computing system for unified memory access

Country Status (3)

Country Link
EP (1) EP3695316A1 (en)
CN (2) CN111247512B (en)
WO (1) WO2019076442A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11526433B2 (en) 2020-03-12 2022-12-13 International Business Machines Corporation Data structure allocation into storage class memory during compilation

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112099799B (en) * 2020-09-21 2022-01-14 飞腾信息技术有限公司 NUMA-aware multi-copy optimization method and system for SMP system read-only code segments
CN112463714B (en) * 2020-11-30 2022-12-16 成都海光集成电路设计有限公司 Remote direct memory access method, heterogeneous computing system and electronic equipment

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101246466B (en) * 2007-11-29 2012-06-20 华为技术有限公司 Management method and device for sharing internal memory in multi-core system
CN106796536A (en) * 2016-12-27 2017-05-31 深圳前海达闼云端智能科技有限公司 Memory pool access method, device and electronic equipment for multiple operating system

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
ANTONIO BARBALACE ET AL: "Breaking the Boundaries in Heterogeneous-ISA Datacenters", ARCHITECTURAL SUPPORT FOR PROGRAMMING LANGUAGES AND OPERATING SYSTEMS, ACM, 2 PENN PLAZA, SUITE 701 NEW YORK NY 10121-0701 USA, 4 April 2017 (2017-04-04), pages 645 - 659, XP058326932, ISBN: 978-1-4503-4465-4, DOI: 10.1145/3037697.3037738 *
ANTONIO BARBALACE ET AL: "Popcorn: a replicated-kernel OS based on Linux", THE 2014 OTTAWA LINUX SYMPOSIUM (OLS '14), 1 July 2014 (2014-07-01), New York, New York, USA, pages 123 - 137, XP055236751, ISBN: 978-1-4503-3238-5, DOI: 10.1145/2741948.2741962 *
BARBALACE ET AL: "It's Time to Think About an Operating System for Near Data Processing Architectures", PROCEEDINGS OF THE 16TH WORKSHOP ON HOT TOPICS IN OPERATING SYSTEMS , HOTOS '17, 1 May 2017 (2017-05-01), New York, New York, USA, pages 56 - 61, XP055446659, ISBN: 978-1-4503-5068-6, DOI: 10.1145/3102980.3102990 *
OLIVER P. ET AL: "OS Support for Thread Migration and Distribution in the Fully Heterogeneous Datacenter", PROCEEDINGS OF THE 16TH WORKSHOP ON HOT TOPICS IN OPERATING SYSTEMS , HOTOS '17, 1 January 2017 (2017-01-01), New York, New York, USA, pages 174 - 179, XP055447588, ISBN: 978-1-4503-5068-6, DOI: 10.1145/3102980.3103009 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11526433B2 (en) 2020-03-12 2022-12-13 International Business Machines Corporation Data structure allocation into storage class memory during compilation

Also Published As

Publication number Publication date
CN114153751A (en) 2022-03-08
CN111247512A (en) 2020-06-05
EP3695316A1 (en) 2020-08-19
CN111247512B (en) 2021-11-09

Similar Documents

Publication Publication Date Title
US10289435B2 (en) Instruction set emulation for guest operating systems
Hoefler et al. MPI+ MPI: a new hybrid approach to parallel programming with MPI plus shared memory
KR101761650B1 (en) Sharing virtual functions in a shared virtual memory between heterogeneous processors of a computing platform
US7921261B2 (en) Reserving a global address space
EP2885708A1 (en) Processing resource allocation
EP3306479A1 (en) Memory structure comprising scratchpad memory
CN107273311B (en) Computing device, method for computing and apparatus for computing
US10031697B2 (en) Random-access disjoint concurrent sparse writes to heterogeneous buffers
US8429394B1 (en) Reconfigurable computing system that shares processing between a host processor and one or more reconfigurable hardware modules
Slaughter et al. Pygion: Flexible, scalable task-based parallelism with python
Gohringer et al. RAMPSoCVM: runtime support and hardware virtualization for a runtime adaptive MPSoC
CN111247512B (en) Computer system for unified memory access
WO2022237590A1 (en) Smart contract upgrading method and blockchain system
Takeuchi et al. Compiling x10 to java
Grinberg et al. Hands on with OpenMP4. 5 and unified memory: developing applications for IBM’s hybrid CPU+ GPU systems (Part I)
CN111344667B (en) System and method for compiling and executing code within virtual memory sub-pages of one or more virtual memory pages
JP2009211167A (en) Program execution system
Plauth et al. Improving the accessibility of NUMA‐aware C++ application development based on the PGASUS framework
Grimmer High-performance language interoperability in multi-language runtimes
GB2568301A (en) Address space access control
Fumero et al. Unified Shared Memory: Friend or Foe? Understanding the Implications of Unified Memory on Managed Heaps
Wellings et al. Ada and cc-NUMA Architectures What can be achieved with Ada 2005?
Becchi et al. Enabling legacy applications on heterogeneous platforms
Fumero et al. Managed Runtime Environments
Harvey A linguistic approach to concurrent, distributed, and adaptive programming across heterogeneous platforms

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17787381

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2017787381

Country of ref document: EP

Effective date: 20200512