CN111247512B

CN111247512B - Computer system for unified memory access

Info

Publication number: CN111247512B
Application number: CN201780096058.2A
Authority: CN
Inventors: 安东尼奥·巴巴拉斯; 安东尼·利奥普洛斯
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2017-10-17
Filing date: 2017-10-17
Publication date: 2021-11-09
Anticipated expiration: 2037-10-17
Also published as: CN111247512A; CN114153751A; WO2019076442A1; EP3695316A1

Abstract

The present invention provides a computing system 100 for unified memory access, comprising: a first processing unit 101 and a second processing unit 102; a shared memory 103 including a first memory segment 104 and a second memory segment 105; an operating system 106 operated at least in part by the first processing unit 101; an application 107 operated at least in part by the operating system 106. The first processing unit 101 and the second processing unit 102 are connected to the shared memory 103. The operating system 106 is configured to: controlling at least one of the first processing unit 101 and the second processing unit 102 and the shared memory 103 based on requirement information included in the operating system 106 and/or the application 107, so as to allocate the first memory segment 104 to at least a part of the application 107, wherein the requirement information includes executable binary code, wherein the executable binary code includes information of a type and/or a state of a memory segment required by at least a part of the application 107.

Description

Computer system for unified memory access

Technical Field

The present invention relates to a computing system and corresponding method for unified memory access. In particular, the system and method of the present invention affect the way an Operating System (OS) allocates shared memory in a multiprocessor system based on demand information. Preferably, the requirement information includes executable binary code, and the executable binary code includes information of the type and/or state of the memory segment required by at least a part of the application.

Background

Emerging computer architectures are characterized by increasingly heterogeneous memory and processor subsystems, mainly due to novel memory technologies, low latency load/store access interconnects, and the resurgence of Near Data Processing (NDP), which introduces processing units along the memory hierarchy.

Addressable memory includes on-chip memory, off-chip memory (e.g., a common DIMM module), or remote machine memory. Furthermore, with the advent of Storage Class Memory (SCM), addressable memory may be volatile or persistent. To facilitate communication of low latency memory maps within machines, new interconnects are being developed, including CCIX, Gen-Z, OpenCAPI, or serial memory buses. A unifying feature beyond these technologies is to provide a shared memory interconnect between all components within the system, which is based on a single node, or under the chassis level. Some features are also intended to provide hardware-based cache coherency. Finally, NDP regains focus due to technological innovations. NDP is a co-location of processing and memory, in the form of (main) in-memory Processing (PIM) or in-storage processing (ISC).

The above-described techniques enable different types of processing units to access the same memory simultaneously. FIG. 8 illustrates a scenario in which main memory is accessed by a near data processor, a CPU connected to a coherent interconnect, an accelerator interconnected with the CPU via a peripheral bus, and a remote processing unit (e.g., a near data processor, CPU, or GPU) connected via an RDMA-enabled interface. The accelerator-CPU-NIC setup is a sub-scenario of the above technique, where the accelerator (FPGA, GPU or Xeon-Phi), CPU and NIC share a common memory area. As NDP and rack-level computing increase, the number of processing units accessing the same memory region steadily increases. In addition, all memory becomes inter-machine and intra-machine memory accessible for loads/stores.

It can be seen that the prior art is transitioning to multiple kernel OSs or multiple OSs/run managers accessing the same memory region. In addition, conventional OS architectures do not consider memory heterogeneity, except for (non-cacheable) cacheable regions and non-uniform memory access (NUMA).

In conventional computing systems, applications are statically compiled for a computer platform and thus conform to a particular isomorphic memory model. However, the same memory region may be accessed by processors having different Instruction Set Architectures (ISAs) and coherency models and providing different synchronization mechanisms.

To fully support emerging architectures where multiple processing units simultaneously access the same memory region, different system software architectures are needed to enable format-compatible data sharing between different (OS-enabled) ISA processors, consistent memory access (based on application programmer expectations), protection between processors, transparency to programmers, and development of efficient communications.

These prior art solutions present a problem as to how efficiently and effectively the system software manages the shared memory between the different computing units in the system.

The prior art solutions deal relatively statically with the problem of heterogeneous memory and processing units. Programmers need to prepare software for a particular target architecture in advance, possibly employing a specialized programming model to explicitly address consistency and incompatibility issues arising from heterogeneity. The use of heterogeneous accelerators also involves explicit programming, e.g., through a domain-specific accelerator language. Furthermore, the potential advantages of heterogeneity are not fully exploited at runtime, as these advantages require explicit addressing at the time of programming and application development, reducing the chance of further optimization.

The prior art solutions have the disadvantages that: forcing application developers to adopt a specific programming model and needing to modify software; specific heterogeneous environments are targeted, the possibility of running in several potentially available accelerator environments that can be co-hosted is excluded; optimization opportunities that can be determined at application runtime are prohibited.

Disclosure of Invention

In view of the above problems and disadvantages, the present invention is directed to improvements in conventional systems and methods. It is therefore an object of the present invention to provide a system that overcomes the disadvantages of heterogeneity in emerging computer architectures. By providing a combination of compile-time and runtime mechanisms at the operating system and application level, programmers do not have to design application programs for a particular heterogeneous platform. Thus, memory accesses in a heterogeneous environment may be unified. This also enables backward compatibility of legacy applications by avoiding redesign of the legacy applications to further take advantage of the heterogeneous platform. The operating system is also turned on to make runtime decisions, taking advantage of the benefits of heterogeneous devices (e.g., by scheduling and transparently migrating processes when and where they will have better performance and efficiency) without requiring explicit participation in application development.

In particular, the present invention proposes a solution using "memory contracts" (which may also be referred to as requirement information) as a system software solution, created by a compiler or linker at compile time, implemented at run time, managed by the operating system. According to the memory contract, checks can be made at run-time, for example to check the consistency, protection or consistency guarantee that the code acting on the memory area must comply with.

The memory contract may include an enhanced executable binary format, extended to maintain metadata sections. The metadata section includes memory consistency, ISA, and Application Binary Interface (ABI) requirements. Thus, a conventional OS binary loader can be enriched to identify metadata sections and load them at runtime. Further, the OS is turned on to dynamically select matching contracts for various possible processing elements present in the heterogeneous computing architecture at runtime, and additionally to transparently migrate tasks (processes, threads) from the user to any processing unit for unified access to memory.

Therefore, the invention solves the following problems in the prior art: an application writer does not have to employ a particular programming model in order to use heterogeneous resources. An application is launched to utilize several available heterogeneous resources, rather than being targeted to a particular resource. The operating system is turned on to make dynamic decisions transparently at runtime to better utilize the available heterogeneous resources.

The object of the invention is achieved by the solution presented in the appended independent claims. Advantageous embodiments of the invention are further defined in the dependent claims.

A first aspect of the present invention provides a computing system for unified memory access, comprising: a first processing unit and a second processing unit; the shared memory comprises a first memory segment and a second memory segment; an operating system operated at least in part by the first processing unit; an application operated at least in part by the operating system; wherein the first processing unit and the second processing unit are connected to the shared memory; wherein the operating system is to: controlling at least one of the first processing unit and the second processing unit and the shared memory based on requirement information included in the operating system and/or the application, so as to allocate the first memory segment to at least a part of the application, wherein the requirement information includes executable binary code, wherein the executable binary code includes information of a type and/or a state of a memory segment required by at least a part of the application.

Since memory access can be performed according to the requirement information included in the operating system and/or the application, a programmer does not need to design an application for a specific heterogeneous platform. In addition, the operating system is started to make runtime decisions about memory access, and the advantages of heterogeneous devices are positively utilized. The OS is turned on to efficiently and effectively control heterogeneous processing units and/or heterogeneous memory segments when running an application.

In a first implementation form of the system according to the first aspect, the first processing unit and the second processing unit have different processing unit architectures.

This ensures that memory access can be unified, in particular in heterogeneous computing systems, more particularly in systems where the first computing unit and the second computing unit have different architectures.

In a second implementation form of the system according to the first aspect, the requirement information further includes first requirement information, where the first requirement information is related to an attribute of an executable binary of at least a part of the application, and the operating system is further configured to: controlling at least one of the first processing unit and the second processing unit and the shared memory based on the first demand information, thereby allocating the first memory segment to at least a portion of the application.

This ensures that attributes of the executable binary code of at least a portion of the application may be specifically considered in operating the computing system to unify memory access.

In a third implementation form of the system according to the first aspect, the first demand information is executable binary code comprising information on: an application binary interface, ABI, for compiling at least a part of said application, and/or a format for compiling at least a part of said application, and/or a persistence characteristic, and/or an attribution of a memory segment required by at least a part of said application, and/or a security policy.

This ensures that specific and detailed information and parameters in the first requirement information can be taken into account when operating the computing system to unify memory access.

In a fourth implementation form of the system according to the first aspect, the requirement information comprises second requirement information, wherein the second requirement information relates to an executable binary code of at least one predefined code segment of the application, the operating system is further configured to: controlling at least one of the first processing unit and the second processing unit and the shared memory based on the second demand information, thereby allocating the first memory segment to at least a portion of the application.

This ensures that the executable binary code of at least one predefined code segment of the application can be specifically considered in order to unify memory accesses when operating the computing system.

In a fifth implementation form of the system according to the first aspect, the second requirement information is an executable binary code, comprising information of: ABI for compiling a predefined code segment of the application, and/or a memory model for compiling a predefined code segment of the application, and/or a security policy for each memory segment accessible to the application.

This ensures that specific and detailed information and parameters in the second requirement information can be taken into account when operating the computing system to unify memory access.

In a sixth implementation form of the system according to the first aspect, the requirement information includes third requirement information, where the third requirement information is related to a connection between the shared memory and at least one of the first processing unit and the second processing unit, and the operating system is further configured to: controlling at least one of the first processing unit and the second processing unit and the shared memory based on the third demand information, thereby allocating the first memory segment to at least a portion of the application.

This ensures that information of the connection between the shared memory and at least one of the first processing unit and the second processing unit may be taken into particular account when operating the computing system to unify memory access.

In a seventh implementation form of the system according to the first aspect, the third requirement information is created by the operating system, comprising information on: a cache coherence guarantee between at least one of the first memory segment and the second memory segment and at least one of the first processing unit and the second processing unit, and/or a memory access latency between at least one of the first memory segment and the second memory segment and at least one of the first processing unit and the second processing unit, and/or a presence and a type of a hardware protection mechanism in the shared memory.

This ensures that specific and detailed information and parameters in the third requirement information can be taken into account when operating the computing system to unify memory access.

In an eighth implementation form of the system according to the first aspect, the operating system is further configured to: if at least one of the first processing unit and the second processing unit, and/or at least one of the first memory segment and the second memory segment, and/or at least a portion of the application does not comply with the requirements in the requirements information, adjusting, based on the requirements information, a configuration of: at least one of the first processing unit and the second processing unit, and/or at least one of the first memory segment and the second memory segment, and/or at least a portion of the application, such that at least one of the first memory segment and the second memory segment is allocated to at least a portion of the application.

This ensures that if it is detected that the requirement in the requirement information is not met when performing a memory access, the configuration of: at least one of the first processing unit and the second processing unit, and/or at least one of the first memory segment and the second memory segment, and/or at least a portion of the application.

In a ninth implementation form of the system according to the first aspect, the operating system is further configured to: migrating at least a portion of the application from operating through the first processing unit to operating through the second processing unit if the first processing unit does not comply with the requirements in the requirements information; and controlling the second processing unit to allocate the first memory segment to at least a portion of the application based on the demand information.

This ensures that if a non-compliance with the requirements in the requirements information is detected while performing the memory access, at least a portion of the application can be migrated to unify the memory access.

In a tenth implementation form of the system according to the first aspect, the operating system is further configured to: if the executable binary codes of the predefined part in the application do not accord with the requirements in the requirement information, exchanging the executable binary codes of the predefined part in the application with the precompiled executable binary codes which accord with the requirements; and allocating the first memory segment to at least a portion of the application based on the demand information and the precompiled executable binary.

This ensures that if a non-compliance with requirements in the requirements information is detected while performing a memory access, a predefined portion of executable binary code in the application can be exchanged with pre-compiled executable binary code to unify memory accesses.

In an eleventh implementation form of the system according to the first aspect, the operating system is further configured to: if the first memory segment does not meet the requirement in the requirement information, controlling the shared memory and the at least one of the first processing unit and the second processing unit based on the requirement information, thereby allocating the second memory segment to at least a portion of the application.

This ensures that if a non-compliance with the requirements in the requirements information is detected while performing the memory access, at least one of the first processing unit and the second processing unit and the shared memory may be controlled such that a second memory segment is allocated to at least a part of the applications for unifying the memory access.

In a twelfth implementation form of the system according to the first aspect, the operating system is further configured to: and if the first processing unit, the first memory segment and the predefined part of executable binary codes in the application do not meet the requirements in the requirement information, distributing the first memory segment to at least one part of the application in a software memory simulation mode based on the requirement information.

This ensures that if a non-compliance with the requirements in the requirements information is detected during a memory access, the first memory segment can be allocated to at least a portion of the applications in a software memory emulation manner to unify memory accesses.

In a thirteenth implementation form of the system according to the first aspect, the at least two memory segments have different memory segment architectures.

This ensures that memory access can be unified, particularly in heterogeneous computing systems, and more particularly in systems where at least two memory segments are of different architectures.

A second aspect of the present invention provides a method for operating a computing system for unified memory access. The computing system includes: a first processing unit and a second processing unit; the shared memory comprises a first memory segment and a second memory segment; an operating system operated at least in part by the first processing unit; an application operated at least in part by the operating system; wherein the first processing unit and the second processing unit are connected to the shared memory, wherein the method comprises the steps of: the operating system controls at least one of the first processing unit and the second processing unit and the shared memory based on requirement information included in the operating system and/or the application, so as to allocate the first memory segment to at least one part of the application, wherein the requirement information includes executable binary code, wherein the executable binary code includes information of the type and/or state of the memory segment required by at least one part of the application.

In a first implementation form of the method according to the second aspect, the first processing unit and the second processing unit have different processing unit architectures.

In a second implementation form of the method according to the second aspect, the requirement information further includes first requirement information, where the first requirement information is related to an attribute of an executable binary of at least a part of the application, and the method further includes: based on the first requirement information, the operating system controls at least one of the first processing unit and the second processing unit and the shared memory to allocate the first memory segment to at least a portion of the application.

In a third implementation of the method according to the second aspect, the first demand information is executable binary code comprising information about: an application binary interface, ABI, for compiling at least a part of said application, and/or for compiling a format, and/or persistence characteristics, of at least a part of said application, and/or attribution of memory segments required by at least a part of said application, and/or a security policy.

In a fourth implementation form of the method according to the second aspect, the requirement information comprises second requirement information, wherein the second requirement information is related to an executable binary code of at least one predefined code segment of the application, the method further comprising: based on the first requirement information, the operating system controls at least one of the first processing unit and the second processing unit and the shared memory to allocate the first memory segment to at least a portion of the application.

In a fifth implementation form of the method according to the second aspect, the second requirement information is an executable binary code, comprising information of: ABI for compiling a predefined code segment of the application, and/or a memory model for compiling a predefined code segment of the application, and/or a security policy for each memory segment accessible to the application.

In a sixth implementation form of the method according to the second aspect, the requirement information includes third requirement information, where the third requirement information is related to a connection between the shared memory and at least one of the first processing unit and the second processing unit, and the method further includes: based on the third requirement information, the operating system controls at least one of the first processing unit and the second processing unit and the shared memory to allocate the first memory segment to at least a portion of the application.

In a seventh implementation of the method according to the second aspect, the third requirement information is created by the operating system, comprising information on: a cache coherence guarantee between at least one of the first memory segment and the second memory segment and at least one of the first processing unit and the second processing unit, and/or a memory access latency between at least one of the first memory segment and the second memory segment and at least one of the first processing unit and the second processing unit, and/or a presence and a type of a hardware protection mechanism in the shared memory.

In an eighth implementation form of the method according to the second aspect, the method further comprises: if at least one of the first processing unit and the second processing unit, and/or at least one of the first memory segment and the second memory segment, and/or at least a portion of the application does not comply with requirements in the requirement information, based on the requirement information, the operating system adjusts a configuration of: at least one of the first processing unit and the second processing unit, and/or at least one of the first memory segment and the second memory segment, and/or at least a portion of the application, such that at least one of the first memory segment and the second memory segment is allocated to at least a portion of the application.

In a ninth implementation form of the method according to the second aspect, the method further comprises: if the first processing unit does not meet the requirements in the requirement information, the operating system migrates at least a portion of the application from operating through the first processing unit to operating through the second processing unit; and based on the requirement information, the operating system controls the second processing unit so as to allocate the first memory segment to at least a part of the application.

In a tenth implementation form of the method according to the second aspect, the method further comprises: if the executable binary codes of the predefined part in the application do not accord with the requirement in the requirement information, the operating system exchanges the executable binary codes of the predefined part in the application with the precompiled executable binary codes which accord with the requirement information; and allocating the first memory segment to at least a portion of the application based on the demand information and the precompiled executable binary.

In an eleventh implementation form of the method according to the second aspect, the method further comprises: if the first memory segment does not meet the requirement in the requirement information, the operating system controls the at least one of the first processing unit and the second processing unit and the shared memory based on the requirement information, thereby allocating the second memory segment to at least a portion of the application.

In a twelfth implementation form of the method according to the second aspect, the method further comprises: if the first processing unit, the first memory segment and the predefined portion of executable binary code in the application do not meet the requirement in the requirement information, the operating system allocates the first memory segment to at least a portion of the application in a software memory emulation manner based on the requirement information.

In a thirteenth implementation form of the method according to the second aspect, the at least two memory segments have different memory segment architectures.

The method according to the second aspect and its implementations achieves the same advantages as the system according to the first aspect and its corresponding implementations.

It should be noted that all devices, elements, units and means described in the present application may be implemented in software or hardware elements or any kind of combination thereof. All steps performed by various entities described in the present application and the functions described as being performed by the various entities are intended to mean that the respective entities are adapted or used to perform the respective steps and functions. Even if, in the following description of specific embodiments, a specific function or step to be performed by an external entity is not reflected in the description of a specific detailed element of that entity performing that specific step or function, it should be clear to the skilled person that these methods and functions may be implemented in corresponding software or hardware elements, or any kind of combination thereof.

Drawings

The above aspects and implementations of the invention are explained in the following description of specific embodiments in conjunction with the attached drawings, in which

FIG. 1 illustrates a computing system according to an embodiment of the invention;

FIG. 2 illustrates a computing system according to an embodiment of the invention;

FIG. 3 illustrates a more detailed computing system according to an embodiment of the invention;

FIG. 4 shows a schematic diagram of ELF format and PE/COFF format;

FIG. 5 is a diagram illustrating an OS process descriptor according to the present invention;

FIG. 6 illustrates a flow diagram of a manner of operation for unifying memory access by OS kernels;

FIG. 7 shows a schematic overview of a method according to an embodiment of the invention;

fig. 8 shows a computing system according to the prior art.

Detailed Description

FIG. 1 illustrates a computing system 100 according to an embodiment of the invention. Computing system 100 allows unified memory access, including first processing unit 101, second processing unit 102, and shared memory 103 including first memory segment 104 and second memory segment 105.

Thus, both processing

units

101 and 102 may for example be one of the following: a CPU, CPU core, GPU core, near data processor, CPU connected to a coherent interconnect, accelerator interconnected with CPU via a peripheral bus, remote processing unit (e.g., near data processor, CPU or GPU) connected via an RDMA enabled interface, or a core (e.g., a kernel of an OS). The computing system 100 may comprise any number of processing units as long as it comprises at least a first processing unit 101 and a second processing unit 102, for example according to the above definitions.

Alternatively, the first processing unit 101 and the second processing unit 102 may have different processing unit architectures. This may include: the first processing unit 101 is a first entity selected from the list and the second processing unit 102 is a different entity selected from the list. This may also include: first processing unit 101 and second processing unit 102 are binary incompatible, e.g., because they operate according to different ISAs.

The shared memory 103 includes a first memory segment 104 and a second memory segment 105. Each

memory segment

104, 105 may be a classic main memory, such as Random Access Memory (RAM) or storage class memory. Each

memory segment

104, 105 may be volatile or non-volatile, and coherent or non-coherent. More specifically, both

memory segments

104, 105 may be implemented on-chip, off-chip (e.g., via conventional DIMM modules), via conventional coherent interconnects, or via inter-machine memory. More specifically, each link of

memory segments

104, 105 may be implemented using cache coherent or non-coherent interconnects or new technologies such as CCIX, Gen-Z, OpenCAPI, a serial memory bus, and the like. In particular, in forming the shared memory 103 (including the first memory segment 104 and the second memory segment 105 among other memory segments that may be present), the first memory segment 104 and the second memory segment 105 may optionally have different memory segment architectures, e.g., each chip providing one or more memory segments may be constructed according to different memory technologies and/or designs, and/or may include multiple segments having different memory technologies and/or designs. This may include: the first memory segment 104 is a first entity selected from the list and the second memory segment 105 is a second, different entity selected from the list. This may also include: the first memory segment 104 and the second memory segment 105 are binary incompatible, for example, because they operate according to different ISAs.

More specifically, the shared memory 103 may be used to enable multiple operating systems and/or multiple processing units and/or multiple applications to access the shared memory 103 simultaneously, preferably to access the

same memory segment

104, 105 in the shared memory simultaneously.

The computing system 100 also includes an OS 106. OS 106 may be a conventional single or multi-tasking and/or single or multi-user OS, such as Linux, BSD, Windows, or Apple OSX. The OS may also be a distributed, templated, embedded, real-time, or library OS. The OS 106 may include only a single kernel. The OS 106 may also include multiple kernels. The computing system 100 may operate using a single OS 106, but may also operate using multiple OSs 106 as long as there is at least one OS 106. In particular, the OS 106 may be binary incompatible with the first processing unit 101 and/or the second processing unit 102, i.e. operating according to different ISAs. When the computing system 100 includes multiple OSs 106 operating on multiple processing units, the OSs 106 may be binary incompatible with each other.

FIG. 1 particularly shows OS 106 operating on multiple processing units, such as near data processors, CPUs, accelerators, and remote units. These multiple processing units also include a first processing unit 101 and a second processing unit 102. The OS 106 needs to operate on the first processing unit 101 at least in part. The second processing unit 102 may be controlled by the OS 106 to optimize the configuration of the computing system 100 to achieve uniform memory access.

Another configuration of the computing system 100 may also be included in this embodiment, as will be described below with respect to FIG. 2. In the configuration that will be described with respect to FIG. 2, the computing system 100 includes multiple OSs 106.

Computing system 100 also includes application 107. The application 107 is operated at least in part by the OS 106. This includes: the application may be, for example, a distributed application operated by multiple OSs, one of which is OS 106. More specifically, the application 107 includes at least one of the application portions "code a", "code B", "code C", or "code D" shown in fig. 1 or fig. 2, thereby indicating that the application 107 may be operated by one OS on multiple processing units, and even by multiple kernels, OSs, or execution managers. The application 107 obtains memory access, for example, by attempting to allocate

memory segments

104, 105 in the shared memory. Obtaining memory access may also involve operating the first processing unit 101 and/or the second processing unit 102, the shared memory 103, one of its

memory segments

104, 105, or the OS 106 to achieve unified memory access.

To access the shared memory 103, the first processing unit 101 and the second processing unit 102 are connected to the shared memory 103, for example, through a bus, and more particularly, via a bus supporting a load/store interface. The bus may provide different types of coherency for each different interconnect segment.

To allow unified memory access, the OS 106 is configured to: at least one of the first processing unit 101 and the second processing unit 102 and the shared memory 103 are controlled based on demand information 108 comprised in the operating system 106 and/or the application 107, such that the first memory segment 104 is allocated to at least a part of the application 107. The demand information 108 may also be maintained by software, for example, by the OS 106 or the application 107.

That is, the OS considers the requirement information 108 and performs a predefined action based on the requirement information 108 to control at least one of the first processing unit 101, the second processing unit 102 or the shared memory 103 to access the shared memory 103, and more specifically, to allocate the first memory segment 104 to at least a portion of the application 107 requiring memory to be allocated.

In addition, the requirement information 108 also includes executable binary code that includes information of the type and/or state of the

memory segments

104, 105 required by at least a portion of the application 107.

OS 106 may load the executable binary code which also provides supplemental information for each program section of application 107. The demand information 108 may also be referred to as a memory contract. The memory contract may define the minimum requirements needed for correct memory access for each code (. text) subsection of the application 107. The requirement information 108, and more particularly, the executable binary code, may be created by a compiler that automatically generates the executable binary code transparently to the program developer so that the program developer may modify the memory contract using additional language compilation instructions or new OS APIs at compile time or runtime, respectively.

The demand information 108 may also include first demand information, second demand information, and third demand information, which will be described in detail below with respect to FIG. 3. The first requirement information may also be referred to as a data contract, the second requirement information may also be referred to as a participant contract, and the third requirement information may also be referred to as a topology contract.

FIG. 2 illustrates an exemplary configuration of computing system 100, according to an embodiment of the invention. The computing system 100 shown in FIG. 2 includes all of the features of the computing system 100 of FIG. 1, and in particular, may operate using multiple OSs. The OS may be of the type described in particular with respect to fig. 1. The OS 106 is composed of a plurality of OSs in the computing system 100 in fig. 2. The plurality of OSs may in particular be binary incompatible with each other and/or with the first processing unit 101 and/or the second processing unit 102, i.e. operating according to different ISAs.

The computing system 100 in fig. 2 requires at least an operating OS 106. However, because multiple OSs are supported, computing system 100 may also be implemented in environments where multiple OSs run on a multi-core, multi-processor, or distributed system. Each OS may run on a respective processing unit. In particular, each OS may run at least partially on a different processing unit, i.e. in a distributed manner. However, the OS 106 needs to operate the computing system 100 and control each OS in the computing system 100.

Multiple OSs in computing system 100 may be controlled by OS 106, for example, to optimize the configuration of computing system 100 to enable unified memory access, e.g., to at least partially migrate applications to different OSs.

FIG. 3 illustrates a more detailed computing system 300, according to an embodiment of the invention. Computing system 300 includes all of the features and functionality of computing system 100 as described above with respect to fig. 1 and 2. Accordingly, like features are labeled with like reference numerals. In the description according to fig. 3, in particular the concepts of the first requirement information, the second requirement information and the third requirement information (i.e. the data contract, the participant contract, the topology contract) and the memory contract will be described in more detail.

The memory contract may also be an OS abstraction that provides an OS interface for programmers. The memory contract may be metadata (data, code, files, exchanges, and/or combinations thereof) for an address space region. There are two memory contracts associated with the application 107: a data contract and a participant contract. The data contract may be associated with executable code of an application and the participant contract may be associated with a memory region accessed by the application. In addition, a topology convention may describe the characteristics of a processing unit to a memory segment. Each OS running on at least two processing units may use a memory contract to enforce at least the following memory region attributes: format (in ABI sense), consistency guarantee, cacheability guarantee, persistence, and user rights (memory protection). Further examples are provided below.

As shown in fig. 3, the requirement information 108 may optionally further include first requirement information 301, which may also be referred to as a data contract.

The first requirement information 301 may relate to properties of at least a portion of the executable binary code of the application 107. The operating system 106 is also operable to: based on the first demand information 301, at least one of the first processing unit 101 and the second processing unit 102 and the shared memory 103 are controlled such that the first memory segment 104 is allocated to at least a portion of the application 107.

The first requirement information (i.e., the data contract) represents the ABI and/or format used by the compiler when compiling a particular (initialized or uninitialized) data portion of the application 107. Even if the heap and stack are memory areas populated by the application 107 runtime, their ABIs and formats can still be defined at compile time, so the heap and stack can also be characterized by a data contract. For each memory mapped file, its data contract is inherited by the creating application or set by the user. Persistence is another property that data contracts represent. Since modern compilers support persistent memory, persistence may also be fetched at compile time. Finally, the data contract defines one or more attribution and security policies for the memory segment to grant capabilities to the memory segment and/or the process/processing unit.

In other words, the first requirement information 301 may be viewed as executable binary code, including information (e.g., assumptions and rules) used in compiling the binary code, and more particularly, information including: ABI for compiling at least a part of the application 107, and/or a format for compiling at least a part of the application 107, and/or persistence characteristics, and/or attribution of memory segments required by at least a part of the application 107, and/or security policies.

In particular, the format specifies alignment and/or data structure field order, and/or ABI, and/or alignment, and/or padding, and/or structure field organization, and/or persistence, and/or cacheability.

Endurance is an attribute of a memory segment that is used to retain data after power is removed. Memory is non-persistent or volatile when data stored by the memory is lost after power is removed. A memory is persistent if data stored in the memory is not lost after power is removed. One example of non-persistent memory is SDRAM or SRAM. One example of persistent memory is NVDIMM or Flash. The persistence characteristics include information of the persistence of the shared memory 103 and/or

memory segments

104, 105.

The attribution of the

memory segments

104, 105 includes information of which applications 107 or users can access the

memory segments

104, 105.

As further shown in fig. 3, the demand information 108 may optionally also include second demand information 302, which may also be referred to as a participant contract.

The second requirement information 302 may relate to executable binary code of at least one predefined code segment of the application 107. The OS 106 may also be used to: based on the second demand information 302, at least one of the first processing unit 101 and the second processing unit 102 and the shared memory 103 are controlled such that the first memory segment 104 is allocated to at least a portion of the application 107.

The second requirement information 302 (i.e., participant contract) may represent the ABI used by the compiler in compiling a particular code segment of the application 107. In addition, the participant contract specifies for each code segment for which memory model the code segment was compiled, such as coherency guarantees and cacheability requirements. Finally, the participant contracts store a set of functions (e.g., security policies) for each memory area that it has access to.

In other words, the second requirement information 302 may be viewed as executable binary code, including information for: ABI for compiling a predefined code segment of the application 107, and/or a memory model for compiling a predefined code segment of the application 107, and/or a security policy for each memory segment that the application 107 can access.

Thus, coherency guarantees and cacheability can be viewed as a concept of correlation.

Coherency is a memory attribute. When multiple processing units operate on the same memory segment, the type/level of coherency defines the manner in which memory modifications made by the first processing unit 101 are propagated to the second processing unit 102. For example, "strong consistency" requires that every change made by the first processing unit 101 appears immediately in the view of the memory segment of the second processing unit 102. Cache is a mechanism in memory hardware to provide coherency. Cacheability is the ability of memory hardware to provide some coherency via a cache.

In particular, the security policy may define a memory segment that a single application 107 may read, write, or execute. The security policy may also be defined as a set of attributes that relate to the ability of code to be dynamically executed on the processing unit to access predefined memory segments.

The memory model may define the manner in which binary code may access a memory segment in shared memory. This is because the same memory segment may have different caching capabilities, access latencies, protection mechanisms, and persistence attributes for different processing units.

In another example, in computing system 100, the work of the operating system may be distributed (in different quantities) among operating system 106, the run manager, and the virtual machine monitor.

As further shown in fig. 3, the requirement information 108 may optionally further include third requirement information 303, which may also be referred to as a topology contract.

The third requirement information may relate to a connection between the shared memory 103 and at least one of the first processing unit 101 and the second processing unit 102. The operating system 106 is also operable to: based on the third demand information 303, at least one of the first processing unit 101 and the second processing unit 102 and the shared memory 103 are controlled such that the first memory segment 104 is allocated to at least a portion of the application 107.

In one particular implementation of the computing system 300, the OS 106 may be built from multiple cores, each running on a different processing unit. The kernels share information about the topology of the computing system and the data contract for each shared memory segment.

Since there are

multiple memory segments

104 and 105 in computing system 100, and virtual memory regions can be relocated between these segments, third requirement information (i.e., a topology contract) is needed. The third requirement information describes a connection between the physical memory segment and each particular processing unit. The topology contract may also describe cache coherency guarantees that exist between the memory segments and the processing units. In addition, other information may be associated with the topology contract, such as memory access latency, information whether memory is persistent, information of the existence and type of hardware protection mechanisms between memory links. In particular, the third requirement information 303 may be created by the operating system 106, and more particularly, based on the hardware topology/geometry of the computing system 300 (i.e., the manner in which the hardware components are connected in the computing system 300).

In other words, the third demand information 303 may include information on: a cache coherence guarantee between at least one of the first memory segment 104 and the second memory segment 105 and at least one of the first processing unit 101 and the second processing unit 102, and/or a memory access latency between at least one of the first memory segment 104 and the second memory segment 105 and at least one of the first processing unit 101 and the second processing unit 102, and/or a presence and type of hardware protection mechanisms in the shared memory 103.

Cache coherency guarantees may be viewed as a set of distinct processing units that require coherent access to a memory segment when a cache coherency mechanism is inserted between the processing unit and the memory segment. The cache coherency mechanism may provide different types of coherency, such as sequential coherency or Total Store Order (TSO). The cache coherency mechanism need not exist at all, or simply snoop memory bus operations.

The memory access latency can be viewed as the time required to perform the memory access indicated by the processing unit. In fact, a single access may take a different amount of time based on the physical distance of the memory segment from the processing unit and whether or not particular data is cached.

The hardware protection mechanism may be memory paging and/or memory segmentation (segmented memory).

In another example implementation, memory contracts of the type described above remain transparent to a user of computing system 300. The memory contract is processed by the compiler at runtime or by the OS without programmer intervention. The memory contract is generated by the compiler and linker in the first instance, including the contract description for generating the code and data (sub) fields. In addition to ABI conventions, the compiler can also augment the description of such memory models. These descriptions are generated internally or by grammatical expressions such as C + +11 atomics. The compiler and linker may embody additional information in the generated binary code. The compiler and linker may divide a text segment, usually as a whole, into sub-segments, which may be assigned to different participant contracts. In a specific implementation example of a compiler and linker, compiling instructions may also be added to the programming language to mark memory segments to be shared so that they can be accessed by variable base pointers from different devices.

The requirement information 108 (i.e., the memory contract) is used by the OS 106 to uniformly access the shared

memory segments

104, 105, particularly by enforcing a plurality of attributes: the OS 106 enforces all memory contracts by checking that all code sections in the application 107 (e.g., running on the first computing unit 101 and/or on other computing units) connected to the

same memory segment

104, 105 conform to the data contract, participant contract and topology contract of each connection to the memory segment. If the code section in the application 107 does not have a valid data contract, participant contract, or topology contract (i.e., if the memory access to the shared memory 103 does not comply with any of the first requirement information 301, the second requirement information 302, or the third requirement information 303), the OS 106 may perform several actions. Although these actions are described with respect to fig. 3, they may also be applied to computing system 100 as described with respect to fig. 1 or 2. That is, these actions may also be performed based only on the demand information 108 without the first demand information 301, the second demand information 302, and the third demand information 303 (i.e., the actions may be performed based only on the memory contract, i.e., there is no data contract, participant contract, or topology contract).

That is, performing the action may include: the operating system 106 is also operable to: if at least one of the first processing unit 101 and the second processing unit 102, and/or at least one of the first memory segment 104 and the second memory segment 105, and/or at least a portion of the application 107 does not comply with the requirements in the requirements information 108, then based on the requirements information 108, adjusting a configuration of: at least one of the first processing unit 101 and the second processing unit 102, and/or at least one of the first memory segment 104 and the second memory segment 105, and/or at least a portion of the application 107, such that at least one of the first memory segment 104 and the second memory segment 105 is allocated to at least a portion of the application 107.

Adjusting the configuration of at least one of the above entities may in particular comprise: if the application 107 needs to allocate a memory segment that is not guaranteed to satisfy the requirement information 108 (e.g., the first requirement information, the second requirement information, or the third requirement information, or the compilation and linking properties of currently executing code (sub) sections), the OS may perform the following actions:

cancel execution of the application 107. The RW operation on the desired memory segment may also be disabled, causing a failure.

Migrating the application 107 to another processing unit that conforms to the requirements information (e.g., compilation and linking properties). It should be noted that this may depend on the availability of executable binary code that may be run on other processing units. This may be achieved by generating executable binary code required by other processing units at compile time.

Swapping a code (sub-) section of the application 107 with another version that requires a set of weaker guarantees. It should be noted that this depends on the availability of the executable binary, which if not already available, may be generated at runtime.

Migrating the memory segment 104 to be operated on by the application 107 to another memory segment 105, the other memory segment 105 complying with the requirements in the requirement information 108 (e.g. from the point of view of the processing unit that wants to operate on the memory segments 104, 105).

Conforming requirements in the requirements information 108 by using software emulation (e.g., via a form of virtual distributed shared memory).

To achieve the above-described actions, in other words, the operating system 106 may also be used to: exchanging the predefined portion of executable binary code in the application 107 with pre-compiled executable binary code that conforms to the requirement information if the predefined portion of executable binary code in the application 107 does not conform to the requirement in the requirement information 108; and allocates the first memory segment 104 to at least a portion of the application 107 based on the demand information 108 and the precompiled executable binary.

Thus, when there is a data consistency mismatch or ABI mismatch, the OS 106 can switch between different binary code versions (compiled based on semantically equivalent source code, or the same source code) in the application 107 to conform to different consistency contracts and ABI participant contracts (different binary code versions can be generated at compile time, or generated in time, i.e., during application run time, or generated in user space).

Additionally or alternatively, the operating system 100 may also be used to: migrating at least a portion of the application 107 from operating through the first processing unit 101 to operating through the second processing unit 102 if the first processing unit 101 does not comply with the requirements in the requirements information 108; and controls the second processing unit 102 to allocate the first memory segment 104 to at least a portion of the application 107 based on the demand information 108.

Additionally or alternatively, the operating system 106 may also be used to: if the first memory segment 104 does not meet the requirement in the requirement information 108, at least one of the first processing unit 101 and the second processing unit 102 and the shared memory 103 are controlled based on the requirement information 108, such that the second memory segment 105 is allocated to at least a portion of the application 107.

Thus, when data coherency is violated, the OS 106 may decide to provide data coherency via distributed shared memory or to move a memory segment of a code block to another memory segment or processing unit that provides an efficient topology contract.

Additionally or alternatively, the operating system 106 may also be used to: if the predefined portion of the executable binary code in the first processing unit 101, the first memory segment 104 and the application 107 does not meet the requirement in the requirement information 108, the first memory segment 104 is allocated to at least a portion of the application 107 by means of software memory emulation based on the requirement information 108.

Further, for ABI inconsistencies, the run manager may ultimately determine the pseudo code (e.g., OpenCL) as a particular ABI. The new API then provides a way for programmers to manipulate the contracts at runtime, enabling user-defined behavior and fine-tuning.

The demand information 108 used by the computing system 100 includes executable binary code that includes information of the type and/or state of memory segments required by at least a portion of the application 107. Such executable binary code may also be referred to as an enhanced executable binary file, an enhanced executable binary format, or enhanced executable binary code.

Fig. 4 shows a schematic diagram 400 of an Executable and Linking Format (ELF) 401 and a Portable Executable (PE)/Common Object File Format (COFF) format 402 as examples of executable binary files. The present invention is applicable to both formats. In both cases, there is a header section 403, a code section 404, a data section 405, and a debug/symbol section 406. Furthermore, the invention is generally applicable to any possible file format and is not limited to the examples given. These examples represent the most common file formats.

To provide suitable formats for executable binaries, such as ELF format 401 and PE/COFF format 402, the compiler used (e.g., GCC, LLVM/clone, MSVC) should support a variety of memory models, ISAs, and ABIs. To generate either ELF format 401 or PE/COFF format 402, the conventional compilation process needs to remain unchanged, only modifying the backend and linker involved in the compilation. The modified back-end is that multiple versions of text are generated code (in code section 404), which support different memory models, ISAs, and ABIs. The number of versions is not limited. All of the different versions may be included in the enhanced executable binary and should be interchangeable within the address range of the same address space. The modified linker may place all of these different code versions in the same executable binary file, while marking each section and providing backward compatibility with the original format 407. The modified linker may also create a new executable binary segment 408, which may be referred to as a "contract". In this section, all compiler hypotheses used during compilation that cannot be extracted from the debug section are included.

To implement the present invention, a conventional OS binary loader may be modified to load additional executable binary segments that are added to the enhanced executable binary file. These sections add the usual OS data structures describing the address space.

The OS binary loader may detect additional metadata including the compiler launch at binary load/execution time (e.g., when executing execave () system calls in the Linux kernel). Accordingly, the OS binary loader sets the address space of the loading process. In this process, the OS 106 may check whether the memory allocated to the application 107 in the computing system 100 meets the requirements represented by the metadata included in the executable binary. The address space OS abstraction is further enhanced to include additional optional code (sub) sections and metadata information.

Fig. 5 shows a diagram 500 of a process descriptor 501 that can be used by an OS. FIG. 5 shows a schematic diagram of a process descriptor 501 in a conventional OS (e.g., Linux, BSD, Windows, or Apple OSX). In these OSs, the address space of an application is described by the link data structure of the virtual memory region descriptor 502. Each descriptor is associated with a logical portion of the program address space, such as a text section, a data section, or a HEAP section. The process descriptor 501 may also include a binary format descriptor 503. Associated with each virtual memory region descriptor 502 are one or more "memory contracts," such as data descriptor 504 or participant descriptor 505, in accordance with the present invention. Data descriptors 504 (or data contracts) are associated with non-code areas of the program, describing compilation options and conventions used during compilation. A participant descriptor 505 (or participant contract) is associated with a code region of a program, describing, for each subsection of code, the minimum memory model required for consistent memory access. Multiple participant contracts may be associated with multiple versions of the same code oriented to the address space. When the computing system 100 loads, the OS may map each code of the non-code sections while guaranteeing the requirements set forth in the contract.

FIG. 6 illustrates a flow chart 600 of a manner of operation for unifying memory access by OS kernels. Flow diagram 600 describes, among other things, how an OS kernel may use additional metadata (i.e., requirement information) according to the present invention to access heterogeneous memory regions during process runtime.

Fig. 7 shows a schematic overview of a method 700 according to an embodiment of the invention. The method 700 corresponds to the system 100 of fig. 1 and, accordingly, operates on the computing system 100 for unified memory access.

The method 700 includes the steps of: based on requirement information comprised in the operating system 106 and/or the application 107, the operating system 106 controls (701) at least one of the first processing unit 101 and the second processing unit 102 and the shared memory 103 to allocate the first memory segment 104 to at least a part of the application 107, wherein the requirement information comprises executable binary code, wherein the executable binary code comprises information of the type and/or state of the memory segment required by at least a part of the application 107.

Fig. 8 illustrates a computing system 800 in accordance with the prior art. In particular, what is shown in this patent may apply to the computer architecture shown in this figure. In addition to dedicated CPUs, multiple other processing units (e.g., NDP, accelerator, or RDMA) may simultaneously access a common memory region, as shown in fig. 8. The teachings of the present invention can be used to unify such heterogeneous memory accesses.

The invention has been described in connection with various embodiments by way of example and implementation. However, other variations can be understood and effected by those skilled in the art in practicing the claimed invention, from a study of the drawings, the disclosure, and the independent claims. In the claims as well as in the specification, the word "comprising" does not exclude other elements or steps, and the indefinite article "a" or "an" does not exclude a plurality. A single processor or other unit may fulfill the functions of several entities recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.

Claims

1. A computing system (100, 300) for unified memory access, comprising:

_–a first processing unit (101) and a second processing unit (102);

_–a shared memory (103) comprising a first memory segment (104) and a second memory segment (105);

_–an operating system (106) operated at least partially by the first processing unit (101);

_–an application (107) operated at least in part by the operating system (106); wherein

_–The first processing unit (101) and the second processing unit (102) are connected to the shared memory (103), wherein

The operating system (106) is configured to: controlling at least one of the first processing unit (101) and the second processing unit (102) and the shared memory (103) based on demand information (108) comprised in the operating system (106) and/or the application (107) to allocate the first memory segment (104) to at least a part of the application (107), wherein

The requirement information (108) comprises executable binary code, wherein the executable binary code comprises information of the type and/or state of a memory segment required by at least a part of the application (107).

2. The computing system (100, 300) of claim 1, wherein the first processing unit (101) and the second processing unit (102) have different processing unit architectures.

3. The computing system (100, 300) of claim 1, wherein the requirement information (108) further comprises first requirement information (301), wherein the first requirement information (301) relates to an attribute of at least a portion of executable binary code of the application (107), and wherein the operating system (106) is further configured to: controlling at least one of the first processing unit (101) and the second processing unit (102) and the shared memory (103) based on the first requirement information (301) such that the first memory segment (104) is allocated to at least a part of the application (107).

4. The computing system (100, 300) of claim 3, wherein the first demand information (301) is executable binary code comprising at least one of: an application binary interface, ABI, for compiling at least a part of said application (107), a format for compiling at least a part of said application (107), a persistence characteristic, attribution of memory segments required by at least a part of said application (107), and a security policy.

5. The computing system (100, 300) of claim 1, wherein the requirement information (108) comprises second requirement information (302), wherein the second requirement information (302) relates to executable binary code of at least one predefined code segment of the application (107), the operating system (106) further configured to: controlling at least one of the first processing unit (101) and the second processing unit (102) and the shared memory (103) based on the second demand information (302) such that the first memory segment (104) is allocated to at least a part of the application (107).

6. The computing system (100, 300) of claim 5, wherein the second demand information (302) is executable binary code comprising at least one of: ABI for compiling a predefined code segment of the application (107), a memory model for compiling a predefined code segment of the application (107), and a security policy for each memory segment accessible by the application (107).

7. The computing system (100, 300) of claim 1, wherein the demand information (108) comprises third demand information (303), wherein the third demand information (303) relates to a connection between the shared memory (103) and at least one of the first processing unit (101) and the second processing unit (102), the operating system (106) further configured to: controlling at least one of the first processing unit (101) and the second processing unit (102) and the shared memory (103) based on the third demand information (303) such that the first memory segment (104) is allocated to at least a part of the application (107).

8. The computing system (100, 300) of claim 7, wherein the third demand information (303) is created by the operating system (106) and includes at least one of: a cache coherence guarantee between at least one of the first memory segment (104) and the second memory segment (105) and at least one of the first processing unit (101) and the second processing unit (102), a memory access latency between at least one of the first memory segment (104) and the second memory segment (105) and at least one of the first processing unit (101) and the second processing unit (102), and a presence and a type of hardware protection mechanism in the shared memory (103).

9. The computing system (100, 300) of any of claims 1-8, wherein the operating system (106) is further configured to: if at least one of the first processing unit (101) and the second processing unit (102), and/or at least one of the first memory segment (104) and the second memory segment (105), and/or at least a part of the application (107) does not comply with the requirements in the requirements information (108), adjusting, based on the requirements information (108), a configuration of: at least one of the first processing unit (101) and the second processing unit (102), and/or at least one of the first memory segment (104) and the second memory segment (105), and/or at least a part of the application (107), such that at least one of the first memory segment (104) and the second memory segment (105) is allocated to at least a part of the application (107).

10. The computing system (100, 300) of any of claims 1-8, wherein the operating system (106) is further configured to: migrating at least a portion of said application (107) from operating through said first processing unit (101) to operating through said second processing unit (102) if said first processing unit (101) does not comply with requirements in said requirements information (108); and controlling the second processing unit (102) to allocate the first memory segment (104) to at least a part of the application (107) based on the demand information (108).

11. The computing system (100, 300) of any of claims 1-8, wherein the operating system (106) is further configured to: exchanging a predefined portion of executable binary code in said application (107) with pre-compiled executable binary code that conforms to said requirement information if said predefined portion of executable binary code in said application (107) does not conform to said requirement in said requirement information (108); and allocating the first memory segment (104) to at least a portion of the application (107) based on the demand information (108) and the precompiled executable binary.

12. The computing system (100, 300) of any of claims 1-8, wherein the operating system (106) is further configured to: -if said first memory segment (104) does not comply with the requirements in said requirement information (108), controlling said at least one of said first processing unit (101) and said second processing unit (102) and said shared memory (103) based on said requirement information (108) such that said second memory segment (105) is allocated to at least a part of said application (107).

13. The computing system (100, 300) of any of claims 1-8, wherein the operating system (106) is further configured to: -if said first processing unit (101), said first memory segment (104) and a predefined portion of executable binary code in said application (107) do not comply with requirements in said requirements information (108), allocating said first memory segment (104) to at least a portion of said application (107) by means of software memory emulation based on said requirements information (108).

14. The computing system (100, 300) of any of claims 1-8, wherein the at least two memory segments (104, 105) have different memory segment architectures.

15. A method (700) for operating a computing system (100) for unified memory access, the computing system (100) comprising:

_–a first processing unit (101) and a second processing unit (102);

The first processing unit (101) and the second processing unit (102) are connected to the shared memory (103), wherein

The method (700) comprises the steps of:

_–-the operating system (106) controlling (701) at least one of the first processing unit (101) and the second processing unit (102) and the shared memory (103) based on demand information (108) comprised in the operating system (106) and/or the application (107) such that the first memory segment (104) is allocated to at least a part of the application (107), wherein

16. A computer-readable storage medium comprising instructions that, when executed on a computer, cause the computer to perform the method of claim 15.