CN117521167A

CN117521167A - High-performance heterogeneous secure memory

Info

Publication number: CN117521167A
Application number: CN202311527819.9A
Authority: CN
Inventors: 华志超; 樊树霖; 夏虞斌; 陈海波
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2023-11-15
Filing date: 2023-11-15
Publication date: 2024-02-06

Abstract

The invention provides a high-performance heterogeneous secure memory, which comprises: an HSMEM transport engine, a multi-mode protection engine, and an interface; the HSMEM transmission engine establishes an HSMEM transmission channel, realizes high-speed transmission between the CPU and the GPU, is positioned in the chip and controls DMA requests for the secure memory; each memory protection scheme supported by the multi-mode protection engine is called a mode, one memory block is protected by different modes, and the multi-mode protection engine comprises a mode selection module, a data encryption module, an integrity tree module and an integrity tree maintenance module, wherein the mode selection module is used for selecting a mode for the memory block; through the interface, the developer uses the preset instruction to explicitly change the mode of the memory block. The invention realizes high-performance data transmission between heterogeneous secure memories, and the time for CPU-GPU transmission is less, so that the overall performance of the application is higher.

Description

High-performance heterogeneous secure memory

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a high-performance heterogeneous secure memory.

Background

Through recent development, artificial intelligence technology has been widely used in many fields, such as automatic driving, face recognition, medical imaging, etc. Massive training data and high-value artificial intelligence models make data privacy concerns a focus of attention. Modern artificial intelligence applications are typically deployed in high performance data centers, such as ChatGPT training, using more than 10K of NVIDIA a100GPU due to their extensive resource requirements. Because of the complexity of the software stack and potential vulnerabilities (e.g., more than 3.6 hundred million lines of code and more than 3K vulnerabilities in Linux kernel), and the potential for malicious employees threatens the data privacy of the artificial intelligence application. On one hand, a platform administrator of the data center may have malicious behaviors, on the other hand, an existing platform usually has a very deep software stack, so that a large number of software vulnerabilities and attack surfaces are brought, and once a cloud platform software system is clamped, the security of a machine learning application cannot be guaranteed. Even if training is performed in a machine inside the enterprise, the training process may be corrupted by internal malicious employees or attackers. Therefore, how to ensure the safety of the machine learning application and prevent the machine learning application from being attacked by the clamped operating system becomes a problem which is widely focused by academia and industry.

Trusted Execution Environments (TEEs) are widely used to protect applications from untrusted system software, and the TEEs ensure that no components outside the TEEs can access or modify the data and control flows of the internal components of the TEEs. There are many methods of building TEEs available today, and almost all CPU manufacturers, including Intel, AMD and ARM, have introduced their hardware TEE extensions. Modern artificial intelligence applications, however, rely heavily on heterogeneous accelerators, such as GPUs, NPUs, etc., which are widely used to accelerate computation of artificial intelligence applications. However, these accelerators are not directly protected by the CPU-side TEE, and therefore, the security of the CPU-side TEE to artificial intelligence applications is very limited. Therefore, in order to protect the data security of the artificial intelligence application, a trusted execution environment needs to be built on the artificial intelligence accelerator, so that a malicious operating system cannot access the data of the accelerator and tamper with the computing process of the accelerator, a GPU is a common accelerator of the artificial intelligence application, and building a GPU TEE is an effective way of protecting the artificial intelligence application.

While High Bandwidth Memory (HBM) is considered secure and not subject to physical attack, it is expensive, limited in capacity, and difficult to replace in the event of a failure. Thus, in the future, there will still be accelerators using GDDR memory. Even at month 8 2023, a GPU such as H100 with an HBM has been on the market for a long time, NVIDIA announced an L40SGPU, a data center GPU product, but still uses GDDR memory. GDDR is only likely to complete its historical life when the cost and capacity of future HBMs are controllable. In addition, most existing accelerator models, including various GPUs, are still using GDDR memory, while HBM is unable to solve the RowHammer and RowPress problems, and thus many academic studies are based on accelerators that use traditional DDR or GDDR to build secure memory.

The accelerator, including the GPU, differs from the CPU in terms of secure memory. They may differ in two ways: 1) The integrity tree structure may be different. 2) The encryption algorithm may be different. The GPU has multiple memory controllers, and in recently designed GPU secure memory, each memory partition has its own secure memory controller. By adopting a plurality of memory partitions, the parallel access capability of the memory can be improved. Each memory controller uses the local address within the partition and a counter for AES encryption. In addition, some multi-chip module GPUs include multiple GPUs and memory chips that are connected by inter-chip connections that have a lower bandwidth than the intra-chip connections. If a centralized engine is used for all memory partitions, performance may be undesirable due to the presence of on-chip or inter-chip interconnections between the engine and the memory controller. Thus, recent GPU secure memory architectures employ a protection engine internal to each controller, and each memory partition maintains its own integrity tree. In addition, some GPU secure memories use different encryption algorithms, such as AES-XTS. Furthermore, there is a difference in granularity of memory protection for the CPU and GPU. The GPU cache lines are typically 128 bytes, while the CPU cache lines are 64 bytes.

A Trusted Execution Environment (TEE) is used to protect user data and programs and may be configured to logically isolate the isolation region from the outside. However, logical isolation alone is not sufficient to defend against physical attacks, and hardware vendors introduce a memory protection engine in the memory controller to protect the data from physical attacks. The strongest protection mechanism in current products can ensure confidentiality, integrity and freshness, such as Intel SGX, the architecture is similar to BMT (Bonsai Merkle Tree). The counter-based integrity tree is used to ensure that the correct counter is used. The memory block may be encrypted by a counter value, a key stored on the chip, and a memory address. In addition, to ensure block integrity, a Message Authentication Code (MAC) is also calculated and stored. The root of the integrity tree is stored on the chip and never tampered with. When a block is loaded into the chip, the integrity tree is used to calculate the correct counter and extract the MAC to verify the integrity of the block. When a dirty block is retired from the cache and written back to memory, the counter value is incremented. In addition, there are different memory addressing modes, as shown in fig. 1. Assume that there are k memory partitions and the interleaving granularity is g memory blocks. Even for such a randomly interleaved memory, it still ensures that consecutive k×g memory blocks come from different memory partitions. In the basic method of memory protection, the integrity of a counter is protected by an integrity tree, and then a counter is used for encryption of a memory block, as shown in fig. 3.

NVIDIA H100GPU contains many security functions that limit access to GPU content by untrusted software, has an on-chip secure processor within the GPU, supports multiple types and levels of encryption, and provides hardware-protected memory areas. The secret computing function of the NVIDIA H100GPU needs to be matched with a secret virtual machine, and in the case that one secret virtual machine is protected by a TEE of a CPU side, one secret virtual machine can be associated with a plurality of H100 GPUs which start a secure computing mode or a plurality of MIG computing instances. And the GPU and the confidential virtual machine on the CPU are provided with a mechanism for establishing a secure channel, data transmission of the GPU and the CPU is encrypted and decrypted through hardware, and meanwhile, related hardware can ensure that an untrusted virtual machine monitor cannot directly access the GPU in a secure computing mode. NVIDIA H100GPU allows users to verify that they are communicating with NVIDIAGPU with confidential computing enabled and verify the security status of the GPU. However, the transmission rate of this technique is low, and according to its technical report, the transmission rate of the CPU-GPU is only 4GB/s, while only HBM memory is protected, and a large number of GPUs using conventional memory cannot be protected.

MMT published at the HPCA meeting of 2023 (Efficient Distributed Secure Memory with Migratable Merkle Tree), this work extends the size of the memory protection engine from one node to multiple nodes in order to achieve a (nearly) linear secure data transfer speed between distributed enclaves. The memory protection engine has been able to protect the confidentiality, integrity and freshness of physical memory. By reusing the hardware protection mechanism, without additional cryptography-based operations, the enclave may directly transfer data to another enclave over various connections (e.g., PCIe, RDMA, etc.). The present solution proposes a Migratable Merck Tree (MMT) that can transmit data and metadata to remote nodes without software involvement (e.g., re-encryption). One key insight of MMT is that (single-node) hardware memory protection has provided confidentiality, integrity and freshness guarantees for untrusted DRAMs, which can be reused in untrusted networks for protection. To this end, the work first expands a single integrity tree into an integrity forest across multiple nodes and breaks the limitation of encryption metadata that is limited to CPUs. Furthermore, the improvement works to design a new protocol: MMT closure delegation for protecting data transmitted in an untrusted network. It can securely transfer the root, node and data of the integrity subtree to the remote node to prevent revealing confidential information or replay attacks. Finally, the work implements a tiny and trusted module to manage the enclave and trusted hardware modules. It hides the details of the hardware implementation and is responsible for the connection between the local and remote enclaves. However, the technology can only support the transmission between nodes with the same secure memory architecture, and in heterogeneous scenarios, the secure memory architectures of the CPU and the GPU are different, and cannot adopt the scheme for transmission.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a high-performance heterogeneous secure memory.

The high-performance heterogeneous secure memory provided by the invention comprises the following components: an HSMEM transport engine, a multi-mode protection engine, and an interface;

the HSMEM transmission engine establishes an HSMEM transmission channel which is positioned in a chip and controls DMA requests for safe memory, selects metadata related to the memory to be transmitted, and transmits the data without additional data encryption and decryption, encrypts a preset number of cache blocks for each data transmission, and after the data is transmitted to the CPU or the GPU, the multi-mode protection engine is positioned in a memory controller and directly uses the transmission data of an original protection scheme;

each memory protection scheme supported by the multi-mode protection engine is called a mode, one memory block is protected through different modes, a mode selection module is arranged in the multi-mode protection engine and used for selecting a mode for the memory block, a data encryption module is arranged for encrypting and decrypting data, and an integrity tree module is arranged for maintaining an integrity tree of the different modes;

Through the interface, the developer uses the preset instruction to explicitly change the mode of the memory block.

Preferably, at initialization, the HSMEM transmission engine generates keys with the peer devices, builds a conventional secure channel, prepares metadata that needs to be transmitted with the data when it is about to be transmitted, and, in addition, the secure channel is used to exchange some basic system configuration during initialization;

the HSMEM transmission engine calculates a target address of metadata corresponding to the data, then the HSMEM transmission engine controls the DMA engine to copy the metadata to a target position, and the GPU uses the own DMA engine to conduct data transmission.

Preferably, at the time of initialization, the HSMEM transmission engine includes an initialization module, specifically:

when the system is started or the GPU is inserted, the CPU and the GPU mutually verify and establish a traditional secure channel, and negotiate a plurality of shared keys for memory encryption through the secure channel;

The secure channel also has the ability to ensure freshness of the transmitted data by using incremental self-initialization vectors for AES-GCM encryption.

Preferably, at the time of initialization, the HSMEM transmission engine includes a meta information selection module, specifically:

when data is transmitted through the HSMEM transmission channel, the security metadata is transmitted together with the data, for the integrity tree BMT, the counter and the message authentication code MAC value are transmitted together, each time the data transmission at least covers the whole subtree, therefore, the minimum transmission granularity is the memory size covered by a single integrity tree node, the selected subtree meets that all the counters in the subtree root node are related to the data to be transmitted, the counter in the father node is called a subtree root counter for the subtree root node, the CPU and the GPU are set to use a split counter in the integrity tree, the integrity tree node fan-out of the GPU is 128 bytes, the cache line size is 128 bytes, the degree of the CPU is 64 bytes, the granularity of the CPU-GPU transmission is different from the granularity of the GPU-CPU transmission, the covered size is calculated by multiplying the degree by the cache line size, for the CPU-GPU transmission, the granularity is 4KB, and only the data covered by the leaf node of the integrity tree can be transmitted;

The HSMEM transmission engine uses the address of the data to be transmitted to determine the integrity tree node that should transmit with the data, finds one or more subtrees covering the range of the transmitted data, and if some counters in the root node do not belong to the data to be transmitted, continues to find subtrees to the lower level of the integrity tree.

Preferably, the mode selection module includes: on the CPU or GPU, for each memory block, the memory controller knows the current mode of that block, while the block with the secure metadata is transferred from the CPU to the GPU, it is still protected by the original protection scheme;

the mode selection module is a hardware module in the memory controller for recording such information, using one bit to indicate the protection scheme of the memory block, the cache line size of the GPU is 128 bytes, and the cache line size of the CPU is 64 bytes, on the GPU the bitmap size is 1/2≡10 of the total memory size, on the CPU the bitmap size is 1/2^9 of the total memory size, for 1GB of GPU memory, the GPU needs 1MB of space for the bitmap, and for 1GB of system memory, the CPU needs 2MB;

using two levels of checking to reduce the overhead of accessing the bitmaps, on the GPU, each bit of the first level bitmap representing 128 bytes, and in the second level bitmap representing 128KB, the first level bitmap is stored directly in an on-chip buffer inside the memory controller, the second level bitmap is stored in secure memory, and integrity must be ensured, leaving an area always in local mode to store the bitmaps; on the GPU, the region is always in GPU mode, while on the CPU, it is always in CPU mode, providing a dedicated cache for the second level bitmap, called bitmap cache, the cache line size is also 128 bytes, so each bit stored in the cache line represents 128KB of memory, when one cache line is replaced from the bitmap cache, the value of the cache line is checked, which means that if all bits in the replaced cache line are not set, all 128KB are in GPU mode, then bits in the first level bitmap will not be set, and if any bit in the bitmap is set, then both the first level bitmap and the corresponding bit in the bitmap are set.

Preferably, the data encryption module includes: the CPU or GPU directly transmits data to each other without re-encrypting, an encryption or decryption memory block is arranged, the encryption needs an address and a counter, if two blocks are encrypted by using the same encryption key, the encryption is regarded as unsafe, therefore, each time a dirty memory block is ejected from a cache and is written into a memory, the counter of the memory block is increased, if the CPU and the GPU use a shared encryption key, a unique address is also needed to be maintained for each memory block on the CPU or the GPU, and therefore, the address used by the memory encryption is different from a local physical address on the CPU or the GPU;

mapping all memories, including system memory and accelerator memory, to a single memory space called encryption address space EAS for encryption, addresses in EAS being used only for encryption and decryption, the addressing of normal memory access remaining unchanged, EAS being implemented by providing each device with a starting address, the receiver requiring additional information about the address of the ciphertext used for decrypting the transmission during data transmission;

the data encryption module firstly obtains a mode from the mode selection module, then obtains a correct counter from the integrity tree module, finds out a correct EAS address, and then decrypts the data block;

The CPU uses half of the MAC to validate its own 64 byte cache line.

Preferably, the integrity tree module of the GPU mode comprises: the different memory controllers use their own integrity tree and verify their data without interacting with other memory partitions, for direct access to the transferred data, the GPU supports a CPU mode and a protection scheme for the GPU mode, the data of the CPU mode is encrypted using addresses and counters from the CPU, when such memory blocks are acquired, the memory controllers acquire the correct counter values and addresses to decrypt the block, for support of the CPU mode, one leaf node of the integrity tree involves a block from a different memory partition, therefore one module is set to verify the counter and communicate the correct counter values to all memory controllers;

the method comprises the steps that a CPU mode engine is used for verifying a counter in a CPU mode integrity tree, the CPU mode engine is an on-chip module and is connected with other memory controllers, metadata of transmitted CPU mode encryption protection is stored in a GPU memory, the CPU mode engine directly sends a request to the memory controllers to obtain read memory, the metadata is directly obtained and returned to the CPU mode engine, when the memory controllers need to obtain real counter values, the CPU mode engine sends the request to the CPU mode engine, and the CPU mode engine reads integrity tree nodes stored in the GPU memory, verifies the counter and returns the values to the memory controllers.

Preferably, the integrity tree module of the CPU mode includes: the method comprises the steps that an engine at a CPU end supports two different modes, besides a root node of a CPU mode counter, a memory controller on the CPU also leaves a space for storing a root node of a subtree in a GPU mode, the two different types of integrity trees are directly supported in a single memory controller of the CPU, an EAS address of the subtree is also stored together with a counting node in a cache, when the mode selection engine determines that one memory block is in the GPU mode, the multimode protection engine firstly verifies the counter in the GPU mode and the EAS address of a transmission block, then decrypts and verifies the block, and in order to support the GPU mode, the CPU maintains the integrity trees of several GPU modes;

all the integrity trees have the same shape, and when the integrity trees are initialized, the information provided by the GPU indicates a memory addressing mode, so that the CPU knows the memory partition address arrangement mode and correctly uses the addressing scheme of the GPU to search the integrity trees corresponding to the memory blocks.

Preferably, the data transmission process of the HSMEM transmission channel is:

step 1: the software sends an instruction to the GPU through a secure channel mechanism of the GPU TEE;

step 2: after the instruction is read, the GPU command processor sends a signal to the HSMEM transmission engine at the CPU end, so that the CPU end is realized as a special interrupt request, and the special interrupt request does not actually cause external interrupt, but transmits the signal and information to the HSMEM transmission engine at the CPU end;

Step 3: the CPU HSMEM transmission engine receives the signal, reads the root node on the chip and verifies the root node, and simultaneously sends a request to the memory controller to prohibit the system from writing in the address;

step 4: after selecting the root node, the HSMEM transmission engine sends a request to the memory controller to prohibit writing to the region, if some tree nodes in the cache are dirty, they should be flushed back to memory, then it reads the root node of the subtree to be transmitted and transmits the root node over the secure channel;

step 5: the HSMEM transmission engine ensures that the GPU DMA engine is in a safe mode and calculates metadata addresses and performs safe memory transmission, the commands give physical addresses of the range to be transmitted, the special memory transmission bypasses the checking of the IOMMU, the addresses of the metadata areas are given by the HSMEM transmission engine when the system is initialized, the HSMEM transmission engine directly calculates the addresses of the metadata and controls the DMA engine to extract correct metadata, and the data, the MAC and the integrity tree nodes are transmitted, so that the steps do not need to be re-encrypted;

step 6: after the metadata is read from the system memory, the GPU sends another special signal to the CPU, the signal does not trigger software interrupt, the HSMEM transmission engine at the CPU end processes the signal, and the root node and the EAS address of the subtree are sent to the GPU through the secure channel;

Step 7: the HSMEM transmission engine at the CPU end sends a message to the memory controller, the writing limitation of the memory of the block is canceled in a transmission channel, for the data transmission of the CPU-GPU or the GPU-CPU, the HSMEM requires all the transmitted data to be in the same mode, and for the data transmission of the GPU-GPU, an MMT transmission mode is adopted, and the data is protected by different modes.

Preferably, the memory access flow of the multi-mode data protection engine is:

step 11: the mode of the current accessed block is obtained through a mode selection module, if the current accessed block is a CPU mode, the following steps are carried out, and if the current accessed block is a GPU mode, an access mode of a common GPU safe memory is adopted;

step 12: when such a block is retrieved from DRAM, it is verified by the CPU mode integrity tree and verified and decrypted using the integrity tree module and the memory encryption module;

step 13: the decrypted and verified data are put into a cache for access;

step 14: after the access is finished, if the block is not modified, directly eliminating the cache, if the block is modified, reading a counter of the GPU mode, re-encrypting, updating metadata of the GPU mode, and then writing back, wherein when the GPU needs to encrypt the block and write into a memory, the GPU does not use an EAS address provided by a CPU, but uses an own EAS address to encrypt the block;

Step 15: if the GPU mode is switched to, the mode selection module is notified, the mode selection module updates corresponding information, and the current memory block is marked to be changed back to the GPU mode.

Compared with the prior art, the invention has the following beneficial effects:

(1) High-performance data transmission (without re-encryption) is realized between heterogeneous secure memories, and the time for transmission between a CPU and a GPU is less, so that the overall performance of the application is higher;

(2) Supporting multiple secure memory protection modes on a single secure memory device, so that data in the past (still using the original secure memory protection mode) can be directly accessed by the device without immediately converting the modes;

(3) On a device with a secure memory, a memory block can be efficiently switched between different secure memory protection modes, and can be transparently converted into a secure memory protection mode with higher performance in the running process according to the application requirements, so that the optimal performance is realized on the current device.

Drawings

Other features, objects and advantages of the present invention will become more apparent upon reading of the detailed description of non-limiting embodiments, given with reference to the accompanying drawings in which:

FIG. 1 is an Intel SGX style memory integrity tree;

FIG. 2 is a memory encryption process;

FIG. 3 is a GPU memory partition;

FIG. 4 is a system architecture;

FIG. 5 is metadata of a selected integrity tree;

fig. 6 is a multi-mode protection engine.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the present invention, but are not intended to limit the invention in any way. It should be noted that variations and modifications could be made by those skilled in the art without departing from the inventive concept. These are all within the scope of the present invention.

Example 1

System architecture or scenario for application of the present invention

HSMEM considers heterogeneous platforms for secret computation. The protected application runs the TEE on both the CPU side and the GPU side. The TEE at both the CPU and GPU end use secure memory to protect the data. The goal of HSMEM is to build DRAM security for GPUTEE. For TEE management, HSMEM may use a mechanism like gradient [ OSDI'18 ]. Only the CPU and GPU's on-chip components are trusted. Therefore, all data leaving the chip should be protected. Confidentiality, integrity and freshness of data should be guaranteed. The present invention contemplates GPUs or other accelerators that use GDDR memory, and an attacker may tap the memory bus on the CPU or GPU, or the PCIe bus or other interconnect bus between the CPU and the GPU. An attacker may also physically issue instructions to read, modify or replay the data in memory.

The core hardware modules of the HSMEM comprise an HSMEM transmission engine and a multi-mode protection engine. The HSMEM transmission engine establishes an HSMEM transmission channel, and high-speed transmission between the CPU and the GPU is realized. It is located inside the chip and controls special DMA requests for secure memory. It selects metadata associated with the memory to be transferred and can transfer the data without additional data encryption and decryption. For each data transmission, only a few cache blocks need to be encrypted, thus providing speeds approaching those of plaintext transmissions. After the data is transferred to the CPU or GPU, the multi-mode protection engine is located in the memory controller and can directly use the transferred data of the original protection scheme. The engine may support a variety of protection schemes. In the present invention, each memory protection scheme supported by the multi-mode protection engine is referred to as a "mode". A memory block may be protected in different modes. Inside the multi-mode protection engine, there is a mode selection module for selecting modes for the memory blocks. There is also a data encryption module for data encryption and decryption and an integrity tree module for maintaining the integrity tree in different modes. Furthermore, HSMEM is not limited to protection schemes similar to BMT, but can be extended to other memory protection schemes with looser security requirements. HSMEM also provides an interface that allows a developer to explicitly alter the mode of a memory block using special instructions.

The core network element/device module/product of the present invention is implemented, and the overall system architecture is shown in fig. 4.

The HSMEM transport engine builds an HSMEM transport channel for secure Direct Memory Access (DMA). The HSMEM transmission channel can realize high-speed data transmission between different devices provided with the safe memory. In the data transmission process, re-encryption is not needed, and only the data and the security metadata are transmitted. At initialization, the HSMEM transport engine generates keys with peer devices to construct a conventional secure channel. This secure channel is part of the HSMEM transmission channel. When data is to be transmitted, the HSMEM transmission engine prepares metadata that needs to be transmitted with the data. In addition, the secure channel is used to exchange some basic system configuration during the initialization process. The HSMEM transfer engine polls the destination address of the metadata corresponding to the accounting data and then controls the DMA engine to copy the metadata to the destination location. The GPU uses its own DMA engine for data transfer. The transmission engine comprises: the device comprises an initialization module and a meta-information selection module.

An initialization module: when the system is started or the GPU is plugged in, the CPU and the GPU should mutually authenticate and establish a conventional secure channel. Through the secure channel, the CPU and GPU may negotiate multiple shared keys for memory encryption. In addition, the CPU and GPU may provide memory layout information to each other. They should provide the integrity tree format, the encryption algorithm type, and the address of the metadata area. The GPU should also provide some additional information to the CPU, including the number of memory controllers and the memory addressing scheme. After this is done, the CPU and GPU will initialize their secure memory. The secure channel should also have the ability to ensure freshness of the transmitted data. This may be achieved by using an incremental self-Initialization Vector (IV) for AES-GCM encryption. It should be noted that maintaining such a secure channel does not place a great burden on the HSMEM transport engine, and even the PCIe security enhancement functions that have been maintaining such a secure channel using AES-GCM can be reused.

The meta information selection module: some of the security metadata is transmitted with the data as it is transmitted over the HSMEM transmission channel. For BMT, the counter is transmitted along with the MAC value. HSMEM requires that each data transmission cover at least the entire subtree, so the smallest transmission granularity is the memory size covered by a single integrity tree node. The selected subtree must satisfy all counters in the subtree root node in relation to the data to be transmitted. In the present invention, the counter in the parent node is referred to as a subtree root counter for the subtree root node. Here, it is assumed that the CPU and GPU use split-counters (split-counters) in the integrity tree. The GPU's integrity tree node fan out is 128 and the cache line size is 128 bytes, while the CPU's degree is 64 and the cache line size is 64 bytes. The granularity of the CPU-GPU transmission is different from the granularity of the GPU-CPU transmission. The size covered may be calculated by multiplying the number of degrees by the cache line size. For CPU-GPU transfers, the granularity is 4KB, which means that only the data covered by the leaf nodes of the integrity tree will be transferred. First, the HSMEM transmission engine uses the address of the data to be transmitted to determine the integrity tree node that should be transmitted with the data. It finds one or more subtrees covering the transmission data range. It should ensure that all data covered in the root node of the selected subtree is data for transmission. If some of the counters in the root node do not belong to the data to be transmitted, it should continue to look up subtrees to the lower level of the integrity tree, as shown in FIG. 5.

The data transfer of the GPU-CPU uses different granularity. The cache lines of adjacent memory controllers may not be adjacent due to the address arrangement of the GPU memory. However, to simplify metadata management of data transferred on the CPU, HSMEM ensures that at least one memory controller's integrity tree is transferred and all memory controllers transfer the same amount of data to the CPU. Thus, the GPU mode integrity tree maintained on the CPU is homogenous and requires at least the transmission of nr_memctrl root nodes. The granularity of the transfer also depends on the address arrangement granularity. NR_MemInter is the granularity of the memory address arrangement, meaning that adjacent NR_MemInter bytes come from the same memory partition. The granularity of replication is min (nr_meminter, 16 KB) nr_memctrl, i.e. all memory controllers transmit the same amount of data to the CPU. Assuming that the granularity of memory address placement is 2 memory blocks (256 bytes) and there are 16 memory controllers, the granularity of replication should be 256KB.

The multi-mode protection engine supports the use of multiple memory protection schemes in a device equipped with secure memory. The multi-mode protection engines of both the CPU and GPU support two memory protection schemes. When the CPU or GPU receives data through the HSMEM transmission channel, the data is first protected by its original protection scheme. The original mode is used when reading data and is switched to the local mode when writing data. First, the present solution describes how to read the memory of the CPU mode on the GPU and how to read the memory of the GPU mode on the CPU, and then how to switch between the two memory modes. The architecture of the multi-mode protection engine is shown in fig. 6. The CPU pattern integrity tree and MAC on the GPU are referred to as shadow metadata. Similarly, GPU mode security metadata on the CPU is also referred to as shadow metadata.

A mode selection module: on the CPU or GPU, for each memory block, the memory controller should know the current mode of that block. When a block with secure metadata is transferred from the CPU to the GPU, it is still protected by the original protection scheme and vice versa. The mode selection module is a hardware module in the memory controller for recording such information. It uses one bit to indicate the protection scheme (i.e., mode) of the memory block. The cache line size of the GPU is 128 bytes, while the cache line size of the CPU is 64 bytes. On GPU, the bitmap size is 1/2≡10 of the total memory size, on CPU, the bitmap size is 1/2^9 of the total memory size. For 1GB of GPU memory, the GPU requires 1MB of space for the bitmap, while for 1GB of system memory the CPU requires 2MB. However, current CPUs or GPUs have a current memory ranging from tens to hundreds of GB. It is not practical to store all bitmaps on a chip. However, each memory access should be checked by a bitmap. HSMEM uses two levels of checking to reduce the overhead of accessing the bitmap. On the GPU, each bit of the first level bitmap represents 128 bytes, while in the second level bitmap, each bit represents 128KB. The first level bitmap is stored directly in an on-chip buffer inside the memory controller. The second level bitmap is stored in secure memory and integrity must be ensured. HSMEM reserves an area that is always in local mode to store a bitmap.

On the GPU, the region must always be in GPU mode, while on the CPU, it must always be in CPU mode. A dedicated cache, referred to as a bitmap cache, is provided for the second level bitmap, with a cache line size of 128 bytes (1028 bits). Thus, each bit stored in a cache line represents 128KB of memory. When a cache line is replaced from the bitmap cache, the value of the cache line is checked. If all bits in the replaced cache line are not set, meaning that all 128KB are in GPU mode, then the bits in the first level bitmap will not be set. If any bit in the bitmap is set, then both the first level bitmap and the corresponding bit in the bitmap should be set. Each memory controller on the GPU has a mode selection module to manage its memory partition. The mode selection module on the CPU is similar, but each bit in the first level bitmap represents 64 bytes, and each bit in the first level bitmap represents 64KB.

And a data encryption module: HSMEM allows the CPU or GPU to directly transfer data to each other without re-encryption. To encrypt or decrypt a memory block, encryption requires an address and a counter, as shown in FIG. 2. Two blocks are considered unsafe if they are encrypted using the same encryption key, address and counter, because they use the same One-time Pad (OTP). Thus, each time a dirty memory block is evicted from the cache and written to memory, the counter for the memory block is incremented. Similar to MMT (see related art), HSMEM also requires maintaining a unique address for each memory block on the CPU or GPU if the CPU and GPU use shared encryption keys. Thus, memory encryption uses addresses that are different from the local physical addresses on the CPU or GPU. HSMEM maps all memory in the system, including system memory and accelerator memory, to a single memory space called the Encrypted Address Space (EAS) for encryption. The addresses in EAS are used only for encryption and decryption, and the addressing for normal memory access remains unchanged. EAS can be easily implemented by providing a start address for each device. At the time of data transmission, the receiving side needs additional information about the address for decrypting the transmitted ciphertext. The data encryption module should first obtain the pattern from the pattern selection module, then obtain the correct counter from the integrity tree module and find the correct EAS address. The data block may then be decrypted. Since the cache line size of the GPU is 128 bytes, it is twice the CPU cache line size. In the GPU mode integrity tree on the CPU, two adjacent CPU memory blocks use the same counter. On the CPU, if the memory block is acquired, the counter of the GPU integrity tree will be verified and used for decryption of the CPU. EAS addresses are also stored with the nodes of the integrity tree. Although both CPU cache lines use the same GPU MAC, the CPU can only read one of the cache lines. This is because HSMEM changes the way the GPU's MAC is calculated, which is generated by concatenating two MACs. Each MAC is generated by one of the 64 bytes in the GPU cache line. Thus, the CPU can use half of the MAC to verify its own 64-byte cache line.

Integrity tree module (GPU mode): GPUs with confidential computing capabilities naturally support protection schemes for GPU modes. For the integrity tree of the GPU mode, a different memory controller may use its own integrity tree and may verify its data without interacting with other memory partitions. In order to directly access the transferred data, the GPU should support protection schemes for the CPU mode and the GPU mode. The data of the CPU mode is encrypted using the address from the CPU and the counter. When such a memory block is acquired, the memory controller should acquire the correct counter value and address to decrypt the block. However, to support CPU mode, one leaf node of the integrity tree involves blocks from different memory partitions. Thus, there should be one module to validate the counter and communicate the correct counter value to all memory controllers. HSMEM uses a special unit, the CPU mode protection engine, to validate the counters in the CPU mode integrity tree. The CPU mode engine is an on-chip module and is connected with other memory controllers. The transmitted CPU mode encryption protected metadata is stored in the GPU memory. The CPU mode engine may retrieve the read memory by sending a request directly to the memory controller. The metadata will be retrieved directly and returned to the CPU mode engine. When the memory controller needs to acquire the real counter value, it sends a request to the CPU mode engine. The CPU mode engine reads the integrity tree nodes stored in the GPU memory and validates the counter and returns the value to the memory controller.

As previously described, several subtrees are transmitted each time data is transmitted to the GPU. The root counter of the subtree is protected by the secure channel. After a sub-tree is completed by the transfer, the root counter is stored in the on-chip portion of the accelerator. The multi-mode protection engine logically uses the CPU mode integrity tree to protect all secure memory. The physical address of a memory block is used to determine which CPU mode counter the block is associated with. When there is no data transfer, there are no active nodes in the CPU mode integrity tree. When data is transferred from the CPU, the corresponding node in the CPU mode integrity tree will become the active node, depending on the physical address of the transferred data. The transmitted subtrees will be placed directly in the corresponding locations. However, since the subtree has no valid parent node, the root counter of the subtree will be stored on-chip. A maximum of 8 bytes are required per CPU mode root counter. The multi-mode protection engine need only store the counter value in the parent counter. At the same time, the block address on the CPU or in the EAS is also transferred with the root counter of the subtree and stored on the chip. Thus, only one counter and 64-bit EAS address need be stored for each sub-tree. When a counter is needed, all tree nodes on the path to the root counter are verified. This continues until the first node of the on-chip cache. In the CPU mode counter cache on the accelerator, each node is stored with the ID of the root node stored on the chip. When verifying the node of the Counter Cache, the corresponding EAS address is obtained according to the EAS address of the root node. The accelerator is not a conventional processor core, but rather has fixed logic about the CPU mode protection scheme, so more on-chip storage may be used. When the GPU memory controller retrieves a memory block, it may request an EAS address and counter value from the accelerator. The memory controller may then verify and decrypt the acquired memory block. However, if there are too many subtrees, the buffer on the chip cannot store the EAS addresses and counters of these subtrees. HSMEM provides two different approaches. The first approach is for the unit to send an interrupt to the GPU command processor. The GPU command processor will then initiate a special GPU to convert a range of memory to a local GPU mode. The second approach is that the nodes may be evicted and saved in a block of secure memory. If the node is not found in the cache, the engine will fetch the node from the protected memory.

Integrity tree module (CPU mode): the CPU side engine should also support two different modes. In addition to the root node of the CPU mode counter, the memory controller on the CPU should also make room to store the root node of the subtree in GPU mode. These two different types of integrity trees may be supported directly in a single memory controller of the CPU. The EAS address of the subtree should also be stored with the counting node in the cache. When the mode selection engine determines that one memory block is in GPU mode, the multimode protection engine first verifies the counter in GPU mode and the EAS address of the transport block. The block may then be decrypted and verified. To support GPU modes, the CPU maintains an integrity tree of several GPU modes. Since the memory address arrangement is fixed, the CPU does not need to store the EAS addresses for the integrity tree for all GPU modes. All integrity trees have the same shape, as the transmission granularity chosen by this scheme ensures that the integrity trees for all GPU modes have the same shape. At initialization, the information provided by the GPU illustrates the manner of memory addressing. Therefore, the CPU knows the memory partition address arrangement and can correctly use the GPU's addressing scheme to find the integrity tree corresponding to the memory block.

Data transmission process of HSMEM transmission channel: in interacting with the CPU, a GPU kernel space driver (a module in the operating system kernel) or a user space driver (CUDA driver) typically sends commands over the GPU channel, in addition to reading or writing MMIO registers. Using a GPU-TEE mechanism like Graviton, the integrity and confidentiality of commands has been protected. When the driver pushes a command through the GPU channel to request a data transfer with the HSMEM transfer engine, the address range of the memory range to be transferred should also be provided. The address translation hierarchy of the GPU needs to be considered. The GPU has its own page table and the address used by the GPU core is the GPU Virtual Address (VA). GPUMMU converts an address to a GPU Physical Address (PA). Pages in the GPU memory or system are marked in the GPU page table. In current GPUs, there are two types of GPU channels with different rights: privileged and non-privileged channels. The kernel space driver may use a privileged channel and commands in the privileged channel may directly use physical addresses. For user space drivers, a non-privileged channel is used. Push commands can only use GPUVA, but will eventually be converted to GPUPA. However, in modern computer systems, IOMMU is typically used to protect any access by the device to system memory. The GPU cannot directly observe physical addresses in the IOPA or system addresses. We note that in modern GPUs, address Translation Services (ATS) are used. For some supercomputers supporting CPU-gpunevlnk interconnect (e.g., summit Super Computer), ATS enables the GPU to access system memory using CPU virtual addresses. After the ATS is enabled, the device or GPU will send a request to the address translation proxy on the CPU side to request an address corresponding to the IOPA of a certain address. The proxy will translate the address and return the IOPA to the GPU. The GPU caches address translations using an address translation service (ATC) if an entry is found in the ATC. The IOPA may be used directly, bypassing the IOMMU to relieve the IOTLB in the IOMMU. HSMEM transmissions may directly use ATS to determine the real address of the transmitted data. Secure transmission requests require a continuous transmission range, so only the address of the first byte is needed.

Step 1: the software sends instructions (command) to the GPU through a secure channel mechanism of GPUTEE;

step 2: after the instruction is read, the GPU command processor sends a signal to the HSMEM transmission engine at the CPU end. This can be implemented as a special interrupt request which does not actually cause an external interrupt, but rather passes signals and information to the HSMEM transmission engine on the CPU side;

step 4: after the root node is selected, the HSMEM transmission engine should send a request to the memory controller to prohibit writing to the region, if some tree nodes in the cache are dirty, they should be flushed back to memory, then it reads the root node of the subtree to be transmitted and transmits the root node over the secure channel;

step 5: the HSMEM transmission engine ensures that the GPUDMA engine is in a safe mode and calculates metadata addresses and performs safe memory transmission, the commands give physical addresses of a range to be transmitted, the special memory transmission can bypass the checking of the IOMMU, the addresses of metadata areas are given by the HSMEM transmission engine when the system is initialized, the HSMEM transmission engine can directly calculate the addresses of metadata and control the DMA engine to extract correct metadata, and the data, the MAC and the integrity tree nodes are transmitted, so that re-encryption is not needed;

Step 6: after the metadata is read from the system memory, the GPU sends another special signal to the CPU, the signal does not trigger software interrupt, but the HSMEM transmission engine at the CPU end processes the signal and sends the root node and the EAS address of the subtree to the GPU through the secure channel;

step 7: the HSMEM transmission engine at the CPU end sends a message to the memory controller, the writing limitation of the memory of the block is canceled to be in a transmission channel, and for the data transmission of the CPU-GPU or the GPU-CPU, the HSMEM requires all the transmitted data to be in the same mode. For example, if the GPU needs to transmit data to the CPU, then all data should be protected by GPU mode or CPU mode. For GPU-GPU data transmission, MMT transmission mode is adopted, data can be protected by different modes, which means that some data can be protected by CPU mode and some data can be protected by GPU mode.

Memory access flow of multi-mode data protection engine: taking the GPU side as an example, the GPU integrity tree remains unchanged when a memory block is in CPU mode. The counter nodes remain in the memory in plain text, as in the original integrity tree structure. The CPU-side implementation is similar.

Step 1: the mode of the current accessed block is obtained through a mode selection module, if the mode is a CPU mode, the following steps are carried out, and if the mode is a GPU mode, an access mode of a common GPU safe memory is adopted;

step 2: when such a block is retrieved from DRAM, it is verified by the CPU mode integrity tree and verified and decrypted using the integrity tree module and the memory encryption module;

step 3: the decrypted and verified data are put into a cache for access;

step 4: after the access is finished, if the block is not modified, the block can be directly eliminated from the cache, if the block is modified, the counter of the GPU mode needs to be read, re-encryption is carried out, metadata such as an integrity tree of the GPU mode is updated, and then the metadata is written back, when the GPU needs to encrypt the block and write the block into a memory, the block is encrypted by using an EAS address provided by a CPU (Central processing Unit) instead of an EAS address of the CPU;

step 5: if the GPU mode is switched, the mode selection module is notified, the mode selection module updates corresponding information, and the current memory block is marked to be changed back to the GPU mode.

Example 2

Example 2 is a preferred example of example 1.

The invention can be realized on the GPU and the CPU using GDDR, can support a memory protection mode with an integrity tree, and can defend physical replay attack. Meanwhile, a TEE mechanism exists between the CPU and the GPU, and the scheme can be constructed on the original TEE mechanism to realize efficient data transmission. Meanwhile, the invention provides a special instruction, a user can write a special GPUKernel, the special instruction is used for actively converting the memory without putting data into the cache of the GPU, the instruction can directly send a request to the memory controller, the memory controller can directly convert the memory mode, and the converted result is not put into the cache.

Meanwhile, in different memory controllers, a cache specially used for storing the integrity tree nodes and the data MAC is added for accelerating access to memory protection metadata. In the implementation process, multiple integrity trees can be adopted in the same memory controller to protect the memory, so that the tree depth is reduced, and the memory is similar to MMT.

The embodiment can realize high-performance data transmission between the CPU and the GPU and ensure the data security under the condition of protecting the confidentiality, the integrity and the freshness of the secure memory and when the secure memory of the CPU and the secure memory of the GPU are heterogeneous.

Example 3

Example 3 is a preferred example of example 1.

HSMEM may support different protection schemes. The strongest protection scheme is mainly used in examples 1 and 2 to ensure confidentiality, integrity and freshness at the same time to defend against physical attacks. This may be achieved by a Bonsai Merkle Tree (BMT) based design, such as intel SGX. However, some hardware platforms may not provide security guarantees. For example, under a physical attack, AMD SEV does not guarantee integrity and freshness, while intel TDX does not consider a physical replay attack. Thus, the integrity tree and MAC may be deleted in such an architecture.

HSMEM can be extended to other protection schemes by customizing the multi-mode protection engine. In such an architecture, no integrity tree is required. Because the address is still used as an encrypted adjustment value, the receiving party still needs to transmit and record the address at the time of data transmission. For transport blocks that are not encrypted using the local EAS address, the multi-mode protection engine should have some mechanism to find the correct EAS address to decrypt the block. The HSMEM uses an EAS translation engine to obtain the decrypted EAS address. For large memory transfers, it is sufficient to record only the starting address of the original EAS address for decryption. Thus, the EAS translation engine maintains range registers that record the start address, end address, and offset of the data as it is transferred. This information is stored in the range buffer. When the mode selection module finds that the memory block does not belong to the current mode, the memory controller checks whether the address falls within any range in the range buffer. However, the size of the range buffer is limited. The EAS translation engine also maintains a multi-level translation table for translation of EAS addresses. The table translates the local address to the original EAS address at a granularity of 4K. The table is organized in a manner similar to a page table and may also support mechanisms similar to large pages. The EAS translation engine periodically scans the bitmap and if all memory switches to local mode, deletes the entry in the range buffer. If the range buffer does not have sufficient space to store the range, it will evict an entry and update the mapping table. The engine will check if the new range is greater than the existing range in the buffer and only if the new range is greater than the existing range will the old range be evicted. Since data is typically transferred in large blocks, large pages can be used even if the range buffer is full and the translation table is used, thereby reducing overhead. The mapping table may introduce TLB-like buffers and may be combined with range buffers with limited translation overhead.

The invention can also be used for other accelerators that do not use HBM memory.

Those skilled in the art will appreciate that the systems, apparatus, and their respective modules provided herein may be implemented entirely by logic programming of method steps such that the systems, apparatus, and their respective modules are implemented as logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc., in addition to the systems, apparatus, and their respective modules being implemented as pure computer readable program code. Therefore, the system, the apparatus, and the respective modules thereof provided by the present invention may be regarded as one hardware component, and the modules included therein for implementing various programs may also be regarded as structures within the hardware component; modules for implementing various functions may also be regarded as being either software programs for implementing the methods or structures within hardware components.

The foregoing describes specific embodiments of the present invention. It is to be understood that the invention is not limited to the particular embodiments described above, and that various changes or modifications may be made by those skilled in the art within the scope of the appended claims without affecting the spirit of the invention. The embodiments of the present application and features in the embodiments may be combined with each other arbitrarily without conflict.

Claims

1. A high performance heterogeneous secure memory, comprising: an HSMEM transport engine, a multi-mode protection engine, and an interface;

the HSMEM transmission engine establishes an HSMEM transmission channel which is positioned in a chip and controls DMA requests for safe memory, selects metadata related to the memory to be transmitted, and transmits the data without additional data encryption and decryption, encrypts a preset number of cache blocks for each data transmission, and the multi-mode protection engine is positioned in a memory controller after the data is transmitted to the CPU or the GPU, directly uses the transmission data of an original protection scheme;

through the interface, the developer uses the preset instruction to change the mode of the memory block.

2. The high performance heterogeneous secure memory of claim 1, wherein upon initialization, the HSMEM transmission engine generates keys with peer devices to construct a conventional secure channel, and when data is to be transmitted, the HSMEM transmission engine prepares metadata that needs to be transmitted with the data, and further wherein the secure channel is used to exchange some basic system configuration during initialization;

3. The high-performance heterogeneous secure memory according to claim 1, wherein, at the time of initialization, the HSMEM transmission engine comprises an initialization module, specifically:

4. The high-performance heterogeneous secure memory according to claim 1, wherein, at the time of initialization, the HSMEM transmission engine comprises a meta information selection module, specifically:

5. The high performance heterogeneous secure memory of claim 1, wherein the mode selection module comprises: on the CPU or GPU, for each memory block, the memory controller knows the current mode of that block, while the block with the secure metadata is transferred from the CPU to the GPU, it is still protected by the original protection scheme;

6. The high-performance heterogeneous secure memory of claim 1, wherein the data encryption module comprises: the CPU or GPU directly transmits data to each other without re-encrypting, an encryption or decryption memory block is arranged, the encryption needs an address and a counter, if two blocks are encrypted by using the same encryption key, the encryption is regarded as unsafe, therefore, each time a dirty memory block is ejected from a cache and is written into a memory, the counter of the memory block is increased, if the CPU and the GPU use a shared encryption key, a unique address is also needed to be maintained for each memory block on the CPU or the GPU, and therefore, the address used by the memory encryption is different from a local physical address on the CPU or the GPU;

the CPU uses half of the MAC to validate its own 64 byte cache line.

7. The high performance heterogeneous secure memory of claim 1, wherein the integrity tree module of the GPU mode comprises: the different memory controllers use their own integrity tree and verify their data without interacting with other memory partitions, for direct access to the transferred data, the GPU supports a CPU mode and a protection scheme for the GPU mode, the data of the CPU mode is encrypted using addresses and counters from the CPU, when such memory blocks are acquired, the memory controllers acquire the correct counter values and addresses to decrypt the block, for support of the CPU mode, one leaf node of the integrity tree involves a block from a different memory partition, therefore one module is set to verify the counter and communicate the correct counter values to all memory controllers;

8. The high performance heterogeneous secure memory of claim 1, wherein the CPU mode integrity tree module comprises: the method comprises the steps that an engine at a CPU end supports two different modes, besides a root node of a CPU mode counter, a memory controller on the CPU also leaves a space for storing a root node of a subtree in a GPU mode, the two different types of integrity trees are directly supported in a single memory controller of the CPU, an EAS address of the subtree is also stored together with a counting node in a cache, when the mode selection engine determines that one memory block is in the GPU mode, the multimode protection engine firstly verifies the counter in the GPU mode and the EAS address of a transmission block, then decrypts and verifies the block, and in order to support the GPU mode, the CPU maintains the integrity trees of several GPU modes;

9. The high performance heterogeneous secure memory of claim 1, wherein the HSMEM transmission channel has a data transmission process of:

10. The high performance heterogeneous secure memory of claim 1, wherein the memory access flow of the multi-mode data protection engine is:

step 13: the decrypted and verified data are put into a cache for access;