CN116680229A - Operation method of distributed shared memory protocol - Google Patents

Operation method of distributed shared memory protocol Download PDF

Info

Publication number
CN116680229A
CN116680229A CN202310656062.7A CN202310656062A CN116680229A CN 116680229 A CN116680229 A CN 116680229A CN 202310656062 A CN202310656062 A CN 202310656062A CN 116680229 A CN116680229 A CN 116680229A
Authority
CN
China
Prior art keywords
page
thread
rdma
shared
local
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310656062.7A
Other languages
Chinese (zh)
Inventor
李子星
赵涛
聂少龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Linji Zhiyun Technology Suzhou Co ltd
Original Assignee
Linji Zhiyun Technology Suzhou Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Linji Zhiyun Technology Suzhou Co ltd filed Critical Linji Zhiyun Technology Suzhou Co ltd
Priority to CN202310656062.7A priority Critical patent/CN116680229A/en
Publication of CN116680229A publication Critical patent/CN116680229A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/173Interprocessor communication using an interconnection network, e.g. matrix, shuffle, pyramid, star, snowflake
    • G06F15/17306Intercommunication techniques
    • G06F15/17331Distributed shared memory [DSM], e.g. remote direct memory access [RDMA]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/544Buffers; Shared memory; Pipes
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The application discloses an operation method of a distributed shared memory protocol, which comprises the following steps: (a) triggering a page fault: a thread accesses the shared page to trigger a page fault; (b) accessing a remote page: triggering the thread of the missing page to access the memory of the remote node by using RDMA unilateral operation, and reading the catalog information to know the state of the shared page; (c) reading the directory information: when the thread accesses the shared page, the thread firstly reads the catalog information of the thread to know the state of the shared page; the catalog records the state of the shared page; (d) updating the directory information: when the thread modifies the shared page, the RDMA unilateral operation is used for updating directory information so as to inform other nodes that the state of the shared page is changed; (e) updating the local page table: and the thread triggering the page missing updates a local page table by using RDMA unilateral operation, and completes the read-write operation of the shared memory. Conflicting accesses by multiple nodes to the same page are effectively coordinated.

Description

Operation method of distributed shared memory protocol
Technical Field
The application belongs to the technical field of processor buffering, and particularly relates to an operation method of a distributed shared memory protocol.
Background
The current multi-machine memory consistency guarantee technology mainly comprises two methods of a cache consistency protocol and distributed transactions. The cache coherence protocol (Cache Coherence Protocol) refers to a system in which each processor has its own local cache, and can ensure that data in the caches of the respective processors are coherent. Under this protocol, if one processor modifies one memory interval, the corresponding data in the caches of the other processors are updated or invalidated to ensure that the data in the caches are consistent. Distributed transactions (Distributed Transactions) are distributed systems where distributed transactions can guarantee consistency between memory locations on different machines. In particular, a distributed transaction may combine multiple operations (e.g., reads, modifications, etc.) into an atomic sequence of operations, and if any of these operations fail, the entire sequence of operations may be rolled back to ensure data consistency.
Although multi-machine memory interval consistency assurance techniques have been well established, there are problems that can affect the performance and reliability of the system. The following are some of the common problems: (1) performance overhead: in order to ensure consistency among multiple machines, network communication, a locking mechanism, memory synchronization and other operations are generally required, and a certain performance overhead is caused by the operations. (2) scalability: when the system scale becomes large, the current technology has a drawback of ensuring the scalability of the system while ensuring the consistency. If a centralized consistency scheme is adopted, the central node becomes a bottleneck of the system and limits the expansion capacity of the system. If a decentralised approach is adopted, more complex algorithms and higher technical thresholds are required to achieve consistency. (3) fault tolerance: nodes in current distributed systems may fail or the network is disconnected, which may lead to data inconsistencies. Therefore, fault tolerance needs to be considered when designing a distributed system to ensure that the system can still function properly and maintain data consistency even if some nodes are experiencing problems. (4) balance of data consistency and availability: ensuring data consistency is very important, but sometimes some data consistency may need to be sacrificed in order to ensure system availability. In this case, how to balance data consistency and availability is a very important issue with the current existing methods.
Disclosure of Invention
Based on the above-mentioned drawbacks, the present application provides an operation method of a distributed shared memory protocol.
In order to achieve the above objective, the present application provides a method for operating a distributed shared memory protocol, comprising the following steps:
(a) Triggering a page missing: when the shared page is not in the local memory, one thread accesses the shared page to trigger page missing;
(b) Accessing a remote page: triggering the thread of the missing page to access the memory of the remote node by using RDMA unilateral operation, and reading the catalog information to know the state of the shared page;
(c) Reading directory information: when the thread accesses the shared page, the thread firstly reads the catalog information of the thread to know the state of the shared page; the catalog records the state of the shared page;
(d) Updating directory information: when the thread modifies the shared page, RDMA unilateral operation is used for updating directory information so as to inform other nodes that the state of the shared page is changed;
(e) Updating the local page table: and the thread triggering the page missing updates a local page table by using RDMA unilateral operation, and completes the read-write operation of the shared memory.
Optimally, in step (a), a lock mechanism of RDMA unilateral primitive is adopted to ensure that only one node in the global can execute protocol operation of a specific page; when processing the missing page fault, triggering the thread of the missing page to acquire the lock of the catalog item corresponding to the target page; the single-sided primitive supports compare and exchange and read and add two atomic operations.
Further, step (a) comprises the steps of:
(a1) Checking a lock variable using an RDMA CAS instruction, when the value of the variable is 0, atomically changing to 1 to indicate lock; after successful locking, triggering a thread of the missing page to acquire information of the page stored in the catalog by using a single-side primitive, wherein the information comprises a current owner of the page and other nodes with read-only copies of the page;
(a2) And triggering the thread of the missing page to retrieve the latest data from the owner of the page, and invalidating the copies on other nodes to complete the writing operation of the shared memory.
Optimally, in step (b), high speed data transmission is achieved between the local node and the remote node using unidirectional flow replication techniques.
Further, in the step (b), when the thread triggering the page missing encounters a page missing error, other read-only copies do not need to be invalidated, so that the owner of the page can lose the write permission; and when the thread triggering the page missing successfully acquires the required data and writes the required data into the local page, granting the corresponding access authority to the local page, and setting a locking mark in the directory to 0 by using a RDMA WRITE primitive to indicate that the locking of the page is released.
Further, in the step (b), the catalogue is evenly distributed on each node through a hashing algorithm so as to reduce the number of RDMA requests received by the network card of each node; an atomic single-sided primitive of RDMA is not compatible with the atomic operation of the CPU, and protocol operations are completed by RDMA requests even when the thread triggering the page miss is located at the same node as the directory data structure to be accessed.
Optimally, in step (c), when the thread accesses the shared page, the shared page data is first obtained from the RDMA network and mapped to a local virtual address space; then execute the local instruction with low overhead to flush the local TLB, allowing the thread to continue accessing the shared page
According to the operation method of the distributed shared memory protocol, through triggering page missing, accessing remote pages, reading directory information, updating directory information and updating cooperation of a local page table, an efficient shared memory model can be realized, and conflict access of a plurality of nodes to the same page can be effectively coordinated; meanwhile, an original communication protocol is expanded by adopting RDMA unilateral primitives, so that the execution efficiency of the protocol is improved; the method can be applied to the fields of distributed systems, cloud computing and the like, and has wide application prospect.
Drawings
FIG. 1 is a diagram of a distributed shared memory protocol architecture of the present application;
FIG. 2 is a flow chart of processing a write page fault in the method of operation of the distributed shared memory protocol of the present application.
Detailed Description
In order that the present application may be better understood, a more particular description of the application will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings, in which it is to be understood that the application is illustrated in the appended drawings. All other embodiments obtained under the premise of equivalent changes and modifications made by those skilled in the art based on the embodiments of the present application shall fall within the scope of the present application.
The protocol architecture shown in fig. 1, which extends based on the sequential consistency protocol in existing distributed computing, combines RDMA single and double edge primitives to achieve low latency protocol operation. The protocol decouples the distributed shared memory protocol logic from the application logic, optimizes protocol operation with network card hardware primitives, and reduces protocol overhead of a software layer with RDMA unilateral operation. The protocol architecture allows any node to access the memory of other nodes, including page tables and directories, through the Infiniband network. In order to solve the problem that the single-side RDMA operation cannot refresh the TLB of the remote node, the application adopts the double-side primitive of RDMA to complete the refreshing of the TLB, and combines the characteristics of the Unsigned RDMA to reduce the communication overhead.
The application provides a directory-based shared memory model, which coordinates conflict access of a shared page among a plurality of nodes. In the protocol flow of the application, most steps only need to be executed by the thread triggering the page missing, and the CPU of the remote node is not needed to participate in the protocol processing. The method can effectively reduce participation of remote nodes, thereby improving the execution efficiency of the protocol.
The application relates to an operation method of a distributed shared memory protocol, which comprises the following steps:
(a) Triggering a page missing: when the shared page is not in the local memory, one thread accesses the shared page to trigger page missing;
(b) Accessing a remote page: triggering the thread of the missing page to access the memory of the remote node by using RDMA unilateral operation, and reading the catalog information to know the state of the shared page;
(c) Reading directory information: when the thread accesses the shared page, the thread firstly reads the catalog information of the thread to know the state of the shared page; the directory records the status of the shared pages, such as whether locked or being modified;
(d) Updating directory information: when the thread modifies the shared page, RDMA unilateral operation is used for updating directory information so as to inform other nodes that the state of the shared page is changed; for example, if a thread is modifying a shared page, the state of that page in the directory will be set to "being modified";
(e) Updating the local page table: and the thread triggering the page missing updates a local page table by using RDMA unilateral operation, and completes the read-write operation of the shared memory.
Through the steps, the application can realize an efficient shared memory model and effectively coordinate conflict access of a plurality of nodes to the same page. Meanwhile, the application adopts RDMA unilateral primitive to extend the original communication protocol, thereby improving the execution efficiency of the protocol. The method can be applied to the fields of distributed systems, cloud computing and the like, and has wide application prospects.
In the step (a), the processing flow of the page fault writing is optimized, and the specific implementation flow is shown in fig. 2. The application designs a lock mechanism based on RDMA (Remote Direct Memory Access) unilateral primitives, which is used for ensuring that only one node in the global can execute protocol operation of a specific page. When processing the write page fault, the thread triggering the page fault needs to acquire the lock of the catalog item corresponding to the target page.
In order to realize the locking mechanism, the application adopts an atomic RDMA CAS unilateral primitive which supports two atomic operations of 'compare and exchange' and 'read and add', and can perform atomic read-write operation on an 8-byte memory area. Similar to CPU-based lock implementations, the present application implements lock functionality based on RDMA atomic primitives: (a1) Checking the lock variable using the CAS instruction, and if the variable has a value of 0, atomically changing to 1 indicates locking; (a2) After successful locking, the thread triggering the missing page uses a single-sided primitive to obtain the information of the page stored in the directory, including the current owner of the page and other nodes that own a read-only copy of the page. And then, triggering the thread of the missing page to fetch the latest data from the owner of the page, and invalidating the copies on other nodes to finish the writing operation of the shared memory. The method avoids the intervention of CPU and reduces the communication between nodes, thereby greatly improving the efficiency of protocol execution. In addition, because of adopting RDMA unilateral primitive, the method has the advantages of low delay, high throughput and the like, and is suitable for shared memory application in the fields of high-performance calculation and the like.
In step (b), high-speed data transmission is achieved between the local node and the remote node using unidirectional stream replication techniques. When a thread triggering the page missing encounters a page missing error, other read-only copies do not need to be invalidated, and the owner of the page can lose the write permission; and when the thread triggering the page missing successfully acquires the required data and writes the required data into the local page, granting the corresponding access authority to the local page, and setting a locking mark in the directory to 0 by using a RDMA WRITE primitive to indicate that the locking of the page is released. The method specifically comprises the following steps: the application provides a method for directly reading data from a remote node memory through RDMA unilateral primitives, which avoids high delay caused by traditional socket-based message transfer. The method utilizes unidirectional stream replication technology to realize high-speed data transmission between the local node and the remote node. In this way, the thread that triggered the page fault can quickly obtain the required data without waiting for the remote CPU to execute the code and network stack of the protocol. Then, when the thread triggering the page fault encounters the page fault, the thread does not need to invalidate other read-only copies, and only needs to make the owner of the page lose the write authority. This avoids the impact on other read-only copies while reducing the time and resource consumption to process errors. Finally, after the thread triggering the missing page successfully acquires the required data and writes the required data into the local page, the thread grants the corresponding access right to the page, and uses the RDMA WRITE primitive to set the lock flag in the directory to 0, which indicates that the lock of the page is released. This ensures that other threads can access the page and avoids problems such as deadlock.
Invalidation operation is an important operation widely used in distributed systems to ensure the single writer characteristics of the distributed system. When a write page fault occurs, the protocol needs to remove the read-write permission of the read-only copy; when a read page fault occurs, the protocol needs to remove the write permission of the page owner. Invalidation is typically implemented by a double-sided primitive, and the thread that triggered the missing page SENDs a SEND request containing the target address and read-write information to the target node. After receiving the message, the protocol thread of the target node modifies the page table and updates the TLB, and then sends an ACK message to inform the original thread that the operation is completed. By using the bilateral primitive, the protocol can effectively synchronize at the network level, thereby ensuring the consistency of the data. Compared with the traditional network communication mode, the RDMA technology can remarkably reduce the delay of message transmission and improve the execution efficiency of the protocol. Invalidation plays a vital role in a distributed system, and can ensure the single writer characteristic of a protocol, improve the execution efficiency of the protocol and reduce the delay of message transmission. The RDMA technology can significantly improve the performance of the distributed system and provide faster and more reliable data transmission service for users.
In this embodiment, in step (b), the directories are evenly distributed on each node through a hashing algorithm, so as to reduce the number of RDMA requests received by the network card of each node; an atomic single-sided primitive of RDMA is not compatible with the atomic operation of the CPU, and protocol operations are completed by RDMA requests even when the thread triggering the page miss is located at the same node as the directory data structure to be accessed.
The present application eliminates the need to use a directory manager process instance to access a directory. When an application thread triggering a page miss is executing a protocol, it uses a single-sided primitive of RDMA to directly access the directory's data structure. An atomic single-sided primitive of RDMA may support atomic read and write of Lock variables (Lock) of a directory entry for implementing mutual exclusivity of protocol operations on the same page. In addition to modifications to the lock variables, the thread also needs to obtain other information about this shared page, such as the page owner and read-only copy owner. These information are recorded in both the Owner and Copyset fields, which can both be read and written using RDMA single edge primitives. The contents of the Owner and Copyset are not directly tied to the lock and these data structures cannot be accommodated in 8 bytes, so it is not possible to access all data structures using one RDMA request. The present application is implemented with at least two independent RDMA requests. First, the lock is acquired by a CAS operation, which requires retries until successful. Once the lock is successful, the information about the shared page may be read. Unlocking and updated Owner and Copyset may be accomplished using a WRITE request because the directory entry is 64 bytes in size and may be accommodated within a memory Block (Block) to which DMA reads and WRITEs are atomic. In the present application, the directory access is done entirely by RDMA single edge primitives. If the entire directory data structure is stored on a node, the bottleneck of the network card processing the request for that node may cause an increase in latency of the protocol. Thus, the entire directory is evenly spread across the nodes by a simple hashing algorithm to reduce the number of RDMA requests received by each node's network card. It should be noted that since the atomic single-sided primitive of RDMA is not compatible with the atomic operation of the CPU, even if the thread triggering the page-missing is located at the same node as the directory data structure to be accessed, the protocol operation must be completed through RDMA requests and cannot be accessed directly from the CPU.
In this embodiment, in step (c), when the thread accesses the shared page, the shared page data is first obtained from the RDMA network and mapped to a local virtual address space; then execute the low overhead local instruction to flush the local TLB, allowing the thread to continue accessing the shared page.
The application also provides a brand new communication interface, which is different from the traditional network communication mode. This new communication approach utilizes RDMA technology, exposing more underlying hardware abstractions to the application to support more aggressive communication optimizations. The design of the present application can fully balance the hardware constraints of support and RDMA of existing systems to select the best interface and most reasonable usage. Traditional network communication methods rely on sockets, which require data to be forwarded back and forth between different hardware, which is not efficient. In contrast, with RDMA techniques, data can be written directly from the host's cache into the remote memory, faster and more energy efficient. The application provides a brand new idea when designing and optimizing software. The traditional network communication mode is based on a client-server model, and the application adopts a point-to-point model. The model can better utilize the hardware advantages of RDMA, and realize faster and more efficient data transmission.
The application designs a novel distributed shared memory system, which utilizes RDMA unilateral primitive technology to directly access memories of other nodes so as to improve data transmission efficiency. However, there is also a limitation in using RDMA unilateral primitives in that the virtual to physical address mapping of the application cannot be changed. Thus, the present application designs a contiguous virtual address area and requires that threads on each node have the same virtual address partitioning. The advantages of this design are: each thread can quickly locate a distributed shared memory region and facilitate management of RDMA memory regions. Furthermore, while such a design may improve system performance, it also requires that virtual addresses must correspond to fixed physical memory and cannot be swapped out by the operating system. Because for general purpose applications, a distributed shared memory system must assume that any thread may access an address within the shared memory at any time.
If a virtual address is not mapped to physical memory or swapped out, an error occurs in accessing memory data. This assumption wastes physical memory resources because the system needs to guarantee that an equal amount of copy space is stored on each machine as shared memory. To solve this problem, the present application proposes a new solution, namely to design a continuous virtual address area and to require the threads on each node to have the same virtual address partitioning. This solution can improve system performance while also requiring that virtual addresses must correspond to fixed physical memory and cannot be swapped out by the operating system.
In a distributed shared memory system, RDMA single-sided primitive techniques may improve network transmission efficiency, but are not optimal choices in invalidation operations. In order to ensure consistency of access rights of each check page in a multi-core system, the TLB must be flushed between processors by an interrupt handler (IPI) mechanism. In a distributed shared memory, after a protocol thread receives an invalidation request, it must use the IPI to inform all cores to flush the TLB to apply the latest access rights. Flushing the TLB between processors via an interrupt handler (IPI) mechanism is critical to ensuring consistency of access permissions for each check page in the multi-core system. To handle bursty access requests, the protocol thread needs to initiate a reentrant function during the operating system idle period to empty the network cache to ensure immediate response when the next access request arrives. Furthermore, because IPI and TLB flushes must be performed by the CPU, even if an RDMA single-sided primitive can bypass the remote CPU to directly modify its page table, another RDMA send message needs to be sent by the thread that triggered the missing page to inform the remote CPU to perform the IPI and TLB flushes. Since RDMA single-sided primitives require at least two RDMA requests, performance bottlenecks may result in processing burst access requests.
In contrast, RDMA bilateral primitives may pass modification authority information to a remote node via one message, and the remote CPU performs page table modification and TLB flushing operations, thereby reducing the number of network messages, while the CPU's overhead of modifying page tables may be negligible. The state of the TLB in the multi-core system is automatically maintained by using extension hardware, so that the TLB can be ensured to be consistent with a page table. Furthermore, the use of RDMA bilateral primitives may reduce network latency and increase data transfer efficiency because it may pass information of modification rights directly to remote nodes without going through the network.
RDMA requests are sent through a Queue Pair (QP), the concept of QP being similar to TCP connections in traditional networks. In a typical RDMA communication, a client sends RDMA requests through a QP using single or double sided access operations, while a server fetches and processes incoming requests through a QP's Completion Queue (CQ). In the present application, each application thread may trigger a page fault and execute the protocol code to send a protocol related request to all other nodes. If multiple application threads on one node share one QP to send requests to remote nodes, then the QP needs to be controlled concurrently to ensure the correctness of the communication. Lock-based concurrency control typically reduces the scalability of the protocol. Thus, the present application assigns a separate QP to each application thread to communicate with the remote node. This means that if there are N machine nodes in the cluster, each node has M application threads, each application thread will establish N-1 QP connections with other nodes, and there are M (N-1) QP's globally. This design ensures that each application thread can communicate with a remote node using a dedicated QP without having to compete for QP resources with other threads. In order to avoid the possible competition condition and error detection problem during QP operation, the application utilizes the expansion hardware to automatically maintain the state of the TLB in the multi-core system so as to keep the state consistent with the page table. This ensures that each application thread can communicate with the remote node using a dedicated QP without competing for QP resources with other threads, while avoiding the negative impact of concurrent control of the lock. Because the present application uses extended hardware to automatically maintain TLB state, it can be ensured that each application thread can communicate with a remote node using a dedicated QP without competing for QP resources with other threads. This design may improve system performance while also avoiding the negative impact of concurrent control of the lock.
In the application, the execution of the protocol is initiated by the application program thread triggering the page fault, and the protocol thread is mainly responsible for receiving the invalidation message and removing the access authority of the corresponding page. This design concentrates protocol processing on each node and facilitates coordinating global prefetch policies and TLB bulk flushes. However, since accessing the directory, acquiring pages, and updating local authority are all done separately by the thread that triggered the page fault, the protocol thread cannot directly control TLB flushes. If the threads are forced to synchronize and negotiate bulk flushes before flushing the local TLB, the synchronization itself can be overhead, and lack of locality between threads can also result in increased latency for TLB flushes. Therefore, in the present application, the optimization strategy for TLB is not applicable to this case. The application provides a new method for reducing the cost caused by protocol consistency by delaying the local TLB refreshing. The method is applicable to systems requiring memory sharing among multiple processor cores, and is particularly widely used in RDMA-based high-performance computing systems. In such a system, when a thread needs to access a page, it will use RDMA unilateral primitives to fetch the page data locally. However, at this point the thread does not immediately inform other cores to update the page's access rights in the TLB. Instead, the thread will execute some less expensive local instructions to flush the local TLB, allowing itself to continue accessing the page. This mechanism is only applicable in the case of local rights promotion, i.e. when a thread needs to access the page, it already has read or write access rights for the page. If other cores need to access the page, they may still experience a page fault. However, these cores need only execute the same instructions to obtain the same access rights, without having to execute the entire protocol again.
Similar to the batch refresh mechanism, the method of the present application can also reduce the overhead caused by protocol consistency to some extent. However, unlike bulk refresh, the method of the present application is applicable only to situations where access to local pages is required, whereas bulk refresh is applicable to situations where access to multiple pages is required. This may be achieved by adding a mechanism to the core that delays TLB flushes. Specifically: when a thread needs to access a page, it will first fetch the page data from the RDMA network and map the page into the local virtual address space. It then executes some less expensive local instructions to flush the local TLB. These instructions may be simple read and write operations or some special TLB flush instructions. Because of the small instruction overhead, these instructions can be completed in a short time, allowing the thread to continue to access the page. It should be noted that: the method is only suitable for the situation of local right improvement. If other cores need to access the page, they still need to perform the corresponding steps in the protocol to obtain the corresponding access rights. However, the method of the application can reduce the cost caused by protocol consistency, so that the method can obviously improve the performance and the expandability of the system. The application provides a new design idea, which aims to reduce the overhead caused by protocol consistency in a distributed shared memory system. In a distributed shared memory system, multiple cores need to access shared data, and in order to ensure consistency of the data, correctness of the multiple cores when accessing the same piece of data needs to be ensured through a protocol. However, this protocol introduces significant overhead in the implementation process, and thus the present application proposes a new way to reduce this overhead.
In the present system, in order to increase access rights, it is generally necessary to read an invalid page or write a read-only or invalid page. The main scenario of this access mode is that after one thread updates data, the other thread uses the data to perform a calculation and updates the local data after the calculation for the other thread to read. Since the data blocks are typically independent, there are no two threads modifying the same piece of data at the same time, which means that there is no need to immediately synchronize to other cores when page permissions are upgraded. Therefore, the application omits the process of refreshing TLB by IPI when page authority is upgraded, and transfers the part of overhead to other read page fault triggered by the core. Specifically, when a thread needs to access a page, it will use RDMA unilateral primitives to fetch the page data locally, but will not immediately inform other cores to update the page's access in the TLB. Instead, the thread will execute some less expensive local instructions to flush the local TLB, allowing itself to continue accessing the page. This mechanism is only applicable in the case of local rights promotion, and if other cores need to access the page, they may still experience a page fault. However, these cores need only execute the same instruction to obtain the same access rights, and do not need to execute the entire protocol again. This mechanism is similar to the bulk refresh mechanism, except for the range of applicability. The batch refresh mechanism is to notify other cores to refresh the TLB via IPI signaling when the page table changes, so that other cores can use the latest page table when accessing shared data. The mechanism of the application delays the cost of refreshing the TLB to page fault triggered when other cores access shared data, thus avoiding invalid TLB refreshing, reducing the cost of protocol, and improving the performance of the system while ensuring the consistency of data. It should be noted that the propagation of the privilege upgrades may be accomplished asynchronously, and other cores may see the latest page table modifications by only locally refreshing the TLB when they trigger a page fault, without executing the entire protocol.
The optimization method aims at realizing the overlapping of the requests, thereby improving the processing speed of the distributed shared memory system. Such overlaps include intra-node and inter-node request overlaps. When processing write page fault, there is a certain dependency relationship between protocol operations. For example, the page permission modification of a node must be protected by a lock for the directory entry, the read back of data from the page owner must occur after an ACK is received, and so on. Due to the dependencies between these operations, conventional serial implementations may result in additional latency, thereby limiting the performance of the system. Therefore, the application adopts an internal parallel method to send the CAS and the READ request simultaneously, and decides whether to reserve the READ content according to the return result of the CAS. Specifically, if the CAS fails, the lock needs to be re-acquired and the READ result discarded. This approach can reduce the overhead of one round trip, thereby increasing the processing speed of the system. In addition, the application optimizes system performance by overlapping requests between nodes. Specifically, when one node needs to access another node's data, it first sends an RDMA request and overlaps the request with the native protocol operation. When the target node receives the request, it immediately starts processing and returns the data to the requesting node. At the same time, it also performs local protocol operations, such as modification of page permissions, etc., while processing the request. This overlap can effectively reduce the delay of requests and improve the throughput of the system. In general, the optimization method provided by the application can improve the performance of the distributed shared memory system by realizing the overlapping of the requests. By simultaneously sending CAS and READ requests and realizing overlapping in the processing of requests between nodes, the application can reduce waiting time and system overhead, thereby improving the throughput of the system.
The application provides a technology realized by an Unsignaled request of RDMA, which overlaps invalidation messages sent to each node to improve protocol performance. In a distributed shared memory system, the sending of invalidation messages is a necessary operation to synchronize data in the cache with data in the memory to ensure data consistency. In existing protocols, the transmission of invalidation messages is serialized, i.e. the next piece must be transmitted after the completion of the transmission of the previous invalidation message. This can lead to increased latency in protocol operation, affecting system performance. Therefore, the application designs a parallelization mode to hide the time delay of protocol operation. In particular, the present application uses the uninignaled request to effect overlapping of invalidation messages. When multiple invalidation requests are sent in succession, the first n-1 requests contain an uninsinignaled tag, indicating that these requests do not produce CQE (Completion Queue Entry) when completed; and the last request contains the signalled tag, a CQE is generated when the request is completed. In this way, the preceding requests may overlap with the following requests, thereby reducing latency between requests. In addition, the application parallelizes the sending of the ACK message corresponding to the invalidation message. Because the ACK message will not fail under the guarantee of RDMA RC, it can be sent in parallel without affecting the correctness of the protocol. In this way, the transmission of the invalidation message and the transmission of the ACK message can be performed simultaneously, thereby improving the parallelism and performance of the protocol. Experimental results show that when smaller request load is used, the previous RDMA request quantity has no influence on the completion time of the last request, and the parallelization mode designed by the application can effectively improve the performance of the system. At the same time, the parallelization technique of the present application can also be applied to other types of messages to achieve more efficient protocol operation.
Parallel programs typically employ an inter-thread synchronization mechanism to coordinate access to shared variables between threads. Common inter-thread synchronization mechanisms include a series of locks, barriers, semaphores, etc. provided by the pthread library. For example, mutex synchronization primitives guarantee mutual exclusivity between threads for shared variable access, i.e., only one thread can enter a particular critical section at a time to perform the related operation of accessing the variable. mutex is typically implemented based on a tag variable whose value is checked to see if any thread is within a critical section; if the lock is free, it indicates successful acquisition of the lock by setting a tag variable. To ensure the correctness of this approach, and in particular to prevent errors in "check-use" inconsistencies on lock variables, existing lock implementations often rely on a processor-provided atomic finger (RMW) to atomically perform a sequence of operations to check and set a flag, ensuring that the operation is not interrupted by other Write operations. However, this is inherently simple and efficient to implement on a stand-alone system, which can lead to serious performance problems on a distributed shared memory system. The simplified spin lock implementation in the pseudo code below is explained as an example. First, atomic instructions that access lock variables that would otherwise be implemented by Cache coherency of hardware would now be implemented by a distributed shared memory protocol, such that the latency of atomic instructions would be thousands of times that in a stand-alone system. Secondly, the lock variable is usually 8 bytes, and the access control mode of the distributed shared memory based on the page granularity will cause serious false sharing phenomenon, generate a large number of unnecessary page fault, and further reduce the efficiency of the synchronization mechanism. Aligning the data structures can eliminate the problem of false sharing to some extent, but also means wasting more memory. Finally, in environments with high levels of contention, the overhead of atomic instructions in the distributed shared memory may further increase and result in a severe "ping-pong effect". In the code described above, line 3 checks whether the lock has been obtained by another thread. If the page in which the lock is located has been modified by another thread, the read may trigger a page fault. Line 6 code again uses the Compare-And-Swap (CAS) atomic instruction to determine if the lock has not been locked yet, then marks it as locked. But the combination of these two operations will cause the thread that acquired the lock to ping-pong. After the execution protocol acquires the latest page data, a thread must return normal code from exception handling to continue execution. But this process involves the overhead of two context switches and a series of task schedules, with other nodes having the opportunity to trigger a page fault again and invalidate the page again by invalidating messages. Thus, the thread will trigger the page fault again after returning from exception handling. Both the conflict between Read and CAS and the conflict between CAS can result in a ping-pong effect of such page permissions, making program logic difficult to advance. Therefore, the application provides a synchronization mechanism which is optimized for a distributed shared memory system and is based on RDMA design, and the influence of protocols on the synchronization mechanism is reduced through the design of two aspects. First, RDMA is used to realize a set of special thread synchronization interfaces, which does not depend on a distributed shared memory mechanism. Because the time delay of a single RDMA request is far smaller than the execution time delay of a protocol, synchronization based on RDMA threads can be completed in a user mode completely, and the cost of context switching and scheduling is reduced, so that the ping-pong effect of page permission is avoided, and the synchronization efficiency is improved. Secondly, the application designs a hierarchical synchronization scheme, so that most of synchronization operations are changed into synchronization inside the nodes, and cross-node communication is reduced.
In particular, the present application re-implements the synchronization interface of pthread. When a function like pthread_mutex_init is used in an application to initialize a lock variable, the present application assigns a corresponding magi_lock structure to it, and it should be noted that the magi_lock structure is not assigned in the distributed shared memory, and the program cannot directly read or modify its internal state, and must use it through the re-implemented standard interface of pthread_mutex_lock. All magi_lock data structures are evenly distributed across all nodes by a hashing algorithm. Each magi_lock has a corresponding local data structure on each node, and magi_lock_local is used to synchronize threads locally within the same node. The application shows the experimental case that a total of 16 cores of 4 nodes are synchronized by using magi_lock_local (local node) and Global magi_lock (Global). When a thread tries to acquire a lock, it will first check the map_lock_local of the node where it is located, if the node where it is not already acquired, the thread is promoted to a Leader, and RDMA CAS acquires the lock from the global map_lock using the RDMA unilateral atomic instruction using the conventional lock algorithm, in this process, if the node has a thread that subsequently tries to acquire the same lock, it will observe that the node has selected a Leader representing the node to acquire the global lock, and wait on the map_lock_local, and after the Leader acquires the lock, it will preferentially transfer the lock to the local thread, avoiding RDMA communication across nodes, and also helping to improve the locality in the local thread synchronization process. In the illustration, nodes 0 and 1 have selected P1 and P6 as the Leader, respectively, responsible for modifying the global state of the lock located on node 2 through RDMA atomic operations. Node 2 also has a Leader and has obtained a global magi_lock. Subsequently, P8, P9, and P11 within node 22 take turns controlling the local lock magick local. When P11 unlocks, it finds that no other core inside node 2 tries to acquire this lock, so the global state is modified again using RDMA operations to release the lock. At this point, the Leader of node 0 and node 2 has the opportunity to acquire a global lock and continue to pass on to the local core. The lock mechanism enables the global to only pass locks across nodes for 3 times, and the rest is local lock transmission, so that the time for acquiring the locks can be obviously reduced.
The foregoing is merely a preferred embodiment of the present application, and is not intended to limit the scope of the present application; while the foregoing is directed to embodiments of the present application, other and further embodiments of the application may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.

Claims (7)

1. The operation method of the distributed shared memory protocol is characterized by comprising the following steps:
(a) Triggering a page missing: when the shared page is not in the local memory, one thread accesses the shared page to trigger page missing;
(b) Accessing a remote page: triggering the thread of the missing page to access the memory of the remote node by using RDMA unilateral operation, and reading the catalog information to know the state of the shared page;
(c) Reading directory information: when the thread accesses the shared page, the thread firstly reads the catalog information of the thread to know the state of the shared page; the catalog records the state of the shared page;
(d) Updating directory information: when the thread modifies the shared page, RDMA unilateral operation is used for updating directory information so as to inform other nodes that the state of the shared page is changed;
(e) Updating the local page table: and the thread triggering the page missing updates a local page table by using RDMA unilateral operation, and completes the read-write operation of the shared memory.
2. The method of claim 1, wherein the step of: in step (a), a lock mechanism of RDMA unilateral primitive is adopted to ensure that only one node in the global can execute protocol operation of a specific page; when processing the missing page fault, triggering the thread of the missing page to acquire the lock of the catalog item corresponding to the target page; the single-sided primitive supports compare and exchange and read and add two atomic operations.
3. The method of claim 2, wherein step (a) comprises the steps of:
(a1) Checking a lock variable using an RDMA CAS instruction, when the value of the variable is 0, atomically changing to 1 to indicate lock; after successful locking, triggering a thread of the missing page to acquire information of the page stored in the catalog by using a single-side primitive, wherein the information comprises a current owner of the page and other nodes with read-only copies of the page;
(a2) And triggering the thread of the missing page to retrieve the latest data from the owner of the page, and invalidating the copies on other nodes to complete the writing operation of the shared memory.
4. The method of claim 1, wherein the step of: in step (b), high-speed data transmission is achieved between the local node and the remote node using unidirectional stream replication techniques.
5. The method of claim 4, wherein the step of: in the step (b), when a thread triggering the page missing encounters a page missing error, other read-only copies do not need to be invalidated, so that the owner of the page can lose the write permission; and when the thread triggering the page missing successfully acquires the required data and writes the required data into the local page, granting the corresponding access authority to the local page, and setting a locking mark in the directory to 0 by using a RDMA WRITE primitive to indicate that the locking of the page is released.
6. The method of claim 5, wherein the step of: in the step (b), the catalogue is evenly distributed on each node through a hash algorithm so as to reduce the number of RDMA requests received by a network card of each node; an atomic single-sided primitive of RDMA is not compatible with the atomic operation of the CPU, and protocol operations are completed by RDMA requests even when the thread triggering the page miss is located at the same node as the directory data structure to be accessed.
7. The method of claim 1, wherein the step of: in step (c), when the thread accesses the shared page, first acquiring the shared page data from the RDMA network, and mapping the shared page to a local virtual address space; then execute the low overhead local instruction to flush the local TLB, allowing the thread to continue accessing the shared page.
CN202310656062.7A 2023-06-05 2023-06-05 Operation method of distributed shared memory protocol Pending CN116680229A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310656062.7A CN116680229A (en) 2023-06-05 2023-06-05 Operation method of distributed shared memory protocol

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310656062.7A CN116680229A (en) 2023-06-05 2023-06-05 Operation method of distributed shared memory protocol

Publications (1)

Publication Number Publication Date
CN116680229A true CN116680229A (en) 2023-09-01

Family

ID=87781799

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310656062.7A Pending CN116680229A (en) 2023-06-05 2023-06-05 Operation method of distributed shared memory protocol

Country Status (1)

Country Link
CN (1) CN116680229A (en)

Similar Documents

Publication Publication Date Title
Cai et al. Efficient distributed memory management with RDMA and caching
US6502170B2 (en) Memory-to-memory compare/exchange instructions to support non-blocking synchronization schemes
Dubois et al. Synchronization, coherence, and event ordering in multiprocessors
Scheurich et al. Correct memory operation of cache-based multiprocessors
US9858200B1 (en) Configurable device interfaces
JP3987162B2 (en) Multi-process system including an enhanced blocking mechanism for read-shared transactions
US8364911B2 (en) Efficient non-transactional write barriers for strong atomicity
US8316190B2 (en) Computer architecture and method of operation for multi-computer distributed processing having redundant array of independent systems with replicated memory and code striping
US7114042B2 (en) Method to provide atomic update primitives in an asymmetric heterogeneous multiprocessor environment
US7318126B2 (en) Asynchronous symmetric multiprocessing
RU2501071C2 (en) Late lock acquire mechanism for hardware lock elision (hle)
EP3701377B1 (en) Method and apparatus for updating shared data in a multi-core processor environment
US7363435B1 (en) System and method for coherence prediction
US11445020B2 (en) Circuitry and method
WO2017012667A1 (en) Hardware transactional memory in non volatile memory with log and no lock
US7080213B2 (en) System and method for reducing shared memory write overhead in multiprocessor systems
US5875468A (en) Method to pipeline write misses in shared cache multiprocessor systems
Rajwar et al. Improving the throughput of synchronization by insertion of delays
Ren et al. High-performance GPU transactional memory via eager conflict detection
Unrau et al. Experiences with locking in a NUMA multiprocessor operating system kernel
US8219762B1 (en) Computer system and method for leasing memory location to allow predictable access to memory location
CN116680229A (en) Operation method of distributed shared memory protocol
US10831607B2 (en) Dynamic transaction throttling in a data processing system supporting transactional memory
Karp et al. Data merging for shared-memory multiprocessors
WO2022246769A1 (en) Data access method and apparatus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination