WO2008057833A2 - Système et procédé pour l'accès mémoire direct à distance sans verrouillage de pages par le système d'exploitation - Google Patents

Système et procédé pour l'accès mémoire direct à distance sans verrouillage de pages par le système d'exploitation Download PDF

Info

Publication number
WO2008057833A2
WO2008057833A2 PCT/US2007/082869 US2007082869W WO2008057833A2 WO 2008057833 A2 WO2008057833 A2 WO 2008057833A2 US 2007082869 W US2007082869 W US 2007082869W WO 2008057833 A2 WO2008057833 A2 WO 2008057833A2
Authority
WO
WIPO (PCT)
Prior art keywords
dma
node
cache
command
dma engine
Prior art date
Application number
PCT/US2007/082869
Other languages
English (en)
Other versions
WO2008057833A3 (fr
Inventor
David Gingold
Philip J. Mucci
Lawrence C. Stewart
Judson S. Leonard
Matthew H. Reilly
Original Assignee
Sicortex, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US11/594,447 external-priority patent/US20080109604A1/en
Priority claimed from US11/594,427 external-priority patent/US20080109569A1/en
Priority claimed from US11/594,443 external-priority patent/US20080109573A1/en
Priority claimed from US11/594,446 external-priority patent/US7533197B2/en
Application filed by Sicortex, Inc. filed Critical Sicortex, Inc.
Publication of WO2008057833A2 publication Critical patent/WO2008057833A2/fr
Publication of WO2008057833A3 publication Critical patent/WO2008057833A3/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/20Handling requests for interconnection or transfer for access to input/output bus
    • G06F13/28Handling requests for interconnection or transfer for access to input/output bus using burst mode transfer, e.g. direct memory access DMA, cycle steal
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/0815Cache consistency protocols
    • G06F12/0831Cache consistency protocols using a bus scheme, e.g. with bus monitoring or watching means
    • G06F12/0835Cache consistency protocols using a bus scheme, e.g. with bus monitoring or watching means for main memory peripheral accesses (e.g. I/O or DMA)
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/10Address translation
    • G06F12/1081Address translation for peripheral access to main memory, e.g. direct memory access [DMA]

Definitions

  • the invention relates to remote direct memory access (RDMA) systems and, more specifically, to RDMA systems that support synchronization of distributed processes in a large scale multiprocessor system.
  • RDMA remote direct memory access
  • Distributed processing involves multiple tasks on one or more computers interacting in some coordinated way to act as an "application".
  • the distributed application may subdivide a problem into pieces or tasks, and it may dedicate specific computers to execute the specific pieces or tasks.
  • the tasks will need to synchronize their activities on occasion so that they may operate as a coordinated whole.
  • message passing interface standard e.g., message passing interface standard
  • RDMA techniques have been proposed in which one computer may directly transfer data from its memory into the memory system of another computer. These RDMA techniques off-load much of the processing from the operating system software to the RDMA network interface hardware (NICs). See Infmiband Architecture Specification, Vol. 1 , copyright Oct 24, 2000 by the Infiniband Trade Association. Processes running on a computer node may post commands to a command queue in memory, and the RDMA engine will retrieve and execute commands from the queue.
  • NICs RDMA network interface hardware
  • the invention provides systems and methods for remote direct memory access without page locking by the operating system.
  • the invention also relates to a RDMA system for sending DMA commands from a source node to a target node. These commands are locally executed at the target node.
  • the invention also relates to systems and methods for remote direct memory access to processor caches for remote direct memory access (RDMA) reads and writes.
  • RDMA remote direct memory access
  • the invention also relates to a remote DMA system, and methods for supporting synchronization of distributed processes in a multiprocessor system using collective operations.
  • a multi-node computer system has a plurality of interconnected processing nodes.
  • DMA engines are used in a way to avoid page locking.
  • a DMA operation is performed between a first virtual address and a second virtual address space via a DMA engine.
  • it is determined whether the DMA operation refers to a virtual address that is present in physical memory. If the DMA operation refers to a virtual address that is not in physical memory, the DMA operation is caused to fail and the node maps the referenced virtual address to a physical address.
  • the DMA sender or receiver is caused to retry the DMA operation.
  • Another aspect of the invention is a multi-node computer system having a plurality of interconnected processing nodes.
  • the computer system issues a direct memory access (DMA) command from a first node to be executed by a DMA engine at a second node.
  • the DMA engine is capable of performing DMA data transfers and of executing pre-defined DMA commands. Commands are transferred and executed by forming, at a first node, a packet having a payload containing the DMA command.
  • the packets are sent to the second node via the interconnection network, where the second node receives the packet and validates that the packet complies with a predefined trust relationship. If the packet complies with the predefined trust relationship, the command is removed from the packet payload, and enqueued on the command queue of the DMA engine at the second node.
  • the command is then processed by the DMA engine at the second node.
  • the packet can include a process identifier, and validation can be done by comparing the process identifier in the packet to a set of process identifiers accessible by the DMA engine at the second node.
  • the process identifier can be stored in other parts of the packet besides the payload, such as the packet header or trailer
  • Another aspect of the invention is a computer node within a multi-node computer system having a plurality of interconnected processing nodes.
  • the computer node has least one processor associated with at least one processor cache for holding cache entries for the at least one processor.
  • the computer node also has a main memory defined by a physical address range, a processor cache control structure for dynamically associating cache entries with physical addresses in the main memory, and a remote direct memory access (DMA) system for transferring data between the computer node and other computer nodes.
  • a cache interface for the remote DMA engine includes logic to consult the processor cache control structure, to determine whether the processor cache has a cache entry associated with a physical address of a DMA transfer, and if so, reading from that cache entry or writing to that cache entry to service the DMA transfer.
  • the DMA engine only reads from or writes to main memory.
  • the processor caches are arranged in a hierarchy of caches, and the cache interface interacts with a subset of the hierarchy of caches.
  • cache entries are in the shared state, they are invalidated before being written to.
  • Another aspect of the invention is a multi-node computer system having a plurality of interconnected processing nodes.
  • This system uses DMA engines to perform collective operations synchronizing processes executing on a set of nodes.
  • the process involves, identifying a DMA engine on one of the nodes of the set of nodes to serve as a master node.
  • Each process in the set of processes causes the DMA engine on the node on which the process executes, to transmit a collective operation command to the master node when the process reaches a synchronization point in its execution.
  • the DMA engine on the master node receives and executes the collective operations from the processes, and in response to receiving a pre-established number of the collective operations, conditionally executing the set of associated commands.
  • the set of associated commands include commands to inform processes of the synchronization event.
  • the collective operation includes a counting command, and if the count equals a pre-established count, the DMA engine executes associated commands stored in a processor memory accessible by the DMA engine of the master node.
  • FIG. 1 is an exemplary Kautz topology
  • FIG. 2 is an exemplary simple Kautz topology
  • FIG. 3 shows a hierarchical view of the system
  • FIG. 4 is a diagram of the communication between nodes
  • FIG. 5 shows an overview of the node and the DMA engine
  • FIG. 6 is a detailed block diagram of the DMA engine
  • FIG. 7 is a flow diagram of the remote execution of DMA commands
  • FIG. 8 is a block diagram of the role of the queue manager and various queues
  • FIG. 9 is a block diagram of the DMA engine's cache interface
  • FIG. 10 is a flow diagram of a block write
  • FIGS.1 IA-B depict a typical logic flow for RDMA receive operations when page pinning is utilized
  • FIGS.12A-B depict a typical logic flow for RDMA send operations when page pinning is utilized
  • FIGS. 13A-D depict a logic flow for preferred embodiments in which
  • FIGS 14A-C depict a logic flow for preferred embodiments in which
  • Figure 15 depicts the operating system logic for mapping or re-validating a
  • Figure 16 A depicts the logic for maintaining buffer descriptors when unmapping a virtual address of certain embodiments
  • Figure 16B depicts the logic for reclaiming physical pages of certain embodiments.
  • Preferred embodiments of the invention provide an RDMA engine that facilitates distributed processing in large scale computing systems and the like.
  • the RDMA engine includes queues for processing DMA data requests for sending data to and from other computing nodes, allowing data to be read from or written to user memory space.
  • the engine also includes command queues, which can receive and process commands from the operating system or applications on the local node or from other computer nodes.
  • the command queues can receive and process (with hardware support) special commands to facilitate collective operations, including barrier and reduction operations, and special commands to support the conditional execution of a set of commands associated with the special command.
  • RDMA engines that interact with processor cache to service RDMA reads and writes.
  • the cache may be read to provide data for a RDMA operation.
  • the cache may be written to service a RDMA operation.
  • Kautz interconnection topologies are unidirectional, directed graphs (digraphs).
  • Kautz digraphs are characterized by a degree k and a diameter n.
  • the degree of the digraph is the maximum number of arcs (or links or edges) input to or output from any node.
  • the diameter is the maximum number of arcs that must be traversed from any node to any other node in the topology.
  • the order O of a graph is the number of nodes it contains.
  • the order of a Kautz digraph is (k + I)F "1 .
  • the diameter of a Kautz digraph increases logarithmically with the order of the graph.
  • Figure IA depicts a very simple Kautz topology for descriptive convenience.
  • the system is order 12 and diameter 2.
  • Figure IB shows a system that is degree three, diameter three, order 36.
  • degree three, diameter three, order 36 One quickly sees that the complexity of the system grows quickly. It would be counter-productive to depict and describe preferred systems such as those having hundreds of nodes or more.
  • any (x,y) pair satisfying (1) specifies a direct egress link from node x.
  • node 1 has egress links to the set of nodes 30, 31 and 32. Iterating through this procedure for all nodes in the system will yield the interconnections, links, arcs or edges needed to satisfy the Kautz topology. (As stated above, communication between two arbitrarily selected nodes may require multiple hops through the topology but the number of hops is bounded by the diameter of the topology.)
  • Each node on the system may communicate with any other node on the system by appropriately routing messages onto the communication fabric via an egress link.
  • node to node transfers may be multi-lane mesochronous data transfers using 8B/10B codes.
  • any data message on the fabric includes routing information in the header of the message (among other information).
  • the routing information specifies the entire route of the message.
  • the routing information is a bit string of 2-bit routing codes, each routing code specifying whether a message should be received locally (i.e., this is the target node of the message) or identifying one of three egress links.
  • each node has tables programmed with the routing information.
  • node x accesses the table and receives a bit string for the routing information.
  • this bit string is used to control various switches along the message's route to node z, in effect specifying which link to utilize at each node during the route.
  • Another node j may have a different bit string when it needs to communicate with node z, because it will employ a different route to node z and the message may utilize different links at the various nodes in its route to node z.
  • the routing information is not literally an "address" (i.e., it doesn't uniquely identify node z) but instead is a set of codes to control switches for the message's route.
  • the incorporated patent applications describe preferred Kautz topologies and tilings in more detail.
  • the routes are determined a priori based on the interconnectivity of the Kautz topology as expressed in equation 1. That is, the Kautz topology is defined, and the various egress links for each node are assigned a code (i.e., each link being one of three egress links). Thus, the exact routes for a message from node x to node y are known in advance, and the egress link selections may be determined in advance as well. These link selections are programmed as the routing information.
  • Figure 3 is a conceptual drawing to illustrate a distributed application. It shows an application 302 distributed across three nodes 316, 318, and 320 (each depicted by a communication stack).
  • the application 302 is made up of multiple processes 306, 308, 322, 324, and 312. Some of these processes, for example, processes 306 and 308, run on a single node; other processes, e.g., 312, share a node, e.g., 320, with other processes, e.g., 314.
  • the DMA engine interfaces with processes 306 and 308 (user level software) directly or through kernel level software 326.
  • Figure 4 depicts an exemplary information flow for a RDMA transfer of a message from a sending node 316 to a receiving node 320.
  • This kind of RDMA transfer may be a result of message passing between processes executing on nodes 316 and 320, as suggested in figure 3.
  • node 316 is not directly connected to node 320, and thus the message has to be delivered through other node(s) (i.e., node 318) in the interconnection topology.
  • Each node 316, 318, and 320 has a main memory, respectively 408, 426, and 424.
  • a process 306 of application 302 running on node 316 may want to send a message to process 312 of the same application running on a remote node 320. This would mean moving data from memory 408 of node 316 to memory 424 of node 320.
  • processor 406 sends a command to its local DMA engine 404.
  • the DMA engine 404 interprets the command and requests the required data from the memory system 408.
  • the DMA engine 404 builds packets 426-432 to contain the message.
  • the packets 426-432 are then transferred to the link logic 402, for transmission on the fabric links 434.
  • the packets 426-432 are routed to the destination node 320 through other nodes, such as node 318, if necessary.
  • the link logic at node 318 will analyze the packets and realize that the packets are not intended for local consumption and instead that they should be forwarded along on its fabiic links 412 connected to node 320.
  • FIG. 5 depicts the architecture of a single node according to certain embodiments of the invention.
  • a large scale multiprocessor system may incorporate many thousands of such nodes interconnected in a predefined topology.
  • Node 500 has six processors 502, 504, 506, 508, 510, and 512.
  • Each processor has a Level 1 cache (grouped as 544) and Level 2 cache (grouped as 542).
  • the node also has main memory 550, cache switch 526, cache coherence and memory controllers 528 and 530, DMA engine 540, link logic 538, and input and output links 536.
  • the input and output links are 8 bits wide (8 lanes) with a serializer and deserializer at each end. Each link also has a 1 bit wide control link for conveying control information from a receiver to a transmitter. Data on the links is encoded using an 8B/10B code.
  • FIG. 6 shows the architecture of the DMA engine 540for certain embodiments of the invention.
  • the DMA engine has input 602 and output 604 data buses to the switch logic (see figures 3 and 4). There are three input buses and three output buses, allowing the DMA to support concurrent transfers on all ports of a Kautz topology of degree 3.
  • the DMA engine also has three corresponding receive ports 606, 620, and 622 and three corresponding transmit ports 608, 624, and 626, corresponding to each of the three input 602 and output buses 604.
  • the DMA engine also has a copy port 610 for local DMA transfers, a microengine 616 for controlling operation of the DMA engine, an ALU 614, and a scratchpad memory 612 used by the DMA engine.
  • the DMA engine has a cache interface 618 for interfacing with the cache switch 526 (see figure 5).
  • the DMA engine 616 is a multi- threaded programmable controller that manages the transmit and receive ports.
  • Cache interface 618 provides an interface for transfers to and from both L2 cache 542 and main memory (528 and 530) on behalf of the microengine.
  • the DMA engine can be implemented completely in hardware, or completely within software that runs on a dedicated processor, or a processor also running application processes.
  • Scratchpad memory DMem 612 is used to hold operands for use by the microengine, as well as a register file that holds control and status information for each process and transmit context.
  • the process context includes a process ID, a set of counters (more below), and a command quota. It also includes pointers to event queues, heap storage, command queues for the DMA engine, a route descriptor table, and a buffer descriptor table (BDT).
  • the scratchpad memory 612 can be read and written by the microengine 616, and it is also accessible to processors 544 via I/O reads and writes.
  • the RX and TX ports are controlled by the microengine 616, but the ports include logic to perform the corresponding data copying to and from the links and node memory (via cache interface 618).
  • Each of the transmit 608 and receive ports 606 contains packet buffers, state machines, and address sequencers so that they can transfer data to and from the link logic 538, using buses 602 and 604, without needing the microengine for the data transfer.
  • the copy port 610 is used to send packets from one process to another within the same node.
  • the copy port is designed to act like a transmit or receive port, so that library software can treat local (within the node) and remote packet transfers in a similar way.
  • the copy port can also be used to perform traditional memory-to-memory copies between cooperating processes.
  • the DMA engine 540 When receiving packets from the fabric links, the DMA engine 540 stores the packets within a buffer in the receive port, e.g., 606, before they are moved to main memory or otherwise handled. For example, if a packet enters the DMA engine on RX Port 0 with the final destination being that node, then the packet is stored in "RX Port 0" until the DMA engine processes the packet. Each RX port can hold up to four such packets at a time, before it signals backpressure to the fabric switch not to send any more data. [0038] The DMA engine is notified of arriving packets by a signal from the receive port in which the packet was buffered.
  • This signal wakes up a corresponding thread in the DMA microengine 616, so that the microengine can examine the packet and take appropriate action.
  • the microengine will decide to copy the packet to main memory at a particular address, and start a block transfer.
  • the cache interface 618 and receive port logic implement the block transfer without any further interaction with the microengine.
  • the packet buffer is then empty to be used by another packet.
  • Transmission of packets from the DMA engine to the link logic 538 is done in a similar manner.
  • Data is transferred from main memory to the DMA engine, where it is packetized within a transmit port. For example, this could be TX 608, if the packet was destined for transmission on the fabric link corresponding to port 0.
  • the microengine signals the transmit port, which then sends the packet out to the link logic 538 and recycles the packet buffer.
  • Figure 8 depicts the interface to a DMA engine 540 for certain embodiments of the invention.
  • the interface includes, among other things, command queues, event queues and relevant microengine threads for handling and managing queues and ports.
  • User-level processes communicate with DMA Engine 540 by placing commands in a region of main memory 550 dedicated to holding command queues 802.
  • Each command queue 803 is described by a set of three values accessible to the kernel.
  • the memory region used for a queue is described by a buffer descriptor.
  • the read pointer is the physical address of the next item to be removed from the queue (the head of the queue).
  • the write pointer is the physical address at which the next item should be inserted in the queue (tail).
  • the read and write pointers are incremented by 128 bytes until the memory reaches the end of the region, then it wraps to the beginning.
  • Various microcoded functions within the DMA engine such as, the queue manager can manage the pointers.
  • the port queues 810 are queues where commands can be placed to be processed by a transmit context 812 or transmit thread 814 of a TX port 608. They are port, nor process, specific.
  • the event queue 804 is a user accessible region of memory that is used by the
  • DMA engine to notify user-level processes about the completion of DMA commands or about errors.
  • Event queues may also be used for relatively short messages between nodes.
  • the engine 616 includes a thread called the queue manager (not shown).
  • the queue manager monitors each of the process queues 803 (one for each process), and copies commands placed there by processes to port queues 810 and 806 for processing.
  • the queue manager also handles placing events on process event queues 804.
  • the queue manger reads entries from the command queue region 802, checks the entry for errors, and then copies the entry to a port command queue 806 or 810 for execution. (The queue manager can either immediately process the command, or copy it to a port command queue for later processing.) Completion of a transfer is signaled by storing onto the event queue, and optionally by executing a string of additional commands.
  • Each process has a quota of the maximum number of commands it may have on a port queue. This quota is stored within the scratchpad memory 612. Any command in excess of the quota is left on a process's individual command queue 803, and processing of commands on that command queue is suspended until earlier commands have been completed.
  • Transmit contexts 812 may be used to facilitate larger DMA transfers.
  • a transmit context 812 is stored within the scratchpad memory 612 and is used to describe an outgoing transfer. It includes the sequence of packets, the memory buffer from which the packets are to be read, and the destination (a route, and a receive context ID).
  • the DMA engine 540 may manage 8 contexts, one background and foreground context for each output link, and a pair for interprocess messages on the local node.
  • Transmit contexts are maintained in the each node. This facilitates the transmission and interpretation of packets.
  • transmit context information may be loaded from the scratchpad memory 612 to a TX or RX port by a transmit thread under the control of engine 616.
  • Route descriptors are used to describe routes through the topology to route messages from one node to another node. Route descriptors are stored in a route descriptor table, and are accessible thorough handles. A table of route descriptors is stored in main memory, although the DMA engine 540 can cache the most commonly used ones in scratchpad memory 612. Each process has a register within scratchpad memory 612 representing the starting physical address and length of the route descriptor table (RDT) for that process.
  • RDT route descriptor table
  • Each RDT entry contains routing directions, a virtual channel number, a processID on the destination node, and a hardware process index, which identifies the location within the scratchpad memory 612 where the process control/status information is stored for the destination process.
  • the Route Descriptor also contains a 2-bit field identifying the output port associated with a path, so that a command can be stored on the appropriate transmit port queue.
  • routing directions are described by a string of routing instructions, one per switch, indicating the output port to use on that switch. After selecting the output, each switch shifts the routing direction right two bits, discarding one instruction and exposing the next for use at the next switch.
  • the routing code will be a value indicating that the node is the destination node.
  • the DMA engine is capable of executing various commands. Examples of these commands are
  • Every command has a command header.
  • the header includes the length of the payload, the type of command, a route handle, and in do_cmd commands, a do_cmd counter selector, and a do__cmd counter reset value.
  • the send__event command instructs the DMA engine to create and send an enq_direct packet whose payload will be stored on the event queue of the destination process.
  • the destination process can be at a remote node.
  • a command from engine 404 of figure 4 can be stored on the event queue for DMA engine 420. This enables one form of communication between remote processes. The details of the packets are described below.
  • the send__cmd command instructs the DMA engine to create an enq_Response packet, with a payload to be processed as a command at the destination node.
  • the send_cmd command contains a nested command as its payload.
  • the nested command will be interpreted at the remote node as if it had been issued by the receiving process at the remote node (i.e., as if it had been issued locally).
  • the nested command should not be a send_cmd or supervise command.
  • the DMA engine will place the payload of the send_cmd command on a port command queue of the receiving DMA engine for execution, just as if it were a local DMA command. If the receiving process does not have enough quota, then the command will be deferred; placed on the process's event queue instead.
  • the do_cmd instructs a DMA engine to conditionally execute a string of commands found in the heap.
  • the heap is a region of memory within the main memory, which is user-writable and contiguous in both virtual and physical memory address spaces. Objects on the heap are referred to by handles.
  • the fields of the do_cmd command are the countld field (register id), the countTotal (the count reset value) field, the execHandle (heap handle for the first command) field, and the execCount (number of bytes in the command string) field.
  • the do_cmd countID field identifies one of these 16 registers within the DMA engine. If the register value is 0 when the do_cmd is executed, the value of the register is replaced by the countTotal field, and commands specified by the execHandle are enqueued for execution by the DMA engine. The do_cmd cannot be used to enqueue another do_cmd for execution. [0058] A do_cmd is executed by selecting the counter identified by the countID field, comparing the value against zero, and decrementing the counter if it is not equal to zero. Once the value reaches zero, the DMA engine uses the execHandle and execCount field to identify and execute a string of commands found on the heap.
  • the put_bf_bf command instructs the DMA engine to create and send a sequence of DMA packets to a remote node using a transmit context.
  • the packet payload is located at a location referred to by a buffer handle, which identifies a buffer descriptor in the BDT, and an offset, which indicates the starting address within the region described by the buffer descriptor.
  • the put_bf_bf commands waits on the background port queues 810 for the availability of a transmit context. Offset fields within the command specify the starting byte address of the destination and source buffers with respect to buffer descriptors.
  • the DMA engine creates packets using the data referred to by the source buffer handle and offset, and sends out packets addressed to the destination buffer handle and offset.
  • the put_bf_bf command can also be used to allow a node to request data from the DMA engine of a remote node.
  • the put_bf_bf command and the send_cmd can be used together to operate as a "get" command.
  • a node uses the send_cmd to send a put_bf_bf command to a remote node.
  • the target of where the DMA packets are sent by the put_bf_bf command is the node that sent the put__bf_bf command. This results in a "get” command. Further details of packets and embedding commands within a send_cmd are described below.
  • the put__im_hp command instructs the DMA engine to send a packet to the remote node. The payload comes from the command itself, and it is written to the heap of the remote node.
  • the supervise command provides control mechanisms for the management of the DMA engine.
  • Packets are used to send messages from one node to another node. Packets are made up of a 8 byte packet header, an optional 8 byte control word, a packet body of 8 to 128 bytes, and an 8 byte packet trailer. Ths first 8 bytes of every data packet, called the header word, includes a routing string, a virtual channel number, a buffer index for the next node, and a link sequence number for error recovery, as well as a non-data start of packet (SOP) flag. The second 8 bytes, called the control word, is optional (depending on the type of packet) and is interpreted by the receiving DMA engine to control where and how the payload is stored. The last 8 bytes, the trailer, includes the packet type, a 20-bit identification code for the target process at the destination node, a CRC checksum, and a non-data end of packet (EOP) flag, used to mark the end of the packet.
  • EOP non-data end of packet
  • An enq_direct packet is used to send short messages of one or a few packets. The payload of such a message if deposited on the event queue of another process. This type of packet has only an 8 byte header (no control word) and an 8 byte trailer.
  • An enqjresponse packet is created by a node to contain a command to be executed by a remote node. The remote node places the payload of the packet, which is a command, onto a port command queue for execution by the DMA engine.
  • DMA packets are used to carry high volume traffic between cooperating nodes that have set up transmit and receive contexts.
  • DMA packets have the same headers and trailers as other packets, but also have an 8 byte control word containing a buffer handle, and offset, which tell the receiving DMA engine where to store the data.
  • a DMA_end packet is sent by a node to signal the end of a successful transmission. It has enough information for the receiving DMA engine to store an event on the event queue of the receiving process, and if request by the sender, to execute a string of additional commands found in the receiver's heap.
  • Certain embodiments of the invention allow one node to issue a command to be executed by another node's RDMA engine. These embodiments establish a "trust system" among processes and nodes. Only trusted processes will be able to use RDMA.
  • the trust model is that an application, which may consist of user processes on many nodes, trusts all its own processes and the operating system, but does not trust other applications. Similarly, the operating system trusts the OS on other nodes, but does not trust any application.
  • Trust relationships are established by the operating system (OS).
  • OS operating system
  • the operating system establishes route descriptor tables in memory.
  • a process needs the RDTs to access the routing information that allows it to send commands that will be accepted and trusted at a remote node.
  • Each process has a register within scratchpad memory 612, representing the starting physical address and length of the route descriptor table for that process. This allows the process to access the route descriptor table.
  • a process creates a command header for a command it places the route handle of the destination node and process in the header.
  • the DMA engine uses this handle to access the RDT to obtain (among other things) a processID and hardware process index of the destination process. This information is placed into the packet trailer.
  • a remote DMA engine uses the hardware process index to retrieve the corresponding control/status info ⁇ nation from scratchpad memory 612. As described above, this contains a processID of the destination process.
  • the DMA engine compares the processID stored in the local DMA engine with the processID in the packet trailer. If the values do not match, the incoming packet is sent to the event queue of process 0 for exception handling.
  • FIG. 7 depicts the logic flow for sending a command to a DMA engine at a remote node for execution of the command by that DMA engine.
  • the process begins with step 702, where a nested command is created.
  • a nested command is one or more commands to be captured as a payload of a send_cmd.
  • the nested command is one command which is sent as the payload of a send_cmd.
  • the process constructs the nested command following the structure for a command header, and the structure of the desired command as described above.
  • a send_cmd is created, following the format for a send command and the command header format.
  • the nested command is used as the payload of the send_cmd.
  • the send_cmd (with the nested command payload) is posted to a command queue for the DMA engine. Eventually, the queue manager of the DMA engine copies the command to a port queue 806 or 810 for processing.
  • the DMA engine interprets the send__cmd.
  • the DMA engine looks up routing information based on a route handle in the command header which points to a routing table entry.
  • the DMA engine builds an enq_response packet.
  • the payload of that packet is loaded with the payload of the send_cmd (i.e., the nested command).
  • the DMA engine also builds the necessary packet header and trailer based on the routing table entry. Specifically, this trailers contain the proper processID and hardware process index to be trusted by the remote DMA engine.
  • the DMA engine copies the enqjresponse packet to the port queue of the link to be used for transmission.
  • the TX port then retrieves the packet and hands it off to the link logic 538 and switching fabric 552.
  • the link logic will handle actual transmission of the packet on the switching fabric. (The microengine can determine the correct port queue by looking at the routing information in the header of the enq_response packet.)
  • the packet will be sent through the interconnect topology until it reaches the destination node.
  • the packet arrives at the destination link logic on the corresponding receive port, where is it forwarded to the corresponding RX port buffer within the DMA engine of the remote node's DMA engine.
  • the RX port notifies the DMA microengine, as it does with any other packet it receives.
  • the DMA engine determines that the packet type is an enq_response packet.
  • the packet is validated. This process, as described above, compares the processID of the destination process to the processID stored in the packet trailer of the enq_response packet. If the processIDs match, the packet is trusted, and the payload of the packet is stored to a command queue of the receiving process for execution. This command is processed in essentially the same way as if the command has been enqueued by the local process having the same processID. If there is a not a match, then an event is added to process O's event queue so that the sender can be notified of the error.
  • the command is eventually selected by the DMA engine and executed by the DMA engine (at the remote node). This execution is done in the context of the receiving node's RDT and BDT.
  • Command queue quotas for each process are maintained within the DMA engine. If the event queue is full , the packet is discarded. It is up to the user-level processes to ensure that command or event queues do not become too full.
  • Preferred embodiments of the invention utilize the remote command execution feature discussed above in a specific way to support collective operations, such as barrier and reduction operations.
  • Barrier operations are used to synchronize the activity of processes in a distributed application.
  • Cold operations and barrier operations are known in the art, e.g., MPI, but are conventionally implemented in operating system and MPI software executed by the processor.
  • One well known method is using hierarchical trees for synchronization.
  • barrier operations may be implemented by using the do_cmd described above, which provides for the conditional execution of a set of other instructions or commands.
  • one node in the set of nodes associated with a distributed application is selected to act as a master node.
  • the specific form of selection is application dependent, and there may be multiple masters in certain arrangements, e.g., hierarchical arrangements.
  • a list of commands is then created to be associated with the do command and to be conditionally executed as described below.
  • the commands may be stored on the heap storage of the master node.
  • a counter register to be used by the synchronization process is initialized by use of an earlier do_cmd that has a countTotal field set to one less than the number of processes that will be involved in the barrier operation. This is because each do_cmd tests if the counter value is equal to zero before it decrements the counter. Therefore if 3 processes are involved, the counter is initialized to 2, and the first do_cmd will reduce the counter value to 1 , the second counter value will reduce the counter value to 0, and the third do_cmd will find that the value is zero. [0084] Each process of the distributed application will include a call to a library routine to issue the do command to the master node, at an application-dependent synchronization point of execution.
  • That node/application will send a do_cmd to the master node in the manner described above for sending DMA commands to another node for execution.
  • the do__cmd will cause the relevant counter to be selected and decremented.
  • the last process to reach the barrier operation will send the final do_cmd.
  • the DMA engine executes this do_cmd, the counter value will be equal to zero and this will cause the DMA engine to execute the DMA commands on the heap associated with the do_cmd (i.e., those pointed to by the execHandle of the do_cmd).
  • the DMA commands on the heap are enqueued to the appropriate port command queue by the do_cmd for execution when the barrier operation is reached. It is envisioned that among other purposes the commands on the heap will include commands to notify other relevant processes about the synchronization status.
  • the commands may include send event commands to notify parent tasks in a process hierarchy of a distributed application, thereby informing the parent tasks that children tasks have performed their work and reached a synchronization point in their execution.
  • the send_event commands would cause an enq_direct or enq_response packet to be sent to each relevant process at each relevant node.
  • the payload of the packet would be stored on the event queue of the process, and would signal that synchronization has occurred.
  • synchronization similar to multicast may be done in the following manner.
  • a list of commands is created and associated with the do_cmd.
  • This list could include a list of send_cmd commands.
  • Each of these send_cmds, as described above, has a nested command, which in this case would be a do_cmd (with an associated counter etc.). Therefore when the list of associated commands are executed by the DMA engine, they will cause a do_cmd to be sent to other nodes.
  • do_cmd commands will be enqueued for execution at the remote node.
  • the multicast use of do_cmd will be performed with the counter equal to zero.
  • Multicast occurs when some or all of these do_cmds being enqueued for execution at a remote node, point to more send_cmd commands on the heap. This causes the DMA engine to send out yet more do_cmd to other remote nodes.
  • the result is an "avalanche process" that notifies every process within an application that synchronization has been completed. Because the avalanche occurs in parallel on many nodes, it completes much faster than could be accomplished by the master node alone. Commands can be placed on the heap of a remote node using the put_im_hp command described earlier. This command can be used to set up the notification process.
  • the first node can execute four send_cmds and a send_event (for the local process) upon execution of the final do_cmd (5 nodes notified now).
  • Each send_crnd has a payload of a do_cmd. Therefore 4 remote nodes receive and execute a do_cmd that causes them to each send out four more do_cmds, as well as a send_event to the local process.
  • Preferred embodiments of the invention may use a cache system like that described in the related and incorporated patent application entitled "System and Method of Multi-Core Cache Coherency," U.S. serial number 11/335,421.
  • This cache is a write back cache. Instructions or data may reside in a particular cache block for a processor, e.g., 120 of figure 5, and not in any other cache or main memory 550.
  • a processor e.g., 502
  • issues a memory request the request goes to its corresponding cache subsystem, e.g., in group 542.
  • the cache subsystem checks if the request hits into the processor-side cache.
  • the memory transaction in conjunction with determining whether the corresponding cache 542 can service the request, is forwarded via memory bus or cache switch 526 to a memory subsystem 550 corresponding to the memory address of the request.
  • the request also carries instructions from the processor cache 542 to the memory controllers 528 or 530, indicating which "way" of the processor cache is to be replaced.
  • the request is serviced by that cache subsystem, for example by supplying to the processor 502 the data in a corresponding entry of the cache data memory.
  • the memory transaction sent to the memory subsystem 550 is aborted or never initiated in this case.
  • the memory subsystem 550 will continue with its processing and eventually supply the data to the processor.
  • the DMA engine 540 of certain embodiments includes a cache interface 618 to access the processors' cache memories 542.
  • the DMA engine when servicing a RDMA read or write request, can read or write to the proper part of the cache memory using cache interface 618 to access cache switch 526, which is able to interface with L2 caches 542. Through these interfaces the DMA engine is able to read or write any cache block in the virtually same way as a processor.
  • the cache interface has an interface 902 for starting tasks, and read and write queues 920.
  • the cache interface also has data bus 918 and command bus 916 for interfacing with cache switch 526, and Memln interface 908 and MemOut interface 910 for connecting to memory buffers.
  • the cache interface also has outstanding read table 912 and outstanding write table 914, and per thread counters 904 and per port counters 906.
  • Each microengine thread can start memory transfers or "tasks" via the TaskStart interface 902 to the cache interface.
  • the TaskStart interface 902 is used for interfacing with the DMA engine/micro engine 616.
  • the TaskStart interface determines the memory address and length of a transfer by copying the MemAddr and MemLen register values from the requesting microengine thread.
  • Tasks are placed in queues where they wait for their turn to use the Cmdaddr 916 or data 918 buses.
  • the CmdAddr 916 and data buses 918 connect the DMA engine's cache interface to the cache switch 526.
  • the cache switch is connected to the cache memory 542 and the cache coherence and memory controllers 528 and 530.
  • the memory transfers move data between main memory and the TX, RX, and copy port buffers in the DMA engine by driving the Memln 908 and MemOut 910 interfaces.
  • the Memln 908 controls moving data from main memory or the caches into the DMA engine
  • the MemOut 910 interface controls moving data from the DMA buffers out to main memory or the caches.
  • the cache interface 618 maintains queues for outstanding read 912 and write 914 requests.
  • the cache interface also maintains per-thread 904 and per-port 906 counters to keep track of how many requests are waiting in queues or outstanding read/write tables. In this way, the cache interface can notify entities when the requests are finished.
  • the cache interface can handle different type of requests: two of these request types are the block read (BRD) and block write (BWT).
  • a block read request received by the DMA microengine is placed in a ReadWriteQ 920.
  • the request cannot leave the queue until an entry is available in the outstanding read table (ORT).
  • the ORT entry contains details of the block read request so that the cache interface knows how to handle the data when it arrives.
  • the microengine drives the TaskStart interface, and the request is placed in ReadWriteQ.
  • the request cannot leave ReadWriteQ until an outstanding write table (OWT) entry is available.
  • OHT write table
  • the cache interface arbitrates for the CmdAddr bus in the appropriate direction and drives a BWT command onto the bus to write the data to main memory.
  • the OWT entry is written with the details of this block write request, so that the cache interface is ready for a "go" (BWTGO) command to write it to memory or a cache when the BWTGO arrives.
  • BWTGO "go"
  • the cache interface performs five basic types of memory operations to and from the cache memory: read cache line from memory, write cache line to memory, respond to I/O write from core, respond to SPCL commands from the core, and respond to I/O reads from core.
  • the DMA engine arbitrates for and writes to the data bus for one cycle to request data from cache or main memory. The response from the cache switch may come back many cycles later, so the details of that request are stored in the OutstandingReadTable (ORT).
  • ORT OutstandingReadTable
  • the outstandingReadTable tells where the data should be sent within the DMA engine.
  • the ORT entry is freed so that it can be reused. Up to 4 outstanding reads at a time are supported.
  • the DMA engine arbitrates for and writes the CmdAddr 916, then when a signal to write the cache data comes back, it reads data from the selected internal memory, then arbitrates for and writes the data bus.
  • the cache interface 618 can be used by the DMA engine to directly read and write remote data from processor caches 542 without having to invalidate L2 cache blocks. This avoids requiring processor 502 to encounter a L2 cache miss the first time it wishes to read data supplied by the DMA engine.
  • the process starts with a block read command (BRD) being sent to the cache coherence controller (memory controller or COH) 528 or 530 from the cache interface 618 of the DMA engine 540.
  • the cache tags are then checked to see whether or not the data is resident in processor cache.
  • the tags will indicate a cache miss. In this case, the request is handled by the memory controller, and after a certain delay, the data is returned to the DMA engine from the main memory (not processor cache). The data is then written to a transmit port by cache interface 618. The data is now stored in a transmit buffer and is ready to be transferred to the link logic and subsequently to another node. If there is an outstanding read or write, then a dependency is set up with the memory controller, so that the outstanding read or write can first complete.
  • FIG. 10 depicts the logic flow when the DMA engine is supplying data to be written into a physical address in memory. In this situation, an RX port writes the incoming DMA data to main memory or, if the addressed block is already in the cache, to the cache. As described above, the DMA engine can write data to main memory once it has received a command and context specifying where data ought to be stored in main memory, e.g., via buffer descriptor tables and the like.
  • the logic starts at step 1002, in which the DMA engine sends a command, through cache interface 618, to the COH controller asking it to check its cache tags, and providing it the data and physical address for the write.
  • the COH can then pass on the information to the memory controller or L2 cache segment as necessary.
  • the COH checks the cache tags to determine if there is a cache hit.
  • the cache coherence controller checks for outstanding read or write operations.
  • the L2 cache operations may involve multiple bus cycle, therefore logic is provided within the COH for to ensure coherency and ordering for outstanding (inflight) transactions.
  • the DMA requests conform to this logic similarly to the manner in which processors do. Assume for now that there are no outstanding operations.
  • step 1016 If there is no cache hit at step 1004, the method proceeds to step 1016, and the incoming data is sent from the DMA engine to the COH.
  • the COH passes the request to the memory controller, which writes the data to main memory.
  • step 1018 If during the check of outstanding write operations, there is a hit, then using the logic with the COH for ordering in-flight operations the current write of data to memory is only done after the outstanding write completes. Similarly, if during the check of the outstanding reads, there is a hit found, then the write waits until the data for the outstanding read has been returned from the main memory. The process then continues similar to writing to a cached block as shown in figure 10.
  • step 1006 a block write probe command is issued from the COH to the processor with the cached data, telling it the address of the block write command.
  • the COH has a control structure that allows the COH to determine which processors have a cache block corresponding to the physical memory address of the data being written by the DMA engine.
  • the probe request causes the processor to invalidate the appropriate Ll cache blocks.
  • the processor invalidates the Ll cache blocks that correspond to the L2 cache blocks being written to. Alternatively, if there is no longer a cache hit, i.e. the block has been evicted, since step 1004, the processor responds to the probe command by telling the DMA engine it should write to the COH (and effectively the main memory). [0112] At step 1010, the DMA engine sends the data to be written to the processor's L2 segment. At step 1012, the processor's L2 segment receives and writes the data to its L2 cache. Finally, at step 1014, the processor informs the COH controller that the write to L2 cache is complete.
  • Additional steps need to be taken when writing to a cached block as shown in figure 10, when there is an outstanding write from another processor.
  • the processor first writes the outstanding write to the COH.
  • the COH then writes the data to the main memory, allowing the write to be completed in the same manner as shown in figure 10.
  • Additional steps also need to be taken if there is an outstanding write to the same address from any source. In this case, then the new incoming write is made dependent upon the outstanding write, and the outstanding write is handled in the same manner as any other write. Once that write is complete, the new incoming write is handled. Additional steps also need to be taken in the above situation if there is an outstanding read. [0115] All the above situations have assumed that the data being written to is in the exclusive state.
  • data in the caches can also be in a shared state, meaning that data within one cache is shared among multiple processors.
  • an invalidation probe is sent out to all processors matching the tag for the block. This requests that all processors having the cache block invalidate their copy. Shared data blocks cannot be dirty, so there is no need to write any changes back to main memory. The data can then be written to main memory safely. The other processors that were sharing the data will reload the data from main memory.
  • the DMA engine allows user-level code to use it directly, without requiring system calls or interrupts in the critical path. To accomplish this task for RDMA operations (which copy data in and out of application virtual memory), preferred embodiments of the invention rely on the virtual-to-physical memory associations to (likely) stay intact during the application's lifetime. Preferred embodiments provide logic to recover from the uncommon case where this is not so.
  • the application software invokes the OS to associate a buffer descriptor index (BDI) with an application virtual address (VA).
  • the OS writes a corresponding entry of the BD table with the physical address (PA) that corresponds to the VA (this association being known to the OS). This validates the BD.
  • only the OS is permitted to write the BDT. In this way applications can only command the DMA to access memory pe ⁇ nitted by the OS via its programming of the BDT.
  • BD index to specify the location for the source of the data
  • BD index to specify the location for the destination for the data.
  • the source DMA engine will translate the source BD index
  • the destination DMA engine (which may be the same) will translate the destination BD index.
  • Each does this by reading the BD value of its respective BDT, which value may reveal the PA or it may reveal the BD is invalid (e.g., in some embodiments this may be indicated by having a zero length for that BD entry).
  • Invalid BDs create a fault condition (which is described further below).
  • FIGS 1 IA-B depict a typical logic flow for RDMA receive operations when page pinning is utilized.
  • the logic begins in 1102 and proceeds to 1 104 where the software (e.g., library routine) receives the source physical address from the sender.
  • the destination virtual address for the RDMA receive operation is looked up to see if it is present in the OS table of pinned pages.
  • a test is made to see if it is present in the table. If so, the logic proceeds to step 1110, where a reference to the destination virtual address is added to the table for this RDMA operation.
  • the source and destination physical addresses are provided to the DMA engine, and in 1114 the logic waits for the DMA operation to complete. Once completed, the sender is notified of the completion status in 1 1 16 by the DMA engine (e.g., via the event queue). The reference to the destination virtual address is then deleted from the table in 1118, and the operation completes. [0120] If, however, the test of 1 108 determines that the destination virtual address is not in table, then the logic proceeds to 1120. In 1120 a test is made to determine whether or not there is an unused slot available in the table of pinned pages. If so, in 1122 the operating system is invoked to lock the virtual page. The physical address for the destination virtual address is resolved in step 1124 (i.e., the OS provides the corresponding physical page for the destination virtual address), and the destination virtual address is then added to the table of pinned pages in 1126.
  • test of 1120 determines that there is no unused available slot in the table of pinned pages, then in 1128 a test is made to determine whether or nor there is an unreferenced slot available. If so, the OS is invoked to unlock the unreferenced virtual addresses in 1130, and the virtual address is removed from the table of pinned pages in 1132. The logic returns to 1122, now with an available table slot.
  • test 1128 determines that there are no unreferenced slots available, the logic fails in 1150.
  • FIGS 12A-B depict a typical logic flow for RDMA send operations when page pinning is utilized.
  • the logic begins in 1202 and proceeds to 1206, where the source virtual address for the RDMA send operation is looked up to see if it is present in the table of pinned pages (the source virtual address being provided by the caller).
  • a test is made to see if it is present in table of pinned pages. If so, the logic proceeds to step 1210, where a reference to the source virtual address is added to the table of pinned pages for this RDMA operation.
  • the source physical address is provided to the receiver (as part of a request to the software on the receiving node), and in 1214 the logic waits for the receiver- initiated DMA operation to complete.
  • the reference to the source virtual address is then deleted from the table of pinned pages in 1218, and the operation completes.
  • the logic proceeds to 1220.
  • a test is made to determine whether or not there is an unused slot available in the table. If so, in 1222 the operating system is invoked to lock the virtual page corresponding to the source virtual address. The physical address for the source virtual address is resolved in step 1124, and the source virtual address is then added to the table of pinned pages in 1226.
  • test of 1220 determines that there is no unused slot available in the table of pinned pages, then in 1228 a test is made to determine whether or nor there is an unreferenced slot available. If so, the unreferenced virtual addresses are unlocked in 1230 (i.e., by the OS), and the virtual address is removed from the table of pinned pages in 1232. The logic returns to 1222, now with an available table slot.
  • FIG. 13A-D depict a logic flow for preferred embodiments in which RDMA receive operations avoid page pinning. As a general matter, all operations are performed by unprivileged software (e.g., application or library software) without invoking the OS kernel unless otherwise indicated.
  • the logic begins in 1302 in which the application is ready to start an RDMA receive operation.
  • the logic proceeds to 1304 where the source BD index is received from the sender (i.e., the entity that will transfer data to the receiving node).
  • the destination virtual address is looked up to see if it is present in a shadow BD table.
  • the shadow BD table is a data structure maintained by unprivileged application software - not the operating system as is the case for the normal BD table — and the entries correspond to the normal BD table and contain additional information, including the corresponding virtual address and reference counts; They are used so that the application software can track its BD mappings .
  • shadow BDs are not needed.
  • a test is made to see if it is present in the shadow BD table. If so, the logic proceeds to step 1310, where a reference to the destination virtual address is added to the shadow BD table for this RDMA operation.
  • a test is made to determine whether or not the buffer descriptor for the destination is valid.
  • the OS is invoked to revalidate the buffer descriptor (more below, see figure 15). (In some embodiments 1312 and 1313 may be omitted.)
  • the RDMA transaction is initiated by providing the DMA engine with the BD indices for both the source and destination.
  • the logic waits for the DMA operation to complete. Once completed, a test is made in 1316 to determine whether or not the DMA engine indicates a BD fault from the sender end (i.e., the other node that will be transferring the data to the receiving DMA engine/node). For example, if the BD is invalid at the sender, a BD fault is indicated by an event sent to the receiver.
  • a test is made to determine whether or not a BD fault was indicated at the receiver end, when processing the operation. If no such fault was indicated the logic proceeds to 1340 where the sender is notified of the successful completion status. The reference to the destination virtual address is then deleted from the shadow BD table in 1342, and the operation completes.
  • the logic proceeds to 1318 where a fault message is sent to the sender (i.e., the entity transferring data) containing the BD indices of the source and destination. In 1319, the sender revalidates the relevant BD, and in 1320 the sender re-starts the RDMA receive operation. The logic then proceeds to 1315 and proceeds as described above. [0129] If the test of 1317 determines that there is a BD fault from the receive end, the logic proceeds to 1313 and proceeds as described above. [0130] If the test of 1308 determines that the destination virtual address is not in shadow BD table, then the logic proceeds to 1320.
  • a test is made to determine whether or not there is an unused BD slot available. If so, in 1322 the operating system is invoked to map the virtual address of the BD at the available slot. The destination virtual address is then added to the shadow BD table in 1326, and the logic process to 1310, described above. [0131] If the test of 1320 determines that there is no unused slot available in the BD table, then in 1328 a test is made to determine whether or nor there is an unreferenced slot available in the shadow BD table. If so, the virtual address is removed from the shadow BD table in 1332. The logic returns to 1322, now with an available BD slot. [0132] If test 1328 determines that there are no unreferenced slots available, the logic fails in 1350.
  • FIGS 14A-D depict a logic flow for preferred embodiments in which RDMA send operations avoid page pinning.
  • the logic begins in 1402 in which the application is ready to start an RDMA send operation.
  • the logic proceeds to 1406, in which the source virtual address is looked up to see if it is present in the shadow BD table.
  • a test is made to see if it is present in the shadow table. If so, the logic proceeds to step 1410, where a reference to the source virtual address is added to the shadow BD table for this RDMA operation.
  • a test is made to determine whether or not the buffer descriptor for the source is valid. If not, in 1413 the OS is invoked to revalidate the source buffer descriptor.
  • the source BD table index is provided to the receiver.
  • the logic waits for the DMA operation to complete. Once completed, a test is made in 1416 to determine whether or not a fault is indicated by the receiver end. If there is no such fault indicated, in 1417 the reference to the source virtual address is then deleted from the shadow BD table, and the operation completes.
  • test of 1416 determines that there is a fault from the receive end
  • the logic proceeds to 1418 where the OS is invoked to revalidate the source BD.
  • the RDMA is restarted providing the source and destination BD indices to the DMA engine.
  • the logic proceeds to 1415 and proceeds as described above.
  • test of 1408 determines that the source virtual address is not in shadow BD table, then the logic proceeds to 1420.
  • 1420 a test is made to determine whether or not there is an unused BD slot available. If so, in 1422 the operating system is invoked to map the virtual address of the BD at the available slot. The source virtual address is then added to the shadow BD table in 1426.
  • the test of 1420 determines that there is no unused slot available in the BD table, then in 1428 a test is made to determine whether or nor there is an unreferenced slot available. If so, the virtual address is removed from the shadow BD table in 1432. The logic returns to 1422, now with an available BD slot.
  • test 1428 determines that there are no unreferenced slots available, the logic fails in 1450.
  • Figure 15 depicts the operating system logic for mapping or re-validating a BD, for example as called for in 1313 etc. (For example, the operating system will be invoked by the application in response to a BD fault, or if the application determines that the BD is invalid prior to initiating to a DMA operation, or if the application is newly mapping a BD prior to initiating a DMA operation.)
  • the logic begins in 1502 where the application requests to map or re-validate a BD, specifying the BD index and relevant virtual address.
  • a test is made to determine whether or not the virtual address is valid. If not, the request is denied in 1512. If the VA is valid, another test is performed in 1506 to determine whether or not the BD index is valid. If no the request is denied in 1514. If so, the logic proceed to
  • the logic proceeds to 1550, where an invalid BD value is written. This is done because the VA is valid, but there is no corresponding page.
  • the application when recovering from a BD fault (at the sender or receiver end), the application will perform a memory operation that references the relevant virtual address. This will cause the OS to create a mapping between the VA and a PA. For example, this may occur prior to step 1418 and other steps like that.
  • Figure 16A depicts the logic for maintaining buffer descriptors when unmapping a virtual address (e.g., when swapping out pages, or an application indicating it no longer needs the pages).
  • the logic begins in 1602 and proceeds to 3604 in which each BD is invalidated that is mapped to the physical address no longer needed.
  • the logic ends in 1699.
  • Figure 16B depicts the logic for reclaiming a physical page. The logic begins in
  • 1622 proceeds to 1624 where a test is made to determine whether or not the physical page is in use by the DMA engine. If it is in use, the request is denied in 1640. If it is not in use, the request is allowed in 1630.
  • memory management software will consider whether a physical page is being used by a DMA engine before using it for another purpose. The OS in fact may use that page but part of its selection logic will consider whether the page is in use by DMA.
  • preferred embodiments do not guarantee page presence. Instead, preferred embodiments allow the RDMA operation to fail if the OS has unmapped physical pages, and allows the software to recover from that failure and restart the RDMA operation. This provides an efficient RDMA operation which does not incur the cost of invoking the OS in typical transactions. It also increases the flexibility for paging software by not requiring pinned pages for the DMA operations.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multi Processors (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

La présente invention concerne des systèmes et des procédés d'accès mémoire direct à distance sans verrouillage de pages par le système d'exploitation. Un système informatique multinœud comporte une pluralité de nœuds de traitement interconnectés. Des moteurs d'accès mémoire direct sont utilisés d'une manière permettant d'éviter le verrouillage de pages.
PCT/US2007/082869 2006-11-08 2007-10-29 Système et procédé pour l'accès mémoire direct à distance sans verrouillage de pages par le système d'exploitation WO2008057833A2 (fr)

Applications Claiming Priority (8)

Application Number Priority Date Filing Date Title
US11/594,447 2006-11-08
US11/594,447 US20080109604A1 (en) 2006-11-08 2006-11-08 Systems and methods for remote direct memory access to processor caches for RDMA reads and writes
US11/594,427 US20080109569A1 (en) 2006-11-08 2006-11-08 Remote DMA systems and methods for supporting synchronization of distributed processes in a multi-processor system using collective operations
US11/594,427 2006-11-08
US11/594,446 2006-11-08
US11/594,443 US20080109573A1 (en) 2006-11-08 2006-11-08 RDMA systems and methods for sending commands from a source node to a target node for local execution of commands at the target node
US11/594,446 US7533197B2 (en) 2006-11-08 2006-11-08 System and method for remote direct memory access without page locking by the operating system
US11/594,443 2006-11-08

Publications (2)

Publication Number Publication Date
WO2008057833A2 true WO2008057833A2 (fr) 2008-05-15
WO2008057833A3 WO2008057833A3 (fr) 2008-10-02

Family

ID=39365212

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2007/082869 WO2008057833A2 (fr) 2006-11-08 2007-10-29 Système et procédé pour l'accès mémoire direct à distance sans verrouillage de pages par le système d'exploitation

Country Status (1)

Country Link
WO (1) WO2008057833A2 (fr)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160378712A1 (en) * 2015-06-23 2016-12-29 International Business Machines Corporation Lock-free processing of stateless protocols over rdma
WO2018049210A1 (fr) * 2016-09-08 2018-03-15 Microsoft Technology Licensing, Llc Appareils de diffusion groupée et procédés de distribution de données à de multiples récepteurs dans un calcul à haute performance et des réseaux en nuage
CN110134618A (zh) * 2018-02-02 2019-08-16 富士通株式会社 存储控制装置、存储控制方法和记录介质
CN112673351A (zh) * 2018-07-04 2021-04-16 图核有限公司 流式传输引擎
US11847074B2 (en) 2020-11-02 2023-12-19 Honeywell International Inc. Input/output device operational modes for a system with memory pools

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5887134A (en) * 1997-06-30 1999-03-23 Sun Microsystems System and method for preserving message order while employing both programmed I/O and DMA operations

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5887134A (en) * 1997-06-30 1999-03-23 Sun Microsystems System and method for preserving message order while employing both programmed I/O and DMA operations

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160378712A1 (en) * 2015-06-23 2016-12-29 International Business Machines Corporation Lock-free processing of stateless protocols over rdma
US9953006B2 (en) * 2015-06-23 2018-04-24 International Business Machines Corporation Lock-free processing of stateless protocols over RDMA
US10255230B2 (en) 2015-06-23 2019-04-09 International Business Machines Corporation Lock-free processing of stateless protocols over RDMA
WO2018049210A1 (fr) * 2016-09-08 2018-03-15 Microsoft Technology Licensing, Llc Appareils de diffusion groupée et procédés de distribution de données à de multiples récepteurs dans un calcul à haute performance et des réseaux en nuage
US10891253B2 (en) 2016-09-08 2021-01-12 Microsoft Technology Licensing, Llc Multicast apparatuses and methods for distributing data to multiple receivers in high-performance computing and cloud-based networks
CN110134618A (zh) * 2018-02-02 2019-08-16 富士通株式会社 存储控制装置、存储控制方法和记录介质
CN112673351A (zh) * 2018-07-04 2021-04-16 图核有限公司 流式传输引擎
US11847074B2 (en) 2020-11-02 2023-12-19 Honeywell International Inc. Input/output device operational modes for a system with memory pools

Also Published As

Publication number Publication date
WO2008057833A3 (fr) 2008-10-02

Similar Documents

Publication Publication Date Title
US7533197B2 (en) System and method for remote direct memory access without page locking by the operating system
US20080109569A1 (en) Remote DMA systems and methods for supporting synchronization of distributed processes in a multi-processor system using collective operations
US20080109573A1 (en) RDMA systems and methods for sending commands from a source node to a target node for local execution of commands at the target node
US5864738A (en) Massively parallel processing system using two data paths: one connecting router circuit to the interconnect network and the other connecting router circuit to I/O controller
US7076597B2 (en) Broadcast invalidate scheme
US9137179B2 (en) Memory-mapped buffers for network interface controllers
EP0817042B1 (fr) Système multiprocesseur avec dispositif pour optimiser des opérations spin-lock
CA2414438C (fr) Systeme et procede de gestion de semaphores et d'operations atomiques dans un multiprocesseur
US5983326A (en) Multiprocessing system including an enhanced blocking mechanism for read-to-share-transactions in a NUMA mode
JP3871305B2 (ja) マルチプロセッサ・システムにおけるメモリ・アクセスの動的直列化
US5265235A (en) Consistency protocols for shared memory multiprocessors
US6209065B1 (en) Mechanism for optimizing generation of commit-signals in a distributed shared-memory system
US5924119A (en) Consistent packet switched memory bus for shared memory multiprocessors
US8255591B2 (en) Method and system for managing cache injection in a multiprocessor system
US5271020A (en) Bus stretching protocol for handling invalid data
US20080109604A1 (en) Systems and methods for remote direct memory access to processor caches for RDMA reads and writes
WO2004088462A2 (fr) Programmation et gestion de taches de micrologiciel a l'aide de materiel
EP2406723A1 (fr) Interface extensible pour connecter de multiples systèmes d'ordinateur qui effectue un appariement d'entête mpi en parallèle
US8086766B2 (en) Support for non-locking parallel reception of packets belonging to a single memory reception FIFO
US11960945B2 (en) Message passing circuitry and method
US7739451B1 (en) Method and apparatus for stacked address, bus to memory data transfer
US7610451B2 (en) Data transfer mechanism using unidirectional pull bus and push bus
WO2008057833A2 (fr) Système et procédé pour l'accès mémoire direct à distance sans verrouillage de pages par le système d'exploitation
US10740256B2 (en) Re-ordering buffer for a digital multi-processor system with configurable, scalable, distributed job manager
US7003628B1 (en) Buffered transfer of data blocks between memory and processors independent of the order of allocation of locations in the buffer

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 07854486

Country of ref document: EP

Kind code of ref document: A2

NENP Non-entry into the national phase in:

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 07854486

Country of ref document: EP

Kind code of ref document: A2