WO2023143103A1 - 报文处理方法、网关设备及存储系统 - Google Patents

报文处理方法、网关设备及存储系统 Download PDF

Info

Publication number
WO2023143103A1
WO2023143103A1 PCT/CN2023/071947 CN2023071947W WO2023143103A1 WO 2023143103 A1 WO2023143103 A1 WO 2023143103A1 CN 2023071947 W CN2023071947 W CN 2023071947W WO 2023143103 A1 WO2023143103 A1 WO 2023143103A1
Authority
WO
WIPO (PCT)
Prior art keywords
rdma
nof
storage node
gateway device
message
Prior art date
Application number
PCT/CN2023/071947
Other languages
English (en)
French (fr)
Inventor
徐晏
孟万红
陈海燕
杜凯
黄�俊
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023143103A1 publication Critical patent/WO2023143103A1/zh

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L49/00Packet switching elements
    • H04L49/90Buffering arrangements
    • H04L49/901Buffering arrangements using storage descriptor, e.g. read or write pointers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L5/00Arrangements affording multiple use of the transmission path
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L5/00Arrangements affording multiple use of the transmission path
    • H04L5/003Arrangements for allocating sub-channels of the transmission path
    • H04L5/0053Allocation of signaling, i.e. of overhead other than pilot signals
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L69/00Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
    • H04L69/06Notations for structuring of protocol data, e.g. abstract syntax notation one [ASN.1]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L69/00Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
    • H04L69/22Parsing or analysis of headers

Definitions

  • the present application relates to the technical field of storage, and in particular to a message processing method, a gateway device and a storage system.
  • the non-volatile high-speed transmission bus (non-volatile memory express, NVM Express, abbreviation: NVMe) is a bus transmission protocol specification based on the logical interface of the device. Express, PCIe) bus accesses the software and hardware standards of solid state disks (solid state disk, SSD), and instructions that follow the format specified in the NVMe standard can be called NVMe instructions.
  • the NVMe protocol (NVMe over fabric, NOF) carried on the network side is an NVMe-based protocol. NOF supports the transmission of NVME commands through various transport layer protocols, thereby extending the application scenarios of NVME to storage area networks (storage area network, SAN), allowing hosts to access storage devices through the network.
  • the basic interaction process of the storage solution based on the NOF protocol includes: the client sends a first NOF request message, and the first NOF request message carries an NVMe command.
  • the server receives the first NOF request message.
  • the server parses the first NOF request message to obtain the NVMe command carried in the first NOF request message.
  • the server executes an operation corresponding to the NVMe instruction on the NVMe storage medium in the server.
  • the NVME storage medium is usually a hard disk.
  • the performance of the hard disk is usually inferior to that of dynamic random access memory (DRAM), and the instruction set of the hard disk operation is more complex than that of the memory operation, the performance of the current storage solution based on the NOF protocol is low.
  • DRAM dynamic random access memory
  • Embodiments of the present application provide a message processing method, a gateway device, and a storage system, which can improve storage performance.
  • the technical scheme is as follows.
  • a message processing method includes: a gateway device receives a first NOF request message from a client, the first NOF request message carries an NVMe instruction, and the NVMe instruction indicates that the first destination address is read /write operation; the gateway device obtains the information of the first RDMA storage node based on the first destination address; the gateway device sends a first RDMA request message to the first RDMA storage node, and the first RDMA request message carries an RDMA command corresponding to the NVMe command.
  • the three types of network elements, client, gateway device, and RDMA storage node, are named according to the function of the device or the role the device plays in the solution.
  • the client is the entity that initiates the NOF request message, or the NOF requester.
  • An RDMA storage node is an entity that performs read/write operations in response to RDMA request messages, also known as an RDMA server.
  • the gateway device is equivalent to the entrance to access the RDMA storage node. The NOF request message sent from the client must pass through this entrance and then be transmitted to the RDMA Storage nodes.
  • Gateway devices include but are not limited to servers, server agents, routers, switches, firewalls, and the like.
  • the RDMA instruction corresponds to the NVMe instruction, for example, the operation type indicated by the RDMA instruction is the same as the operation type indicated by the NVMe instruction, and the operation type includes a read operation and a write operation.
  • the first NOF request message carries an NVMe read command
  • the first RDMA request message carries an RDMA read command
  • the first NOF request message carries an NVMe write command
  • the first RDMA request message carries an RDMA write command.
  • the data to be processed indicated by the RDMA instruction is the same as the data to be processed indicated by the RDMA instruction.
  • the data to be read indicated by the RDMA is the same as the data to be read indicated by the RDMA instruction, or the data to be saved indicated by the RDMA is the same as the data to be saved indicated by the RDMA instruction.
  • the first destination address represents a location in the storage space provided by the NVMe storage medium.
  • the first destination address is a logical address (or called a virtual address).
  • the first destination address includes a start logical block address (start LBA) and a block number (block number, or called number of Logical blocks).
  • the above information of the first RDMA storage node includes at least one of the following information: the second destination address, the network location information of the first RDMA storage node, one or more queues in the first RDMA storage node The identity of the pair (queue pair, QP) and the remote key (remote key, R_Key).
  • the second destination address points to the memory space in the first RDMA storage node.
  • the memory space is a section of space in the memory, and the location of the memory space in the memory is indicated by the second destination address.
  • the form of the second destination address includes multiple implementation manners.
  • the second destination address includes a start address and a length.
  • the value of the start address is 0x1FFFF, and the value of the length is 32KB. space.
  • the second destination address includes a start address and an end address, the value of the start address is 0x1FFFF, and the value of the end address is 0x2FFFF, and the second destination address points to the memory of the first RDMA storage node from address 0x1FFFF to The space at address 0x2FFFF.
  • the second destination address points to the memory space in the memory of the first RDMA storage node that stores the data to be read.
  • the second destination address indicates the data memory space to be written in the memory of the first RDMA storage node.
  • the second destination address is a logical address (or called a virtual address).
  • the start address in the second destination address is specifically a virtual address (virtual address, VA), and the length in the second destination address is specifically a length of direct memory access (DMA length).
  • the network location information of the first RDMA storage node is used to identify the first RDMA storage node in the network.
  • the network location information is used to guide the network device between the gateway device and the first RDMA storage node to perform routing and forwarding.
  • the network location information of the first RDMA storage node includes at least one of a MAC address, an IP address, a multi-protocol label switching (multi-protocol label switching, MPLS) label, or a segment identifier (segment ID, SID).
  • a QP includes a send queue (send queue, SQ) and a receive queue (receive queue, RQ), and QP is used to manage various types of messages.
  • R_Key indicates the right to access the memory of the first RDMA storage node.
  • R_Key is also called memory key.
  • the R_Key indicates the right to access a specific memory space on the first RDMA storage node.
  • the specific memory space is, for example, a memory space storing data to be read, or a pre-registered memory space.
  • the R_Key indicates the right to access the memory of the first RDMA storage node and the right to access the memory of the second RDMA storage node.
  • DMA length indicates the length of the RDMA operation.
  • the value of DMA length is 16KB, which indicates that the RDMA operation is performed on the memory space with a length of 16KB.
  • RDMA operations include write operations and read operations.
  • a write operation is, for example, writing data into memory.
  • a read operation is, for example, reading data from memory.
  • the gateway device converts the access request to the NVMe node into an access request to the RDMA node by executing the method in the first aspect above. Since the storage medium of the NVMe node is a hard disk, and the storage medium of the RDMA node is a memory, the memory can provide The faster read and write speed of the hard disk, so this method improves storage performance. Of course, if the NVMe instruction indicates a read operation, if the above method is used to successfully access the corresponding data, the first RDMA storage node should store the data to be read indicated by the NVMe instruction. In addition, since the instruction set for memory operations is simpler than that for hard disk operations, this method reduces the complexity of executing read and write instructions on storage nodes.
  • the client can use the storage service provided by the RDMA storage node after initiating access according to the original NOF process, without having to perceive changes in the storage node, and without requiring the client to support RDMA.
  • Some NOF storage solutions are compatible, which is convenient for quick service activation.
  • the first RDMA request message further includes information about the first RDMA storage node.
  • the gateway device acquiring the information of the first RDMA storage node based on the first destination address includes: the gateway device queries and obtains the information of the first RDMA storage node from the first correspondence based on the first destination address.
  • the first correspondence refers to the correspondence between the first destination address and the information of the first RDMA storage node.
  • the first correspondence indicates the correspondence between the first destination address and the information of the first RDMA storage node.
  • the first correspondence indicates the correspondence between the first destination address and the information of the first RDMA storage node includes multiple implementation manners.
  • the first correspondence includes information about the first destination address and the first RDMA storage node.
  • the first corresponding relationship is equivalent to a table, the index of the table is the first destination address, and the value of the table is the information of the first RDMA storage node.
  • the first correspondence does not include the information of the first RDMA storage node itself, but includes other information associated with the information of the first RDMA storage node, such as the metadata of the information of the first RDMA storage node, the first A file name, a uniform resource locator (uniform resource locator, URL), etc. of a file in which the information of the RDMA storage node is stored.
  • the addressing task of the storage node is offloaded (addressing refers to the process of finding the target storage node according to the target NVMe address), thereby reducing the CPU pressure and network IO pressure of the NOF storage node.
  • the above method further includes: the gateway device receives an RDMA response message from the first RDMA storage node, and the RDMA response message is for the first RDMA A response message of the request message; the gateway device generates a first NOF response message based on the RDMA response message; the gateway device sends the first NOF response message to the client.
  • the first NOF response message is a response message to the first NOF request message.
  • the first NOF response message indicates to respond to the NVMe command in the first NOF request message.
  • the first NOF response message includes the data requested by the first NOF request message.
  • the first NOF response message also includes a complete queue element (complete queue element, CQE), and CQE is used to indicate that the NVMe read operation has been completed.
  • the first NOF response message is a NOF write response message.
  • the first NOF response message includes a CQE, and the CQE is used to indicate that the NVMe write operation has been completed, or that the data has been saved successfully.
  • the gateway device implements the NOF protocol stack agent, and replies the response message to the client instead of the RDMA storage node.
  • the response message perceived by the client is still a NOF message, there is no need to require the client to The client perceives the logic of protocol message conversion, reducing the difficulty of maintaining the client.
  • the gateway device generates the first NOF response message based on the RDMA response message, including: the gateway device obtains RDMA status information based on the RDMA response message, and the RDMA status information indicates that the RDMA response message is different from the first RDMA response message.
  • the corresponding relationship between the request messages the gateway device queries the NOF status information from the second corresponding relationship according to the RDMA status information, and the NOF status information indicates the corresponding relationship between the first NOF response message and the first NOF request message ;
  • the gateway device generates a first NOF response message based on the NOF state information.
  • the second correspondence refers to the correspondence between RDMA status information and NOF status information.
  • the second correspondence includes a correspondence between RDMA status information and NOF status information.
  • the first NOF response message includes NOF status information.
  • the first NOF request message includes NOF status information.
  • the NOF status information is the packet sequence number in the first NOF request message.
  • the NOF status information is the packet sequence number in the first NOF response message.
  • the NOF state information is a value converted by a set rule for the packet sequence number in the first NOF request message.
  • the NOF message returned by the gateway device to the client is helpful for the NOF message returned by the gateway device to the client to carry accurate status information, thereby realizing the continuity of the session between the client and the gateway device based on the NOF protocol, and improving the communication success rate.
  • the complexity is low.
  • the method further includes: the gateway device obtains the NOF status information based on the first NOF request message; the gateway device establishes the second correspondence relationship, and the second corresponding relationship is the corresponding relationship between NOF status information and RDMA status information.
  • the gateway device associates the NOF state with the RDMA state during the process of interacting with the client and the RDMA node, thereby providing accurate state information for the process of replying the NOF message.
  • the first RDMA request message includes NOF status information
  • the RDMA response message includes NOF status information
  • the gateway device generates the first NOF response message based on the RDMA response message, including: the gateway device generates the first NOF response message based on the RDMA response message Obtain NOF state information; the gateway device generates a first NOF response message based on the NOF state information.
  • the gateway device can obtain the NOF status information without maintaining additional table entries locally, thus saving the storage space of the gateway device and reducing the resource overhead caused by the gateway device looking up and writing tables.
  • the first RDMA request message includes a first NOF message header
  • the RDMA response message includes a second NOF message header generated by the first RDMA storage node based on the first NOF message header
  • the first NOF response The packet includes a second NOF packet header.
  • the first NOF packet header refers to a packet header of the NOF packet.
  • the first NOF packet header is the packet header of the first NOF packet corresponding to the RDMA request packet.
  • the NOF packet header includes the packet header corresponding to the fabric and the NVMe layer information.
  • the so-called “fabric” refers to the network between the host and the storage medium. Typical forms of fabrics are, for example, Ethernet, Fiber Channel, InfiniBand (IB), and the like.
  • the specific format of the packet header corresponding to the fabric is related to the implementation of the fabric.
  • the packet header corresponding to the fabric may include the packet header corresponding to the multi-layer protocol.
  • the fabric is implemented through RoCEv2.
  • the packet headers corresponding to the fabric include the MAC header (corresponding to the link layer protocol), the IP header (corresponding to the network layer protocol), the UDP header (corresponding to the transport layer protocol) and the IB header (corresponding to the transport layer protocol).
  • the packet header corresponding to the fabric is a packet header corresponding to a protocol.
  • the fabric is implemented through InfiniBand, and the packet header corresponding to the fabric is the IB header.
  • the gateway device can obtain the NOF status information without maintaining additional entries locally, thus saving the storage space of the gateway device and reducing the resource overhead caused by the gateway device looking up and writing tables.
  • the work of generating the NOF message header is transferred to the RDMA storage node for execution, thereby reducing the processing pressure of the gateway device.
  • the above RDMA state information includes a packet sequence number (packet sequence number, PSN).
  • the above NOF state information includes PSN, sending queue head pointer (SQ head pointer, SQHD), command identifier (command ID), destination queue pair (destination queue pair, DQP), virtual address, R_Key or direct memory access at least one of the lengths.
  • the PSN is used to support detection and retransmission of lost packets.
  • SQHD is used to indicate the current head of the submission queue (SQ).
  • the SQHD is used to indicate to the host the consumed entries in the SQ (that is, the read/write commands that have been added to the SQ).
  • command ID is the identifier of the command associated with the error.
  • R_Key is used to describe the permission of the remote device to access the local memory, such as the access permission of the client to the memory of the RDMA storage node.
  • R_Key is also called memory key.
  • R_Key is usually used with VA.
  • R_Key is also used to help hardware identify the page table that translates virtual addresses into physical addresses.
  • DMA length indicates the length of the RDMA operation.
  • the above method further includes: the gateway device obtains information of the second RDMA storage node based on the above-mentioned first destination address; when the above-mentioned NVMe instruction indicates a write operation, the gateway device sends a second RDMA request to the second RDMA storage node message, the second RDMA request message carries the RDMA command corresponding to the above NVMe command.
  • the second RDMA request message further includes information about the second RDMA storage node.
  • both the above-mentioned first RDMA request message and the above-mentioned second RDMA request message are multicast messages; or,
  • Both the foregoing first RDMA request packet and the foregoing second RDMA request packet are unicast packets.
  • the gateway device before the gateway device sends the first RDMA request message to the first RDMA storage device, the gateway device also obtains the information of the second RDMA storage node based on the above-mentioned first destination address; in the case where the above-mentioned NVMe instruction indicates a read operation, The gateway device selects the first RDMA storage node from the first RDMA storage node and the second RDMA storage node based on a load sharing algorithm.
  • the foregoing method further includes: the gateway device receiving the foregoing first correspondence from a device other than the gateway device; or, the gateway device generating the foregoing first correspondence.
  • the gateway device generating the first correspondence includes: the gateway device assigning an NVMe logical address to the first RDMA storage node to obtain the first destination address.
  • the gateway device establishes a corresponding relationship between the first destination address and the information of the first RDMA storage node, thereby generating the first corresponding relationship.
  • how the gateway device acquires the information of the first RDMA storage node includes multiple implementation manners.
  • the first RDMA storage node actively reports the information of the node to the gateway device.
  • the first RDMA storage node generates and sends an RDMA registration message to the gateway device, and the gateway device receives the RDMA registration message from the first RDMA storage node, and obtains the information of the first RDMA storage node from the RDMA registration message.
  • the gateway device pulls the information of the first RDMA storage node from the first RDMA storage node.
  • the gateway device generates and sends a query request to the first RDMA storage node, where the query request is used to instruct to acquire information of the first RDMA storage node.
  • the first RDMA storage node receives the query request, generates and sends a query response to the gateway device, and the query response includes information of the first RDMA storage node.
  • Gateway device receives query response from The query response obtains the information of the first RDMA storage node.
  • the gateway device after the gateway device receives the first NOF request message from the client, the gateway device also obtains the information of the NOF storage node based on the first destination address; the gateway device generates a second NOF request message based on the first NOF request message ; The gateway device sends the second NOF request message to the NOF storage node.
  • the gateway device after the gateway device sends the second NOF request message to the NOF storage node, the gateway device also receives the second NOF response message from the NOF storage node; the gateway device generates a third NOF response message based on the second NOF response message The gateway device sends the third NOF response message to the client.
  • the original NOF interaction process is supported, so as to maintain compatibility with the original NOF storage solution.
  • the above-mentioned first RDMA storage node is a storage server, a memory or a storage array.
  • the memory is dynamic random access memory (dynamic random access memory, DRAM), or storage class memory (storage class memory, SCM) or dual in-line memory module or dual line memory module (dual In-line memory module, referred to as DIMM).
  • DRAM dynamic random access memory
  • SCM storage class memory
  • DIMM dual In-line memory module
  • a gateway device in a second aspect, is provided, and the gateway device has a function of realizing the above-mentioned first aspect or any optional manner of the first aspect.
  • the gateway device includes at least one unit, and each unit of the gateway device is configured to implement the method provided in the first aspect or any optional manner of the first aspect.
  • the units in the gateway device are implemented by software, and the units in the gateway device are program modules. In some other embodiments, the units in the gateway device are implemented by hardware or firmware.
  • a gateway device in a third aspect, includes a processor and a network interface, the processor is coupled to a memory, the network interface is used to receive or send messages, at least one computer program instruction is stored in the memory, at least one computer The program instructions are loaded and executed by the processor, so that the gateway device implements the method provided in the first aspect or any optional manner of the first aspect.
  • the processor of the gateway device is a processing circuit.
  • the processor of the gateway device is a programmable logic circuit, for example, the processor is a field-programmable gate array (field-programmable gate array, FPGA), or a programmable device such as a coprocessor.
  • FPGA field-programmable gate array
  • the storage of the gateway device is a storage medium.
  • the storage medium of the gateway device includes but not limited to a memory or a hard disk, and the memory is, for example, a DRAM, or an SCM or a DIMM.
  • the hard disk is, for example, a solid state disk (solid state disk, SSD) or a mechanical hard disk (hard disk drive, HDD).
  • gateway device For specific details of the gateway device provided in the third aspect, refer to the first aspect or any optional manner of the first aspect, and details are not repeated here.
  • a gateway device in a fourth aspect, includes: a main control board and an interface board, and may further include a switching network board.
  • the gateway device is configured to execute the method in the first aspect or any possible implementation manner of the first aspect.
  • a computer-readable storage medium is provided, and at least one instruction is stored in the storage medium, and when the instruction is run on a computer, the computer executes the above-mentioned first aspect or any optional method of the first aspect. provided method.
  • a computer program product includes one or more computer program instructions.
  • the computer program instructions When the computer program instructions are loaded and run by a computer, the computer is made to execute any one of the above-mentioned first aspect or the first aspect.
  • the method provided by the selection method is provided.
  • a chip in the seventh aspect, includes programmable logic circuits and/or program instructions, when the chip runs is used to implement the method provided in the first aspect or any optional manner of the first aspect.
  • the chip is a network card.
  • a storage system in an eighth aspect, includes the gateway device described in the second aspect or the third aspect or the fourth aspect and one or more RDMA storage nodes, and the one or more RDMA storage nodes include the first An RDMA storage node.
  • the gateway device is used to receive the first NOF request message from the client; obtain the information of the first RDMA storage node based on the first destination address; send the first RDMA request message to the first RDMA storage node; the first RDMA storage node, It is used to receive the first RDMA request message from the gateway device; perform read/write operations on the second destination address according to the above RDMA command.
  • the above storage system While supporting the original NOF process, the above storage system also introduces support for RDMA storage nodes, thereby making full use of the advantages of RDMA memory storage and greatly improving the overall performance of the system. At the same time, the client does not perceive changes when using the NOF storage service, thereby ensuring availability.
  • the first RDMA storage node is configured to send the information of the first RDMA storage node to the gateway device; the gateway device is configured to receive the information of the first RDMA storage node sent by the first RDMA storage node information, establishing a first corresponding relationship based on the information of the first RDMA storage node.
  • the first RDMA storage node is configured to generate an RDMA response message based on the first RDMA request message, and send the above RDMA response message to the gateway device;
  • the gateway device is configured to receive an RDMA response message; generate a first NOF response message based on the RDMA response message; and send the first NOF response message to the client.
  • the storage system further includes one or more NOF storage nodes.
  • NOF storage nodes Through the above methods, it supports the mixed networking mode of NOF hard disk storage + RDMA memory media storage, improves the flexibility of networking, and supports more business scenarios.
  • a message processing method comprising: the first RDMA storage node receives a first RDMA request message from a gateway device, the first RDMA request message includes an RDMA command and a first NOF message header , the RDMA instruction indicates to perform a read/write operation on the second destination address; the first RDMA storage node performs a read/write operation on the second destination address according to the above RDMA instruction; the first RDMA storage node sends an RDMA response message to the gateway device, RDMA The response message is a response message to the first RDMA request message, and the RDMA response message includes a second NOF message header, and the second NOF message header corresponds to the first NOF message header.
  • the correspondence between the second NOF header and the first NOF header means that the NOF status information carried by the second NOF header is the same as the NOF status information carried by the first NOF header.
  • the RDMA storage node undertakes part of the work of generating the NOF header, and returns the NOF header along with the RDMA response message to the gateway device, thereby reducing the processing pressure required by the gateway device to restore the NOF header , and at the same time, there is no need for the gateway device to cache the NOF header in the NOF request message, thereby saving the internal storage space of the gateway device.
  • the generating the second NOF header based on the first NOF header includes: filling missing content in the first NOF header to obtain the second NOF header.
  • the above-mentioned generation of the second NOF message header based on the first NOF message header includes: modifying the constant cyclic redundancy check (invariable cyclic redundancy check, ICRC) in the first NOF message header to obtain The second NOF packet header.
  • ICRC constant cyclic redundancy check
  • Fig. 1 is a schematic structural diagram of an NVMe SSD provided in the embodiment of the present application.
  • Fig. 2 is a schematic flow diagram of a host communicating with an NVMe controller provided in an embodiment of the present application
  • FIG. 3 is a schematic diagram of a queue pair mechanism provided by an embodiment of the present application.
  • Fig. 4 is a schematic diagram of the structure of a NOF storage system provided by the embodiment of the present application.
  • Fig. 5 is a schematic diagram of a resolution path of a storage node to a NOF protocol stack provided by an embodiment of the present application
  • FIG. 6 is a schematic diagram of the architecture of an RDMA system provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of a queue pair mechanism provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of an application scenario provided by an embodiment of the present application.
  • FIG. 9 is a flowchart of a message processing method provided in an embodiment of the present application.
  • FIG. 10 is a schematic diagram of a gateway device processing a message provided in an embodiment of the present application.
  • FIG. 11 is a flow chart of a message processing method provided by an embodiment of the present application.
  • FIG. 12 is a schematic diagram of the architecture of a storage system after deploying a gateway device according to an embodiment of the present application
  • Fig. 13 is a schematic diagram of a scenario in which a gateway device acts as a storage node provided by an embodiment of the present application;
  • FIG. 14 is a schematic diagram of a logical functional architecture of a gateway device provided in an embodiment of the present application.
  • FIG. 15 is a flow chart of a message processing method provided by an embodiment of the present application.
  • FIG. 16 is a schematic diagram of a logical functional architecture of a gateway device provided in an embodiment of the present application.
  • FIG. 17 is a functional schematic diagram of an address translation table provided by an embodiment of the present application.
  • Fig. 18 is a schematic diagram of the establishment process and search process of a NOF context table provided by the embodiment of the present application.
  • FIG. 19 is a flowchart of a message processing method provided by an embodiment of the present application.
  • FIG. 20 is a flow chart of a message processing method provided by an embodiment of the present application.
  • FIG. 21 is a schematic diagram of a logical functional architecture of a gateway device provided by an embodiment of the present application.
  • FIG. 22 is a flow chart of a message processing method provided in an embodiment of the present application.
  • FIG. 23 is a flow chart of a message processing method provided by an embodiment of the present application.
  • FIG. 24 is a schematic diagram of a logical functional architecture of a gateway device provided by an embodiment of the present application.
  • FIG. 25 is a flow chart of a message processing method provided by an embodiment of the present application.
  • FIG. 26 is a schematic structural diagram of a message processing device 700 provided in an embodiment of the present application.
  • FIG. 27 is a schematic structural diagram of a gateway device 800 provided in an embodiment of the present application.
  • FIG. 28 is a schematic structural diagram of a gateway device 900 provided by an embodiment of the present application.
  • a read/write operation means a read operation or a write operation.
  • NVMe is a bus transmission protocol specification based on the device logic interface (equivalent to the application layer in the communication protocol). It is used for accessing non-volatile storage media attached via a peripheral component interconnect bus (PCI express, PCIe) bus (such as a solid-state hard disk drive using flash memory), although the PCIe bus protocol is not necessarily required in theory.
  • PCI express, PCIe peripheral component interconnect bus
  • NVMe is a protocol and a set of software and hardware standards that allow solid state disks (SSD) to use the PCIe bus; PCIe is the actual physical connection channel.
  • SSD solid state disks
  • NVM stands for the acronym for non-volatile memory, which is a common form of flash memory for SSDs.
  • This specification mainly provides a low-latency, internally concurrent native interface specification for flash-based storage devices, and also provides support for native storage concurrency for modern central processing units (CPUs), computer platforms, and related applications. , so that the host hardware and software can take full advantage of the parallel storage capabilities of solid-state storage devices.
  • NVMe reduces the input/output (input/output, I/O) operation waiting time, increase the number of operations at the same time, and provide a larger capacity operation queue.
  • the interface between the host and NVM SSD (a typical storage medium in NVMe) is based on a series of paired commit and completion queues. These queues are created by the driver and shared between the driver (running on the host) and the NVMe SSD. The queue itself can be located either in host shared memory or in memory provided by the NVMe SSD. Once the commit queue and completion queue are configured, they will be used for communication between the driver and NVMe SSD.
  • NVMe SSD includes NVMe controller and flash memory array.
  • the NVMe controller is responsible for communicating with the host, and the flash array is responsible for storing the data.
  • Fig. 2 shows the flow of communication between the host and the NVMe controller.
  • the host places a new command on the submission queue.
  • the driver notifies the NVMe controller that a new instruction is pending by writing the new tail pointer to the doorbell register.
  • the NVMe controller obtains the instruction from the submission queue.
  • the NVMe controller processes the instruction.
  • Step 5 After the NVMe controller completes the command, it places an entry in the associated completion queue. Step 6, generate an interrupt.
  • Step 7 after the driver finishes processing the entry, it sends the queue-updated head pointer to the doorbell register by writing it to the NVMe controller.
  • the NVMe specification allows up to 64K individual queues, each of which can have up to 64K entries. In practical applications, the number of queues can be determined based on system configuration and expected load. For example, for a quad-core processor system, each core can set up a queue pair, which is useful for implementing a lock-free mechanism. However, NVMe also allows the driver to create multiple submission queues per core and establish different priorities between these queues. While submission queues are typically serviced in a round-robin fashion, NVMe optionally supports a weighted round-robin scheme that allows some queues to be serviced more frequently than others.
  • FIG. 3 is a schematic diagram of a queue pair mechanism. As shown in FIG. 3 , there is a one-to-one correspondence between the submission queue and the completion queue.
  • the NVMe command refers to the command defined by the NVMe protocol. Instructions in the NVMe protocol are divided into Admin (management, administrator) and I/O instructions (I/O instructions are also called NVM instructions). Admin commands are used to manage and control NVMe storage media. I/O instructions are used to transfer data. Optionally, the NVMe command occupies 64 bytes in the packet. I/O commands in the NVMe protocol include NVMe read commands and NVMe write commands.
  • NVMe read commands are used to read data in NVMe storage media. For example, if the NVMe in the NOF packet The content of the opcode field in the layer is 02h, indicating that the NVMe command carried in the NOF message is an NVMe read command.
  • NVMe write commands are used to write data to NVMe storage media.
  • the content of the opcode field in the NVMe layer in the NOF message is 01h, it means that the NVMe command carried in the NOF message is an NVMe write command.
  • NOF is a high-speed storage protocol based on the NVMe specification. NOF is used to access NVMe storage media across networks. The NOF protocol adds fabric-related instructions on the basis of NVMe. The NOF protocol makes the application scenarios of NVMe not limited to a device, but can be extended to cross-network communication.
  • fabric refers to the network between the host and the storage medium. Typical forms of fabrics are, for example, Ethernet, Fiber Channel, InfiniBand (IB), and the like.
  • RDMA remote direct memory access
  • RoCE converged ethernet
  • the way to implement fabric by using RDMA is NVMe over RDMA technology. For details, please refer to the introduction of NVMe over RDMA in (8) below.
  • Fig. 4 is a schematic diagram of the architecture of a NOF storage system.
  • the fabric in the NOF technology is implemented using RoCEv2, that is, NVMe is carried on top of RoCEv2.
  • the network card encapsulates the NVMe command into a RoCE message, and sends the RoCE message to the NOF storage node where the NVMe SSD is located via Ethernet.
  • hosts are supported to access NVMe SSDs across Ethernet.
  • FIG. 5 shows the parsing path of the NOF protocol stack by the NOF storage node.
  • FIG. 5 uses an example in which the NOF message is the RoCE message for illustration.
  • the network card of the NOF storage node is equipped with processing modules corresponding to various protocol stacks. After the NOF storage node receives the RoCE message, the network card of the NOF storage node processes the RoCE message sequentially through each processing module. Access control layer (media access control, MAC) protocol stack analysis, Internet protocol (internet protocol, IP) protocol stack analysis, user datagram protocol (user datagram protocol, UDP) protocol stack analysis, IB protocol stack analysis, NVMe protocol stack analysis , to obtain the NVMe command carried in the RoCE message.
  • the network card sends NVMe instructions to the NVMe controller in the SSD through the PCIE bus, and the NVMe controller executes the NVMe instructions to perform data read and write operations on the flash memory array.
  • FIG. 5 illustrates that the network card is responsible for the parsing and processing of various protocol stacks as an example.
  • the parsing and processing tasks of the protocol stacks are optionally performed by the CPU or other components of the storage node.
  • the above description focuses on the processing flow of NOF packets or NVMe commands inside the storage node.
  • the following describes how to interact between devices based on the NOF protocol.
  • the interaction process between the client A and the NOF storage node B based on the NOF protocol includes the following steps (1) to (8).
  • NOF is implemented through the RoCEv2 protocol.
  • NOF is carried by NVMe on the RoCEv2 protocol.
  • the following process takes a scenario without sharding as an example.
  • the method of updating the PSN in the message can be replaced by adding one to the PSN to adding other values to the PSN.
  • Step (1) Client A establishes a connection with NOF storage node B.
  • the queue pair (queue pair, QP) at both ends of client A and NOF storage node B establishes a logical session.
  • Client A initializes the packet sequence number (packet sequence number, PSN) from A to B, and obtains the initial PSN-AB.
  • the NOF storage node B initializes the PSN in the direction from B to A, and obtains PSN-BA.
  • PSN-AB is the PSN in the direction from client A to NOF storage node B
  • PSN-BA is the PSN in the direction from NOF storage node B to client A.
  • Step (2) Client A sends an RDMA send only (RDMA only send) message to NOF storage node B.
  • RDMA send only packets are read requests.
  • PSN-AB1 in the RDMA send only message is the current PSN number from A to B. If there is no interaction after initialization, PSN-AB1 in the RDMA send only message is the initial PSN-AB. If there has been interaction after initialization, PSN-AB1 in the RDMA send only message is the current PSN-AB1.
  • the NVMe layer in the RDMA send only message contains a scatter gather list (SGL) specifying the memory address of client A, where the starting logical block address (start LBA) and block number (block number, or number of Logical blocks) specifies the destination storage address of the NOF storage node B, and the command identifier (command Identifier, command ID) specifies the sequence number of the NVMe operation.
  • start LBA starting logical block address
  • block number block number, or number of Logical blocks
  • Step (3) The NOF storage node B generates an RDMA acknowledgment (acknowledge, ACK) message with PSN-AB1 as the PSN. NOF storage node B sends an RDMA ACK message to client A.
  • RDMA acknowledgment acknowledge, ACK
  • Step (4) NOF storage node B uses PSN-BA1 as PSN to generate an RDMA read response (RDMA read response) message, and NOF storage node B sends an RDMA read response message to client A.
  • the content of the RDMA extended transport header (RDMA extended transport header, RETH) in the RDMA read response message is the SGL information in the NVMe layer.
  • the content of the payload in the RDMA read response message is the specific data value in the NVMe hard disk.
  • Step (5) The NOF storage node B uses PSN-BA1+1 as the PSN to generate an RDMA send only invalidate (RDMA send only invalidate) message.
  • NOF storage node B sends an RDMA send only invalidate message to client A.
  • the content of RETH in the RDMA send only invalidate message is the remote key (remote key) of SGL in the NVMe layer.
  • the NVMe layer information contains the requested command ID.
  • the send queue head pointer (SQ head pointer, SQHD) is the head pointer position of the current operation in the send queue.
  • Step (6) Client A sends an RDMA send only message to NOF storage node B.
  • the RDMA send only message is used to request to write data to a certain memory space of NOF storage node B. If the RDMA send only message follows the last read request, then the PSN-AB1 in the RDMA send only message is the current PSN from A to B, otherwise the PSN-AB1 in the RDMA send only message is the current PSN- AB1+1.
  • the NVMe layer information in the RDMA send only message indicating the write operation is the same as the NVMe layer information in the RDMA send only message indicating the read operation.
  • the payload (payload) part of the RDMA send only message is the specific data value that needs to be written into the memory.
  • Step (7) NOF storage node B generates an RDMA ACK message with PSN-AB1+1 as PSN. NOF storage node B sends an RDMA ACK message to client A.
  • Step (8) NOF storage node B generates an RDMA send message with PSN-BA1+2 as PSN. NOF storage node B sends an RDMA send message to client A.
  • the NVMe layer information in the RDMA send message indicating the write operation is the same as the NVMe layer information in the RDMA send message indicating the read operation.
  • the SQHD in the RDMA send message indicating the write operation is the SQHD+1 returned by the NOF storage node B when it performed the read operation last time.
  • RoCE is a network protocol that can carry the RDMA protocol and the NOF protocol. RoCE allows the use of RDMA in Ethernet.
  • RoCE v1 is an Ethernet link layer protocol, so RoCE v1 supports RDMA for data transmission between any two hosts in the same Ethernet broadcast domain.
  • RoCE v2 is a network layer protocol.
  • RoCE v2 packets include UDP headers and IP headers, so RoCE v2 packets can be forwarded by IP routing, thus supporting data transmission between any two hosts in the IP network using RDMA.
  • the gateway device interacts with the client and the storage node respectively based on the RoCE protocol.
  • NVMe over RDMA is a technology that uses RDMA to transmit NVMe instructions or the execution results of NVMe instructions. surgery. From the perspective of the protocol stack, in NVMe over RDMA, NVMe is carried on the upper layer of RDMA. In the NVMe over RDMA solution, RDMA is equivalent to the carrier of the NVMe protocol or the transmission channel of the NVMe protocol. To make an analogy, the role of RDMA in NVMe over RDMA is similar to the PCIe bus in a computer. The PCIe bus is used to transfer data between the CPU and the local hard disk. Transfer NVMe commands between them.
  • Some embodiments of the present application are completely different from the inventive concepts of the NVMe over RDMA scheme. Some embodiments of the present application utilize instructions in RDMA to perform read and write operations on the memory, thereby taking advantage of performance advantages such as faster data read and write speeds in the memory to improve storage performance and reduce the complexity of instruction sets that storage nodes need to process.
  • the NVMe over RDMA solution uses RDMA as the NVMe transmission channel to reduce the delay in transmitting NVMe instructions across the network. From the perspective of message content, in some embodiments of the present application, the content of the message sent by the gateway device to the storage node is an RDMA instruction, and the semantics of the instruction is how to operate the memory.
  • the contents of the message sent to the storage node are NVMe instructions, and the semantics of the instructions are how to operate the hard disk.
  • some embodiments of the present application support the use of storage media such as memory to provide data read and write services for clients, while in the NVMe over RDMA solution, storage media such as hard disks are used to provide data read and write services for clients .
  • RDMA is a technology that bypasses the operating system kernel of the remote device to access the memory of the remote device. Because the RDMA technology usually does not need to go through the operating system, it not only saves a lot of CPU resources, but also improves the throughput and reduces the network communication delay. RDMA is especially suitable for large-scale parallel computer clusters.
  • RDMA storage nodes refer to storage nodes that provide data read and write services through RDMA. There are many product forms of RDMA storage nodes. For example, RDMA storage nodes are storage servers, desktop computers, and the like.
  • RDMA has several major characteristics, (1) The local device transmits data between the remote device through the network; (2) In most cases, there is no participation of the operating system kernel, and the data transmission task is offloaded to the smart network card; (3) The user Since the data transmission between the space virtual memory and the smart network card does not involve the operating system kernel, additional data movement and copying are avoided.
  • Infiniband is a network specially designed for RDMA, which guarantees reliable transmission from the hardware.
  • RoCE and iWARP are both Ethernet-based RDMA technologies.
  • RDMA unilateral operation supports the unilateral CPU of the local device to participate in the work while the CPU of the remote device does not participate in the work. In other words, in the process of performing RDMA unilateral operation, the CPU of the remote device is bypassed (CPU bypass).
  • RDMA unilateral operations are typically used to transfer data. Generally speaking, the RDMA operation refers to the unilateral RDMA operation.
  • the end network card encapsulates it into a message and returns it to the local end.
  • RDMA unilateral operations include RDMA read operations and RDMA write operations.
  • the RDMA write operation refers to the operation of writing data to the memory of the server (that is, the RDMA storage node).
  • the basic principle of performing an RDMA write operation is that the client pushes data from the local cache to the server's memory based on the server's memory address and access rights to the server's memory. Among them, the client's access right to the server's memory is called a remote key (R_Key) in the RDMA protocol.
  • R_Key remote key
  • the basic workflow for performing RDMA write operations is as follows: 1) when After an application 101 in the client 100 generates the RDMA write request message, the application 101 puts the RDMA write request message in the buffer 102 .
  • the processor 142 of the local network card 140 reads the request message into the buffer 141 of the network card 140 itself, and the operating system 103 is bypassed in the process.
  • the RDMA write request message includes the logical address of the memory space of the RDMA storage node 200 , the remote key and the data to be saved by the application 101 .
  • the remote key is used to indicate that the network card 140 has access authority to the memory of the RDMA storage node 200 .
  • the processor 142 of the network card 140 sends the RDMA write request message to the network card 240 through the network.
  • the processor 242 of the network card 240 checks the remote secret key in the RDMA write request message. If the key is correct, the processor 242 writes the data carried in the RDMA write request message from the buffer 241 of the network card 204 to the buffer 202, thereby saving the data into the memory of the RDMA storage node 200.
  • the RDMA read operation refers to the operation of reading data in the memory of the server (that is, the RDMA storage node).
  • the basic principle of performing an RDMA read operation is that the client’s network card obtains data from the server’s memory based on the server’s memory address and the server’s memory access authority (remote key), and pulls the data into the client’s local cache.
  • the RDMA bilateral operation includes an RDMA send (RDMA-send) operation and an RDMA receive (RDMA-receive) operation.
  • RDMA bilateral operation supports CPU bypass during data transmission, but both CPUs of the local device and the remote device are required to participate in the work. In other words, neither the remote device nor the CPUs on both sides of the remote device are completely bypassed during the execution of the RDMA bilateral operation.
  • RDMA bilateral operations are typically used to transmit control packets. Specifically, if the local device wants to transmit data to the memory of the remote device by performing the RDMA-send operation, the remote device needs to call the RDMA-receive operation first.
  • the remote device does not call the RDMA-receive operation, Then the remote device will fail to call the RDMA-send operation locally.
  • the working mode of bilateral operation is similar to traditional socket (socket) programming.
  • the overall performance of bilateral operations is slightly lower than that of RDMA unilateral operations.
  • RDMA-send and RDMA-receive operations are usually used to transmit connection control packets.
  • RDMA a logical connection is established between the applications of the communicating parties for communication, and the logical connection is referred to as an RDMA connection hereinafter.
  • An RDMA connection is equivalent to a channel for transmitting messages, and the first and last endpoints of each RDMA connection are two pairs of queues.
  • a QP includes a sending queue (send queue, SQ) and a receiving queue (receive queue, RQ), which manage various types of messages.
  • Network card 140 includes SQ 302, network card 140 includes RQ 403, SQ 302 and RQ 403 form a QP, and SQ 302 and RQ 403 are equivalent to two endpoints of RDMA connection.
  • QP will be mapped to the virtual address space of the application, so that the application can directly access the network card through it.
  • RDMA also provides a complete queue (CQ), which is used to notify the application that the message on WQ has been processed.
  • CQ complete queue
  • the following is an example of how devices interact based on the RDMA protocol.
  • the complete interaction process between client A and RDMA storage node B includes the following steps (1) to (6).
  • the RDMA protocol based on the following process is specifically the RoCEv2 protocol.
  • the following process takes a scenario without sharding as an example.
  • the method of updating the PSN in the message can be replaced by adding one to the PSN to adding other values to the PSN.
  • Step (1) Client A establishes a connection with RDMA storage node B.
  • Client A and QP at both ends of RDMA storage node B establish a logical session.
  • Client A initializes the PSN from A to B, and obtains the initial PSN-AB.
  • RDMA storage node B initializes the PSN in the direction from B to A to obtain PSN-BA.
  • PSN-AB is the PSN in the direction from client A to RDMA storage node B
  • PSN-BA is the PSN in the direction from RDMA storage node B to client A.
  • Step (2) RDMA storage node B sends an RDMA send only message to client A.
  • the RDMA send only message is used to report the address of the memory space of the RDMA storage node B.
  • the PSN-BA1 carried in the RDMA send only message is the PSN number in the current B-A direction. If there is no interaction after initialization, the PSN-BA1 carried in the RDMA send only message is the initial PSN-BA. If there has been interaction after initialization, the PSN-BA1 carried in the RDMA send only message is the current PSN-BA1.
  • the address of the memory space is carried in RETH in the message.
  • the address of the memory space includes VA, remote key and direct memory access (direct memory access, DMA) length.
  • Step (3) Client A sends an RDMA read request message to RDMA storage node B.
  • the RDMA read request message is used to request the address of the memory space of the RDMA storage node B.
  • PSN-AB1 in the RDMA read request message is the PSN number of the current A-B direction. If there is no interaction after initialization, PSN-AB1 in the RDMA read request message is the initial PSN-AB. If there is interaction after initialization, PSN-AB1 in the RDMA read request message is the current PSN-AB1.
  • the address of the memory space in the RDMA read request message is the address of the memory space reported by RDMA storage node B to client A before.
  • Step (4) The RDMA storage node B uses PSN-AB1 as the PSN to generate an RDMA read response (RDMA read response) message, and sends the RDMA read response message to the client A.
  • the payload part in the RDMA read response message is the specific data value stored in the memory read from the memory.
  • Step (5) Client A sends RDMA write only (RDMA write only) message to RDMA storage node B.
  • the RDMA write only message is used to write data into the memory of RDMA storage node B. If the RDMA write only message follows the last RDMA read request message, the PSN-AB1 in the RDMA write only message is the PSN number of the current A-B direction, otherwise the PSN-AB1 in the RDMA write only message is the current state PSN-BA1+1.
  • the address of the memory space in the RDMA write only message is the address of a certain memory space reported by B-A before.
  • the payload part in the RDMA write only message is the specific data value that needs to be written into the memory.
  • Step (6) RDMA storage node B generates an RDMA ACK message with PSN-AB1+1 as PSN, and sends the RDMA ACK message to client A.
  • the technical details of how the gateway device interacts with the RDMA storage node and examples of the RDMA state information can refer to the above process.
  • the gateway device in the embodiment corresponding to FIG. 9 is optionally the RDMA storage node B introduced above.
  • the first RDMA request message in S404 in the embodiment corresponding to Fig. 9 is optionally the RDMA read request message in the above step (3), and the RDMA response in S408 in the embodiment corresponding to Fig.
  • the message may optionally be the RDMA read response message in the above step (4); the RDMA status information in the RDMA response message may optionally be the PSN-AB1 in the above step (4).
  • the first RDMA request message in S404 in the embodiment corresponding to FIG. 9 is optionally an RDMA write only message in the above step (5), and S408 in the embodiment corresponding to FIG. 9
  • the RDMA response message in is optionally the RDMA ACK message in the above step (6).
  • the RDMA state information in the RDMA response message is optionally PSN-AB1+1 in the above step (6).
  • step (1) and step (2) in the above process are optionally provided as preparatory steps of the embodiment corresponding to FIG. 9 , providing a sufficient basis for the implementation of the embodiment in FIG. 9 .
  • the device establishes the RDMA connection in advance, and pre-gives the address of the memory space in the first RDMA storage node (the second destination address) to the gateway device through step (2).
  • So-called "state information” is a term in the field of computer network communication.
  • the status information is used to indicate the connection between different messages that the communication parties exchange successively in a session.
  • each message exchanged by the communication parties in a session is not an isolated individual, but is related to the previously exchanged messages.
  • each message in a session carries certain information, and the value of this information remains unchanged during the session, or the value of this information changes according to the set rules during the session.
  • the information that remains unchanged in the session or whose value changes according to the set rules is the state information.
  • Packets carry state information usually for reliability or security considerations.
  • the receiving end judges whether packet loss occurs according to the state information in the message, and retransmits when packet loss occurs, or the receiving end judges whether the sender is trustworthy according to whether the state information in the message is correct, and when the sender is untrustworthy packet loss.
  • the sequence number (sequence number) carried in the TCP message is a kind of state information.
  • the RDMA status information indicates the relationship between different RDMA packets in a session based on the RDMA protocol and the logical order of the RDMA packets. For example, after the communication parties establish a connection based on the RDMA protocol, the responder sends multiple RDMA response messages to the requester successively.
  • the multiple RDMA response messages contain different RDMA status information, and the RDMA status information indicates the status of the multiple RDMA response messages. sequence.
  • the RDMA status information specifically indicates the correspondence between the RDMA response message and the RDMA request message.
  • the communication parties exchange multiple RDMA request messages and multiple RDMA response messages.
  • Each RDMA request packet or RDMA response packet includes RDMA status information.
  • the RDMA state information in an RDMA response message indicates which RDMA request message the RDMA response message corresponds to.
  • the RDMA state information is PSN.
  • the RDMA state information is the information carried in the RDMA message.
  • the RDMA state information is the information carried in the RDMA packet header in the RDMA packet.
  • the RDMA state information is the information carried in the IB header or the iWARP packet header.
  • the NOF status information indicates the relationship between different NOF messages in a session based on the NOF protocol and the logical order of the NOF messages.
  • the NOF state information specifically indicates the corresponding relationship between the NOF response message and the NOF request message.
  • the communicating parties exchange multiple NOF request messages and multiple NOF response messages in a session based on the NOF protocol.
  • Each NOF request message or NOF response message includes NOF status information.
  • the NOF status information in an NOF response message indicates which NOF request message the NOF response message corresponds to.
  • the NOF state information includes at least one of the following information: PSN, SQHD, command ID, DQP, virtual address, remote key (remote key) or direct memory access length.
  • the NOF status information is the information carried in the NOF message.
  • the NOF state information is the information carried in the NOF packet header in the NOF packet.
  • the NOF packet header refers to the packet header in the NOF packet.
  • the NOF packet header includes the packet header corresponding to the fabric and the NVMe layer information.
  • the specific format of the packet header corresponding to the fabric is related to the implementation of the fabric.
  • the packet header corresponding to the fabric may include the packet header corresponding to the multi-layer protocol.
  • the fabric is implemented through RoCEv2
  • the packet header corresponding to the fabric includes MAC Header (corresponding to link layer protocol), IP header (corresponding to network layer protocol), UDP header (corresponding to transport layer protocol) and IB header (corresponding to transport layer protocol).
  • the packet header corresponding to the fabric is a packet header corresponding to a protocol.
  • the fabric is implemented through InfiniBand, and the packet header corresponding to the fabric is the IB header.
  • RETH is a transport layer header in the RDMA protocol.
  • RETH contains some additional fields for RDMA operations.
  • RETH includes a virtual address (virtual address, VA) field, a remote key (remote key, R_Key) field, and a direct memory access length (DMA length) field.
  • VA virtual address
  • R_Key remote key
  • DMA length direct memory access length
  • PSN is a value carried in the packet transmission header.
  • the PSN is used to support detection and retransmission of lost packets.
  • SQHD is used to indicate the current head of the submission queue (SQ).
  • the SQHD is used to indicate to the host the consumed entries in the SQ (that is, the read/write commands that have been added to the SQ).
  • command ID is the identifier of the command associated with the error. If the error is not related to the specified command, then the command ID field is optionally set to FFFFh.
  • VA represents the starting address of the buffer.
  • the length of VA is, for example, 64 bits.
  • R_Key is used to describe the permission of the remote device to access the local memory, such as the access permission of the client to the memory of the RDMA storage node.
  • R_Key is also called memory key.
  • R_Key is usually used with VA.
  • R_Key is also used to help hardware identify the page table that translates virtual addresses into physical addresses.
  • DMA length indicates the length of the RDMA operation.
  • DMA length is a field name in RDMA-related standards.
  • DMA length can also be called RDMA length.
  • the mainframe refers to the main body part of the computer.
  • a host usually includes a CPU, memory, and interfaces.
  • the connection relationship between the host and the SSD includes multiple implementation manners.
  • the SSD is arranged inside the host, and the SSD is a component inside the host.
  • the SSD is disposed outside the host and connected to the host.
  • Storage nodes refer to entities that support data storage functions.
  • a storage node is an independent storage device.
  • a storage node is a device integrated with multiple storage devices, or a cluster or distributed system including multiple storage devices.
  • an RDMA storage node is a storage server that supports RDMA, and the storage server utilizes local memory to provide services for reading and writing data based on RDMA.
  • an RDMA storage node includes multiple storage servers supporting RDMA, and memory in each storage server forms a memory pool supporting RDMA. The storage node uses the memory belonging to one or more storage servers in the memory pool to provide services for reading and writing data based on RDMA.
  • Memory refers to internal storage that exchanges data directly with the processor. Memory is usually able to read and write data at any time, and fast, and acts as temporary data storage for the operating system or other running programs.
  • the memory is, for example, a random access memory, or a read only memory (ROM).
  • the random access memory is, for example, dynamic random access memory (DRAM), or storage class memory (SCM).
  • DRAM dynamic random access memory
  • SCM storage class memory
  • DRAM is a semiconductor memory that, like most random access memory (RAM), is a volatile memory device.
  • SCM is a composite storage technology that combines the characteristics of traditional storage devices and memory.
  • Storage-class memory can provide faster read and write speeds than hard disks, but the access speed is slower than DRAM, and the cost is also cheaper than DRAM. .
  • the DRAM and the SCM are only exemplary illustrations in this embodiment, and the memory may optionally include other random access memories, such as static random access memory (static random access memory, SRAM) and the like.
  • the ROM is, for example, a programmable read only memory (PROM), an erasable programmable read only memory (EPROM), and the like.
  • the memory is a dual in-line memory module or a dual in-line memory module (DIMM for short), that is, a module composed of dynamic random access memory (DRAM), or an SSD .
  • DIMM dual in-line memory module
  • the memory configures the memory to have a power saving function.
  • the power protection function means that when the system is powered off and then powered on again, the data stored in the memory will not be lost.
  • Memory with a power saving function is called non-volatile memory.
  • LB is also called block, and LB refers to the smallest storage unit defined by NVMe.
  • LB is a storage space with a size of 2KB or 4KB.
  • a LUN is a number used to identify a logical unit, which is a device addressed by SCSI.
  • the storage system partitions the physical hard disk into various parts with logical addresses, and then allows the host to access.
  • Such a partition is called a LUN.
  • Commonly referred to as LUN also refers to the logical disk created on the SAN storage.
  • Some embodiments of the present application relate to the mutual conversion process of NOF and RDMA protocol packets.
  • NOF-RDMA uses the form of "NOF-RDMA” to simplify the process of converting a NOF message into an RDMA message
  • RDMA-NOF uses the form of "RDMA-NOF” to simplify the process of converting an RDMA message to The process of NOF packets.
  • FIG. 8 is a schematic diagram of an application scenario provided by an embodiment of the present application.
  • the scenario shown in FIG. 8 includes a client 31 , a gateway device 33 and an RDMA storage node 35 .
  • Each device in FIG. 8 will be described with an example below.
  • the deployment location of the client 31 includes various situations.
  • the client 31 is deployed on a user network, or in a local area network, for example, the client 31 is deployed on an intranet of an enterprise.
  • the client 31 is deployed on the Internet, or in the cloud, or in a cloud network such as a public cloud, an industry cloud, or a private cloud.
  • the client 31 is deployed in the backbone network (for example, the client is a router with data storage requirements), and this embodiment does not limit the deployment location of the client.
  • the client 31 may be a terminal, a server, a router, a switch, and the like.
  • Terminals include but are not limited to personal computers, mobile phones, servers, laptops, IP phones, cameras, tablets, wearable devices, and the like.
  • the client 31 plays the role of the initiator of the NOF request message. Taking the process of reading data as an example, when the client 31 needs to obtain pre-saved data, the client 31 generates and sends a NOF read request message, thereby triggering the following method embodiment shown in Figure 9; the process of writing data is For example, when the client 31 needs to save data, the client 31 generates and sends a NOF write request message, thereby triggering the following method embodiment shown in FIG. 9 .
  • the client 31 also plays the role of the destination party of the NOF response message. Taking the process of reading data as an example, after the client 31 receives the NOF read response message, the client 31 obtains the read data from the NOF read response message, and performs business processing according to the data. Taking the process of writing data as an example, after the client 31 receives the NOF write response message, the client 31 obtains the NOF response information from the NOF write response message, and confirms that the data has been saved successfully according to the NOF response message.
  • the gateway device 33 is an entity deployed between the client 31 and the RDMA storage node 35 .
  • the gateway device 33 is configured to forward messages exchanged between the client 31 and the RDMA storage node 35 .
  • the gateway device 33 acts as both a NOF proxy and an RDMA proxy. From the perspective of the client 31, the gateway device 33 is equivalent to a NOF server, and the gateway device 33 interacts with the client 31 instead of the NOF server. As shown in FIG. 8 , the gateway device 33 establishes a NOF connection with the client 31 based on the NOF protocol, and the gateway device 33 can receive the NOF request message sent by the client 31 through the NOF connection. From the perspective of the RDMA storage node 35, the gateway device 33 is equivalent to an RDMA client, and the gateway device 33 interacts with the RDMA storage node 35 instead of the client. As shown in FIG.
  • the gateway device 33 establishes an RDMA connection with the RDMA storage node 35 based on the RDMA protocol, and the gateway device 33 can send the RDMA request message to the RDMA storage node 35 through the RDMA connection.
  • the gateway device 33 implements the proxy function, please refer to the various method embodiments below.
  • the gateway device 33 is a network device.
  • the gateway device 33 is a router, a switch, a firewall, and the like.
  • the gateway device 33 is a server, for example, the gateway device 33 is a storage server.
  • the gateway device 33 is implemented by a programmable device such as a field-programmable gate array (field-programmable gate array, FPGA), or a coprocessor.
  • the gateway device 33 is a dedicated chip.
  • the gateway device 33 is a general-purpose computer device, and the computer device implements the functions of the gateway device 33 by running a program in a memory through a processor.
  • the gateway device 33 provides message forwarding and proxy services for multiple clients.
  • the network also includes a client 32 .
  • the gateway device processes the NOF request message sent by the client 32 in a similar manner.
  • the scenario of deploying one gateway device shown in FIG. 8 is only an example, and the number of gateway devices deployed in the system may be more or less. For example, there is only one gateway device, and if there are tens or hundreds of gateway devices, or more, this embodiment does not limit the number of gateway devices deployed in the system.
  • a load balancer is deployed before each gateway device, and the load balancer is used to transfer the requests from each client The message is distributed to each gateway device, so that each gateway device works in a load balancing manner, thereby sharing the processing pressure of a single gateway device.
  • the RDMA storage node 35 is configured to provide a service of reading and writing data through RDMA.
  • the RDMA storage node 35 is also called the RDMA server.
  • RDMA storage nodes 35 have memory.
  • the network interface of the RDMA storage node 35 is connected to the network interface of the gateway device 33 .
  • the RDMA storage node 35 stores data of the client 31 .
  • multiple RDMA storage nodes are deployed in the system.
  • an RDMA storage node 36 is optionally deployed in the system, and the RDMA storage node 36 has features similar to the RDMA storage node 35 .
  • FIG. 9 is a flow chart of a message processing method provided by an embodiment of the present application.
  • the method shown in FIG. 9 relates to the case where the storage system includes multiple RDMA storage nodes.
  • “first RDMA storage node” and “second RDMA storage node” are used to distinguish and describe different RDMA storage nodes.
  • the client in the embodiment shown in FIG. 9 is the host in FIG. 1 .
  • the client in the embodiment shown in FIG. 9 is the host in FIG. 2 .
  • the client in the embodiment shown in FIG. 9 is the host in FIG. 3 .
  • the gateway device in the embodiment shown in Figure 9 acts as the RDMA protocol stack agent of the client 100 in the accompanying drawing 6, and the gateway device replaces the client 100 to interact with the RDMA storage node 200 in the accompanying drawing 6 , the gateway device includes a network card 140 in FIG. 6 , and the network card 140 executes various steps in charge of the gateway device in the embodiment shown in FIG. 9 .
  • the gateway device in the embodiment shown in FIG. 9 includes the network card 140 in FIG. 7 , and the first RDMA storage node in the embodiment shown in FIG. 9 is provided with the network card in FIG. 7 240.
  • the gateway device establishes an RDMA connection with the first RDMA storage node through the network card 140 and performs interaction.
  • the gateway device implements S404 in the embodiment shown in FIG. 9 by adding the first RDMA request message to SQ302 in FIG. 7 .
  • the gateway device implements S404 in the embodiment shown in FIG. 9 by adding the first RDMA request message to SQ302 in FIG. 7 .
  • the first RDMA storage node implements S405 in the embodiment shown in FIG. 9 through RQ403 in FIG. 7 .
  • the network deployment scenario on which the method shown in FIG. 9 is based is shown in FIG. 8 above.
  • the first RDMA storage node in the method shown in FIG. 9 is the RDMA storage node 35 in FIG. 8
  • the client in the method shown in FIG. 9 is the client 31 in FIG.
  • the gateway device in the shown method is the gateway device 33 in FIG. 8 .
  • the method shown in FIG. 9 includes the following steps S401 to S406.
  • the client sends a first NOF request packet.
  • the first NOF request packet carries an NVMe command.
  • the NVMe command indicates to perform a read/write operation on the first destination address.
  • NVMe instructions please refer to the term explanation section (3) above.
  • the NVMe command carried in the first NOF request message is specifically an I/O command.
  • the first NOF request message is a NOF read request message
  • the NVMe instruction carried by the first NOF request message is an NVMe read instruction
  • the NVMe instruction carried by the first NOF request message indicates that The first destination address performs a read operation.
  • NVMe read instructions please refer to the term explanation section (4) above.
  • the first NOF request message is a NOF write request message
  • the first NOF request message is a NOF write request message
  • the first NOF request message The NVMe command carried in the message is an NVMe write command
  • the NVMe command carried in the first NOF request message indicates to perform a write operation on the first destination address.
  • NVMe write instructions please refer to the term explanation section (5) above.
  • the first destination address indicates the location of the storage space provided by the NVMe storage medium.
  • the first destination address indicates the location of the data to be read in the NVMe storage medium.
  • the first destination address indicates the location of the data to be saved in the NVMe storage medium.
  • the first destination address is a logical address (or called a virtual address).
  • the data form of the first destination address includes multiple possible implementations.
  • the form of the first destination address meets the specification of the NVMe protocol.
  • the first destination address is an NVMe address.
  • the first destination address includes a start logical block address (start LBA) and a block number.
  • the first destination address includes a LUN ID, a start address, and a data length.
  • the memory space on the first RDMA storage node is not directly exposed to the client, but is virtualized into a logical unit (logical unit, LU) for the client to use.
  • LU logical unit
  • the storage resources perceived by the client are individual LUNs, not blocks of memory on the RDMA storage node.
  • the gateway device communicates with the client based on LUN semantics. For the concept of LU and LUN, please refer to the terminology explanation part (31) above.
  • the step of mapping the memory space as a LUN is optionally performed by the gateway device, or by the control plane device.
  • the first RDMA storage node provides RDMA memory space for the LUN at the granularity of a page, in other words, allocates the RDMA memory space at a page or an integer multiple of a page.
  • the size of one page is, for example, 4KB or 8KB.
  • the first destination address and the second destination address are the same address.
  • a network element such as a gateway device or a control plane device exposes the memory space on the first RDMA storage node to the client, so that the client can perceive the memory on the RDMA storage node.
  • the gateway device communicates with the client based on memory semantics.
  • the first NOF request message includes the first destination address.
  • the first NOF request message has a start LBA field and a block number field. The contents of the start LBA field and the block number field are used to indicate the first destination address.
  • the first NOF request message includes NOF status information.
  • the gateway device receives the first NOF request message from the client.
  • the gateway device and the client pre-establish a NOF connection.
  • the gateway device receives the first NOF request message through the NOF connection with the client.
  • a NOF connection refers to a logical connection established based on the NOF protocol.
  • the transmission mode of the first NOF request message includes multiple situations, which are described below with examples of the first to second situations.
  • Case 1 After the first NOF request message is sent from the client, it is forwarded to the gateway device through one or more forwarding devices.
  • Case 1 supports the scenario where one or more forwarding devices are deployed between the client and the gateway device. After the client sends the first NOF request message, the forwarding device receives the first NOF request message, and forwards the first NOF request message to the gateway device.
  • the forwarding device through which the first NOF request message passes includes but not limited to a layer-2 forwarding device (such as a switch), a layer-3 forwarding device (such as a router, a switch), and the like.
  • Forwarding devices include but are not limited to wired network devices or wireless network devices.
  • Case 2 supports the scenario where the client is physically directly connected to the gateway device, and the gateway device is the next-hop node of the client.
  • the gateway device acquires information of the first RDMA storage node based on the first destination address.
  • the gateway device obtains the first destination address from the first NOF request message.
  • the gateway device acquires the information of the destination storage node based on the first destination address, and obtains the information of the first RDMA storage node.
  • the process for the gateway device to obtain the first destination address includes: the gateway device obtains the initial LBA from the initial LBA field in the first NOF request message, and obtains the block from the block quantity field in the first NOF request message. Quantity, and get the block size based on the properties of the NOF connection. The gateway device obtains the first destination address based on the starting LBA, the number of blocks and the block size.
  • the first RDMA storage node is the storage node where the data to be read is located, that is, the storage node where the data requested by the first NOF request message is stored.
  • the first RDMA storage node is a storage node for data to be saved, that is, the storage node for which the data requested by the first NOF request message is to be written.
  • the specific content of the information of the first RDMA storage node includes various situations.
  • the information of the first RDMA storage node is the device identifier of the first RDMA storage node; for another example, the information of the first RDMA storage node is the network address of the first RDMA storage node; and for another example, the information of the first RDMA storage node is The memory address of the first RDMA storage node; for another example, the information of the first RDMA storage node is any information that can identify the RDMA connection of the first RDMA storage node; for another example, the information of the first RDMA storage node is the first RDMA storage node In another example, the information of the first RDMA storage node is the session ID of the session between the gateway device and the first RDMA storage node.
  • the information of the first RDMA storage node is the public key of the first RDMA storage node. In another example, the information of the first RDMA storage node is permission information (such as R_Key) for accessing the memory of the first RDMA storage node.
  • the information of the first RDMA storage node includes at least one of the following information: the second destination address, network location information of the first RDMA storage node, one or more QPs in the first RDMA storage node and R_Key, R_Key indicates the right to access the memory of the first RDMA storage node.
  • the second destination address points to the memory space in the first RDMA storage node.
  • the second destination address indicates the location of the data to be read in the memory space of the first RDMA storage node.
  • the second destination address indicates the location of the data to be saved in the memory space of the first RDMA storage node.
  • the second destination address is a logical address (or called a virtual address).
  • the data form of the second destination address includes multiple possible implementation manners.
  • the form of the second destination address meets the requirements of the RDMA protocol.
  • the second destination address is an RDMA address.
  • the second destination address includes VA and DMA length.
  • the second destination address is other data capable of indicating a location in the memory, such as a memory space ID, a start address and a length of the memory space.
  • the network location information of the first RDMA storage node is used to identify the first RDMA storage node in the network.
  • an intermediate network device optionally exists between the gateway device and the first RDMA storage node.
  • the network location information is used to guide intermediate network devices to perform routing and forwarding. Specifically, after the gateway device sends the first RDMA request message, the first RDMA request message first arrives at the intermediate network device.
  • the intermediate network device obtains the network location information of the first RDMA storage node according to the first RDMA request message.
  • the intermediate network device searches for a local routing and forwarding entry according to the network location information, and routes and forwards the first RDMA request message, so that the first RDMA request message is transmitted to the first RDMA storage node.
  • the network location information includes at least one of a MAC address, an IP address, a multi-protocol label switching (multi-protocol label switching, MPLS) label, or a segment identifier (segment ID, SID).
  • a MAC address an IP address
  • MPLS multi-protocol label switching
  • SID segment ID
  • the network location information is the MAC address of the first RDMA storage node, and the MAC address is used to identify the first RDMA storage node in the layer-2 network.
  • an IP network exists between the gateway device and the first RDMA storage node, and the network location information is the IP address of the first RDMA storage node, and the IP address is used to identify the first RDMA storage node in the IP network.
  • the gateway device there is an MPLS network between the gateway device and the first RDMA storage node, and the network location information is the first An MPLS label of the RDMA storage node, where the MPLS label is used to identify the first RDMA storage node in the MPLS network.
  • segment routing segment routing
  • the QP identifier is used to indicate a QP in the first RDMA storage node.
  • One QP is equivalent to one logical channel between the gateway device and the first RDMA storage node.
  • the first RDMA storage node includes multiple QPs.
  • the first correspondence includes an identifier of each QP in the multiple QPs of the first RDMA storage node.
  • the gateway device sends a first RDMA request message to the first RDMA storage node.
  • the gateway device generates a first RDMA request message based on the information of the first RDMA storage node and the RDMA command corresponding to the NVMe command.
  • the gateway device sends the generated first RDMA request message to the first RDMA storage node.
  • the first RDMA request message is a request message in the RDMA protocol.
  • the first RDMA request message is an RDMA unilateral operation message.
  • the above-mentioned first NOF request message is an NOF read request message
  • the first RDMA request message is an RDMA read request message (RDMA read request).
  • the first NOF request message is an NOF write request message
  • the first NOF request message includes data to be saved.
  • the first RDMA request message is an RDMA write request message (RDMA write request).
  • the first RDMA request packet includes data to be saved in the first NOF request packet.
  • the first RDMA request message carries an RDMA command corresponding to the NVMe command and information about the first RDMA storage node.
  • the RDMA instruction indicates to perform a read/write operation on the second destination address in an RDMA manner.
  • the NVMe command carried in the first NOF request message is an NVMe read command
  • the RDMA command carried in the first RDMA request message indicates to perform an RDMA read operation on the second destination address.
  • the concept of RDMA read operation please refer to the introduction of the term explanation part (12) above.
  • the RDMA command carried in the first RDMA request message indicates to perform an RDMA write operation on the second destination address.
  • the concept of RDMA write operation please refer to the introduction of the terminology explanation part (11) above.
  • the first RDMA request packet includes the second destination address.
  • the first RDMA request message includes RETH
  • the RETH in the first RDMA request message includes a VA field and a DMA length field.
  • the second destination address is carried in the VA field and the DMA length field.
  • the NVMe command carried in the first NOF request message has different semantics from the RDMA command carried in the first RDMA request message.
  • the semantics of NVMe instructions is to operate on NVMe media (hard disk).
  • the semantics of RDMA instructions are to operate on memory.
  • the gateway device supports the function of converting NVMe commands into RDMA commands.
  • the gateway device converts the NVMe command carried in the first NOF request message into a corresponding RDMA command, thereby generating the first RDMA request message.
  • the gateway device stores the corresponding relationship between the NVMe command and the RDMA command.
  • the gateway device obtains the NVMe command carried in the first NOF request message, and the gateway device queries the correspondence between the NVMe command and the RDMA command according to the NVMe command, and obtains the RDMA command corresponding to the NVMe command .
  • the gateway device encapsulates the RDMA instruction into an RDMA request message, thereby generating a first RDMA request message.
  • the gateway device converts the NVMe command into the RDMA command by modifying all or part of the parameters in the NVMe command based on the difference between the NVMe command and the RDMA command.
  • the first RDMA request packet includes RDMA status information.
  • the gateway device pre-establishes an RDMA connection with the first RDMA storage node.
  • the gateway device sends the first RDMA request message to the first RDMA storage node through the RDMA connection with the first RDMA storage node.
  • An RDMA connection refers to a logical connection established based on the RDMA protocol.
  • the first RDMA storage node receives the first RDMA request message.
  • the first RDMA storage node executes an RDMA instruction to perform a read/write operation on the memory.
  • the first RDMA storage node obtains the second destination address and the RDMA instruction from the first RDMA request message.
  • the first RDMA storage node finds the memory space corresponding to the second destination address from the local memory.
  • the first RDMA storage node executes RDMA instructions to perform read/write operations on the memory space.
  • the first RDMA storage node executes an RDMA read command, executes an RDMA read operation on the memory space corresponding to the second destination address, and acquires data stored in the memory space corresponding to the second destination address.
  • the first RDMA storage node obtains the data to be saved from the first RDMA request message.
  • the first RDMA storage node Based on the RDMA write instruction, the first RDMA storage node performs an RDMA write operation on the memory space corresponding to the second destination address, and saves the data in the memory space corresponding to the second destination address.
  • the gateway device converts the access initiated by the NOF or NVMe storage node (the first NOF request message from the client) to the access to the RDMA storage node (the first RDMA request message), thereby improving the storage capacity. performance.
  • the storage media provided by RDMA nodes is memory, and the performance of memory is better than that of NVMe hard disks.
  • the gateway device converts NOF request messages into RDMA request messages and NVMe commands into RDMA commands, which is equivalent to converting hard disk operations into memory operations, so as to take advantage of the performance advantages of memory storage and improve performance.
  • the instruction set for memory operations is simpler than that for hard disk operations, which reduces the complexity of executing read and write instructions on storage nodes and further improves performance.
  • the gateway device determines the information of the RDMA storage node (the information of the first remote direct memory access RDMA storage node) based on the destination logical address of NVMe (the first destination address), it supports addressing offloading and reduces the CPU pressure of the storage node.
  • Addressing in this embodiment refers to a process of finding a destination storage node according to the destination NVMe address.
  • the so-called "offload” usually refers to the transfer of tasks originally responsible by the CPU to specific hardware for execution.
  • addressing is usually performed by the CPU of the NOF storage node. Specifically, the CPU of the NOF storage node in the related art needs to judge whether the destination storage node is the local node according to the destination NVMe address. If it is not the local node, the NOF storage node needs to reconstruct the request message, and then forward the constructed request message to the final The destination storage node. The process of addressing, reconstructing the request message, and forwarding the message will occupy a large amount of processing resources of the CPU of the storage node, and the process of forwarding the message will also bring network IO pressure to the storage node.
  • the addressing task (such as the step of determining the first RDMA storage node according to the first destination address) is performed by the gateway device, which is equivalent to offloading the addressing task of the NOF storage node, thereby reducing the CPU pressure of the NOF storage node , and save the network IO pressure occupied by forwarding packets on storage nodes.
  • the gateway device which is equivalent to offloading the addressing task of the NOF storage node, thereby reducing the CPU pressure of the NOF storage node , and save the network IO pressure occupied by forwarding packets on storage nodes.
  • the gateway device since the gateway device is deployed in the system, the logical connection is established between the gateway device and the RDMA storage node (RDMA connection), so that the gateway device takes over the original back-end expansion function of the storage node, thus optimizing the message forwarding path and reducing the message forwarding delay.
  • RDMA connection RDMA storage node
  • the forwarding path of the NOF request message is logically client ⁇ network device ⁇ NOF front-end storage node ⁇ NOF back-end storage node. It can be seen that the forwarding path needs to go through at least two hops of intermediate nodes, the network device and the NOF front-end storage node. The packet forwarding path is long and the delay is large.
  • the NOF front-end storage node is used to forward the message to the NOF back-end storage node when the destination storage address is not in the node.
  • the forwarding path of the request message is logically client ⁇ gateway device ⁇ RDMA storage node, without forwarding through the NOF front-end storage node, so shortening the message forwarding path and reducing the Text forwarding delay.
  • the gateway device executes the processing flow based on the NOF request message initiated by the client, it does not need to require the client to initiate an RDMA message, thus requiring no modification of the client, thereby reducing the difficulty of service provisioning.
  • the client can use the storage service provided by the RDMA storage node after initiating access according to the original NOF process, without having to perceive the change of the storage node, and without requiring the client to support RDMA, so it is different from the original
  • the NOF storage solution is compatible, which facilitates quick provisioning of services.
  • the gateway device since the gateway device is deployed in the system, the logical connection (RDMA connection) is established between the gateway device and the RDMA storage node, which reduces the difficulty of expanding the storage system and improves the scalability of the storage system.
  • RDMA connection logical connection
  • the client when a new storage node is added to the NOF storage system, the client is usually required to establish a connection with the new storage node, and the client can only use the storage capacity provided by the new storage node after connecting to the new storage node.
  • the client has high requirements and is difficult to expand.
  • the gateway device since the work of establishing the RDMA connection with the RDMA storage node is performed by the gateway device, when a new RDMA storage node is added to the storage system, the gateway device establishes a connection with the newly added RDMA storage node and interacts, then Provide the storage capacity of the newly added RDMA storage node to the client. From the perspective of the client, it is not necessary to require the client to perceive the newly added RDMA storage node, nor to require the client to establish a connection with the newly added RDMA storage node. The client can use the newly added RDMA by using the connection established with the gateway device before.
  • the storage capacity of the storage node obviously reduces the requirements for the client, which also reduces the difficulty of expansion, meets the needs of flexible expansion of the storage system, and improves scalability.
  • the method shown in FIG. 9 further includes the following S407 to S412 on the basis of including the above S401 to S406.
  • the above S401 to S406 are the interaction process in the direction of NOF-RDMA.
  • the following S407 to S412 are the interaction process in the direction of RDMA-NOF.
  • the first RDMA storage node generates an RDMA response message.
  • the first RDMA storage node sends an RDMA response message.
  • the RDMA response message is a response message to the first RDMA request message.
  • the RDMA response message indicates a response to the RDMA command in the first RDMA request message.
  • the RDMA response message is an RDMA read pespond message.
  • Executing the RDMA command includes a process of executing an RDMA read operation, and the RDMA response message includes data read from the memory space of the first RDMA storage node.
  • the read data is carried in the payload field of the RDMA read response message.
  • the RDMA response message is an RDMA ACK message.
  • the RDMA response message includes RDMA status information.
  • the RDMA status information indicates the correspondence between the RDMA response message and the first RDMA request message.
  • the RDMA status information in the RDMA response message has the same value as the RDMA status information in the first RDMA request message.
  • the RDMA status in the RDMA response message information is different from the value of the RDMA state information in the first RDMA request message, and the value of the RDMA state information in the RDMA response message and the value of the RDMA state information in the first RDMA request message meet the set rules (such as difference value is 1).
  • the gateway device receives the RDMA response message from the first RDMA storage node.
  • the gateway device generates a first NOF response message based on the RDMA response message.
  • the first NOF response message is a response message to the first NOF request message.
  • the first NOF response message indicates to respond to the NVMe command in the first NOF request message.
  • the first NOF response message includes the data requested by the first NOF request message.
  • the process of generating the first NOF response message includes: the gateway device obtains the data saved in the memory space of the first RDMA storage node from the RDMA response message.
  • the gateway device generates the first NOF response message based on the data stored in the memory space of the first RDMA storage node.
  • the first NOF response message further includes a CQE, and the CQE is used to indicate that the NVMe read operation has been completed.
  • the first NOF response message is a NOF write response message.
  • the first NOF response message includes a CQE, and the CQE is used to indicate that the NVMe write operation has been completed, or that the data has been saved successfully.
  • the first NOF response packet includes NOF status information.
  • the NOF status information indicates the correspondence between the first NOF response message and the first NOF request message.
  • the NOF state information in the first NOF response message and the NOF state information in the first NOF request message have the same value.
  • the first NOF request message and the first NOF response message contain the same virtual address, the same remote key and the same direct memory access length.
  • the values of the NOF state information in the first NOF request message and the NOF state information in the first NOF response message are different, and the values of the NOF state information in the first NOF request message are different from those in the first NOF response message.
  • the value of the NOF state information satisfies the set rule (for example, the difference is 1).
  • the difference between the PSN in the first NOF request message and the PSN in the first NOF response message is equal to the set value.
  • the gateway device sends a first NOF response packet to the client.
  • the client receives the first NOF response message.
  • how the gateway device obtains the information of the RDMA storage node (such as the information of the first RDMA storage node) according to the destination NVMe address (such as the first destination address) includes multiple implementation methods, and some possible implementation methods are given below as examples illustrate.
  • the gateway device obtains the information of the RDMA storage node by querying the corresponding relationship, and this implementation will be introduced below.
  • the gateway device After the gateway device receives the first NOF request message, the gateway device obtains the first destination address from the first NOF request message, and based on the first destination address, the gateway device obtains the first RDMA storage address from the first corresponding relationship. node information, so as to determine that the destination storage node corresponding to the first destination address is the first RDMA storage node. Afterwards, the gateway device generates a first RDMA request message based on the information of the first RDMA storage node. The first RDMA request message includes information about the first RDMA storage node.
  • the first correspondence refers to the correspondence between the first destination address and the information of the first RDMA storage node.
  • the first corresponding relationship includes the first destination address and information of the first RDMA storage node.
  • the first correspondence is the content of an entry in a table.
  • the first correspondence is a combination of contents of two fields in the same entry, one of the two fields represents the first destination address, and the other field represents information of the first RDMA storage node.
  • the first correspondence is specifically the content of an entry in the address translation table.
  • address translation table please refer to the introduction of the following example 1, and the address translation table will not be described in detail here.
  • How the gateway device obtains the foregoing first correspondence includes multiple implementation manners. Two possible implementation manners are illustrated below with examples, see the following implementation manners A to B.
  • the gateway device generates a first correspondence.
  • Implementation mode A belongs to a scheme in which a gateway device is responsible for address arrangement. Specifically, the gateway device allocates an NVMe logical address to the first RDMA storage node to obtain the first destination address. The gateway device establishes a corresponding relationship between the first destination address and the information of the first RDMA storage node, thereby generating the first corresponding relationship.
  • how the gateway device acquires the information of the first RDMA storage node includes multiple implementation manners.
  • the first RDMA storage node actively reports the information of the node to the gateway device.
  • the first RDMA storage node sends the information of the first RDMA storage node to the gateway device.
  • the gateway device receives the information of the first RDMA storage node sent by the first RDMA storage node, so as to obtain the information of the first RDMA storage node.
  • the timing for the first RDMA storage node to report information includes various situations.
  • the first RDMA storage node when establishing an RDMA connection with the gateway device, the first RDMA storage node sends information of the first RDMA storage node to the gateway device.
  • the first RDMA storage node when the information of this node is updated, the first RDMA storage node sends the information of the first RDMA storage node to the gateway device.
  • the information of the first RDMA storage node may be updated.
  • the first RDMA storage node sends the updated information of the node to the gateway device whenever it finds that the information of the node is updated.
  • the first RDMA storage node sends the information of the first RDMA storage node to the gateway device when it is powered on, started or restarted. In another possible implementation, the first RDMA storage node sends the information of the first RDMA storage node to the gateway device when receiving the instruction.
  • the first RDMA storage node generates and sends an RDMA packet to the gateway device, and the RDMA packet carries information of the first RDMA storage node.
  • the first RDMA storage node generates and sends an RDMA registration message to the gateway device, where the RDMA registration message carries information of the first RDMA storage node.
  • the RDMA registration message is used to register the memory space of the first RDMA storage node as a space for RDMA operations.
  • the RDMA registration message is a message for bilateral operation in RDMA, for example, the RDMA registration message is a send message or a receive message.
  • the first RDMA storage node reports the information of the node to the gateway device by using an inter-device communication protocol other than RDMA.
  • the first RDMA storage node uses a private protocol message, or a communication interface between the storage node and the control plane, or Report the information of this node to the gateway device by means of routing protocol messages or other methods.
  • the gateway device pulls the information of the first RDMA storage node from the first RDMA storage node.
  • the gateway device generates and sends a query request to the first RDMA storage node, where the query request is used to instruct to acquire information of the first RDMA storage node.
  • the first RDMA storage node receives the query request, generates and sends a query response to the gateway device, and the query response includes information of the first RDMA storage node.
  • the gateway device receives the query response, and obtains the information of the first RDMA storage node from the query response.
  • the protocol types corresponding to the above query request and query response include multiple implementations.
  • the above query request and query response are network configuration (network configuration, NETCONF) messages or simple network management protocol (simple network management protocol, SNMP) messages.
  • the gateway device acquires the information of the first RDMA storage node from the network element of the control plane or the management plane.
  • the gateway device generates and sends a query request to the network element of the control plane or the management plane, where the query request is used to instruct to acquire the information of the first RDMA storage node.
  • the control plane or management plane network element receives the query request, generates and sends a query response to the gateway device, and the query response includes information of the first RDMA storage node.
  • the gateway device receives the query response, and obtains the information of the first RDMA storage node from the query response.
  • the control plane or management plane network element includes many implementation manners.
  • a storage node is elected from various storage nodes in the storage system, and the elected storage node serves as a network element of a control plane or a management plane.
  • the control plane or management plane network element is optionally a NOF storage node, or an RDMA storage node.
  • an independent network element is deployed as a control plane or management plane network element.
  • the gateway device acquires the information of the first RDMA storage node through static configuration.
  • the network administrator configures the information of the first RDMA storage node to the gateway device through a command line, a web interface or other methods.
  • the gateway device obtains the information of the first RDMA storage node based on the configuration operation of the network administrator.
  • how the gateway device allocates the NVMe logical address for the first RDMA storage node includes multiple implementation manners. Generally speaking, the gateway device assigns an NVMe logical address to the first RDMA storage node under the constraint that the NVMe logical addresses corresponding to different storage nodes in the storage system are not repeated.
  • the gateway device not only acquires information about the first RDMA storage node, but also acquires information about other RDMA storage nodes.
  • the gateway device creates a storage resource pool based on the information of each RDMA storage node.
  • the storage space of the storage resource pool comes from the memory space of each RDMA storage node.
  • the gateway device uniformly addresses each memory space in the storage resource pool, so that each memory space has a unique global address.
  • the so-called global address means that the memory space indicated by this address is unique in the storage resource pool, and the physical memory spaces corresponding to different global addresses are not repeated.
  • the global address of the memory space of the first RDMA storage node is the NVMe logical address allocated to the first RDMA storage node.
  • the hard disk space of the NOF storage nodes is also included in the storage resource pool, which is equivalent to pooling the memory space provided by each RDMA storage node and the hard disk space of each NOF storage node for unified management.
  • the gateway device not only obtains information of RDMA storage nodes, but also obtains information of each NOF storage node.
  • the gateway device creates a storage resource pool based on the information of each RDMA storage node and the information of each NOF storage node. Please refer to the description of Example 3 below for more details on implementing address arrangement by the gateway device.
  • Implementation manner B The gateway device receives the first correspondence from other devices other than the gateway device.
  • Implementation mode B belongs to a scheme in which other devices other than the gateway device are responsible for address arrangement.
  • the network element of the control plane or the management plane allocates an NVMe logical address for the first RDMA storage node to obtain the first destination address.
  • the network element of the control plane or the management plane establishes the first destination address and the first RDMA storage node The corresponding relationship between the point information, thereby generating the above-mentioned first corresponding relationship.
  • the network element of the control plane or the management plane sends the first correspondence to the gateway device.
  • the gateway device receives the first correspondence sent by the network element of the control plane or the management plane.
  • How to obtain the information of the first RDMA storage node and how to allocate the NVMe logical address when the network element of the control plane or the management plane generates the above-mentioned first correspondence can refer to the description in Implementation A, and the execution subject of the steps described in Implementation A is from The gateway device can be replaced with a network element of the control plane or the management plane.
  • the network element of the control plane or the management plane cooperates with the gateway device to generate the foregoing first correspondence.
  • the gateway device is responsible for reporting the information of the first RDMA storage node to the network element of the control plane or the management plane, and the network element of the control plane or the management plane generates the above-mentioned first corresponding relationship according to the information reported by the gateway device.
  • the gateway device determines the destination storage node by querying the corresponding relationship, which reduces the implementation complexity and can quickly determine the destination storage node in the process of forwarding the message.
  • the processing logic of querying the corresponding relationship is relatively simple and modular, it can be easily offloaded to dedicated hardware for execution, so that resources of the main control processor do not need to be consumed.
  • the above-mentioned first correspondence and the forwarding entry are stored in the memory on the interface board (also called the service board), and the action of querying the first correspondence is executed by the processor on the interface board, thereby There is no need to upload the content of the NOF request message to the main control processor, which saves the computing power of the main control processor and improves forwarding efficiency.
  • the gateway device determines the RDMA storage node by querying the corresponding relationship.
  • the gateway device adopts other implementation manners to determine the RDMA storage node, and some other implementation manners are described as examples below.
  • the gateway device determines the destination storage node based on a quality of service (quality of service, QoS) policy. For example, if the service level agreement (service-level agreement, SLA) requirement of the client is high, the gateway device determines the RDMA storage node as the destination storage node. If the SLA requirement of the client is low, the gateway device determines the NOF storage node as the destination storage node.
  • QoS quality of service
  • the gateway device determines the destination storage node based on the capacity balancing policy. Specifically, after the gateway device receives the first NOF request message, the gateway device selects the storage node with the largest free capacity from each storage node as the destination storage node according to the current idle capacity of each storage node in the storage system, so as to ensure that the data Evenly write to each storage node.
  • other devices than the gateway device execute the step of querying the corresponding relationship, and then notify the gateway device of the destination storage node obtained by querying the corresponding relationship.
  • the client specifies the destination storage node.
  • the first NOF request message sent by the client includes the identifier of the first RDMA storage node.
  • the various manners of determining the destination storage node listed above are all optional manners, and this embodiment does not limit how the gateway device determines the destination storage node after receiving the first NOF request message.
  • how the gateway device generates the NOF response message to reply to the client includes multiple implementation modes, and some implementation modes that may be adopted when generating the NOF response message are illustrated below with examples.
  • the gateway device after receiving the RDMA response message returned by the RDMA storage node, the gateway device obtains the NOF status information by some means, and generates the NOF response message according to the NOF status information.
  • the gateway device obtains the NOF status information by querying the corresponding relationship.
  • the gateway device obtains the RDMA status information based on the RDMA response message; the gateway device obtains the RDMA status information according to the RDMA status information , and obtain the NOF status information from the second correspondence; the gateway device generates the first NOF response message based on the NOF status information.
  • the second correspondence refers to the correspondence between RDMA status information and NOF status information.
  • the second correspondence includes a correspondence between RDMA status information and NOF status information.
  • RDMA status information please refer to the introduction of the term explanation part (17) above
  • NOF status information please refer to the introduction of the term explanation part (18) above.
  • the second correspondence is the content of an entry in a table.
  • the second correspondence is a combination of contents of two fields in the same entry, one of the two fields represents RDMA state information, and the other field represents NOF state information.
  • the second correspondence is specifically the content of an entry in the NOF context table.
  • NOF context table please refer to the introduction of the following example 1, and the NOF context table will not be described in detail here.
  • the gateway device establishes the second correspondence during the process of converting the NOF request packet into the RDMA request packet. For example, in conjunction with the method shown in Figure 9, after the gateway device receives the first NOF request message, the gateway device obtains the NOF status information based on the first NOF request message; status Get RDMA status information. The gateway device establishes the corresponding relationship between the NOF state information and the RDMA state information.
  • the gateway device when the gateway device executes the method shown in Figure 9, the gateway device obtains the NOF PSN carried in the first NOF request message, and obtains the NOF PSN carried in the RDMA request message (that is, the first RDMA request message) to be sent this time.
  • RDMA PSN establish the corresponding relationship between NOF PSN and RDMA PSN.
  • the basic principle for the gateway device to obtain the RDMA PSN is that when the gateway device establishes a session with the RDMA storage node based on the RDMA protocol, the gateway device initializes the RDMA PSN to obtain an RDMA PSN. Afterwards, whenever the gateway device wants to send an RDMA request message to the RDMA storage node, the gateway device first updates the RDMA PSN carried in the last RDMA request message according to the set rules, and then sends the updated RDMA PSN It is carried in the RDMA request message to be sent this time, and then the RDMA request message is sent.
  • the specific way for the gateway device to update the RDMA PSN during the interaction process is determined according to the processing logic of the RDMA protocol stack. For example, in the case of no fragmentation, updating the RDMA PSN is adding one to the RDMA PSN, and in the case of fragmentation, updating the RDMA PSN is adding the number of fragments to the RDMA PSN.
  • the method is not limited.
  • the second correspondence described above is the correspondence between the RDMA PSN and the NOF PSN.
  • the RDMA PSN in the second correspondence is replaced with other RDMA state information
  • the NOF PSN in the second correspondence is replaced with other NOF state information.
  • the RDMA state information and the NOF state in the second correspondence The specific content of the information is not limited.
  • the establishment of the correspondence between the NOF state information and the RDMA state information described above is optional.
  • the correspondence between the NOF state information and other information is established, and the gateway device obtains the NOF state information by searching for the correspondence between the NOF state information and other information.
  • the gateway device establishes a correspondence between the NOF state information and the information of the first RDMA node (such as the device identifier).
  • the gateway device maintains a session table, During a session between the gateway device and the client based on the NOF protocol, whenever the gateway device and the client exchange a message, the gateway device will save the current NOF state information in the session table, and the gateway device will use the latest The NOF state information saved once determines the NOF state information to be used when sending the NOF response message currently.
  • the above implementation mode I describes a solution in which the gateway device obtains the NOF status information by querying the corresponding relationship.
  • the technical effect of this method is analyzed below, see the following two points.
  • the gateway device first carries the NOF status information in the RDMA request message and sends it to the RDMA storage node, and then the gateway device obtains the NOF status information from the RDMA response message returned by the RDMA storage node.
  • the gateway device obtains the NOF status information based on the first NOF request message.
  • the gateway device adds the NOF state information to the first RDMA request message to obtain the first RDMA request message including the NOF state information.
  • the gateway device sends the first RDMA request packet including the NOF state information to the first RDMA storage node.
  • the RDMA storage node obtains the NOF state information from the first RDMA request message.
  • the RDMA storage node adds the NOF state information to the RDMA response message to obtain the RDMA response message including the NOF state information.
  • the RDMA storage node sends an RDMA response message including NOF status information.
  • the gateway device obtains NOF status information based on the RDMA response message; the gateway device generates a first NOF response message based on the NOF status information.
  • the carrying positions of the NOF state information in the first RDMA request message and the RDMA response message include various situations.
  • the NOF state information is located between the RDMA header and the payload.
  • the NOF status information is located in the RDMA header.
  • a new type of message header or a new type of TLV is extended in the RDMA protocol, and the new type of message header or TLV is used to carry NOF status information.
  • some reserved fields in the RDMA protocol are used to carry the NOF state information, and this embodiment does not limit how to carry the NOF state information in the RDMA message.
  • the gateway device can obtain the NOF status information without maintaining additional table entries locally, thus saving the storage space of the gateway device and reducing the resource overhead caused by the gateway device looking up and writing tables.
  • the gateway device generates the NOF response message by acquiring the NOF status information is an optional implementation manner of the embodiment of the present application.
  • the main processing work of generating the NOF packet header is executed on the RDMA storage node, see the following implementation manner III.
  • Implementation Mode III The RDMA storage node processes and obtains part or all of the content of the NOF message header, and the gateway device multiplexes the processing result of the RDMA storage node to generate a NOF response message.
  • the gateway device pre-generates the NOF message header, fills in the contents of some fields in the NOF message header, and sends the RDMA request message containing the NOF message header to RDMA storage nodes.
  • the RDMA storage node After the RDMA storage node receives the RDMA request message, the RDMA storage node further processes the NOF message header, such as filling the content of the blank field in the NOF message header, or modifying the content of the filled field of the gateway device. Then, the RDMA storage node carries the processed NOF message header in the RDMA response message, and returns the RDMA response message including the NOF message header.
  • the gateway device receives the RDMA response message returned by the RDMA storage node, after receiving the RDMA response message, the gateway device generates into a NOF response packet.
  • Which fields of the NOF packet header are pre-filled by the gateway device includes various implementation manners.
  • the gateway device uses the NOF state information to fill the field in the NOF packet header for carrying the NOF state information.
  • the gateway device also fills in one or more items of MAC header content, IP header content, or UDP header content in the NOF packet header.
  • the field types that the gateway device pre-fills in the NOF message header can be set according to the business scenario. This embodiment does not limit which fields the gateway device specifically pre-fills in the NOF message header.
  • the gateway device generates the first RDMA request message including the first NOF message header, and sends the first RDMA request message including the first NOF message header to the first RDMA storage node .
  • the first RDMA storage node obtains the first NOF message header from the first RDMA request message, and the first RDMA storage node generates a second NOF based on the first NOF message header A message header, generating and sending an RDMA response message including the second NOF message header.
  • the gateway device receives the RDMA response message, the gateway device generates the first NOF response message based on the second NOF message header in the RDMA response message.
  • the NOF packet header is encapsulated in the inner layer of the RDMA packet header.
  • the specific process for the gateway device to generate the first NOF response message includes: the gateway device strips the outer RDMA message header in the RDMA response message, and uses the remaining part of the obtained RDMA response message as the NOF response message.
  • the implementation mode III is illustrated below by taking NOF as RoCE as an example.
  • the gateway device pre-generates a RoCE header (that is, the first NOF packet header), and the RoCE header includes a MAC header, an IP header, a UDP header and an IB header.
  • the gateway device fills the contents of the MAC header, IP header, UDP header and IB header with the information to be responded to the client.
  • the gateway device encapsulates the RDMA header and the filled RoCE header to obtain the above-mentioned first RDMA request message.
  • the RoCE header is encapsulated in the inner layer of the RDMA header.
  • the first RDMA storage node generates an RDMA response message according to the RoCE header in the RDMA request message.
  • the RoCE header (the second NOF header) in the RDMA response message is encapsulated in the inner layer of the RDMA header.
  • the gateway device strips the outer RoCE header of the RDMA response message, uses the remaining part of the RDMA response message as the first NOF response message, and returns the first NOF response message to the client.
  • the gateway device can obtain the NOF status information without maintaining additional entries locally, thus saving the storage space of the gateway device and reducing the resource overhead caused by the gateway device looking up and writing tables.
  • the processing pressure of the gateway device is reduced by transferring the work of generating the NOF message header to the RDMA storage node for execution.
  • the step in which the gateway device pre-generates the NOF packet header in the implementation mode III is optional.
  • the RDMA storage node is responsible for encapsulating the NOF header into the RDMA message.
  • the gateway device supports writing the same piece of data to each of the multiple RDMA storage nodes, thereby realizing the function of data backup.
  • the above-mentioned first NOF request message is a NOF write request message
  • the first NOF request message carries NVMe A write command
  • the NVMe write command instructs the first destination address to perform a write operation.
  • the gateway device After receiving the first NOF request message, acquires the information of the first RDMA storage node and the information of the second RDMA storage node based on the first destination address.
  • the gateway device not only generates the first RDMA request message based on the first NOF request message, but also generates the second RDMA request message based on the first RDMA request message.
  • the gateway device not only sends the first RDMA request message to the first RDMA storage node, but also sends the second RDMA request message to the second RDMA storage node.
  • the second RDMA request packet has similar features to the first RDMA request packet.
  • the second RDMA request message also includes the data to be saved carried in the first NOF request message.
  • the second RDMA request message includes an RDMA write command corresponding to the NVMe write command.
  • the second RDMA request message also includes information about the second RDMA storage node.
  • the second RDMA request message includes the third destination address, network location information of the second RDMA storage node, and identifiers of one or more QPs in the second RDMA storage node.
  • the third destination address is the address of the memory space in the second RDMA storage node.
  • the processing action of the second RDMA storage node for the second RDMA request message is similar to the processing action of the first RDMA storage node. Specifically, the second RDMA storage node will execute the RDMA instruction in the second RDMA request message, find the location corresponding to the third destination address in the memory, and save the data in the second RDMA request message to the third destination address in the memory the corresponding location.
  • the gateway device obtains the information of the second RDMA storage node.
  • the above-mentioned first corresponding relationship not only includes the information of the first destination address and the first RDMA storage node, but also includes the information of the second RDMA storage node, so the gateway device searches for the first corresponding relationship After that, the information of the second RDMA storage node can be obtained.
  • the above-described situation of obtaining the information of two RDMA storage nodes according to a destination NVMe address, and thus writing a copy of data to two RDMA storage nodes is an example.
  • the number of nodes is not limited.
  • the number of RDMA storage nodes determined according to one destination NVMe is optionally equal to the number of copies.
  • the number of RDMA storage nodes determined according to a destination NVMe is optionally equal to the sum of the number of data blocks and check blocks in a stripe .
  • how the gateway device sends an RDMA write request message to multiple RDMA storage nodes includes multiple implementation methods. The following two sending methods are combined for example.
  • Sending method 1 The gateway device multicasts the RDMA write request message to multiple RDMA storage nodes.
  • the multicast manners that may be adopted by the gateway device include many implementation manners.
  • multicast methods include but not limited to bit indexed explicit replication (BIER), BIER (BIERv6) based on Internet protocol version 6 (IPv6), multicast group management protocol (internet group management protocol, IGMP), protocol independent multicast (protocol independent multicast, PIM), multicast source discovery protocol, multicast border gateway protocol (multiprotocol border gateway protocol, MBGP) and so on.
  • BIER bit indexed explicit replication
  • BIERv6 based on Internet protocol version 6 (IPv6)
  • IPv6 Internet protocol version 6
  • IGMP Internet protocol version 6
  • PIM protocol independent multicast
  • PIM protocol independent multicast
  • multicast source discovery protocol multicast border gateway protocol
  • MBGP multiprotocol border gateway protocol
  • the above-mentioned first RDMA request message and the second RDMA request message are all multicast messages.
  • the first RDMA request packet and the second RDMA request packet include a multicast packet header encapsulated in an outer layer of the RDMA packet header.
  • the multicast packet header includes, for example, the identifier of the multicast group joined by the first RDMA storage node and the second RDMA storage node, and for example includes the device identifier of the first RDMA storage node or the second RDMA storage node in the multicast domain.
  • the multicast packet header includes but not limited to BIER header, BIERv6 header, IGMP header, PIM header and so on.
  • Transmission method 2 The gateway device sends the RDMA write request message to each RDMA storage node in a unicast manner.
  • the above-mentioned first RDMA request message and the second RDMA request message are both unicast messages.
  • the gateway device supports sending the read request to multiple candidate RDMA storage nodes among which One of the RDMA storage nodes supports the load sharing feature, allowing multiple RDMA nodes to share the processing pressure caused by reading data.
  • the gateway device obtains the information of the first RDMA storage node and the information of the second RDMA storage node based on the first destination address. In this case, the gateway device selects an RDMA storage node from the first RDMA storage node and the second RDMA storage node based on a load sharing algorithm.
  • the gateway device sends the first RDMA request message to the first RDMA storage device.
  • the steps described in the method shown in FIG. 9 that the first RDMA storage device is responsible for are replaced by the second RDMA storage node.
  • the load sharing algorithm adopted by the gateway device includes multiple specific implementation methods.
  • the load sharing algorithm is a consistent hash algorithm.
  • the load sharing algorithm is to select the storage node with the lowest data access frequency from multiple RDMA storage nodes corresponding to the destination NVMe address. This embodiment does not limit the type of load sharing algorithm adopted by the gateway device.
  • the gateway device also supports interaction with NOF storage nodes based on the NOF protocol.
  • the process of interaction between the gateway device and the NOF storage node is described below with an example.
  • part of the data requested by the client is not stored in the RDMA storage node, but is stored in the NOF storage node, then the gateway device uses the method corresponding to Figure 11 to obtain the data stored on the NOF storage node and requested by the client. data.
  • the current storage capacity of the RDMA storage node in the system is insufficient, which may not be able to meet the data storage requirements of the client.
  • the gateway device uses the storage space of the NOF storage node to store the data of the client through the method corresponding to FIG. 11 .
  • FIG. 11 is a flow chart of a message processing method provided by an embodiment of the present application. The method shown in FIG. 11 includes the following steps S501 to S512.
  • the client sends a first NOF request packet.
  • the gateway device receives the first NOF request message from the client.
  • the gateway device acquires the information of the NOF storage node based on the first destination address.
  • the gateway device obtains the first destination address from the first NOF request packet.
  • the gateway device obtains the information of the destination storage node based on the first destination address, and obtains the information of the NOF storage node.
  • the implementation manner of how the gateway device obtains the information of the NOF storage node is the same as the implementation manner of obtaining the information of the first RDMA storage node in the embodiment shown in FIG. 9 .
  • the above-mentioned first corresponding relationship is replaced from the corresponding relationship between the first destination address and the information of the first RDMA storage node with the first destination address and the NOF The corresponding relationship between the information of the storage nodes, so the gateway device can obtain the information of the NOF storage node after searching the corresponding relationship.
  • the gateway device modifies the first NOF request message to obtain the second NOF request message.
  • the above-mentioned first NOF request message includes the first NOF state information
  • the gateway device modifies the first NOF state information into the second NOF state information to obtain the second NOF request message including the second NOF state information.
  • the first NOF state information is state information of interaction between the client and the gateway device based on the NOF protocol.
  • the second NOF state information is the state information of interaction between the gateway device and the NOF storage node based on the NOF protocol.
  • the gateway device sends a second NOF request message to the NOF storage node.
  • the second NOF request message includes the NVMe command, the first destination address and the information of the NOF storage node.
  • the NOF storage node receives the second NOF request message.
  • the NOF storage node executes the NVMe instruction to perform a read/write operation on the hard disk.
  • the NOF storage node generates a second NOF response message.
  • the second NOF response message is a response message to the second NOF request message.
  • the NOF storage node sends a second NOF response message.
  • the gateway device receives the second NOF response message from the NOF storage node.
  • the gateway device generates a third NOF response message based on the second NOF response message.
  • the gateway device modifies the second NOF response message to obtain the third NOF response message.
  • the above-mentioned second NOF response message includes the third NOF state information
  • the gateway device modifies the third NOF state information into the fourth NOF state information to obtain the third NOF response message including the fourth NOF state information.
  • the third NOF state information is the state information of interaction between the gateway device and the NOF storage node based on the NOF protocol.
  • the fourth NOF state information is state information of interaction between the client and the gateway device based on the NOF protocol.
  • the gateway device sends a third NOF response message to the client.
  • the third NOF response message is a response message to the first NOF request message.
  • the client receives the third NOF response packet.
  • the gateway device supports the original NOF interaction process, so as to maintain compatibility with the original NOF storage solution, without the need to replace a large number of existing network devices.
  • At least part of the content of the embodiment corresponding to FIG. 9 and the embodiment corresponding to FIG. 11 may be combined with each other.
  • the gateway device selectively executes one of the embodiment corresponding to FIG. 9 and the embodiment corresponding to FIG. 11 by making a judgment.
  • the corresponding relationship on the gateway device includes a node type identifier, where the node type identifier is used to identify whether the storage node is an RDMA storage node or a NOF storage node. After the gateway device receives the NOF request, it judges that the node type identifier corresponding to the destination NVMe address in the corresponding relationship represents an RDMA storage node or a NOF storage node, and if the node type identifier represents an RDMA storage node, then enter the embodiment corresponding to FIG. 9 .
  • the gateway device obtains the information of the RDMA storage node and the information of the NOF storage node according to the destination address, the gateway device according to the set policy (such as load sharing, capacity sharing, QoS policy, etc.) One of the storage nodes and the NOF storage nodes is selected as the responder of the NOF request message. If the gateway device selects the RDMA storage node, enter the embodiment corresponding to FIG. 9 . If the gateway device selects the NOF storage node, enter the embodiment corresponding to FIG. 11 .
  • the set policy such as load sharing, capacity sharing, QoS policy, etc.
  • another possible implementation manner of combining the two embodiments is that both the embodiment corresponding to FIG. 9 and the embodiment corresponding to FIG. 11 are executed. That is, after the gateway device receives the NOF request message from the client, it not only interacts with the RDMA storage node, but also interacts with the NOF storage node. As an example, one type of node in the RDMA storage node and the NOF storage node acts as a master node, and the other type of node acts as a standby node.
  • the gateway device After receiving the NOF request message from the client, the gateway device sends the RDMA request to the RDMA storage node and the NOF request to the NOF storage node, thereby saving the data in the memory of the RDMA storage node and the hard disk of the NOF storage node respectively.
  • IP-SAN storage area network storage area network, SAN
  • SAN refers to an architecture that connects storage media with computers (such as servers) through a network.
  • SAN supports the expansion of storage media that can only be carried by a single server to multiple servers through the network, greatly improving storage capacity and scalability.
  • FC-SAN fiber channel-storage area network
  • IP-SAN IP-storage area network
  • the effect of using the NOF protocol based on NVMe instructions to build an IP-SAN storage network is better, so the following examples are described as an example of improvement on the basis of NOF.
  • the basic principle of using the NOF protocol to build an IP-SAN storage network is better.
  • the NVMe subsystem (NOF storage node) is directly connected to the host through the PCIe bus, and there is no Requires a host bus adapter (HBA) card, reducing system overhead.
  • HBA host bus adapter
  • the NVMe subsystem reduces the IO scheduling layer, a separate command layer, and a shorter IO path, which provides a guarantee for low latency.
  • NVMe command queue can support up to 64K command queues, and each command queue supports up to 64K commands.
  • NVMe is more performant and efficient.
  • NOF inherits the advantages of NVMe, so using the NOF protocol to build an IP-SAN storage network has a better effect.
  • the solutions of the following examples are applied to storage protocol instructions based on other types of storage protocols other than NVMe and storage systems based on such instructions.
  • the storage protocols involved in the following examples Partially modified, the specific implementation is similar to the following example.
  • the following example implements a gateway device that can replace traditional network forwarding devices.
  • the gateway device supports the following four functions on the basis of realizing Layer 2 and Layer 3 forwarding.
  • the gateway device Because the gateway device provided in this example supports the RDMA protocol stack, it can establish a connection with the RDMA storage node and interact based on the RDMA protocol.
  • the gateway device provided in this example can act on behalf of the NOF storage node to interact with the client because it implements the NOF protocol stack.
  • the gateway device By deploying the gateway device provided in this embodiment, the expansion of RDMA storage nodes in the NOF storage network is supported.
  • the address conversion table is saved on the gateway device. After the destination NVMe address of the NOF operation is resolved, the destination NVMe address can be converted into an RDMA address through the address conversion table. Moreover, the gateway device converts the traditional NOF operation on the target NVMe hard disk into an operation directed to the memory of the RDMA node by converting the NVMe instruction into an RDMA instruction.
  • the following examples can improve the performance and expansion flexibility of traditional NOF storage solutions.
  • the solution in the following example is compatible with the original solution, and the client does not need to make improvements, nor does it need to perceive changes in storage nodes.
  • the client can not only use the storage service provided by the NOF storage node, but also the storage service provided by the RDMA storage node with better performance; for the storage node, the gateway device offloads the address management task of the storage node and takes over the storage server's backend. Terminal extension function.
  • the gateway device can process the NOF request and direct it to the destination storage node according to the destination NVMe address in the NOF request. It does not require the storage node to expand the back-end NOF, thereby reducing the CPU pressure and network I/O pressure of the storage node.
  • the following example implements a gateway device.
  • the RDMA storage node does not need to establish a logical connection with the client, but establishes a logical connection with the gateway device.
  • the gateway device is equivalent to the general entrance of storage resources.
  • the gateway device manages the storage space of the NOF storage node and the RDMA storage node at the same time.
  • the gateway device can map the destination address in the NOF request of the client to the address of the memory space of the RDMA node, so that the original full-path NOF storage service supports both the NOF storage service and the RDMA storage service with better performance.
  • FIG. 12 is a schematic structural diagram of a storage system after a gateway device is deployed according to an embodiment of the present application.
  • Figure 12 is described by taking the memory accessed based on RDMA as a DRAM cache as an example.
  • Figure 12 distinguishes and marks NOF-related features and RDMA-related features with different line types.
  • the storage system shown in FIG. 12 includes a client, a gateway device, a NOF storage node, an RDMA storage node A, an RDMA storage node B, and an RDMA storage node C.
  • NOF storage nodes contain NVMe storage media.
  • Each of RDMA storage node A, RDMA storage node B, and RDMA storage node C includes a DRAM cache.
  • the gateway device is deployed between the client and each storage node.
  • the gateway device and the client establish a NOF connection based on the NOF protocol.
  • the gateway device establishes an RDMA connection with the RDMA storage node A, the RDMA storage node B, and the RDMA storage node C based on the RDMA protocol.
  • the gateway device establishes a NOF connection with the NOF storage node based on the NOF protocol.
  • the gateway device judges whether the storage node corresponding to the destination address in the NVMe instruction in the NOF request message in the locally saved correspondence is a NOF storage node or an RDMA storage node. If the storage node corresponding to the destination address is an RDMA storage node, the gateway device converts the NOF request message into an RDMA request message containing an RDMA instruction, and sends the RDMA request message to the RDMA storage node corresponding to the destination address, so that the RDMA storage node RDMA is used to read and write to the DRAM cache.
  • the gateway device sends the RDMA request message to the RDMA storage node A. If the storage node corresponding to the destination address is a NOF storage node, the gateway device does not need to perform the step of protocol message conversion, and sends a NOF request message to the NOF storage node, so that the NOF storage node performs read and write operations on the NVMe storage medium.
  • the memory medium in the RDMA storage node is a DRAM cache.
  • the RDMA storage node uses other types of memory media, such as SCM, SRAM, DIMM, or memory-type hard disk, etc. This embodiment does not limit the type of memory media in the RDMA storage node.
  • FIG. 12 is a simplified schematic diagram. Other hardware components such as a processor are omitted in FIG. 12 , and the hardware structure of the device will be described in detail in other embodiments.
  • the original capacity expansion solution can be optionally used, or RDMA storage nodes can be added.
  • the newly added RDMA storage node establishes a connection with the gateway device.
  • the address mapping relationship of the newly added RDMA storage node is added to the address mapping table of the gateway device. Better performance due to the use of an RDMA-accessed memory space to provide expanded storage capacity.
  • this expansion method combines the advantages of NOF horizontal expansion and vertical expansion.
  • the gateway device acts as a storage node to provide storage services for the client.
  • the NOF request message is terminated at the gateway device, the gateway device operates its own cache to perform data read and write operations, and the gateway device constructs a NOF response message to interact with the client.
  • Fig. 13 is a schematic diagram of a scenario where a gateway device acts as a storage node. As shown in Figure 13, the gateway device executes NOF locally The NVMe command in the request message can perform data read and write operations on the cache of the gateway device itself without forwarding the request message to the storage node.
  • Some embodiments provided in this application implement a gateway device.
  • the gateway device can support the second and third layer forwarding of the traditional Ethernet, and realizes the following functions on this basis.
  • the gateway device can process the RDMA protocol stack to realize the connection and interaction between the gateway device and the RDMA storage node.
  • the gateway device can process the NOF protocol stack, parse out the information of the NOF protocol stack, maintain the status information of the NOF protocol stack, and realize the proxy function of replying the NOF message to the client.
  • the gateway device realizes NOF-RDMA message conversion and RDMA-NOF message conversion according to the current interaction information and the previously known state information of the NOF message and RDMA message.
  • Text conversion specifically, the NOF request message is converted into an RDMA request message, and the RDMA response message is converted into a NOF response message.
  • This embodiment provides an address translation table in the NOF-RMDA direction.
  • the address translation table is deployed on the gateway device.
  • the address translation table implements the mapping from the NVMe destination logical address to the RDMA destination logical address in the NOF.
  • the gateway device parses out the destination address in the NVMe instruction in the NOF message, and finds the memory address and other information of the corresponding RDMA node by searching the address translation table.
  • the table result constructs an RDMA packet.
  • FIG. 14 The logical combination of the above functions in the gateway device is shown in FIG. 14 .
  • Both server 1 and server 2 in FIG. 14 are examples of RDMA storage nodes. Both Server 1 and Server 2 are configured with RDMA network cards. RDMA storage node server 1 has registered a memory space with a length of 8K*100 for RDMA read and write operations, and the logical storage address corresponding to the memory space of RDMA storage node server 1 is LUN0. RDMA storage node server 2 has registered a memory space with a length of 8K*100 for RDMA read and write operations, and the logical storage address corresponding to the memory space of RDMA storage node server 2 is LUN1.
  • the disk array in FIG. 14 is an example of a NOF storage node.
  • the gateway device shown in FIG. 14 includes a NOF snooping module, an RDMA adapter and multiple ports.
  • the NOF monitoring module is used to identify NOF messages. After the NOF monitoring module receives the NOF message, if the NOF monitoring module recognizes that the destination storage node of the NOF message is the NOF disk array, then the NOF monitoring module forwards the NOF message to the NOF disk array; If the destination storage node of the file is an RDMA storage node, the NOF message is sent to the RDMA adapter.
  • the RDMA adapter converts the NOF message into an RMDA message, and sends the RDMA message to the RDMA node.
  • the RDMA adapter also processes the RDMA response message sent by the RDMA node, converts the RDMA response message into a NOF response message, and sends the NOF response message to the client.
  • the RDMA memory space provided by server 1 includes 100 pages with a size of 8KB.
  • the RDMA memory space provided by server 2 includes 100 pages with a size of 8KB.
  • the gateway device virtualizes the RDMA memory space provided by server 1 as LUN0, and virtualizes the RDMA memory space provided by server 2 as LUN1.
  • LUN0 and LUN1 are presented to clients as usable storage spaces.
  • the RDMA adapter in the gateway device parses the destination NVMe address in the NOF request message. If the LUN ID in the destination NVMe address is LUN0, then the RDMA adapter converts the NOF request message into an RDMA request message to be sent to server 1. If the LUN ID in the destination NVMe address is LUN1, then the RDMA adapter converts the NOF request message into an RDMA request message to be sent to server 2.
  • the NOF monitoring module corresponds to part of the logic, the address conversion table and the NOF proxy sending module in the message parsing module in the following examples.
  • the RDMA adapter is a module for logical conversion of NOF-RDMA.
  • the RDMA adapter corresponds to the modules used for logical conversion between NOF and RDMA in the following examples, such as the NOF context table in Example 1 and the message additional information processing in Example 2.
  • FIG. 14 selects a dedicated gateway device to implement the gateway device in this embodiment, so as to improve message forwarding performance.
  • This embodiment does not limit the use of a dedicated gateway device to realize the above functions.
  • a server, a traditional network device, an FPGA device, etc. are used as a gateway device to realize the above functions.
  • Pre-connection mainly refers to the process of establishing a connection between nodes.
  • the configuration process mainly refers to the process in which the storage node reports the address of the memory space and the information of the node.
  • the unilateral operation of RDMA is used in the actual data access process.
  • the gateway device performs special processing on the read operation or write operation of the unilateral operation to improve performance. Bilateral operations by gateway devices are optionally not treated specially.
  • the gateway device can parse normally according to the specification.
  • the RDMA storage node reports the address of the memory space through a bilateral operation
  • the gateway device after the gateway device parses and obtains the address of the memory space, optionally, the gateway device notifies the address of the memory space to the NOF storage node.
  • the NOF storage node arranges the address of the memory space of each RDMA storage node and the address of the hard disk space of each NVMe storage node in a unified manner to obtain the NVMe logical address, and then configures the NVMe logical address to the gateway device.
  • the gateway device performs unified address arrangement, the gateway device does not need to report the address of the memory space to the NOF storage node, and the gateway device directly controls the addresses of all memory spaces and the addresses of all hard disk spaces.
  • Figure 15 is a flow diagram of one embodiment of the present application.
  • Figure 15 mainly shows the process of the gateway device implementing the NOF protocol proxy and the protocol message conversion process of NOF-RDMA.
  • the flow shown in FIG. 15 includes the following S61 to S63.
  • S61 specifically includes S611 to S614.
  • the client establishes an NOF connection with the NOF storage node.
  • the gateway device establishes an RDMA connection with the RDMA storage node.
  • the RDMA storage node reports the information of the node and the address of the memory space to the NOF storage node.
  • the NOF storage node receives the node information and the address of the memory space sent by the RDMA storage node.
  • the NOF storage node performs unified address arrangement and sends the address translation table to the gateway device.
  • the NOF storage node performs the address arrangement is an optional implementation manner, and in other embodiments, the gateway device performs the address arrangement.
  • the above flow is an initialization process. If an RDMA storage node needs to be added during the operation of the storage system, the newly added RDMA storage node can be added to the entire storage system by repeatedly executing S612, S613 and S614.
  • the client sends a NOF request message.
  • the NOF request message is an NOF read request message or an NOF write request message.
  • the gateway device receives the NOF request message from the client.
  • the gateway device parses the NOF request message to obtain the destination storage address in the NVMe command in the message.
  • the gateway device looks up the address in the gateway device according to the destination storage address
  • the address conversion table is used to obtain the information of the destination storage node. If the destination storage address is located in the NOF storage node, enter the following S622 to S623.
  • the gateway device performs simple proxy processing on the NOF request message, and sends the processed NOF request message to the NOF storage node.
  • the NOF storage node receives the NOF request message.
  • the NOF storage node sends a corresponding NOF response message to the NOF request message.
  • the gateway device receives the NOF response message.
  • the gateway device performs simple proxy processing on the NOF response message, and sends the processed NOF response message to the client.
  • the NOF response message is the NOF read response message.
  • the NOF response message is a NOF write response message.
  • the gateway device searches the address conversion table in the gateway device and obtains that the destination storage address is located in the RDMA storage node, the gateway device performs the following S631 to S633.
  • the gateway device encapsulates the RDMA unilateral operation request message according to the NOF-RDMA conversion logic and the information of the destination RDMA node.
  • the gateway device sends the RDMA unilateral operation request message to the RDMA storage node.
  • the RDMA request message sent by the gateway device is an RDMA read request message.
  • the NOF request message sent by the client is a NOF write request message
  • the RDMA request message sent by the gateway device is an RDMA write request message.
  • the RDMA storage node receives an RDMA unilateral operation request message from the gateway device.
  • the RDMA storage node executes the RDMA command based on the RDMA unilateral operation request message, and generates and sends the RDMA unilateral operation response message.
  • the gateway device obtains the RDMA unilateral operation response message from the RDMA storage node, the gateway device converts the RDMA unilateral operation response message into a NOF response message according to the RDMA-NOF conversion logic.
  • the NOF response message converted by the gateway device is an NOF read response message.
  • the NOF response message converted by the gateway device is an NOF write response message.
  • the gateway device sends a NOF response message to the client.
  • Fig. 16 is a schematic diagram of the internal logic function architecture of the gateway device.
  • Example 1 is an implementation within the gateway device.
  • the gateway device realizes the protocol message conversion function of NOF-RDMA.
  • NOF-RDMA protocol message conversion function
  • the gateway device parses out the NVMe command carried in the NOF request message.
  • the gateway device determines the destination storage node according to the destination storage address in the NVMe command and the address translation table.
  • the destination storage node has the following two situations.
  • the destination storage node is a NOF storage node.
  • the gateway device maintains the original NOF interaction process by performing a simple NOF protocol proxy operation.
  • the destination storage node is an RDMA storage node.
  • the gateway device converts the NVMe command into the RDMA command. While the instruction is converted, the gateway device saves the state information of the NOF (the embodiment of the present application refers to the state information of the NOF as the NOF context) into the NOF context table. Then, the gateway device encapsulates the corresponding RDMA requests. The gateway device sends the RDMA request to the corresponding RDMA storage node.
  • the gateway device After the RDMA storage node responds to the RDMA response message, the gateway device implements RDMA-NOF conversion, and the gateway device restores the NOF state information according to the content in the NOF context table.
  • the gateway device uses the NOF state information to encapsulate the NOF response message, and sends the NOF response message to the client.
  • the modules in the gateway device mainly include a message parsing module, an address conversion table, a NOF proxy sending module, a NOF and RDMA message conversion module, a NOF context table, and an RDMA proxy sending module.
  • modules with the same name in FIG. 16 there are modules with the same name in FIG. 16 , such as message analysis module-1 and message analysis module-2. Modules with the same name have the same or similar processing logic. In order to make the whole process more concise, modules with the same name are scattered in different positions of the process with suffix numbers, and no special distinction is made when introducing these modules below.
  • the message parsing module is used to parse the message, and extract the protocol type and the content of the message from the NOF message and the RDMA message.
  • the functions of the message analysis module specifically include the following (1) to (5).
  • the message parsing module parses the transport layer information in the message.
  • the message parsing module judges whether the message is a NOF message or an RDMA message according to the port number in the transport layer information in the message. If the message is a NOF message or an RDMA message, the message parsing module sends the message to the subsequent corresponding protocol stack (ie, the NOF protocol stack or the RDMA protocol stack), so that the subsequent protocol stack continues to analyze the message. If the message is not a NOF message or an RDMA message, the message parsing module does not perform special processing, and just forwards the message directly according to the original forwarding logic.
  • both the NOF packet and the RDMA packet include a UDP header.
  • the destination port number in the UDP header is 4791.
  • the upper layer of the UDP layer in the protocol stack is the IB layer.
  • the operation code (Operation Code, OPcode) specified in the IB layer and the operation code of the upper layer of the IB layer it can be determined whether the message is an RDMA message or a NOF message.
  • the NOF message and the RDMA message enter the gateway device through different ingress ports, and the gateway device determines whether the message is an RDMA message or an NOF message according to the ingress port of the message and the port number in the message.
  • the message parsing module parses the NOF message, and the parsed NOF message includes a request message from the client to the storage node and a response message from the storage node to the host.
  • the message parsing module parses out the farbic information and NVMe instructions in the NOF message.
  • the farbic information is RoCEv2 information, for example, the farbic information includes MAC layer information, IP layer information, UDP layer information and IB layer information.
  • the message parsing module parses the RDMA message, and the parsed RDMA message is mainly a response message from the storage node to the client.
  • the message parsing module parses out the relevant information of the RDMA field in the RDMA message.
  • the message parsing module extracts the information carried in the fields after the protocol parsing in (2) and (3), and caches the information for subsequent modules to use.
  • the message parsing module After the message parsing module finishes parsing the message, the message parsing module outputs the NOF message or the RDMA message to a subsequent corresponding processing module. Packets other than NOF packets and RDMA packets are not processed specially, and are forwarded according to normal logic.
  • the address translation table is used to indicate the correspondence between the destination NVMe address and the information of the destination storage node. Address translation records the actual node information corresponding to the destination storage address in the NVMe instruction in the NOF protocol.
  • the address conversion table is described in detail below, see (1) to (5) below for details.
  • the destination NVMe logical address in the address translation table is the index, and the destination storage node information is the value.
  • the destination NVMe logical address includes the content of the start LBA field, the content of the block number field, and the block size contained in the attribute of the connection itself.
  • the information of the destination storage node includes the network location information of the storage node (such as layer 2 and layer 3 information), and DQP (according to the DQP to determine a logical connection of the RDMA storage node or NOF storage node).
  • the information of the second and third layers is used to determine the physical channel, and the information of the second and third layers is used to find a specific device (that is, a storage node).
  • the layer 2 information is, for example, a MAC address
  • the layer 3 information is, for example, an IP address.
  • the address translation table also includes RETH information corresponding to the RDMA storage node (that is, a segment of memory address registered by the RDMA storage node.
  • the gateway device queries the address translation table according to the destination NVMe address in the NVMe command, and obtains the destination storage node information corresponding to the destination NVMe address in the address translation table.
  • the gateway device can determine whether the destination storage node is a NOF node or an RDMA node according to the information of the destination storage node, so as to enter into subsequent different processing logics.
  • the gateway device can also determine the logical connection with the destination storage node and the logical address of the storage space in the destination storage node according to the destination storage node information.
  • the logical address of the hard disk space in the NOF node does not need to be mapped, and the logical address of the memory space of the RDMA node is mapped to RETH in the address translation table.
  • each entry in the address translation table further includes a flag bit, which is used to identify whether the destination storage node is a NOF node or an RDMA node.
  • the gateway device determines whether the destination storage node is a NOF node or an RDMA node according to the value of the flag bit corresponding to the destination NVMe address.
  • the address translation table supports multi-channel RDMA
  • RDMA connections between two nodes are optionally differentiated by different QPs.
  • Each RDMA channel manages its own resources.
  • the address translation table stores the destination storage address and the QP mapping information of each RDMA storage node, thereby supporting RDMA multiple access.
  • RDMA multiple access refers to supporting access to an RDMA node through multiple logical channels.
  • An RDMA node has multiple QP pairs, and each QP pair is a logical channel.
  • Different QPs of the same RDMA storage node in the address translation table correspond to different entries, so different QPs on the same RDMA storage node can be distinguished through the address translation table, thereby supporting access to RDMA nodes through different logical channels. Since multiple channels have higher performance and availability than a single channel, multi-channel RDMA is supported through the address translation table, which can improve performance and availability.
  • the address translation table supports load sharing and hot backup
  • the address translation table can map a certain destination logical address to multiple RDMA storage nodes.
  • the gateway device After the gateway device finds multiple RDMA storage nodes according to the address translation table, the gateway device sends an RDMA write request to each found RDMA node, so that the data can be synchronously written into at most RDMA storage nodes.
  • the gateway device uses a multicast mechanism to send the RDMA write request, that is, multicasts the RDMA write request to multiple RDMA service nodes.
  • the gateway device finds multiple RDMA storage nodes according to the address translation table, the gateway device applies a consistent hash algorithm or other load sharing algorithms to read from the found multiple RDMA storage nodes Randomly select an RDMA storage node, and the gateway device sends an RDMA read request to the selected RDMA storage node, thereby improving system performance and stability.
  • the load sharing algorithm for specific applications is determined based on services and device capabilities.
  • the gateway device determines whether the destination storage node is an NOF storage node or an RDMA storage node according to the information of the destination storage node obtained by querying the address conversion table.
  • the gateway device obtains the network location information and logical connection information of the NOF node, and then processes it through the NOF proxy module.
  • the gateway device obtains the network location information, logical connection information, and destination memory address of the RDMA node, and then processes it through the NOF-RDMA message conversion module.
  • the address translation table is shown in Table 2 below.
  • the destination NVMe address is the index (index) or key (key) used when looking up the table
  • the destination storage node information is the query result obtained from the table lookup or the value corresponding to the key ( value).
  • QP+number is used to simplify the identification of QP.
  • RETH+number is used to simplify the specific content of a RETH, that is, the address of a section of memory space in the server.
  • the destination NVMe address shown in Table 2 above includes three attributes: Start LBA, block size and block number.
  • the information of the destination storage node is the information of the RDMA server 1, for example including the IP address of the RDMA server 1, QP1 and RETH1.
  • the Start LBA of the destination NVMe address is 0x4000
  • the block size is 512
  • the block number is 32
  • the logical address range represented by the destination NVMe address is 0x4000---0x7FFF
  • the information of the destination storage node is queried according to the destination NVMe address It is the information of the RDMA server 1, including, for example, the IP address of the RDMA server 1, QP1 and RETH2.
  • the Start LBA in the destination NVMe address is 0x8000
  • the block size is 512
  • the block number is 64
  • the logical address range represented by the destination NVMe address is 0x8000---0xFFFF
  • the destination storage node information is queried according to the destination NVMe address It is the information of the RDMA server 1, including, for example, the IP address of the RDMA server 1, QP2 and RETH3.
  • RDMA server 1 includes two queue pairs corresponding to QP1 and QP2.
  • the queue pair identified by QP1 corresponds to the memory space identified by RETH1 in RDMA server 1
  • the queue pair identified by QP2 corresponds to the memory space identified by RETH2 in RDMA server 1.
  • the Start LBA in the destination NVMe address is 0x10000, the block size is 512, and the block number is 128, the logical address range represented by the destination NVMe address is 0x10000---00x1FFFF and 0x20000---0x2FFFF.
  • the information of the destination storage node is the information of the RDMA server 2 and the information of the RDMA server 3, for example, including the MAC address of the RDMA server 2, QP10 and RETH4, and the MPLS label of the RDMA server 3, QP20 and RETH5.
  • RDMA server 2 and RDMA server 3 have a load sharing relationship.
  • the Start LBA in the destination NVMe address is 0x20000, the block size is 512, and the block number is 128, it means that the logical address range is 0x0000---0x3FFF, and the destination storage node information queried according to the destination NVMe address is NOF server 1 Information, NOF storage service is provided at this time.
  • the address translation table will be illustrated below with reference to FIG. 17 .
  • FIG. 17 shows that there are three logical address segments with a length of 64K in the address translation table.
  • the first logical address segment with a length of 64K in the address translation table is address 0x0000 to address 0xFFFF.
  • the destination storage node corresponding to the first address segment is RDMA server 1.
  • the first address segment in the first conversion table corresponds to the identification of the two QPs (QP1 and QP2 in Figure 17) in the RDMA server 1, and the two logical channels of QP1 and QP2 correspond to the two memory address segments of the RDMA server 1 respectively .
  • the second logical address segment with a length of 64K in the address translation table is address 0xFFFF to address 0x1FFFF.
  • the destination storage nodes corresponding to the second logical address segment are the two RDMA nodes RDMA server 2 and RDMA server 3 .
  • RDMA server 2 and RDMA server 3 store the same data, indicating that a logical address can implement active/standby and load sharing.
  • the third logical address segment with a length of 64K in the address translation table is address 0x1FFFF to address 0x2FFFF.
  • the destination storage node corresponding to the third logical address segment in the address translation table is the NOF server 1, indicating that it is compatible with the NOF node of the original NOF network.
  • the NOF proxy sending module is used to take over the original NOF message forwarding process, and modify or construct the NOF message according to the NOF connection status and NOF proxy logic.
  • the functions of the NOF proxy contract sending module specifically include the following (1) to (3).
  • the NOF protocol stack agent is similar to the NOF protocol stack of the message analysis module.
  • the NOF protocol stack of the above message parsing module is mainly responsible for parsing messages, while the NOF protocol stack of the NOF proxy contract sending module is mainly responsible for NOF message proxy processing.
  • the functions of the NOF message proxy include maintaining the connection state of the NOF protocol, modifying or constructing the NOF message.
  • the NOF proxy sending module needs to modify the NOF message or construct the NOF message according to the connection status of the NOF.
  • the gateway device modifies the NOF request message, and sends the modified NOF request message to the NOF storage node.
  • the gateway device modifies the NOF response message, and sends the modified NOF response message to the client.
  • the gateway device After receiving the RDMA response message from the RDMA storage node, the gateway device constructs a NOF response message and sends the NOF response message to the client.
  • the output result of the NOF proxy sending module is the NOF response message sent to the client or the NOF request message sent to the NOF storage node.
  • the NOF and RDMA message conversion module is used for mutual conversion of protocol messages between NOF and RDMA.
  • the NOF and RDMA message conversion module is divided into two sub-modules, the NOF-RDMA conversion module and the RDMA-NOF conversion module.
  • the NOF-RDMA conversion module is used to realize the protocol packet conversion from NOF to RDMA. Specifically, after the destination RDMA storage node is determined from the NOF request message based on the address conversion table, the NOF request message enters the NOF-RDMA conversion module, and the NVMe command in the NOF protocol has been parsed out at this time. The NOF-RDMA conversion module processes the NOF request message of the client to obtain the RDMA request message.
  • the NOF-RDMA conversion module obtains the RDMA state information according to the address, QP and other parameters of the destination RDMA storage node, and subsequently uses the RDMA state information obtained here. Wherein, parameters such as address and QP of the destination RDMA storage node are acquired according to an address conversion table, for example.
  • the RDMA state information is obtained, for example, according to the RDMA proxy contract sending module.
  • the conversion submodule of NOF-RDMA converts NVMe instructions into RDMA instructions. Among them, the read operation of NVMe is converted into the read operation of RDMA, and the write operation of NVMe is converted into the write operation of RDMA.
  • the NOF-RDMA conversion module pre-fills the RDMA protocol fixed fields in the RDMA request message according to the RDMA protocol standard. Subsequent modules of the NOF-RDMA conversion module will complete the RDMA protocol information that the RDMA request message needs to carry, and send the DMA request message containing the complete RDMA protocol information to the RDMA agent contract sending module.
  • the RDMA-NOF conversion module is used to realize the protocol message conversion from RDMA to NOF. Specifically, the RDMA response message responded by the RDMA storage node enters the RDMA-NOF conversion module after being processed by the message parsing module, and the information in the RDMA protocol carried in the message has been parsed out at this time. The RDMA-NOF conversion module converts the information in the RDMA protocol into the information in the NOF protocol.
  • the RDMA-NOF conversion module When the RDMA-NOF conversion module receives the RDMA read response message, it parses the data and PSN from the RDMA read response message, and the RDMA-NOF conversion module converts the RDMA read response message into a NOF read response message according to the PSN and data, Or construct a NOF read response message.
  • the RDMA-NOF conversion module When the RDMA-NOF conversion module receives the RDMA write response message, it parses the PSN from the RDMA write response message, and converts the RDMA write response message into a NOF write response message or constructs a NOF write response message according to the PSN.
  • the RDMA-NOF conversion module pre-fills the fixed fields of the NOF protocol in the NOF response message according to the NOF protocol standard Charge.
  • the subsequent modules of the RDMA-NOF conversion module will complete the NOF protocol information that the NOF response message needs to carry, and send the NOF response message containing the complete NOF protocol information to the NOF agent contract sending module.
  • the processing logic of the NOF-RDMA conversion module is different from that of the RDMA-NOF conversion module.
  • the NOF status information has not been obtained, and the next module of the RDMA-NOF conversion module, that is, the NOF context table, needs to obtain the NOF status information, so the RDMA-NOF conversion module can only pre-fill Information about the NVMe part of the NOF protocol.
  • the output result of the NOF and RDMA message conversion module is a message filled with some fixed fields in the target protocol and fields with currently known information.
  • the target protocol is the RDMA protocol.
  • the target protocol is the NOF protocol.
  • the index in the NOF context table is the RDMA PSN value.
  • the RDMA PSN in the NOF context table is generated by the gateway device during the process of generating the RDMA message.
  • the RDMA PSN comes from the RDMA proxy contract sending module, for example.
  • the RDMA PSN in the NOF context table is obtained by the gateway device from the RDMA PSN field of the RDMA message.
  • the content of the state information in the NOF context table contains all missing NOF state information needed to respond to the client.
  • the state information can optionally be directly parsed from the message in NOF-RDMA, or calculated by the gateway device.
  • NOF status information includes PSN, DQP, and RETH at the RoCE layer, and SQHD and command ID at the NVMe layer.
  • the NOF status information that needs to be obtained includes but is not limited to the above situations, and the specific parameters may vary according to actual usage scenarios. Among them, PSN, SQHD and Command ID are calculated by the gateway device, and the specific calculation method is to make additive correction based on the current value.
  • the NOF context table is responsible for maintaining the correspondence between the state in the NOF connection and the state in the RDMA connection.
  • the gateway device converts the NOF message into an RDMA message and interacts with the RDMA storage node based on the RDMA message, there is no NOF status information for the interaction on the RDMA side.
  • the process of converting the NOF message into an RDMA message is similar to the CPU switching process, and the new process of switching (similar to the RDMA interaction between the gateway and the RDMA storage node in this embodiment) does not have a current process (similar to the gateway and the client in this embodiment) Based on the information of NOF interaction), the CPU saves the current process information (similar to the NOF state information in this embodiment) into the context table.
  • the gateway device By designing the NOF context table, when NOF switches to RDMA, the gateway device saves the current NOF state information to the NOF context table. After completing the RDMA interaction, the gateway device restores the NOF state information through the NOF context table.
  • the gateway device saves the NOF state information to the NOF context table, and the RDMA contract sending module continues to process it subsequently.
  • the gateway device searches the NOF context table to obtain the NOF status information, and outputs the NOF status information to the NOF agent sending module, so as to complete the process of sending the NOF message Provide the required parameters.
  • Fig. 18 shows the establishment process and search process of the NOF context table.
  • the RDMA status information is described as an example of RDMA PSN.
  • the current RDMA PSN is obtained from the RDMA proxy contract sending module, and the current RDMA PSN is used as the NOF context
  • the index in the table and obtain the NOF state according to the NOF request message, and use the NOF state as the value corresponding to the index in the NOF context table, so as to establish the NOF context table.
  • the RDMA proxy contract sending module is similar to the NOF proxy contract sending module.
  • the main difference between the RDMA contract sending module and the NOF contract sending module is that the RDMA contract sending module acts as a proxy for the RDMA protocol. And only when interacting with the RDMA storage node, use the RDMA proxy contract sending module in the contract sending link.
  • the functions of the RDMA proxy contract sending module specifically include the following (1) to (3).
  • the gateway device implements the RDMA protocol stack.
  • the gateway device establishes a connection with the RDMA storage node as a client.
  • the RDMA proxy contract sending module mainly uses the part of the RDMA protocol stack client to send the contract.
  • the RDMA proxy sending module constructs an RDMA request message.
  • the output result of the RDMA proxy sending module is the RDMA request message sent to the RDMA storage node.
  • FIG. 19 and FIG. 20 are complete flowcharts of the method executed by the gateway device in Example 1.
  • Fig. 19 shows a complete flow chart of the method executed by the gateway device in the direction of client->storage node.
  • Fig. 20 shows a complete flow chart of the method executed by the gateway device in the storage node->client direction.
  • the method flow performed by the client->storage node to the gateway device includes the following S71 to S710.
  • the gateway device receives the message.
  • the gateway device parses the received message.
  • the gateway device judges whether the received packet is an NOF packet. If the received message is a NOF message, the gateway device executes S74. If the received packet is not a NOF packet, the gateway device executes S710.
  • the gateway device searches the information of the destination storage node from the address translation table.
  • the gateway device judges whether the destination storage node is an RDMA storage node. If the destination storage node is an RDMA storage node, the gateway device executes S76. If the destination storage node is not an RDMA storage node, the gateway device executes S79.
  • the gateway device performs NOF-RDMA instruction conversion.
  • the gateway device saves the NOF state in the NOF context table.
  • the gateway device implements an RDMA proxy function, and sends the RDMA message.
  • the gateway device implements the NOF proxy function, and sends the NOF message.
  • the gateway device forwards the message according to the original message forwarding process.
  • the method flow performed by the storage node->client to the gateway device includes the following S81 to S88.
  • the gateway device receives the message.
  • the gateway device parses the received message.
  • the gateway device judges whether the received packet is a NOF packet or an RDMA packet. If the received message is a NOF message or an RDMA message, the gateway device executes S84. If the received message is neither a NOF message nor an RDMA message, the gateway device executes S88.
  • the gateway device determines whether the received packet is an RDMA packet. If the received packet is an RDMA packet, the gateway device executes S85. If the received message is not an RDMA message (that is, the received message is a NOF message), the gateway device executes S87.
  • the gateway device converts the information in the RDMA message into information in the NOF protocol.
  • the gateway device finds the NOF state information from the NOF context table according to the RDMA state information in the RDMA message.
  • the gateway device sends the NOF message.
  • the gateway device forwards the message according to the original message forwarding process.
  • the above example 1 provides a new gateway device, and the gateway device is located at the gateway position of the storage node.
  • the gateway device supports the NOF protocol stack and the RDMA protocol stack.
  • the gateway device has the capability of NOF-RDMA protocol stack conversion.
  • the gateway device can perform destination node orientation according to the destination logic storage address.
  • Example 1 The effects achieved in Example 1 include but are not limited to the following (1) to (3).
  • the RDMA storage medium is memory, and the performance of memory is better than that of existing NVMe hard disks.
  • the gateway device provided in Example 1 enables the NOF storage network to support RDMA, thereby giving full play to the advantages of memory storage and improving performance.
  • the gateway device provided in Example 1 can offload part of the business processing tasks of the server (that is, the gateway device performs part of the business processing tasks instead of the server), thereby reducing the CPU pressure on the server and improving overall performance.
  • Example 1 Summarizing the solution of Example 1, it can be seen that in Example 1, the existing NOF storage network structure is changed, and the original storage backend can only expand the NOF storage node.
  • the gateway device of this embodiment can support the NOF storage network to expand the RDMA storage node.
  • Example 1 changes the status quo that all storage media in the existing NOF storage network are hard disks, and can support the conversion of NVMe hard disk operation semantics to RDMA memory operation semantics, realizing the collaboration of hard disk storage services and memory storage services.
  • the gateway device can complete the orientation of the destination storage logical address, reducing the CPU pressure of the current storage nodes.
  • Example 1 can be provided as a non-invasive expansion support solution.
  • Non-invasive means that instance 1 does not change the existing business deployment, so as to avoid affecting the existing running system of the business.
  • Example 1 can be used as an enhanced mode to optimize service performance.
  • Example 2 is an alternative to the NOF context table in Example 1.
  • Example 2 uses a piggeback or piggeback-like mode to transmit RDMA packets.
  • Piggeback means that the local end carries the specified information in the packet, and after sending the packet carrying the specified information to the peer end, the peer end returns the specified information to the local end.
  • the specified information in Example 2 is the NOF status information or the NOF packet header content.
  • the gateway device When the destination storage node is an RDMA storage node, the gateway device does not save the NOF status information in the NOF context table, but pre-fills the existing NOF status information into the response header, and then sends the response message containing the NOF status information
  • the header is encapsulated into the RDMA request message.
  • the NOF status information is used as an additional header information in the RDMA request message.
  • RDMA storage nodes need to be aware of this protocol change.
  • the RDMA storage node does not process this additional header information; or, the RDMA storage node processes the additional header information as required, for example, the RDMA storage node calculates the ICRC.
  • the RDMA storage node carries additional header information in the RDMA response message.
  • the RDMA storage node sends an RDMA response message containing the additional header information, so that the additional header information is returned to the gateway device.
  • the gateway device restores the state information that needs to be carried in the NOF response message according to this additional field.
  • the gateway device constructs a NOF response message, and sends the NOF response message to the client.
  • FIG. 21 shows a logical functional architecture diagram of Example 2.
  • the gateway device in Example 2 also includes a message parsing module, an address translation table, a NOF-RDMA conversion module, an RDMA proxy sending module and a NOF proxy sending module.
  • the message parsing module, NOF-RDMA conversion module and address conversion table in Example 2 are similar to those in Example 1 and will not be described in detail.
  • the processing of additional header information is newly added in the RDMA proxy contract sending module and the NOF proxy contract sending module in Example 2.
  • the following introduces the new business logic of the two modules, the RDMA proxy contract sending module and the NOF proxy contract sending module, in Example 2.
  • the RDMA proxy contract sending module in Example 2 retains the original functions of the RDMA proxy contract sending module in Example 1.
  • the RDMA agent contract sending module adds a step of adding additional header information in the RDMA message when constructing the RDMA message.
  • Example 2 includes two specific implementation methods. The following uses the RoCEV2 protocol at the fabric layer of the NOF as an example to describe the two implementation methods respectively.
  • the gateway device carries the NOF status information in the RDMA message.
  • the additional header information carried in the additional field is similar to the value in the NOF context table in Example 1 (that is, the NOF state information).
  • the additional header information in the RDMA packet is equivalent to the value in an entry in the NOF context table in Example 1. It can also be understood in this way that the NOF state information does not need to be saved locally on the gateway device as the value of the entry in the NOF context table, but flows with the message.
  • RDMA storage nodes do not do anything with this additional header information.
  • the RDMA storage node only receives the RDMA message carrying this additional field, extracts the additional header information from the additional field, and encapsulates the additional header information into the RDMA response message after the routine business logic processing of RDMA is completed.
  • the gateway device After receiving the RDMA response message, the gateway device reads the additional header information according to the standard, and uses the additional header information to construct a NOF response message.
  • Implementation mode (2) The gateway device pre-generates the NOF packet header, and uses the NOF packet header as additional header information.
  • the gateway device pre-constructs the NOF packet header.
  • the gateway device fills all existing NOF information to be returned to the client into the NOF header.
  • the gateway device uses the NOF header as additional header information, and sends the RDMA request message including the NOF header to the RDMA storage node.
  • the RDMA storage node receives the RDMA request message, the RDMA storage node continues to process and modify the NOF message header in the RDMA request message. For example, the RDMA storage node completes the missing content of the NOF packet header, calculates the ICRC of the packet, and so on.
  • the RDMA storage node continues to use the processed NOF header as additional header information, and encapsulates the NOF header into the RDMA response message, so that the NOF header is used as the RDMA response message Inner header in .
  • the gateway device After the gateway device receives the RDMA response message, the gateway device strips the outer message header in the RDMA response message. The gateway device uses the part starting from the inner layer header (NOF packet header) of the RDMA response packet as the NOF response packet.
  • NOF packet header the inner layer header
  • the NOF proxy contract sending module in Example 2 and the RDMA proxy contract sending module in Example 2 are used together.
  • Example 1 the NOF status information is obtained from the NOF context table.
  • Example 2 the NOF proxy sending module obtains the NOF status information from the additional header information in the message. The subsequent processing is similar to Example 1.
  • Example 2 the NOF proxy sending module strips the outer header in the RDMA response message, and forwards the message after stripping the outer header.
  • the NOF agent contract sending module in Example 2 modifies the Layer 2 part or the Layer 3 part of the message according to the network conditions.
  • Example 2 since the additional header information is carried in the packet, the additional header information will occupy additional space in the packet.
  • Example 2 The additional space required in the message is the same as the length of each entry in the NOF context table of Example 1. For example, in the RoCEV2 scenario, the additional space occupied in the message is about 20B- 30B.
  • FIG. 22 and FIG. 23 are complete flowcharts of the method executed by the gateway device in Example 2.
  • Fig. 22 shows a complete flow chart of the method executed by the gateway device in the direction of client->storage node.
  • S77 in the process shown in Figure 19 in Example 1 is replaced by S77', the gateway device constructs additional fields in the message. Refer to FIG. 19 for other steps in the process shown in FIG. 22 .
  • Fig. 23 shows a complete flow chart of the method executed by the gateway device in the storage node->client direction.
  • S86 in the process shown in Figure 20 in Example 1 is replaced with S86', the gateway device processes the additional fields of the message. Refer to FIG. 20 for other steps in the process shown in FIG. 23 .
  • Example 2 The technical effect of example 2 is the same as example 1. Comparing Example 1 and Example 2, the gateway device provided by Example 2 does not need to deploy the NOF context table, thus reducing the consumption of internal storage space of the gateway device. Moreover, Example 2 reduces the process of looking up and writing tables. However, the example 2 needs to modify the RDMA protocol so that the RDMA protocol supports identification and processing of additional fields.
  • Example 3 is the supplement of Example 1 and Example 2.
  • Example 3 mainly complements the control plane process.
  • FIG. 24 is a logical functional architecture diagram of Example 3.
  • the gateway device in Example 3 includes a message parsing module, an address conversion table and an address editing module.
  • the message parsing module and address translation table in Example 3 are similar to those in Example 1, and will not be described in detail.
  • Example 3 mainly explains how to send the address of the storage node to the gateway device.
  • Example 3 involves the bilateral operation of RDMA and the information exchange message of the NOF control channel.
  • the address arrangement module is used to process the message of the RDMA registration storage address space and the information exchange message of the NOF control channel in the RDMA bilateral operation.
  • the address arrangement module performs unified arrangement and management on the memory address reported by the RDMA storage node through the bilateral operation message and the NVMe storage address segment in the information exchange message of the NOF control channel, and then generates a unified virtual address, which will be written later into the address translation table.
  • the functions of the address programming module specifically include the following (1) to (3).
  • the RDMA node registers the address of the memory space of the RDMA storage node through the send operation or the receive operation of the bilateral operation, and reports the address of the memory space to the user, and the subsequent user can directly operate this memory space through the address reported by the RDMA node the address of.
  • the NOF storage node through the control channel, notifies the user of the available hard disk address segment of the storage node, and the subsequent user can directly operate this hard disk address through the address reported by the NOF storage node.
  • the address arrangement module is used to parse out the memory address reported by the RDMA node from the message sent by the RDMA node, and parse out the hard disk address reported by the NVMe node from the message sent by the NOF node.
  • the address arrangement module uniformly arranges the addresses reported by each storage node into a global virtual address.
  • the address programmed by the address programming module is the content in the address conversion table.
  • the address arranged by the address arrangement module is specifically the index used to search the information of the destination storage node in the address translation table, that is, the NVMe logical address.
  • the information of the destination storage node stored in the address translation table is the address reported by each storage node.
  • the address editing module outputs the address table entries including the editing to the address translation table.
  • This embodiment takes the NOF storage node reporting the hard disk address to the gateway device through the control channel in the NOF protocol as an example for illustration.
  • a message dedicated to reporting the address is provided.
  • the NOF storage node sends a dedicated message file to report the hard disk address.
  • Example 3 The address conversion table in Example 3 is the same as Example 1.
  • Example 1 describes the look-up process of the address translation table
  • example 3 describes the table writing process of the address translation table.
  • FIG. 25 shows a complete flowchart of the method executed by the gateway device in Example 3. As shown in FIG. 25 , the process of the method executed by the gateway device includes the following S91 to S98.
  • the gateway device receives the message.
  • the gateway device parses the received message.
  • the gateway device determines whether the received message is an RDMA bilateral operation message.
  • the gateway device executes S94. If the received message is not an RDMA bilateral operation message, the gateway device executes S95.
  • the gateway device parses the address information of the RDMA registration.
  • the gateway device judges whether the received message is an address reporting message from the NOF control channel. If the received message is an address reporting message from the NOF control channel, the gateway device executes S96. If the received message is not an address reporting message from the NOF control channel, the gateway device executes S98.
  • the gateway device performs address arrangement according to the address carried in the message, or parses out the address carried in the message.
  • the gateway device configures an address translation table.
  • the gateway device executes the process of Example 1 or Example 2.
  • Example 3 The technical effects of Example 3 are described below.
  • Example 3 supplements the details of Example 1 and Example 2, and completes the control plane process.
  • the gateway device provided in this embodiment parses out the memory address reported by the RMDA storage node when registering the memory and the hard disk address reported by the NOF storage node through the control channel, and the gateway device uniformly arranges the addresses reported by each storage node, and finally generates an address translation table. item.
  • the gateway device or each storage node reports the memory address of the RMDA storage node and the hard disk address of the NOF storage node to the server that provides unified address arrangement management and control software.
  • the server performs address programming and sends the content of the address translation table to the gateway device.
  • the embodiment of the present application implements a gateway device, which is optionally deployed in a traditional NOF storage network, and implements the following (1) to (4).
  • the gateway device provided by the embodiment of the present application can process the RDMA protocol stack, and realize the connection and interaction between the gateway device and the RDMA storage node.
  • the gateway device provided by the embodiment of the present application can process the NOF protocol stack, parse out the information of the NOF protocol stack, and maintain the state information of the NOF protocol stack.
  • the gateway device can replace the NOF server to reply the NOF message to the client, and realize the function of acting as the NOF server .
  • the protocol logic mutual conversion mechanism of NOF-RDMA is embodied in the conversion of NOF request message into RDMA request message, and the conversion of RDMA response message into NOF response message.
  • the NOF-RMDA address translation table is deployed on the gateway device.
  • the address translation table implements the mapping from the NVMe destination logical address to the RDMA destination logical address in the NOF.
  • the storage solution provided by the embodiment of the present application can optionally be combined with a memory-type hard disk to play a greater role.
  • the functions required to be implemented by the RDMA storage node are basically network protocol analysis, bus data relocation, and memory medium operation, which do not require strong CPU capabilities.
  • the smart network card-PCIE bus-memory pass-through device is being studied.
  • the smart network card-PCIE bus-memory pass-through device is lighter than the server.
  • This embodiment optionally uses this device as a storage node to implement a mass storage solution for the NOF storage network.
  • the first RDMA storage node in the embodiment of FIG. 9 is a smart network card-PCIE bus-memory pass-through device.
  • FIG. 9 the first RDMA storage node in the embodiment of FIG. 9 is a smart network card-PCIE bus-memory pass-through device.
  • S408 is executed by the iNIC in the first RDMA storage node.
  • the smart network card in the first RDMA storage node performs data transmission with the memory through the PCIE bus, thereby performing read/write operations. In this way, the processing work of the CPU is offloaded to the smart network card, reducing the calculation burden of the CPU and improving the performance of Figure 9. Example operating efficiency.
  • NOF uses a network other than RoCE as the fabric carrying NVMe.
  • This embodiment is optionally applied to a scenario where NVMe is carried on other fabrics, such as a scenario where NVMe over TCP is applied.
  • NVMe is directly carried on TCP instead of UDP and IB.
  • the first NOF request message in S401 and the first NOF response message in S411 in the embodiment of Figure 9 are TCP messages, and the NOF status information includes Sequence numbers in TCP.
  • the gateway device supports interaction with the client based on TCP to meet more business scenarios.
  • Figure 26 is a schematic structural diagram of a message processing device 700 provided in an embodiment of the present application, and the device 700 shown in Figure 26 is set on a gateway device.
  • the device 700 includes a receiving unit 701 , a processing unit 702 and a sending unit 703 .
  • the device 700 shown in FIG. 26 is set in the gateway device 33 in FIG. 8 .
  • the device 700 shown in FIG. 26 is set in the gateway device in FIG. 9 .
  • the receiving unit 701 is configured to support the gateway device in FIG. 9 to execute S402 and S409.
  • the processing unit 702 is configured to support the gateway device in FIG. 9 to execute S403 and S410.
  • the sending unit 703 is configured to support the gateway device in FIG. 9 to execute S404 and S411.
  • the device 700 shown in FIG. 26 is set in the gateway in FIG. 11 equipment.
  • the receiving unit 701 is configured to support the gateway device in FIG. 11 to execute S502 and S509.
  • the processing unit 702 is configured to support the gateway device in FIG. 11 to execute S503 and S510.
  • the sending unit 703 is configured to support the gateway device in FIG. 11 to execute S504 and S511.
  • the device 700 shown in FIG. 26 is set in the gateway device in FIG. 12 .
  • the receiving unit 701 and the sending unit 703 are realized through the network port in the gateway device in FIG. 12 .
  • the receiving unit 701 is configured to support the gateway device in FIG. 12 to receive the NOF request message from the client in FIG. 12 .
  • the sending unit 703 is configured to support the gateway device in FIG. 12 to send the RDMA request message to the RDMA storage node A in FIG. 12 or send the NOF request message to the NOF storage node in FIG. 12 .
  • the device 700 shown in FIG. 26 is set in the gateway device in FIG. 13 .
  • the apparatus 700 shown in FIG. 26 also includes a storage unit, and the storage unit is implemented by Cache (caching) in the gateway device in FIG. 13 .
  • the device 700 shown in FIG. 26 is set in the gateway device in FIG. 14 .
  • the processing unit 702 includes the RDMA adapter and the NOF monitoring module in FIG. 14
  • the receiving unit 701 and the sending unit 703 include various ports in FIG. 14 .
  • the device 700 shown in FIG. 26 is set in the gateway device in FIG. 15 .
  • the processing unit 702 is used to support the gateway device in Figure 15 to perform S612, look up the address translation table, NOF simple proxy processing, NOF-RDMA message conversion and RDMA-NOF message conversion, and the receiving unit 701 is used to support the gateway in Figure 15
  • the device receives the address conversion table issued in S614, the NO read/write request in S621, the NO read/write response in S623, and the RDMA read/write response in S632, and the sending unit 703 is used to support the gateway device in Figure 15 to execute S622 , S631 and S633.
  • the apparatus 700 shown in FIG. 26 is set in the gateway device in FIG. 16 .
  • the processing unit 702 includes a NOF-RDMA conversion module, an RDMA-NOF conversion module, and a packet parsing module in FIG. 16
  • the sending unit 703 includes a NOF proxy sending module and an RDMA proxy sending module in FIG. 16 .
  • the apparatus 700 shown in FIG. 26 also includes a storage unit for saving the NOF context table in FIG. 16 .
  • the device 700 shown in FIG. 26 is set in the gateway device in FIG. 17 .
  • the device 700 shown in FIG. 26 also includes a storage unit, which is used to save the address translation table in FIG. 17 .
  • the device 700 shown in FIG. 26 is set in the gateway device in FIG. 18 .
  • the sending unit 703 includes a NOF proxy sending module and an RDMA proxy sending module in FIG. 18 .
  • the processing unit 702 is configured to execute the steps of storing the NOF state and finding the NOF state in FIG. 18 .
  • the device 700 shown in FIG. 26 also includes a storage unit, which is used to store the address translation table in FIG. 18 .
  • the apparatus 700 is configured to support the gateway device to execute the method flow shown in FIG. 19 .
  • the receiving unit 701 is used to support the gateway device to execute S71 in FIG. 19;
  • the processing unit 702 is used to support the gateway device to execute S72, S73, S74, S75, S76 and S77 in FIG. 19;
  • the sending unit 703 is used to support the gateway device to execute the attached S78, S79 and S710 in Fig. 19.
  • the apparatus 700 is configured to support the gateway device to execute the method flow shown in FIG. 20 .
  • the receiving unit 701 is used to support the gateway device to execute S81 in Figure 20;
  • the processing unit 702 is used to support the gateway device to execute S82, S83, S84, S85 and S86 in Figure 20;
  • the sending unit 703 is used to support the gateway device to execute Figure 20 Medium S87 and S88.
  • the device 700 shown in FIG. 26 is set in the gateway device in FIG. 21 .
  • the processing unit 702 includes a message parsing module, a NOF-RDMA conversion module and an RDMA-NOF conversion module in FIG. 21 .
  • the sending unit 703 includes an RDMA proxy sending module and a NOF proxy sending module in FIG. 21 .
  • the device 700 shown in FIG. 26 also includes a storage unit, which is used to store the address translation table in FIG. 21 .
  • the apparatus 700 is configured to support the gateway device to execute the method flow shown in FIG. 22 .
  • the receiving unit 701 is used to support the gateway device to execute S71 in Figure 22;
  • the processing unit 702 is used to support the gateway device to execute S72, S73, S74, S75, S76 and S77' in Figure 22;
  • the sending unit 703 is used to support the gateway device to execute S78, S79 and S710 in accompanying drawing 22.
  • the apparatus 700 is configured to support the gateway device to execute the method flow shown in FIG. 23 .
  • the receiving unit 701 is used to support the gateway device to execute S81 in FIG. 23;
  • the processing unit 702 is used to support the gateway device to execute S82, S83, S84, S85 and S86' in FIG. 23;
  • the sending unit 703 is used to support the gateway device to execute the accompanying drawing 23 in S87 and S88.
  • the apparatus 700 is configured to support the gateway device to execute the method flow shown in FIG. 25 .
  • the receiving unit 701 is used to support the gateway device to execute S91 in FIG. 25; the processing unit 702 is used to support the gateway device to execute S92, S93, S94, S95, S96 and S97 in FIG. 25.
  • the device embodiment described in Figure 26 is only schematic.
  • the division of the above units is only a logical function division.
  • there may be other division methods for example, multiple units or components can be combined or can be Integrate into another system, or some features may be ignored, or not implemented.
  • Each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may physically exist separately, or two or more units may be integrated into one unit.
  • Each unit in the device 700 is fully or partially implemented by software, hardware, firmware or any combination thereof.
  • the above processing unit 702 is implemented by a software functional unit generated by at least one processor 801 in FIG. 27 after reading the program code 810 stored in the memory 802 .
  • the above-mentioned processing unit 702 is realized by a software function unit generated by the network processor 932 or the central processing unit 911 or the central processing unit 931 in FIG. 28 after reading the program code stored in the memory 912 or the memory 934 .
  • the processing unit 702 is implemented by at least one processor 801 in FIG. 27 or the network processor in FIG. 28 932 or the central processing unit 911 or a part of processing resources in the central processing unit 931 (for example, one core or two cores in a multi-core processor), or a field-programmable gate array (field-programmable gate array, FPGA), or Coprocessor and other programmable devices to complete.
  • the receiving unit 701 and the sending unit 703 are realized by the network interface 803 in FIG. 27 or the interface board 930 in FIG. 28 .
  • FIG. 27 is a schematic structural diagram of a gateway device 800 provided in an embodiment of the present application.
  • the gateway device 800 includes at least one processor 801 , a memory 802 and at least one network interface 803 .
  • the gateway device 800 shown in FIG. 27 is the gateway device 33 in FIG. 8 .
  • the gateway device 800 shown in FIG. 27 is the gateway device in FIG. 9 .
  • the network interface 803 is used to support the gateway device in FIG. 9 to execute S402, S404, S409 and S411.
  • the processor 801 is configured to support the gateway device in FIG. 9 to execute S403 and S410.
  • the gateway device 800 shown in FIG. 27 is the gateway device in FIG. 11 .
  • the network interface 803 is used to support the gateway device in FIG. 11 to execute S502, S504, S509 and S511.
  • the processor 801 is configured to support the gateway device in FIG. 11 to execute S503 and S510.
  • the gateway device 800 shown in FIG. 27 is the gateway device in FIG. 12 .
  • the network interface 803 is a network port in the gateway device in FIG. 12 .
  • the gateway device 800 shown in FIG. 27 is the gateway device in FIG. 13 .
  • the memory 802 includes the Cache (cache) in the gateway device in FIG. 13 .
  • the gateway device 800 shown in FIG. 27 is the gateway device in FIG. 14 .
  • the processor 801 includes the RDMA adapter and the NOF monitoring module in FIG. 14
  • the network interface 803 includes various ports in FIG. 14 .
  • the gateway device 800 shown in FIG. 27 is the gateway device in FIG. 15 .
  • the processor 801 is used to support the gateway device in Figure 15 to execute S612, look up the address translation table, NOF simple agent processing, NOF-RDMA message conversion and RDMA-NOF message conversion, and the network interface 803 is used to support the gateway in Figure 15
  • the device receives the address conversion table issued in S614, the NO read/write request in S621, the NO read/write response in S623, the RDMA read/write response in S632, S622, S631, and S633.
  • the gateway device 800 shown in FIG. 27 is the gateway device in FIG. 16 .
  • the processor 801 includes a NOF-RDMA conversion module, an RDMA-NOF conversion module, and a packet parsing module in FIG. 16
  • the network interface 803 includes a NOF proxy sending module and an RDMA proxy sending module in FIG. 16
  • the memory 802 is used to save the NOF context table in FIG. 16 .
  • the gateway device 800 shown in FIG. 27 is the gateway device in FIG. 17 .
  • the memory 802 is used to save the address translation table in FIG. 17 .
  • the gateway device 800 shown in FIG. 27 is the gateway device in FIG. 18 .
  • the network interface 803 includes the NOF proxy contract sending module and the RDMA proxy contract sending module in FIG. 18 .
  • the processor 801 is configured to execute the steps of storing the NOF state and finding the NOF state in FIG. 18 .
  • the memory 802 is used to save the address translation table in FIG. 18 .
  • the gateway device 800 is configured to execute the method flow shown in FIG. 19 .
  • the network interface 803 is used to execute S71, S78, S79 and S710 in FIG. 19; the processor 801 is used to execute S72, S73, S74, S75, S76 and S77 in FIG. 19.
  • the gateway device 800 is configured to execute the method flow shown in FIG. 20 .
  • the network interface 803 is used to execute S81, S87 and S88 in FIG. 20; the processor 801 is used to execute S82, S83, S84, S85 and S86 in FIG. 20.
  • the gateway device 800 shown in FIG. 27 is the gateway device in FIG. 21 .
  • the processor 801 includes a message parsing module, a NOF-RDMA conversion module and an RDMA-NOF conversion module in FIG. 21 .
  • the network interface 803 includes the RDMA proxy contract sending module and the NOF proxy contract sending module in FIG. 21 .
  • the memory 802 is used for saving the address conversion table in FIG. 21 .
  • the gateway device 800 is configured to execute the method flow shown in FIG. 22 .
  • the network interface 803 is used to execute S71, S78, S79 and S710 in FIG. 22; the processor 801 is used to execute S72, S73, S74, S75, S76 and S77' in FIG. 22.
  • the gateway device 800 is configured to execute the method flow shown in FIG. 23 .
  • the network interface 803 is used to execute S81, S87 and S88 in FIG. 23; the processor 801 is used to execute S82, S83, S84, S85 and S86' in FIG. 23.
  • the gateway device 800 is configured to execute the method shown in FIG. 25 process.
  • the network interface 803 is used to execute S91 in FIG. 25; the processor 801 is used to execute S92, S93, S94, S95, S96 and S97 in FIG. 25.
  • the processor 801 is, for example, a general-purpose central processing unit (central processing unit, CPU), a network processor (network processor, NP), a graphics processing unit (graphics processing unit, GPU), a neural network processor (neural-network processing units, NPU) ), a data processing unit (data processing unit, DPU), a microprocessor, or one or more integrated circuits for implementing the solution of this application.
  • the processor 801 includes an application-specific integrated circuit (application-specific integrated circuit, ASIC), a programmable logic device (programmable logic device, PLD) or a combination thereof.
  • the PLD is, for example, a complex programmable logic device (complex programmable logic device, CPLD), a field-programmable gate array (field-programmable gate array, FPGA), a general array logic (generic array logic, GAL) or any combination thereof.
  • complex programmable logic device complex programmable logic device, CPLD
  • field-programmable gate array field-programmable gate array
  • GAL general array logic
  • the processor 801 includes one or more CPUs, such as CPU0 and CPU1 shown in FIG. 27 .
  • the memory 802 is, for example, a read-only memory (read-only memory, ROM) or other types of static storage devices that can store static information and instructions, or a random access memory (random access memory, RAM) or a storage device that can store information and instructions.
  • Other types of dynamic storage devices such as electrically erasable programmable read-only memory (EEPROM), compact disc read-only memory (CD-ROM) or other optical disc storage, optical disc Storage (including Compact Disc, Laser Disc, Optical Disc, Digital Versatile Disc, Blu-ray Disc, etc.), magnetic disk storage medium, or other magnetic storage device, or is capable of carrying or storing desired program code in the form of instructions or data structures and capable of Any other medium accessed by a computer, but not limited to.
  • the memory 802 exists independently and is connected to the processor 801 through an internal connection 804 . Or, optionally, the memory 802 and the processor 801 are integrated together.
  • Network interface 803 uses any transceiver-like device for communicating with other devices or communication networks.
  • the network interface 803 includes, for example, at least one of a wired network interface or a wireless network interface.
  • the wired network interface is, for example, an Ethernet interface.
  • the Ethernet interface is, for example, an optical interface, an electrical interface or a combination thereof.
  • the wireless network interface is, for example, a wireless local area network (wireless local area networks, WLAN) interface, a cellular network interface or a combination thereof.
  • the processor 801 and the network interface 803 cooperate with each other to complete the processes of sending packets and receiving packets involved in the foregoing embodiments.
  • the foregoing process of sending the first RDMA request packet includes: the processor 801 instructs the network interface 803 to send the first RDMA request packet.
  • the processor 801 generates and sends an instruction to the network interface 803, and the network interface 803 sends the first RDMA request packet based on the instruction of the processor 801.
  • the above-mentioned process of receiving the first NOF request message includes: the network interface 803 receives the first NOF request message, performs partial processing (such as decapsulation) on the first NOF request message, and sends it to the processor 801, so that The processor 801 obtains the information (for example, the first destination address) required in the above embodiment carried in the first NOF request packet.
  • the network interface 803 receives the first NOF request message, performs partial processing (such as decapsulation) on the first NOF request message, and sends it to the processor 801, so that The processor 801 obtains the information (for example, the first destination address) required in the above embodiment carried in the first NOF request packet.
  • the gateway device 800 optionally includes multiple processors, such as the processor 801 and the processor 805 shown in FIG. 27 .
  • processors are, for example, a single-core processor (single-CPU), or a multi-core processor (multi-CPU).
  • a processor herein alternatively refers to one or more devices, circuits, and/or processing cores for processing data such as computer program instructions. In a possible implementation, multiple cores or multiple processors respectively execute part of the steps in the foregoing method embodiments.
  • the gateway device 800 also includes an internal connection 804 .
  • the processor 801 , memory 802 and at least one network interface 803 are connected by an internal connection 804 .
  • Internal connections 804 include pathways that carry information between the aforementioned components interest.
  • internal connection 804 is a single board or a bus.
  • the internal connection 804 is divided into address bus, data bus, control bus and so on.
  • the gateway device 800 further includes an input and output interface 806 .
  • the processor 801 implements the methods in the foregoing embodiments by reading the program code 810 stored in the memory 802, or, the processor 801 implements the methods in the foregoing embodiments through internally stored program codes.
  • the processor 801 implements the method in the foregoing embodiment by reading the program code 810 stored in the memory 802
  • the memory 802 stores the program code for implementing the method provided in the embodiment of the present application.
  • processor 801 For more details of the processor 801 implementing the above functions, please refer to the descriptions in the foregoing method embodiments, which will not be repeated here.
  • FIG. 28 is a schematic structural diagram of a gateway device 900 provided in an embodiment of the present application.
  • the gateway device 900 includes: a main control board 910 and an interface board 930 .
  • the gateway device 900 shown in FIG. 28 is the gateway device 33 in FIG. 8 .
  • the gateway device 900 shown in FIG. 28 is the gateway device in FIG. 9 .
  • the interface board 930 is used to support the gateway device in FIG. 9 to execute S402, S404, S409, S411, S403 and S410.
  • the gateway device 900 shown in FIG. 28 is the gateway device in FIG. 11 .
  • the interface board 930 is used to support the gateway device in FIG. 11 to execute S502, S504, S509, S511, S503 and S510.
  • the gateway device 900 shown in FIG. 28 is the gateway device in FIG. 12 .
  • the interface board 930 includes network ports in the gateway device in FIG. 12 .
  • the gateway device 900 shown in FIG. 28 is the gateway device in FIG. 13 .
  • the forwarding entry storage 934 includes the Cache (cache) in the gateway device in FIG. 13 .
  • the gateway device 900 shown in FIG. 28 is the gateway device in FIG. 14 .
  • the interface board 930 includes the RDMA adapter, the NOF monitoring module and various ports shown in FIG. 14 .
  • the gateway device 900 shown in FIG. 28 is the gateway device in FIG. 15 .
  • the interface board 930 is used to support the gateway device in FIG. 15 to execute S612, look up the address translation table, NOF simple proxy processing, NOF-RDMA message conversion, RDMA-NOF message conversion, address translation table delivered in S614, and address translation table issued in S621. NOF read/write request, NO read/write response in S623, RDMA read/write response in S632, S622, S631 and S633.
  • the gateway device 900 shown in FIG. 28 is the gateway device in FIG. 16 .
  • the interface board 930 includes a NOF-RDMA conversion module, an RDMA-NOF conversion module, a message analysis module, a NOF proxy sending module and an RDMA proxy sending module in FIG. 16 .
  • the forwarding entry storage 934 is used to save the NOF context table in FIG. 16 .
  • the gateway device 900 shown in FIG. 28 is the gateway device in FIG. 17 .
  • the forwarding entry storage 934 is used to store the address translation table in FIG. 17 .
  • the gateway device 900 shown in FIG. 28 is the gateway device in FIG. 18 .
  • the interface board 930 includes the NOF proxy contract sending module and the RDMA proxy contract sending module in FIG. 18 .
  • the storage 912 or the forwarding entry storage 934 is used to store the address translation table in FIG. 18 .
  • the gateway device 900 is configured to execute the method flow shown in FIG. 19 .
  • the interface board 930 is used to execute S71, S78, S79, S710, S72, S73, S74, S75, S76 and S77 in FIG. 19 .
  • the gateway device 900 is configured to execute the method shown in FIG. 20 process.
  • the interface board 930 is used to execute S81, S87, S88, S82, S83, S84, S85 and S86 in FIG. 20 .
  • the gateway device 900 shown in FIG. 28 is the gateway device in FIG. 21 .
  • the interface board 930 includes a message parsing module, a NOF-RDMA conversion module, an RDMA-NOF conversion module, an RDMA proxy sending module and a NOF proxy sending module in FIG. 21 .
  • the forwarding entry storage 934 is used for saving the address translation table in FIG. 21 .
  • the gateway device 900 is configured to execute the method flow shown in FIG. 22 .
  • the interface board 930 is used to execute S71, S78, S79, S710, S72, S73, S74, S75, S76 and S77' in FIG. 22 .
  • the gateway device 900 is configured to execute the method flow shown in FIG. 23 .
  • the interface board 930 is used to execute S81, S87, S88, S82, S83, S84, S85 and S86' in FIG. 23 .
  • the gateway device 900 is configured to execute the method flow shown in FIG. 25 .
  • the interface board 930 is used to execute S91, S92, S93, and S95 in FIG. 25; the main control board 910 is used to execute S94, S96, and S97 in FIG. 25.
  • the main control board 910 is also called a main processing unit (main processing unit, MPU) or a route processing card (route processor card). Management, equipment maintenance, protocol processing functions.
  • the main control board 910 includes: a CPU 911 and a memory 912 .
  • the interface board 930 is also called a line interface unit card (line processing unit, LPU), a line card (line card), or a service board.
  • the interface board 930 is used to provide various service interfaces and implement forwarding of data packets.
  • the service interface includes but is not limited to an Ethernet interface, a POS (packet over sONET/SDH) interface, etc., and the Ethernet interface is, for example, a flexible ethernet service interface (flexible ethernet clients, FlexE clients).
  • the interface board 930 includes: a central processing unit 931 , a network processor 932 , a forwarding entry storage 934 and a physical interface card (physical interface card, PIC) 933 .
  • the CPU 931 on the interface board 930 is used to control and manage the interface board 930 and communicate with the CPU 911 on the main control board 910 .
  • the network processor 932 is configured to implement message forwarding processing.
  • the form of the network processor 932 is, for example, a forwarding chip.
  • the network processor 932 is configured to forward the received message based on the forwarding table stored in the forwarding table entry memory 934, and if the destination address of the message is the address of the gateway device 900, then send the message to the CPU ( If the destination address of the message is not the address of the gateway device 900, the next hop and the outgoing interface corresponding to the destination address are found from the forwarding table according to the destination address, and the message is forwarded to The outbound interface corresponding to the destination address.
  • the processing of the uplink message includes: processing of the inbound interface of the message, forwarding table search; processing of the downlink message: forwarding table search and so on.
  • the physical interface card 933 is used to realize the interconnection function of the physical layer.
  • the original traffic enters the interface board 930 through this, and the processed packets are sent out from the physical interface card 933 .
  • the physical interface card 933 is also called a daughter card, which can be installed on the interface board 930, and is responsible for converting the photoelectric signal into a message, checking the validity of the message and forwarding it to the network processor 932 for processing.
  • the central processing unit can also perform the functions of the network processor 932 , such as implementing software forwarding based on a general-purpose CPU, so that the network processor 932 is not required in the physical interface card 933 .
  • the gateway device 900 includes multiple interface boards.
  • the gateway device 900 further includes an interface board 940 , and the interface board 940 includes: a central processing unit 941 , a network processor 942 , a forwarding entry storage 944 and a physical interface card 943 .
  • the gateway device 900 further includes a switching fabric unit 920 .
  • the SFU 920 is also called, for example, a switch fabric unit (SFU).
  • SFU switch fabric unit
  • the switching fabric board 920 is used to complete the data exchange between the interface boards.
  • the interface board 930 communicates with the interface board 940 through, for example, the switching fabric board 920 .
  • the main control board 910 is coupled to the interface board 930 .
  • the main control board 910, the interface board 930 and the interface board 940, and the switching fabric board 920 are connected to the system backplane through the system bus to realize intercommunication.
  • an inter-process communication protocol IPC
  • IPC inter-process communication
  • the gateway device 900 includes a control plane and a forwarding plane.
  • the control plane includes a main control board 910 and a central processing unit 931.
  • the forwarding plane includes various components for performing forwarding, such as a forwarding entry storage 934, a physical interface card 933, and a network processing device 932.
  • the control plane executes router functions, generates forwarding tables, processes signaling and protocol packets, configures and maintains device status, and other functions.
  • the control plane sends the generated forwarding tables to the forwarding plane.
  • the network processor 932 The issued forwarding table looks up and forwards the packets received by the physical interface card 933 .
  • the forwarding table issued by the control plane is saved in the forwarding table item storage 934, for example.
  • the control plane and the forwarding plane are, for example, completely separated and not on the same device.
  • the operations on the interface board 940 are the same as those on the interface board 930 , and will not be repeated for brevity.
  • main control boards There may be one or more main control boards, and when there are multiple main control boards, for example, it includes an active main control board and a standby main control board.
  • interface boards There may be one or more interface boards. The stronger the data processing capability of the network device, the more interface boards it provides. There may also be one or more physical interface cards on the interface board.
  • SFU There may be no SFU, or there may be one or more SFUs. When there are multiple SFUs, they can jointly implement load sharing and redundant backup. Under the centralized forwarding architecture, the network device does not need a switching network board, and the interface board undertakes the processing function of the service data of the entire system.
  • the network device can have at least one SFU, through which the data exchange between multiple interface boards can be realized, and large-capacity data exchange and processing capabilities can be provided. Therefore, the data access and processing capabilities of network devices with a distributed architecture are greater than those with a centralized architecture.
  • the form of the network device can also be that there is only one board, that is, there is no switching fabric board, and the functions of the interface board and the main control board are integrated on this board.
  • the central processing unit and the main control board on the interface board The central processing unit on the board can be combined into one central processing unit on the board to perform the superimposed functions of the two.
  • the data exchange and processing capabilities of this form of equipment are low (for example, low-end switches or routers and other network equipment). Which architecture to use depends on the specific networking deployment scenario, and there is no limitation here.
  • a reference to B means that A is the same as B or A is a simple variation of B.
  • first and second in the description and claims of the embodiments of the present application are used to distinguish different objects, not to describe a specific order of objects, nor can they be interpreted as indicating or implying relative importance sex.
  • first RDMA storage node and the second RDMA storage node are used to distinguish different RDMA storage nodes, not to describe the specific order of the RDMA storage nodes, nor can it be understood that the first RDMA storage node is more advanced than the second RDMA storage node. important.
  • RDMA storage nodes refer to two or more RDMA storage nodes.
  • a computer program product includes one or more computer instructions.
  • a computer can be a general purpose computer, special purpose computer, computer network, or other programmable device.
  • Computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, computer instructions may be transferred from a website site, computer, service A server or data center transmits to another website site, computer, server or data center by wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.).
  • the computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device such as a server, a data center, etc. integrated with one or more available media.
  • the available media may be magnetic media (eg, floppy disk, hard disk, magnetic tape), optical media (eg, DVD), or semiconductor media (eg, Solid State Disk (SSD)).

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Computer Security & Cryptography (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

本申请提供了一种报文处理方法、网关设备及存储系统,属于存储技术领域。本申请通过将对NVMe节点的访问请求转换为对RDMA节点的访问请求,由于NVMe节点的存储介质是硬盘,而RDMA节点的存储介质是内存,内存能提供比硬盘更快速的读写速度,因此提升了存储性能,同时有助于传统的NOF存储系统扩展RDMA内存池,提升存储系统组网和扩容的灵活性。

Description

报文处理方法、网关设备及存储系统
本申请要求于2022年01月30日提交的申请号为202210114823.1、发明名称为“报文处理方法、网关设备及存储系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及存储技术领域,特别涉及一种报文处理方法、网关设备及存储系统。
背景技术
非易失性高速传输总线(non-volatile memory express,NVM Express,缩写:NVMe)是一种基于设备逻辑接口的总线传输协议规范,NVMe提供了主机通过高速串行计算机扩展总线标准(peripheral component interconnect express,PCIe)总线访问固态硬盘(solid state disk,SSD)的软硬件标准,遵循NVMe标准中规定的格式的指令可称为NVMe指令。承载在网络端的NVMe协议(NVMe over fabric,NOF)是一种基于NVMe的协议。NOF支持通过各种传输层协议传输NVME指令,从而将NVME的应用场景扩展至存储区域网络(storage area network,SAN),允许主机通过网络访问存储设备。
目前,基于NOF协议的存储方案的基本交互流程包括:客户端发送第一NOF请求报文,第一NOF请求报文携带NVMe指令。服务器接收第一NOF请求报文。服务器解析第一NOF请求报文,得到第一NOF请求报文携带的NVMe指令。服务器对服务器中的NVME存储介质执行NVMe指令对应的操作。其中,NVME存储介质通常为硬盘。
由于硬盘的性能通常不及动态随机存取存储器(dynamic random access memory,DRAM)这类内存,并且硬盘操作的指令集比内存操作的指令集更复杂,导致目前基于NOF协议的存储方案性能低下。
发明内容
本申请实施例提供了一种报文处理方法、网关设备及存储系统,能够提升存储性能。所述技术方案如下。
第一方面,提供了一种报文处理方法,该方法包括:网关设备接收来自客户端的第一NOF请求报文,第一NOF请求报文携带NVMe指令,NVMe指令指示对第一目的地址执行读/写操作;网关设备基于第一目的地址获取第一RDMA存储节点的信息;网关设备向第一RDMA存储节点发送第一RDMA请求报文,第一RDMA请求报文携带NVMe指令对应的RDMA指令。
客户端、网关设备、RDMA存储节点这三类网元是按照设备功能或者说在设备在方案中扮演的角色命名的。客户端是发起NOF请求报文的实体,或者说NOF请求方。RDMA存储节点是响应RDMA请求报文执行读/写操作的实体,也称RDMA服务端。网关设备相当于访问RDMA存储节点的入口,从客户端发来的NOF请求报文要经过这个入口再传输到RDMA 存储节点。网关设备包括而不限于服务器、服务器代理、路由器、交换机、防火墙等。
RDMA指令和NVMe指令对应,例如,RDMA指令指示的操作类型和NVMe指令指示的操作类型相同,该操作类型包括读操作和写操作。例如,第一NOF请求报文携带NVMe读指令,第一RDMA请求报文携带RDMA读指令;第一NOF请求报文携带NVMe写指令,第一RDMA请求报文携带RDMA写指令。又如,RDMA指令指示的待处理的数据和RDMA指令指示的待处理的数据相同。比如说,RDMA指示的待读取的数据和RDMA指令指示的待读取的数据相同,或者,RDMA指示的待保存的数据和RDMA指令指示的待保存的数据相同。
第一目的地址表示NVMe存储介质提供的存储空间中的位置。可选地,第一目的地址是逻辑地址(或称为虚拟地址)。可选地,第一目的地址包括起始逻辑块地址(start LBA)和块数量(block number,或称为number of Logical blocks)。
在一种可能的实现中,上述第一RDMA存储节点的信息包括以下信息中的至少一个:第二目的地址、第一RDMA存储节点的网络位置信息、第一RDMA存储节点中一个或多个队列对(queue pair,QP)的标识和远端秘钥(remote key,R_Key)。
第二目的地址指向第一RDMA存储节点中的内存空间。内存空间即内存中的一段空间,内存空间在内存中所处的位置通过第二目的地址指示。第二目的地址的形式包括多种实现方式。例如,第二目的地址包括起始地址和长度,起始地址的值为0x1FFFF,长度的值为32KB,这个第二目的地址指向第一RDMA存储节点的内存中从地址0x1FFFF开始、长度为32KB的空间。又如,第二目的地址包括起始地址和结束地址,起始地址的值为0x1FFFF,结束地址的值为0x2FFFF,这个第二目的地址指向第一RDMA存储节点的内存中从地址0x1FFFF开始、到地址0x2FFFF的空间。在读数据的情况下,第二目的地址指向第一RDMA存储节点的内存中保存了待读取的数据的内存空间。在写数据的情况下,第二目的地址指示第一RDMA存储节点的内存中待写入数据内存空间。可选地,第二目的地址是逻辑地址(或称为虚拟地址)。可选地,第二目的地址中的起始地址具体为虚拟地址(virtual address,VA),第二目的地址中的长度具体为直接内存访问的长度(DMA length)。
第一RDMA存储节点的网络位置信息用于在网络中标识第一RDMA存储节点。例如,网络位置信息用于指导网关设备与第一RDMA存储节点之间的网络设备进行路由转发。示例性地,第一RDMA存储节点的网络位置信息包括MAC地址、IP地址、多协议标签交换(multi-protocol label switching,MPLS)标签或者段标识(segment ID,SID)中至少一项。
一个QP包括一个发送队列(send queue,SQ)和一个接收队列(receive queue,RQ),QP用于管理各种类型的消息。
R_Key指示访问第一RDMA存储节点的内存的权限。R_Key也称内存钥匙。在一种可能的实现中,R_Key指示访问第一RDMA存储节点上特定内存空间的权限,该特定内存空间例如是保存有待读取数据的内存空间,又如是预先注册的内存空间。在另一种可能的实现中,在向第一存储节点和第二存储节点写数据的场景下,R_Key指示访问第一RDMA存储节点内存的权限以及第二RDMA存储节点内存的权限。
DMA length表示RDMA操作的长度。例如,DMA length的值为16KB,表示对长度为16KB的内存空间进行RDMA操作。RDMA操作包括写操作和读操作。写操作例如是将数据写入到内存。读操作例如是从内存中读取数据。
网关设备通过执行上述第一方面的方法,从而将对NVMe节点的访问请求转换为对RDMA节点的访问请求,由于NVMe节点的存储介质是硬盘,而RDMA节点的存储介质是内存,内存能提供比硬盘更快速的读写速度,因此该方法提升了存储性能。当然,如果NVMe指令指示读操作,使用上述方法要想成功访问到相应的数据,该第一RDMA存储节点中,应存储有该NVMe指令指示要读取的数据。此外,由于内存操作的指令集比硬盘操作的指令集更简单,因此该方法降低了存储节点执行读写指令的复杂度。
此外,从客户端的角度来看,客户端按照原有的NOF流程发起访问,即可使用RDMA存储节点提供的存储服务,而不必感知存储节点的变化,也不必要求客户端支持RDMA,因此与原有的NOF存储方案兼容,便于快速开通业务。
可选地,上述第一RDMA请求报文还包括第一RDMA存储节点的信息。
可选地,网关设备基于第一目的地址获取第一RDMA存储节点的信息,包括:网关设备基于第一目的地址,从第一对应关系中查询得到第一RDMA存储节点的信息。
第一对应关系是指第一目的地址以及第一RDMA存储节点的信息之间的对应关系。述第一对应关系指示第一目的地址以及第一RDMA存储节点的信息之间的对应关系。
第一对应关系如何指示第一目的地址以及第一RDMA存储节点的信息之间的对应关系包括多种实现方式。可选地,第一对应关系包括第一目的地址以及第一RDMA存储节点的信息。例如,第一对应关系相当于一个表,表的索引是第一目的地址,表的值是第一RDMA存储节点的信息。可替代地,第一对应关系不包括第一RDMA存储节点的信息本身,而是包括第一RDMA存储节点的信息关联的其他信息,例如第一RDMA存储节点的信息的元数据、保存有第一RDMA存储节点的信息的文件的文件名、统一资源定位器(uniform resource locator,URL)等。
通过上述实现方式,卸载了存储节点的寻址任务(寻址是指根据目的NVMe地址查找目的存储节点的过程),从而减轻NOF存储节点的CPU压力以及网络IO压力。
可选地,网关设备向第一RDMA存储节点发送第一RDMA请求报文之后,上述方法还包括:网关设备接收来自第一RDMA存储节点的RDMA响应报文,RDMA响应报文是针对第一RDMA请求报文的响应报文;网关设备基于RDMA响应报文生成第一NOF响应报文;网关设备向客户端发送第一NOF响应报文。
第一NOF响应报文是针对上述第一NOF请求报文的响应报文。第一NOF响应报文表示对第一NOF请求报文中的NVMe指令进行回应。在读数据的情况下,第一NOF响应报文包括第一NOF请求报文所请求获取的数据。可选地,第一NOF响应报文还包括完成队列元素(complete queue element,CQE),CQE用于表示已完成NVMe读操作。在写数据的情况下,第一NOF响应报文为NOF写响应报文。第一NOF响应报文包括CQE,CQE用于表示已完成NVMe写操作,或者说数据已经保存成功。
通过采用上述实现方式,网关设备实现了NOF协议栈代理,代替RDMA存储节点将响应报文回复给客户端,一方面,由于客户端感知到的响应报文仍是NOF报文,从而不必要求客户端感知协议报文转换的逻辑,降低维护客户端的难度。另一方面,也不必要求RDMA存储节点支持NOF协议,减少RDMA存储节点所需支持的协议种类。
可选地,网关设备基于RDMA响应报文生成第一NOF响应报文,包括:网关设备基于RDMA响应报文获得RDMA状态信息,RDMA状态信息指示RDMA响应报文与第一RDMA 请求报文之间的对应关系;网关设备根据RDMA状态信息,从第二对应关系中查询得到NOF状态信息,NOF状态信息指示第一NOF响应报文与第一NOF请求报文之间的对应关系;网关设备基于NOF状态信息生成第一NOF响应报文。
第二对应关系是指RDMA状态信息与NOF状态信息之间的对应关系。第二对应关系包括RDMA状态信息与NOF状态信息之间的对应关系。
可选地,上述第一NOF响应报文包括NOF状态信息。
可选地,上述第一NOF请求报文包括NOF状态信息。
上述NOF状态信息如何指示第一NOF响应报文与第一NOF请求报文之间的对应关系包括多种情况。在一种可能的实现中,NOF状态信息为第一NOF请求报文中的包序列号。在另一种可能的实现中,NOF状态信息为第一NOF响应报文中的包序列号。在另一种可能的实现中,NOF状态信息为第一NOF请求报文中的包序列号经过设定的规则换算后的值。
通过上述方式,一方面,有助于网关设备返回给客户端的NOF报文携带准确的状态信息,从而实现客户端与网关设备基于NOF协议进行会话的连续性,提高通信成功率。另一方面,无需修改原有的RDMA协议,复杂度较低。
可选地,网关设备根据上述RDMA状态信息,从第二对应关系中查询得到NOF状态信息之前,上述方法还包括:网关设备基于第一NOF请求报文获得NOF状态信息;网关设备建立第二对应关系,第二对应关系为NOF状态信息与RDMA状态信息之间的对应关系。
通过上述方式,网关设备由于在与客户端和RDMA节点交互的过程中,将NOF状态与RDMA状态联系起来,从而为回复NOF报文的过程提供了准确的状态信息。
可选地,上述第一RDMA请求报文包括NOF状态信息,上述RDMA响应报文包括NOF状态信息,网关设备基于RDMA响应报文生成第一NOF响应报文,包括:网关设备基于RDMA响应报文获得NOF状态信息;网关设备基于NOF状态信息生成第一NOF响应报文。
通过上述方式,网关设备无需本地维护额外的表项即可获得NOF状态信息,因此节省了网关设备的存储空间,也减少网关设备查表和写表带来的资源开销。
可选地,上述第一RDMA请求报文包括第一NOF报文头,上述RDMA响应报文包括第一RDMA存储节点基于第一NOF报文头生成的第二NOF报文头,第一NOF响应报文包括第二NOF报文头。
第一NOF报文头是指NOF报文的报文头。例如,第一NOF报文头是RDMA请求报文对应的第一NOF报文的报文头。
NOF报文头包括fabric对应的报文头以及NVMe层信息。所谓“fabric”,就是指主机与存储介质之间的网络。fabric的典型形态例如为以太网、光纤通道、无限带宽(InfiniBand,IB)等。fabric对应的报文头的具体格式与fabric的实现方式有关。fabric对应的报文头可能包括多层协议对应的报文头。例如,fabric通过RoCEv2实现,fabric对应的报文头包括MAC头(对应于链路层协议)、IP头(对应于网络层协议)、UDP头(对应于传输层协议)和IB头(对应于传输层协议)。或者,fabric对应的报文头为一种协议对应的报文头。例如,fabric通过InfiniBand实现,fabric对应的报文头为IB头。
通过上述实现方式,一方面,网关设备无需本地维护额外的表项即可获得NOF状态信息,因此节省了网关设备的存储空间,也减少网关设备查表和写表带来的资源开销。另一方面,将生成NOF报文头的工作转移到RDMA存储节点上执行,从而减轻网关设备的处理压力。
可选地,上述RDMA状态信息包括包序列号(packet sequeue number,PSN)。
可选地,上述NOF状态信息包括PSN、发送队列头指针(SQ head pointer,SQHD)、命令标志符(command ID)、目的队列对(destination queue pair,DQP)、虚拟地址、R_Key或直接内存访问长度中至少一项。
PSN用于支持对丢失的报文进行检测和重传。
SQHD用于指示发送队列(submission queue,SQ)当前的头部。SQHD用于向主机指明SQ中已经被消费的条目(即已经向SQ中加入的读/写指令)。
command ID为错误相关的命令的标识符。
R_Key用于描述远端设备访问本地内存的权限,如客户端对RDMA存储节点内存的访问权限。R_Key也称内存钥匙。R_Key通常和VA一起使用。可选地,R_Key还用于帮助硬件识别将虚拟地址转换为物理地址的页表。
DMA length表示RDMA操作的长度。
可选地,上述方法还包括:网关设备基于上述第一目的地址获取第二RDMA存储节点的信息;在上述NVMe指令指示写操作的情况下,网关设备向第二RDMA存储节点发送第二RDMA请求报文,第二RDMA请求报文携带上述NVMe指令对应的RDMA指令。
可选地,上述第二RDMA请求报文还包括第二RDMA存储节点的信息。
通过上述方式,支持将同一份数据写入到多个RDMA存储节点中的每一个RDMA存储节点,从而实现数据备份的功能。
可选地,上述第一RDMA请求报文和上述第二RDMA请求报文均为组播报文;或者,
上述第一RDMA请求报文和上述第二RDMA请求报文均为单播报文。
可选地,网关设备向第一RDMA存储设备发送第一RDMA请求报文之前,网关设备还基于上述第一目的地址获取第二RDMA存储节点的信息;在上述NVMe指令指示读操作的情况下,网关设备基于负载分担算法,从第一RDMA存储节点和第二RDMA存储节点中选择第一RDMA存储节点。
通过上述方式,支持将读请求发送至多个候选RDMA存储节点其中的一个RDMA存储节点,从而支持负载分担特性,允许多个RDMA节点分担读取数据带来的处理压力。
可选地,上述方法还包括:网关设备从网关设备之外的其他设备接收上述第一对应关系;或者,网关设备生成上述第一对应关系。
可选地,网关设备生成第一对应关系,包括:网关设备为第一RDMA存储节点分配NVMe逻辑地址,得到第一目的地址。网关设备建立第一目的地址与第一RDMA存储节点的信息之间的对应关系,从而产生上述第一对应关系。
在生成上述第一对应关系的过程中,网关设备如何获取第一RDMA存储节点的信息包括多种实现方式。在一种可能的实现中,第一RDMA存储节点主动向网关设备上报本节点的信息。示例性地,第一RDMA存储节点生成并向网关设备发送RDMA注册报文,网关设备接收来自第一RDMA存储节点的RDMA注册报文,从RDMA注册报文获得第一RDMA存储节点的信息。在另一种可能的实现中,网关设备从第一RDMA存储节点拉取第一RDMA存储节点的信息。例如,网关设备生成并向第一RDMA存储节点发送查询请求,查询请求用于指示获取第一RDMA存储节点的信息。第一RDMA存储节点接收查询请求,生成并向网关设备发送查询响应,查询响应包括第一RDMA存储节点的信息。网关设备接收查询响应,从 查询响应获得第一RDMA存储节点的信息。
可选地,网关设备接收来自客户端的第一NOF请求报文之后,网关设备还基于上述第一目的地址获取NOF存储节点的信息;网关设备基于第一NOF请求报文生成第二NOF请求报文;网关设备向NOF存储节点发送第二NOF请求报文。
可选地,网关设备向NOF存储节点发送第二NOF请求报文之后,网关设备还接收来自NOF存储节点的第二NOF响应报文;网关设备基于第二NOF响应报文生成第三NOF响应报文;网关设备向客户端发送第三NOF响应报文。
通过上述实现方式,支持原有的NOF交互流程,从而与原有的NOF存储方案保持兼容。
可选地,上述第一RDMA存储节点为存储服务器、内存或者存储阵列。
可选地,内存为动态随机存取存储器(dynamic random access memory,DRAM)、或者存储级存储器(storage class memory,SCM)或者双列直插式存储器模块或双线存储器模块(dual In-line memory module,简称DIMM)。
第二方面,提供了一种网关设备,该网关设备具有实现上述第一方面或第一方面任一种可选方式的功能。该网关设备包括至少一个单元,网关设备的各个单元用于实现上述第一方面或第一方面任一种可选方式所提供的方法。在一些实施例中,网关设备中的单元通过软件实现,网关设备中的单元是程序模块。在另一些实施例中,网关设备中的单元通过硬件或固件实现。第二方面提供的网关设备的具体细节可参见上述第一方面或第一方面任一种可选方式,此处不再赘述。
第三方面,提供了一种网关设备,该网关设备包括处理器和网络接口,处理器与存储器耦合,网络接口用于接收或发送报文,存储器中存储有至少一条计算机程序指令,至少一条计算机程序指令由处理器加载并执行,以使网关设备实现上述第一方面或第一方面任一种可选方式所提供的方法。
可选地,网关设备的处理器为处理电路。例如,网关设备的处理器为可编程逻辑电路,例如,处理器为现场可编程门阵列(field-programmable gate array,FPGA)、或协处理器等可编程器件。
可选地,网关设备的存储器为存储介质。网关设备的存储介质包括而不限于内存或者硬盘,内存例如为DRAM、或者SCM或者DIMM。硬盘例如为固态硬盘(solid state disk,SSD)或机械硬盘(hard disk drive,HDD)。
第三方面提供的网关设备的具体细节可参见上述第一方面或第一方面任一种可选方式,此处不再赘述。
第四方面,提供了一种网关设备,网关设备包括:主控板和接口板,进一步,还可以包括交换网板。网关设备用于执行第一方面或第一方面的任意可能的实现方式中的方法。
第五方面,提供了一种计算机可读存储介质,该存储介质中存储有至少一条指令,该指令在计算机上运行时,使得计算机执行上述第一方面或第一方面任一种可选方式所提供的方法。
第六方面,提供了一种计算机程序产品,计算机程序产品包括一个或多个计算机程序指令,当计算机程序指令被计算机加载并运行时,使得计算机执行上述第一方面或第一方面任一种可选方式所提供的方法。
第七方面,提供了一种芯片,该芯片包括可编程逻辑电路和/或程序指令,当该芯片运行 时用于实现如上述第一方面或第一方面的任一可选方式所提供的方法。示例性地,该芯片为网卡。
第八方面,提供了一种存储系统,该存储系统包括上述第二方面或第三方面或第四方面上述的网关设备以及一个或多个RDMA存储节点,该一个或多个RDMA存储节点包括第一RDMA存储节点。
网关设备,用于接收来自客户端的第一NOF请求报文;基于第一目的地址获取第一RDMA存储节点的信息;向第一RDMA存储节点发送第一RDMA请求报文;第一RDMA存储节点,用于接收来自网关设备的第一RDMA请求报文;根据上述RDMA指令对第二目的地址执行读/写操作。
上述存储系统在支持原有的NOF流程的同时,还引入对RDMA存储节点的支持,从而充分利用了RDMA内存存储的优势,大大提升系统整体性能。同时,客户端在使用NOF存储服务时并不感知变化,从而保证可用性。
在一种可能的实现中,第一RDMA存储节点用于向上述网关设备发送上述第一RDMA存储节点的信息;上述网关设备,用于接收上述第一RDMA存储节点发送的第一RDMA存储节点的信息,基于第一RDMA存储节点的信息建立第一对应关系。
在一种可能的实现中,第一RDMA存储节点用于基于上述第一RDMA请求报文生成RDMA响应报文,向网关设备发送上述RDMA响应报文;
上述网关设备用于接收RDMA响应报文;基于上述RDMA响应报文生成第一NOF响应报文;向上述客户端发送上述第一NOF响应报文。
在一种可能的实现中,该存储系统还包括一个或多个NOF存储节点。通过上述方式,支持NOF硬盘存储+RDMA内存介质存储的混合组网方式,提升组网的灵活性,支持更多的业务场景。
第九方面,提供了一种报文处理方法,该方法包括:第一RDMA存储节点接收来自网关设备的第一RDMA请求报文,第一RDMA请求报文包括RDMA指令以及第一NOF报文头,RDMA指令指示对第二目的地址执行读/写操作;第一RDMA存储节点根据上述RDMA指令对第二目的地址执行读/写操作;第一RDMA存储节点向网关设备发送RDMA响应报文,RDMA响应报文是针对上述第一RDMA请求报文的响应报文,上述RDMA响应报文包括第二NOF报文头,上述第二NOF报文头与上述第一NOF报文头对应。
其中,第二NOF报文头与第一NOF报文头对应是指第二NOF报文头携带的NOF状态信息与第一NOF报文头携带的NOF状态信息相同。
以上方法中,RDMA存储节点由于承担了生成NOF报文头的部分工作,将NOF报文头随着RDMA响应报文一起返回给网关设备,从而减轻网关设备恢复NOF报文头所需的处理压力,同时无需网关设备缓存NOF请求报文中的NOF报文头,从而节省了网关设备内部的存储空间。
可选地,上述基于第一NOF报文头生成第二NOF报文头,包括:对第一NOF报文头中缺失的内容进行填充,得到第二NOF报文头。
可选地,上述基于第一NOF报文头生成第二NOF报文头,包括:对第一NOF报文头中的常数循环冗余码校验(invariable cyclic redundancy check,ICRC)进行修改,得到第二NOF报文头。
附图说明
图1是本申请实施例提供的一种NVMe SSD的结构示意图;
图2是本申请实施例提供的一种主机与NVMe控制器通信的流程示意图;
图3是本申请实施例提供的一种队列对机制的示意图;
图4是本申请实施例提供的一种NOF存储系统的架构示意图;
图5是本申请实施例提供的一种存储节点对NOF协议栈的解析路径示意图;
图6是本申请实施例提供的一种RDMA系统的架构示意图;
图7是本申请实施例提供的一种队列对机制的示意图;
图8是本申请实施例提供的一种应用场景示意图;
图9是本申请实施例提供的一种报文处理方法的流程图;
图10是本申请实施例提供的一种网关设备处理报文的示意图;
图11是本申请实施例提供的一种报文处理方法的流程图;
图12是本申请实施例提供的一种部署网关设备后存储系统的架构示意图;
图13是本申请实施例提供的一种网关设备充当存储节点的场景示意图;
图14是本申请实施例提供的一种网关设备的逻辑功能架构示意图;
图15是本申请实施例提供的一种报文处理方法的流程图;
图16是本申请实施例提供的一种网关设备的逻辑功能架构示意图;
图17是本申请实施例提供的一种地址转换表的功能示意图;
图18是本申请实施例提供的一种NOF上下文表的建立过程以及查找过程示意图;
图19是本申请实施例提供的一种报文处理方法的流程图;
图20是本申请实施例提供的一种报文处理方法的流程图;
图21是本申请实施例提供的一种网关设备的逻辑功能架构示意图;
图22是本申请实施例提供的一种报文处理方法的流程图;
图23是本申请实施例提供的一种报文处理方法的流程图;
图24是本申请实施例提供的一种网关设备的逻辑功能架构示意图;
图25是本申请实施例提供的一种报文处理方法的流程图;
图26是本申请实施例提供的一种报文处理装置700的结构示意图;
图27是本申请实施例提供的一种网关设备800的结构示意图;
图28是本申请实施例提供的一种网关设备900的结构示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合图对本申请实施方式作进一步地详细描述。
本申请实施例中的字符“/”,一般表示前后关联对象是一种“或”的关系。例如,读/写操作表示读操作或写操作。
下面对一些术语概念做解释说明。
(1)NVMe
NVMe是一种基于设备逻辑接口的总线传输协议规范(相当于通讯协议中的应用层),用 于访问通过外围组件互连总线(PCI express,PCIe)总线附加的非易失性存储器介质(例如采用闪存的固态硬盘驱动器),虽然理论上不一定要求PCIe总线协议。NVMe是一种协议,是一组允许固态硬盘(solid state disk,SSD)使用PCIe总线的软硬件标准;而PCIe是实际的物理连接通道。NVM代表非易失性存储器(non-volatile memory)的首字母缩略字,这是SSD的常见的闪存形式。此规范主要是为基于闪存的存储设备提供一个低延时、内部并发化的原生界面规范,也为现代中央处理器(central processing unit,CPU)、电脑平台及相关应用提供原生存储并发化的支持,令主机硬件和软件可以充分利用固态存储设备的并行化存储能力。相比此前机械硬盘(hard disk drive,HDD)时代的高级主机控制器接口(advanced host controller interface,AHCI)(串行ATA(serial advanced technology attachment,SATA)下的协议),NVMe降低了输入/输出(input/output,I/O)操作等待时间、提升同一时间内的操作数、提供更大容量的操作队列。
(2)NVMe的工作原理
在NVMe规范中,主机和NVM SSD(NVMe中典型的存储介质)之间的接口基于一系列成对的提交、完成队列。这些队列由驱动程序创建,并在驱动程序(在主机上运行)和NVMe SSD之间共享。队列本身既可以位于主机共享内存中,也可以位于NVMe SSD提供的内存中。配置提交队列和完成队列后,它们将用于驱动程序和NVMe SSD之间的通信。
如图1所示,NVMe SSD包括NVMe控制器和闪存阵列。NVMe控制器负责与主机通信,闪存阵列负责保存数据。图2示出了主机与NVMe控制器通信的流程。参见图2,步骤1中,主机将一条新命令放置在提交队列上。步骤2,驱动程序通过将新的尾部指针写入门铃寄存器来通知NVMe控制器有新指令待执行。步骤3,NVMe控制器从提交队列中获取该指令。步骤4,NVMe控制器处理该指令。步骤5,NVMe控制器完成命令后,将一个条目放入关联的完成队列中。步骤6,产生中断。步骤7,驱动程序完成对条目的处理后,通过将完成队列更新后的头部指针写入门铃寄存器,将其发送给NVMe控制器。
管理操作(例如在设备上创建和删除队列或者更新固件)与普通I/O操作(例如读和写)有单独的队列。这样可以确保I/O操作不会被长期运行的管理操作影响。
NVMe规范最多允许64K个单独的队列,每个队列最多可以有64K个条目。实际应用中,队列的数量可以基于系统配置和预期负载来决定的。例如,对于一个四核处理器的系统,每个核可以设置一个队列对,这样有用于实现免锁机制。但是,NVMe也允许驱动程序为每个核心创建多个提交队列,并在这些队列之间建立不同的优先级。虽然通常以轮询方式为提交队列提供服务,但NVMe可选地支持加权循环方案,该方案允许某些队列比其他队列更频繁地得到服务。图3是一种队列对机制的示意图,如图3所示,提交队列和完成队列之间具有一一对应的关系。
(3)NVMe指令
NVMe指令是指NVMe协议定义的指令。NVMe协议中指令分为Admin(管理,administrator)与I/O指令(I/O指令也称NVM指令)。Admin指令用于管理和控制NVMe存储介质。I/O指令用于传输数据。可选地,NVMe指令在报文中占64字节。NVMe协议中I/O指令包括NVMe读指令和NVMe写指令。
(4)NVMe读指令
NVMe读指令用于读取NVMe存储介质中的数据。示例性地,如果NOF报文中NVMe 层中操作码字段的内容是02h,表示NOF报文携带的NVMe指令是NVMe读指令。
(5)NVMe写指令
NVMe写指令用于将数据写入至NVMe存储介质。示例性地,如果NOF报文中NVMe层中操作码字段的内容是01h,表示NOF报文携带的NVMe指令是NVMe写指令。
(6)NOF
NOF是一种建立在NVMe规范的基础之上的高速存储协议,NOF用于跨网络访问NVMe存储介质。NOF协议在NVMe的基础上增加了fabric相关的指令。NOF协议使得NVMe的应用场景不局限于一个设备内部,而是能扩展到跨网络通信。
所谓“fabric”,就是指主机与存储介质之间的网络。fabric的典型形态例如为以太网、光纤通道、无限带宽(InfiniBand,IB)等。目前,也有一些技术尝试采用远程直接内存访问(remote direct memory access,RDMA)实现fabric,比如使用基于融合以太网的RDMA(RDMA over converged ethernet,RoCE)实现fabric。采用RDMA实现fabric的方式也即是NVMe over RDMA技术,具体细节可参考下文(8)处对NVMe over RDMA的介绍。
图4是一种NOF存储系统的架构示意图。图4所示的场景中,NOF技术中的fabric采用RoCEv2实现,也即是NVMe承载于RoCEv2之上。如图4所示,上层应用发送NVMe指令之后,网卡将NVMe指令封装至RoCE报文,将RoCE报文通过以太网发送至NVMe SSD所在的NOF存储节点。通过图4所示的架构,支持主机跨以太网访问NVMe SSD。
图5示出了NOF存储节点对NOF协议栈的解析路径。图5以NOF报文为RoCE报文为例进行说明。如图5所示,NOF存储节点的网卡中设有各种协议栈对应的处理模块,NOF存储节点接收到RoCE报文之后,NOF存储节点的网卡通过各处理模块,对RoCE报文依次进行介质访问控制层(media access control,MAC)协议栈解析、网际协议(internet protocol,IP)协议栈解析、用户数据报协议(user datagram protocol,UDP)协议栈解析、IB协议栈解析、NVMe协议栈解析,得到RoCE报文携带的NVMe指令。网卡通过PCIE总线将NVMe指令发送给SSD中的NVMe控制器,NVMe控制器执行NVMe指令,以对闪存阵列进行数据读写操作。
图5以网卡负责各种协议栈的解析处理为例说明,协议栈的解析处理任务可选地由存储节点的CPU或者其他元件执行。
上文侧重描述存储节点内部对NOF报文或者NVMe指令的处理流程,下面对设备之间如何基于NOF协议交互的流程进行介绍。
例如,客户端A与NOF存储节点B基于NOF协议的交互流程包括以下步骤(1)至步骤(8)。以下流程中,NOF通过RoCEv2协议实现,换句话说,NOF是NVMe承载于RoCEv2协议上。以下流程以不分片的场景为例说明。在分片的场景下,报文中PSN的更新方式可从对PSN加一替换为对PSN增加其他数值。
步骤(1)客户端A与NOF存储节点B建立连接。
客户端A和NOF存储节点B两端的队列对(queue pair,QP)建立一个逻辑会话。客户端A初始化从A到B方向的包序列号(packet sequeue number,PSN),得到初始PSN-AB。NOF存储节点B初始化从B到A方向的PSN,得到PSN-BA。其中PSN-AB为从客户端A到NOF存储节点B方向的PSN,PSN-BA为从NOF存储节点B到客户端A方向的PSN。
步骤(2)客户端A向NOF存储节点B发送RDMA send only(RDMA仅发送)报文。 RDMA send only报文为读请求。
RDMA send only报文中PSN-AB1是当前从A到B方向的PSN号。如果初始化后没有过交互,RDMA send only报文中PSN-AB1就是初始PSN-AB。如果初始化后发生过交互,RDMA send only报文中PSN-AB1是当前状态的PSN-AB1。RDMA send only报文中NVMe层中包含分散聚合表(scatter gather list,SGL)指定客户端A的内存地址,其中起始逻辑块地址(start LBA)和块数量(block number,或称为number of Logical blocks)指定NOF存储节点B的目的存储地址,命令标志符(command Identifier,command ID)指定NVMe操作的序列号。
步骤(3)NOF存储节点B以PSN-AB1为PSN生成RDMA确认(acknowledge,ACK)报文。NOF存储节点B向客户端A发送RDMA ACK报文。
步骤(4)NOF存储节点B以PSN-BA1为PSN生成RDMA读响应(RDMA read response)报文,NOF存储节点B向客户端A发送RDMA read response报文。RDMA read response报文中RDMA扩展传输头(RDMA extended transport header,RETH)的内容为NVMe层中SGL信息。RDMA read response报文中载荷(payload)的内容是NVMe硬盘中具体数据值。
步骤(5)NOF存储节点B以PSN-BA1+1为PSN生成RDMA仅发送无效化(RDMA send only invalidate)报文。NOF存储节点B向客户端A发送RDMA send only invalidate报文。RDMA send only invalidate报文中RETH的内容为NVMe层中SGL的远端秘钥(remote key)。NVMe层信息包含请求的command ID。发送队列头指针(SQ head pointer,SQHD)为当前操作在发送队列的头指针位置。
步骤(6)客户端A向NOF存储节点B发送RDMA send only报文。RDMA send only报文用于请求将数据写入到NOF存储节点B的某段内存空间。若RDMA send only报文紧跟着上次read请求,则RDMA send only报文中PSN-AB1是当前从A到B方向的PSN,否则RDMA send only报文中PSN-AB1是当前状态的PSN-AB1+1。指示写操作的RDMA send only报文中NVMe层信息与指示读操作的RDMA send only报文中NVMe层信息相同。RDMA send only报文中载荷(payload)部分是需要写入内存中具体数据值。
步骤(7)NOF存储节点B以PSN-AB1+1为PSN生成RDMA ACK报文。NOF存储节点B向客户端A发送RDMA ACK报文。
步骤(8)NOF存储节点B以PSN-BA1+2为PSN生成RDMA send报文。NOF存储节点B向客户端A发送RDMA send报文。指示写操作的RDMA send报文中NVMe层信息与指示read操作的RDMA send报文中NVMe层信息相同。指示写操作的RDMA send报文中SQHD为NOF存储节点B上次执行读操作时回复的SQHD+1。
(7)RoCE
RoCE是一种能够承载RDMA协议和NOF协议的网络协议。RoCE允许在以太网中使用RDMA。RoCE存在RoCE v1和RoCE v2两个版本。RoCE v1是一种以太网链路层协议,因此RoCE v1支持同一个以太网广播域中的任意两台主机之间采用RDMA的方式进行数据传输。RoCE v2是一种网络层协议。RoCE v2报文包括UDP头和IP头,因而RoCE v2报文能够被IP路由转发,从而支持IP网络中的任意两台主机之间采用RDMA的方式进行数据传输。本申请的一些可选实施例中,网关设备基于RoCE协议分别与客户端和存储节点交互。
(8)NVMe over RDMA(基于RDMA的NVMe,NoR)
NVMe over RDMA是一种利用RDMA传输NVMe指令或者NVMe指令的执行结果的技 术。从协议栈的角度来看,NVMe over RDMA中NVMe承载于RDMA上层。NVMe over RDMA的方案中,RDMA的作用相当于NVMe协议的载体或者说NVMe协议的传输通道。做个类比,NVMe over RDMA中RDMA的作用类似于一台计算机内的PCIe总线,PCIe总线用于在CPU与本地硬盘之间传输数据,NVMe over RDMA中RDMA用于跨网络在主机与远端硬盘之间传输NVMe指令。
本申请的一些实施例与NVMe over RDMA方案的发明构思截然不同。本申请的一些实施例利用RDMA中的指令对内存进行读写操作,从而利用内存读写数据的速度更快等性能优势来提升存储性能,并降低存储节点所需处理的指令集的复杂度。而NVMe over RDMA方案利用RDMA作为NVMe的传输通道从而减少跨网络传输NVMe指令的时延。从报文内容的角度来看,本申请的一些实施例中,网关设备向存储节点发送的报文的内容为RDMA指令,指令的语义为如何操作内存。而NVMe over RDMA方案中向存储节点发送的报文的内容为NVMe指令,指令的语义为如何操作硬盘。从存储介质的角度来看,本申请的一些实施例支持使用内存这种存储介质为客户端提供数据读写服务,而NVMe over RDMA方案中使用硬盘这种存储介质为客户端提供数据读写服务。
(9)RDMA
RDMA是一种绕过远端设备的操作系统内核来访问远端设备的内存的技术。由于RDMA技术通常无需经过操作系统,从而不仅节省了大量CPU资源,还提高了吞吐量,降低了网络通信延迟。RDMA尤其适合应用于大规模并行计算机集群。
RDMA存储节点是指通过RDMA的方式提供数据读写服务的存储节点。RDMA存储节点的产品形态包括很多种。例如,RDMA存储节点为存储服务器、台式计算机等。
RDMA有几大特点,(1)本端设备通过网络与远程设备之间进行数据传输;(2)大部分情况下没有操作系统内核的参与,传输数据任务卸载到智能网卡上;(3)用户空间虚拟内存与智能网卡进行数据传输时由于不涉及操作系统内核,所以避免额外的数据移动和复制。
目前,大致有三类RDMA网络,分别是Infiniband、RoCE、因特网广域RDMA协议(internet wide area RDMA protocol,iWARP)。其中,Infiniband是一种专为RDMA设计的网络,从硬件上保证可靠传输。而RoCE和iWARP都是基于以太网的RDMA技术。
(10)RDMA单边操作
RDMA单边操作支持本端设备单侧CPU参与工作而远端设备的CPU不参与工作。换句话说,在执行RDMA单边操作的过程中,远端设备的CPU被旁路(CPU bypass)。RDMA单边操作通常用于传输数据。通常所说的RDMA操作都是指单边RDMA操作。在执行RDMA读操作或者RDMA写操作的过程中,通常只需要本端明确源地址和目的地址,远端应用不必感知此次通信,数据的读或写都通过远端的网卡完成,再由远端网卡封装成消息返回到本端。RDMA单边操作包括RDMA读操作和RDMA写操作。
(11)RDMA写(RDMA-write)操作
RDMA写操作是指将数据写入到服务端(即RDMA存储节点)内存的操作。执行RDMA写操作的基本原理为,客户端基于服务端内存地址和对服务端内存的访问权限,将数据从本地缓存中推送到服务端的内存中。其中,客户端对服务端内存的访问权限在RDMA协议中称为远端秘钥(R_Key)。
例如,在图6所示的RDMA的架构中,执行RDMA写操作的基本工作流程如下:1)当 客户端100中的一个应用101生成RDMA写请求报文后,应用101将RDMA写请求报文放在缓冲区102中。本端网卡140的处理器142将该请求报文读取至网卡140自己的缓冲区141中,在此过程中绕过了操作系统103。RDMA写请求报文中包含RDMA存储节点200的内存空间的逻辑地址、远端秘钥和应用101待保存的数据。远端秘钥用于指示网卡140对RDMA存储节点200的内存具有访问权限。2)网卡140的处理器142通过网络将RDMA写请求报文发送到网卡240。3)网卡240的处理器242对RDMA写请求报文中的远端秘钥进行校验,如果确认远端秘钥正确,则处理器242将RDMA写请求报文携带的数据从网卡204的缓冲区241写入到缓冲区202,从而将数据保存到RDMA存储节点200的内存中。
(12)RDMA读(RDMA-read)操作
RDMA读操作是指读取服务端(即RDMA存储节点)内存中数据的操作。执行RDMA读操作的基本原理为,客户端的网卡基于服务端内存地址和服务端内存的访问权限(remote key),从服务端的内存中获取数据,将数据拉取到客户端本地的缓存中。
(13)RDMA双边操作
RDMA双边操作包括RDMA发送(RDMA-send)操作和RDMA接收(RDMA-receive)操作。RDMA双边操作支持数据在传输的过程中旁路CPU,但要求本端设备和远端设备双侧CPU都参与工作。换句话说,在执行RDMA双边操作的过程中,远端设备和远端设备双侧CPU都未被完全旁路。RDMA双边操作通常用于传输控制报文。具体来说,如果本端设备想要通过执行RDMA-send操作从而将数据传输到远端设备的内存中,远端设备需要先调用RDMA-receive操作,如果远端设备没有调用RDMA-receive操作,则远端设备本地调用RDMA-send操作就会失败。双边操作的工作模式类似于传统的套接字(socket)编程。双边操作的整体性能要略低于RDMA单边操作。RDMA-send操作和RDMA-receive操作通常用于传输连接控制类报文。
(14)RDMA连接
RDMA中,通信双方的应用之间会建立一条逻辑连接来进行通信,下文中将该逻辑连接称为RDMA连接。RDMA连接相当于传输消息的通道,每条RDMA连接的首尾端点是两对队列对。
(15)队列对(queue pair,QP)
一个QP包括一个发送队列(send queue,SQ)和一个接收队列(receive queue,RQ),这些队列中管理着各种类型的消息。如图7所示。网卡140中包括SQ 302,网卡140中包括RQ 403,SQ 302和RQ 403组成了一个QP,SQ 302和RQ 403相当于RDMA连接的两个端点。QP会被映射到应用的虚拟地址空间,使得应用直接通过它访问网卡。除了QP描述的两种基本队列之外,RDMA还提供一种完成队列(complete queue,CQ),CQ用来通知应用WQ上的消息已经处理完成。
下面对设备间如何基于RDMA协议交互的流程举例说明。
例如,客户端A与RDMA存储节点B的完整交互流程包括以下步骤(1)至步骤(6)。以下流程基于的RDMA协议具体为RoCEv2协议。
以下流程以不分片的场景为例说明。在分片的场景下,报文中PSN的更新方式可从对PSN加一替换为对PSN增加其他数值。
步骤(1)客户端A和RDMA存储节点B建立连接。
客户端A和RDMA存储节点B两端的QP建立一个逻辑会话。客户端A初始化从A到B方向的PSN,得到初始PSN-AB。RDMA存储节点B初始化从B到A方向的PSN,得到PSN-BA。其中PSN-AB为从客户端A到RDMA存储节点B方向的PSN,PSN-BA为从RDMA存储节点B到客户端A方向的PSN。
步骤(2)RDMA存储节点B向客户端A发送RDMA send only报文。
RDMA send only报文用于上报RDMA存储节点B的内存空间的地址。RDMA send only报文携带的PSN-BA1是当前B-A方向的PSN号。如果初始化后没有过交互,RDMA send only报文携带的PSN-BA1是初始PSN-BA。如果初始化后发生过交互,RDMA send only报文携带的PSN-BA1是当前状态的PSN-BA1。内存空间的地址在报文中携带于RETH中。内存空间的地址包括VA、remote key以及直接内存访问(direct memory access,DMA)长度。
步骤(3)客户端A向RDMA存储节点B发送RDMA read request报文。RDMA read request报文用于请求获取RDMA存储节点B的内存空间的地址。RDMA read request报文中PSN-AB1是当前A-B方向的PSN号。如果初始化后没有过交互,RDMA read request报文中PSN-AB1就是初始PSN-AB,如果初始化后发生过交互,RDMA read request报文中PSN-AB1是当前状态的PSN-AB1。RDMA read request报文中内存空间的地址是之前RDMA存储节点B向客户端A上报过的内存空间的地址。
步骤(4)RDMA存储节点B以PSN-AB1为PSN生成RDMA读响应(RDMA read response)报文,向客户端A发送RDMA read response报文。RDMA read response报文中payload部分是从内存中读取的内存所保存的具体数据值。
步骤(5)客户端A向RDMA存储节点B发送RDMA仅写(RDMA write only)报文。RDMA write only报文用于将数据写入到RDMA存储节点B的内存中。若RDMA write only报文紧跟着上次RDMA read request报文,则RDMA write only报文中的PSN-AB1是当前A-B方向的PSN号,否则RDMA write only报文中的PSN-AB1是当前状态的PSN-BA1+1。RDMA write only报文中内存空间的地址是之前B-A上报过的某段内存空间的地址。RDMA write only报文中的payload部分是需要写入内存的具体数据值。
步骤(6)RDMA存储节点B以PSN-AB1+1为PSN生成RDMA ACK报文,向客户端A发送RDMA ACK报文。
在一些实施例中,网关设备与RDMA存储节点如何交互的技术细节以及RDMA状态信息的实例可参考上述流程。例如,结合图9对应的实施例来看,图9对应的实施例中网关设备可选地相当于以上介绍的客户端A,或者说客户端A的代理,图9对应的实施例中第一RDMA存储节点可选地为以上介绍的RDMA存储节点B。在读数据的情况下,图9对应的实施例中S404中的第一RDMA请求报文可选地为以上步骤(3)中RDMA read request报文,图9对应的实施例中S408中的RDMA响应报文可选地为以上步骤(4)中RDMA read response报文;RDMA响应报文中的RDMA状态信息可选地为以上步骤(4)中PSN-AB1。又如,在写数据的情况下,图9对应的实施例中S404中的第一RDMA请求报文可选地为以上步骤(5)中RDMA write only报文,图9对应的实施例中S408中的RDMA响应报文可选地为以上步骤(6)中RDMA ACK报文。RDMA响应报文中的RDMA状态信息可选地为以上步骤(6)中PSN-AB1+1。此外,以上流程中步骤(1)和步骤(2)可选地提供为图9对应的实施例的预备步骤,为图9实施例提供充分的实现基础。例如,通过步骤(1)支持网关设备与客户端 设备预先建立RDMA连接,通过步骤(2)将第一RDMA存储节点中的内存空间的地址(第二目的地址)预先给到网关设备。
(16)状态信息
所谓的“状态信息”是计算机网络通信领域的一个术语。状态信息用于指示通信双方在一次会话中先后交互的不同报文之间的联系。通常情况下,通信双方在一次会话中交互的每一个报文并不是孤立的个体,而是与之前交互的报文具有联系。比如说,一次会话中的每个报文都携带某种信息,这种信息的取值在会话过程中保持不变,或者这种信息的取值在会话过程中按照设定规则进行变化。这种在会话中保持不变的信息或者取值按照设定规则变化的信息即为状态信息。报文携带状态信息通常是出于可靠性或者安全性的考虑。比如,接收端根据报文中状态信息判断是否发生丢包,在发生丢包时进行重传,或者接收端根据报文中状态信息是否正确来判断发送方是否可信,在发送方不可信时丢包。例如,在TCP协议中,TCP报文中携带的序列号(sequence number)就属于一种状态信息。
(17)RDMA状态信息
RDMA状态信息指示基于RDMA协议进行的一次会话中不同RDMA报文之间的联系以及RDMA报文的逻辑顺序。例如,通信双方基于RDMA协议建立连接后,响应方向请求方先后发送了多个RDMA响应报文,多个RDMA响应报文分别包含不同的RDMA状态信息,RDMA状态信息指示多个RDMA响应报文的先后顺序。
可选地,RDMA状态信息具体指示RDMA响应报文与RDMA请求报文之间的对应关系。例如,通信双方基于RDMA协议进行的一次会话中交互了多个RDMA请求报文以及多个RDMA响应报文。每一个RDMA请求报文或者RDMA响应报文都包括RDMA状态信息。一个RDMA响应报文中的RDMA状态信息指示该RDMA响应报文对应于哪一个RDMA请求报文。
可选地,RDMA状态信息为PSN。
RDMA状态信息为RDMA报文携带的信息。例如,RDMA状态信息为RDMA报文中RDMA报文头携带的信息。例如,RDMA状态信息为IB头或iWARP报文头中携带的信息。
(18)NOF状态信息
NOF状态信息指示基于NOF协议进行的一次会话中不同NOF报文之间的联系以及NOF报文的逻辑顺序。可选地,NOF状态信息具体指示NOF响应报文与NOF请求报文之间的对应关系。例如,通信双方基于NOF协议进行的一次会话中交互了多个NOF请求报文以及多个NOF响应报文。每一个NOF请求报文或者NOF响应报文都包括NOF状态信息。一个NOF响应报文中的NOF状态信息指示该NOF响应报文对应于哪一个NOF请求报文。
可选地,NOF状态信息包括以下信息中的至少一个:PSN、SQHD、command ID、DQP、虚拟地址、远端秘钥(remote key)或直接内存访问长度。
NOF状态信息为NOF报文携带的信息。例如,NOF状态信息为NOF报文中NOF报文头携带的信息。
(19)NOF报文头
NOF报文头是指NOF报文中的报文头。NOF报文头包括fabric对应的报文头以及NVMe层信息。fabric对应的报文头的具体格式与fabric的实现方式有关。fabric对应的报文头可能包括多层协议对应的报文头。例如,fabric通过RoCEv2实现,fabric对应的报文头包括MAC 头(对应于链路层协议)、IP头(对应于网络层协议)、UDP头(对应于传输层协议)和IB头(对应于传输层协议)。或者,fabric对应的报文头为一种协议对应的报文头。例如,fabric通过InfiniBand实现,fabric对应的报文头为IB头。
(20)RETH
RETH是RDMA协议中一种传输层报文头。RETH包含一些用于RDMA操作的附加字段。可选地,RETH包括虚拟地址(virtual address,VA)字段、远端秘钥(remote key,R_Key)字段以及直接内存访问长度(DMA length)字段。RETH的格式可选地如下表1所示。
表1
(21)包序列号(packet sequence number,PSN)
PSN是包传输头携带的一个值。PSN用于支持对丢失的报文进行检测和重传。
(22)发送队列头指针(submission queue head pointer,SQHD)
SQHD用于指示发送队列(submission queue,SQ)当前的头部。SQHD用于向主机指明SQ中已经被消费的条目(即已经向SQ中加入的读/写指令)。
(23)command ID
command ID为错误相关的命令的标识符。如果错误不是指定命令相关的,那么command ID字段可选地设置为FFFFh。
(24)虚拟地址(virtual address,VA)
VA表示缓冲区的起始地址。VA的长度例如为64比特。
(25)远端秘钥(remote key,R_Key)
R_Key用于描述远端设备访问本地内存的权限,如客户端对RDMA存储节点内存的访问权限。R_Key也称内存钥匙。R_Key通常和VA一起使用。可选地,R_Key还用于帮助硬件识别将虚拟地址转换为物理地址的页表。
(26)直接内存访问长度(DMA length)
DMA length表示RDMA操作的长度。DMA length是RDMA相关标准中的一个字段名。DMA length也可以称为RDMA长度。
(27)主机
主机是指计算机中主要的机体部分。主机通常包括CPU、内存以及接口。主机和SSD之间连接关系包括多种实现方式。可选地,SSD设于主机内部,SSD为主机内部的一个组件。或者,SSD设于主机外部,且与主机相连。
(28)存储节点
存储节点是指支持数据存储功能的实体。在一种可能的实现中,一个存储节点是一个独立的存储设备。在另一种可能的实现中,一个存储节点是由多个存储设备集成的设备,又如是包括多个存储设备的集群或者分布式系统。例如,对于RDMA存储节点来说,在一种可能 的实现中,一个RDMA存储节点是一个支持RDMA的存储服务器,该存储服务器利用本地的内存提供基于RDMA进行数据读写的服务。在另一种可能的实现中,一个RDMA存储节点包含多个支持RDMA的存储服务器,各个存储服务器中的内存组成支持RDMA的内存池。存储节点利用内存池中属于一个或多个存储服务器上的内存提供基于RDMA进行数据读写的服务。
(29)内存
内存是指与处理器直接交换数据的内部存储器。内存通常能够随时读写数据,而且速度很快,作为操作系统或其他正在运行中的程序的临时数据存储器。内存例如是随机存取存储器,又如是只读存储器(read Only Memory,ROM)。举例来说,随机存取存储器是动态随机存取存储器(dynamic random access memory,DRAM),或者存储级存储器(storage class memory,SCM)。
DRAM是一种半导体存储器,与大部分随机存取存储器(random access memory,RAM)一样,属于一种易失性存储器(volatile memory)设备。
SCM是一种同时结合传统储存装置与存储器特性的复合型储存技术,存储级存储器能够提供比硬盘更快速的读写速度,但存取速度上比DRAM慢,在成本上也比DRAM更为便宜。
DRAM和SCM在本实施例中只是示例性的说明,内存可选地包括其他随机存取存储器,例如静态随机存取存储器(static random access memory,SRAM)等。只读存储器例如是可编程只读存储器(programmable read only memory,PROM)、可抹除可编程只读存储器(erasable programmable read only memory,EPROM)等。
在另一些实施例中,内存是双列直插式存储器模块或双线存储器模块(dual in-line memory module,简称DIMM),即由动态随机存取存储器(DRAM)组成的模块,或者是SSD。
可选地,对内存进行配置使其具有保电功能。保电功能是指系统发生掉电又重新上电时,内存中存储的数据也不会丢失。具有保电功能的内存被称为非易失性存储器。
(30)逻辑块(logical block,LB)
LB也称块,LB是指NVMe定义的最小的存储单元。例如,一个LB是尺寸为2KB或4KB的存储空间。
(31)逻辑单元号(logical unit number,LUN)
在SAN中,LUN是用来标识一个逻辑单元的数字,这个逻辑单元是通过SCSI寻址的设备。换句话说,存储系统将物理硬盘进行分区,成为拥有逻辑地址的各个部分,进而允许主机进行访问,这样的一个分区便称为一个LUN。通常说的LUN也指在SAN存储上创建的逻辑磁盘。
本申请的一些实施例涉及NOF和RDMA这两种协议报文的互相转换流程。为了简明起见,本申请的一些实施例用“NOF-RDMA”的形式来简化表示将NOF报文转换为RDMA报文的过程,用“RDMA-NOF”的形式来简化表示将RDMA报文转换为NOF报文的过程。
下面对本申请实施例的应用场景举例说明。
图8是本申请实施例提供的一种应用场景示意图。图8示出的场景包括客户端31、网关设备33和RDMA存储节点35。下面对图8中各个设备举例说明。
(1)客户端31
客户端31的部署位置包括多种情况。例如,客户端31部署在用户网络,或者说部署在局域网,比如客户端31部署在企业内网中。又如,客户端31部署在互联网中,或者说部署在云端,又如部署在公有云、行业云、私有云等云网络中。再如,客户端31部署在骨干网中(比如客户端是某台有数据存储需求的路由器),本实施例对客户端的部署位置不做限定。
客户端31存在多种可能产品形态。例如,客户端31可以是终端、服务器、路由器、交换机等。终端包括而不限于个人计算机、移动电话、服务器、笔记本电脑、IP电话、摄像头、平板电脑、可穿戴设备等。
客户端31扮演着NOF请求报文发起方的角色。以读数据的流程为例,当客户端31需要获取预先保存的数据时,客户端31生成并发送NOF读请求报文,从而触发下述图9所示方法实施例;以写数据的流程为例,当客户端31需要保存数据时,客户端31生成并发送NOF写请求报文,从而触发下述图9所示方法实施例。
客户端31还扮演着NOF响应报文目的方的角色。以读数据的流程为例,客户端31接收到NOF读响应报文后,客户端31从NOF读响应报文获得读取到的数据,根据数据进行业务处理。以写数据的流程为例,客户端31接收到NOF写响应报文后,客户端31从NOF写响应报文获得NOF的回应信息,根据NOF的回应信息确认数据已经保存成功。
(2)网关设备33
网关设备33是部署在客户端31和RDMA存储节点35之间的实体。网关设备33用于转发客户端31和RDMA存储节点35之间交互的报文。
在一些实施例中,网关设备33既充当了NOF代理(proxy)的角色,又充当了RDMA代理的角色。从客户端31的角度来看,网关设备33相当于一个NOF服务器,网关设备33代替NOF服务器与客户端31交互。如图8所示,网关设备33基于NOF协议与客户端31建立NOF连接,网关设备33通过该NOF连接能够接收到客户端31发送的NOF请求报文。从RDMA存储节点35的角度来看,网关设备33相当于一个RDMA客户端,网关设备33代替客户端与RDMA存储节点35交互。如图8所示,网关设备33基于RDMA协议与RDMA存储节点35建立RDMA连接,网关设备33通过该RDMA连接能够将RDMA请求报文发送至RDMA存储节点35。网关设备33如何实现代理功能的细节请参考下文中各个方法实施例。
网关设备33存在多种可能产品形态。在一些实施例中,网关设备33为网络设备。例如,网关设备33为路由器、交换机、防火墙等。在另一些实施例中,网关设备33为服务器,例如网关设备33为存储服务器。在另一些实施例中,网关设备33采用现场可编程门阵列(field-programmable gate array,FPGA)、或协处理器等可编程器件来实现。例如,网关设备33为专用芯片。在另一些实施例中,网关设备33为通用的计算机设备,该计算机设备通过处理器运行存储器中的程序从而实现网关设备33的功能。
可选地,网关设备33为多个客户端提供报文转发和代理的服务。如图8所示,网络中还包括客户端32。当客户端32发起NOF请求报文后,网关设备采用类似的方式处理客户端32发送的NOF请求报文。
图8示出的部署一个网关设备的场景仅是举例,系统中部署的网关设备的数量可选地更多或更少。例如网关设备仅为一个,又如网关设备为几十个或几百个,或者更多数量,本实施例对系统中部署的网关设备的数量不做限定。在部署多台网关设备的情况下,在一种可能的实现中,在各个网关设备之前部署负载均衡器,负载均衡器用于将来自各个客户端的请求 报文分发至各个网关设备,使得各个网关设备以负载均衡方式工作,从而分担单个网关设备的处理压力。
(3)RDMA存储节点35
RDMA存储节点35用于提供通过RDMA的方式读写数据的服务。RDMA存储节点35也称RDMA服务端。RDMA存储节点35具有内存。在一种可能的实现中,RDMA存储节点35的网络接口与网关设备33的网络接口相连。在一种可能的实现中,RDMA存储节点35存储有客户端31的数据。
可选地,系统中部署有多个RDMA存储节点。如图8所示,系统中可选地还部署RDMA存储节点36,RDMA存储节点36具有和RDMA存储节点35类似的特征。
下面对本申请实施例的方法流程举例说明。
图9是本申请实施例提供的一种报文处理方法的流程图。
图9所示方法涉及存储系统包含多个RDMA存储节点的情况。为了区分不同的RDMA存储节点,用“第一RDMA存储节点”、“第二RDMA存储节点”区分描述不同的RDMA存储节点。
可选地,结合附图1来看,图9所示实施例中客户端为附图1中主机。
可选地,结合附图2来看,图9所示实施例中客户端为附图2中主机。
可选地,结合附图3来看,图9所示实施例中客户端为附图3中主机。
可选地,结合附图6来看,图9所示实施例中网关设备充当附图6中客户端100的RDMA协议栈代理,网关设备代替客户端100与附图6中RDMA存储节点200交互,网关设备包括附图6中网卡140,通过网卡140执行图9所示实施例中网关设备负责的各个步骤。
可选地,结合附图7来看,图9所示实施例中网关设备包括附图7中的网卡140,图9所示实施例中第一RDMA存储节点中设置有附图7中的网卡240,网关设备通过网卡140与第一RDMA存储节点建立RDMA连接并进行交互。例如,网关设备通过将第一RDMA请求报文加入至附图7中的SQ302,从而实现图9所示实施例中S404。
网关设备通过将第一RDMA请求报文加入至附图7中的SQ302,从而实现图9所示实施例中S404。第一RDMA存储节点通过附图7中RQ403,实现图9所示实施例中S405。
可选地,结合附图8来看,图9所示方法所基于的网络部署场景如上述图8所示。例如,结合图8来看,图9所示方法中的第一RDMA存储节点为图8中的RDMA存储节点35,图9所示方法中的客户端为图8中的客户端31,图9所示方法中的网关设备为图8中的网关设备33。
图9所示方法包括以下步骤S401至步骤S406。
S401、客户端发送第一NOF请求报文。
第一NOF请求报文携带NVMe指令。
NVMe指令指示对第一目的地址执行读/写操作。NVMe指令的概念可参考上文术语解释部分(3)处。可选地,第一NOF请求报文携带的NVMe指令具体是I/O指令。
在读数据的情况下,可选地,第一NOF请求报文为NOF读请求报文,第一NOF请求报文携带的NVMe指令是NVMe读指令,第一NOF请求报文携带的NVMe指令指示对第一目的地址执行读操作。NVMe读指令的概念可参考上文术语解释部分(4)处。
在写数据的情况下,可选地,第一NOF请求报文为NOF写请求报文,第一NOF请求报 文携带的NVMe指令是NVMe写指令,第一NOF请求报文携带的NVMe指令指示对第一目的地址执行写操作。NVMe写指令的概念可参考上文术语解释部分(5)处。
可选地,第一目的地址表示NVMe存储介质提供的存储空间的位置。例如,在读数据的情况下,第一目的地址指示待读取的数据在NVMe存储介质所处的位置。在写数据的情况下,第一目的地址指示NVMe存储介质中待保存数据的位置。可选地,第一目的地址是逻辑地址(或称为虚拟地址)。
第一目的地址的数据形式包括多种可能实现方式。可选地,第一目的地址的形式满足NVMe协议的规范。换句话说,第一目的地址是NVMe地址。例如,第一目的地址包括起始逻辑块地址(start LBA)和block number。
在一种可能的实现方式中,第一目的地址包括LUN ID、起始地址和数据的长度。具体地,第一RDMA存储节点上的内存空间并非直接暴露给客户端,而是会虚拟化为逻辑单元(logical unit,LU)提供给客户端使用。换句话说,从客户端的角度来看,客户端感知到的存储资源是一个个LUN,而非RDMA存储节点上的一块块内存。网关设备与客户端基于LUN语义通信。LU以及LUN的概念可参考上文术语概念解释部分(31)处。将内存空间映射为LUN的步骤可选地由网关设备执行,或者由控制面设备执行。可选地,第一RDMA存储节点以页为粒度为LUN提供RDMA内存空间,换句话说,以一个页或者页的整数倍分配RDMA内存空间。一个页的大小例如是4KB或者8KB等。
在另一种可能的实现方式中,第一目的地址和下文中的第二目的地址(内存空间的地址)是同一个地址。具体地,由网关设备或者控制面设备等网元将第一RDMA存储节点上的内存空间暴露给客户端,使得客户端能够感知到RDMA存储节点上的内存。网关设备与客户端基于内存语义通信。
可选地,第一NOF请求报文包括第一目的地址。示例性地,第一NOF请求报文具有start LBA字段和block number字段。start LBA字段和块数量字段的内容用于指示第一目的地址。
可选地,第一NOF请求报文包含NOF状态信息。
S402、网关设备接收来自客户端的第一NOF请求报文。
可选地,网关设备与客户端预先建立NOF连接。网关设备通过与客户端之间的NOF连接接收第一NOF请求报文。NOF连接是指基于NOF协议建立的逻辑连接。
第一NOF请求报文的传输方式包括多种情况,下面结合情况一至情况二举例说明。
情况一、第一NOF请求报文从客户端发送出去后,经过一个或多个转发设备转发至网关设备。
情况一支持客户端与网关设备之间部署有一个或多个转发设备的场景。客户端发送第一NOF请求报文后,转发设备接收第一NOF请求报文,将第一NOF请求报文转发至网关设备。
第一NOF请求报文经过的转发设备包括而不限于二层转发设备(如交换机)、三层转发设备(如路由器、交换机)等。转发设备包括而不限于有线网络设备或者无线网络设备。
情况二、第一NOF请求报文从客户端发送出去后,直接到达网关设备。
情况二支持客户端与网关设备物理直连的场景,网关设备是客户端的下一跳节点。
S403、网关设备基于第一目的地址获取第一RDMA存储节点的信息。
网关设备从第一NOF请求报文获取第一目的地址。网关设备基于第一目的地址获取目的存储节点的信息,得到了第一RDMA存储节点的信息。
在一些实施例中,网关设备获取第一目的地址的过程包括:网关设备从第一NOF请求报文中起始LBA字段获得起始LBA,并从第一NOF请求报文中块数量字段获得块数量,并基于NOF连接的属性获得block size。网关设备基于起始LBA、块数量和block size获得第一目的地址。
在读数据的情况下,第一RDMA存储节点是待读取的数据所在的存储节点,也即是保存有第一NOF请求报文请求获取的数据所在的存储节点。在写数据的情况下,第一RDMA存储节点是待保存数据的存储节点,也即是第一NOF请求报文请求保存的数据待写入的存储节点。
第一RDMA存储节点的信息的具体内容包括多种情况。例如,第一RDMA存储节点的信息为第一RDMA存储节点的设备标识;又如,第一RDMA存储节点的信息为第一RDMA存储节点的网络地址;又如,第一RDMA存储节点的信息为第一RDMA存储节点的内存地址;又如,第一RDMA存储节点的信息为能够标识第一RDMA存储节点的RDMA连接的任意信息;又如,第一RDMA存储节点的信息为第一RDMA存储节点的端口号;又如,第一RDMA存储节点的信息为网关设备与第一RDMA存储节点之间会话的会话ID。又如,第一RDMA存储节点的信息为第一RDMA存储节点的公钥。又如,第一RDMA存储节点的信息为访问第一RDMA存储节点的内存的权限信息(如R_Key)。
在一种可能的实现方式中,第一RDMA存储节点的信息包括以下信息中的至少一个:第二目的地址、第一RDMA存储节点的网络位置信息、第一RDMA存储节点中一个或多个QP的标识和R_Key,R_Key指示访问第一RDMA存储节点的内存的权限。
第二目的地址指向第一RDMA存储节点中的内存空间。在读数据的情况下,第二目的地址指示待读取的数据在第一RDMA存储节点的内存空间中所处的位置。在写数据的情况下,第二目的地址指示第一RDMA存储节点的内存空间中待保存数据的位置。可选地,第二目的地址是逻辑地址(或称为虚拟地址)。第二目的地址的数据形式包括多种可能的实现方式。可选地,第二目的地址的形式满足RDMA协议的规定。换句话说,第二目的地址是RDMA地址。例如,第二目的地址包括VA以及DMA length。可选地,第二目的地址是其他能够在内存中指示位置的数据,例如内存空间ID、内存空间的起始地址以及长度。
第一RDMA存储节点的网络位置信息用于在网络中标识第一RDMA存储节点。可选地,网关设备与第一RDMA存储节点之间可选地存在中间网络设备。网络位置信息用于指导中间网络设备进行路由转发。具体地,网关设备发送第一RDMA请求报文之后,第一RDMA请求报文先到达中间网络设备。中间网络设备根据第一RDMA请求报文获得第一RDMA存储节点的网络位置信息。中间网络设备根据网络位置信息查找本地的路由转发表项,对第一RDMA请求报文进行路由转发,使得第一RDMA请求报文传输到第一RDMA存储节点。
在一些实施例中,网络位置信息包括MAC地址、IP地址、多协议标签交换(multi-protocol label switching,MPLS)标签或者段标识(segment ID,SID)中至少一项。
例如,网关设备与第一RDMA存储节点之间存在二层网络,上述网络位置信息为第一RDMA存储节点的MAC地址,MAC地址用于在二层网络中标识第一RDMA存储节点。
又如,网关设备与第一RDMA存储节点之间存在IP网络,上述网络位置信息为第一RDMA存储节点的IP地址,该IP地址用于在IP网络中标识第一RDMA存储节点。
又如,网关设备与第一RDMA存储节点之间存在MPLS网络,网络位置信息为第一 RDMA存储节点的MPLS标签,MPLS标签用于在MPLS网络中标识第一RDMA存储节点。
又如,网关设备与第一RDMA存储节点之间存在段路由(segment routing,SR)网络,网络位置信息为第一RDMA存储节点的SID,SID用于在SR网络中标识第一RDMA存储节点。
QP的标识用于指示第一RDMA存储节点中一个QP。一个QP相当于网关设备与第一RDMA存储节点之间一路逻辑通道。可选地,第一RDMA存储节点包含多个QP。第一对应关系包括第一RDMA存储节点的多个QP中每一个QP的标识。
S404、网关设备向第一RDMA存储节点发送第一RDMA请求报文。
网关设备基于第一RDMA存储节点的信息以及NVMe指令对应的RDMA指令,生成第一RDMA请求报文。网关设备将生成的第一RDMA请求报文发送至第一RDMA存储节点。
第一RDMA请求报文是RDMA协议中的请求报文。可选地,第一RDMA请求报文是RDMA单边操作报文。例如,在读数据的情况下,上述第一NOF请求报文是NOF读请求报文,第一RDMA请求报文是RDMA读请求报文(RDMA read request)。又如,在写数据的情况下,第一NOF请求报文是NOF写请求报文,第一NOF请求报文中包含待保存的数据。第一RDMA请求报文是RDMA写请求报文(RDMA write request)。第一RDMA请求报文包含第一NOF请求报文中待保存的数据。
第一RDMA请求报文携带NVMe指令对应的RDMA指令以及第一RDMA存储节点的信息。
RDMA指令指示对第二目的地址采用RDMA的方式执行读/写操作。在第一NOF请求报文携带的NVMe指令为NVMe读指令的情况下,第一RDMA请求报文携带的RDMA指令指示对第二目的地址执行RDMA读操作。RDMA读操作的概念可参考上文术语解释部分(12)的介绍。在第一NOF请求报文携带的NVMe指令为NVMe写指令的情况下,第一RDMA请求报文携带的RDMA指令指示对第二目的地址执行RDMA写操作。RDMA写操作的概念可参考上文术语解释部分(11)的介绍。
可选地,第一RDMA请求报文包括第二目的地址。示例性地,第一RDMA请求报文包括RETH,第一RDMA请求报文中RETH包括VA字段以及DMA length字段。第二目的地址携带在VA字段以及DMA length字段中。
在一些实施例中,第一NOF请求报文携带的NVMe指令与第一RDMA请求报文携带的RDMA指令具有不同的语义。NVMe指令的语义是对NVMe介质(硬盘)进行操作。RDMA指令的语义是对内存进行操作。
可选地,网关设备支持NVMe指令转换为RDMA指令的功能。网关设备将第一NOF请求报文携带的NVMe指令转换为对应的RDMA指令,从而生成第一RDMA请求报文。在一种可能的实现中,网关设备保存有NVMe指令与RDMA指令之间的对应关系。网关设备接收到第一NOF请求报文之后,网关设备获取第一NOF请求报文携带的NVMe指令,网关设备根据NVMe指令查询NVMe指令与RDMA指令之间的对应关系,得到NVMe指令对应的RDMA指令。网关设备将该RDMA指令封装至RDMA请求报文,从而生成第一RDMA请求报文。另一种可能的实现中,网关设备基于NVMe指令和RDMA指令之间的区别,通过修改NVMe指令中的全部或部分参数,从而将NVMe指令转换为RDMA指令。
可选地,第一RDMA请求报文包括RDMA状态信息。
可选地,网关设备与第一RDMA存储节点预先建立RDMA连接。网关设备通过与第一RDMA存储节点之间的RDMA连接,向第一RDMA存储节点发送第一RDMA请求报文。RDMA连接是指基于RDMA协议建立的逻辑连接。
S405、第一RDMA存储节点接收第一RDMA请求报文。
S406、第一RDMA存储节点执行RDMA指令以对内存执行读/写操作。
第一RDMA存储节点从第一RDMA请求报文获得第二目的地址和RDMA指令。第一RDMA存储节点从本端的内存中找到第二目的地址对应的内存空间。第一RDMA存储节点执行RDMA指令以对内存空间执行读/写操作。
在读数据的情况下,第一RDMA存储节点执行RDMA读指令,对第二目的地址对应的内存空间执行RDMA读操作,获取第二目的地址对应的内存空间中保存的数据。在写数据的情况下,第一RDMA存储节点从第一RDMA请求报文获得待保存的数据。第一RDMA存储节点基于RDMA写指令,对第二目的地址对应的内存空间执行RDMA写操作,将数据保存至第二目的地址对应的内存空间。
以上通过S401至S406,描述在请求报文的传输过程中,客户端、网关设备以及RDMA存储节点三侧的交互流程。下面对上述流程达到的技术效果进行分析,见下述五点描述。
第一,由于网关设备将对NOF或者说NVMe存储节点发起的访问(来自客户端的第一NOF请求报文),转换为对RDMA存储节点的访问(第一RDMA请求报文),从而提升了存储性能。
从存储介质的角度来看,RDMA节点提供的存储介质是内存,而内存的性能优于NVMe硬盘。网关设备通过将NOF请求报文转换为RDMA请求报文,将NVMe指令转换为RDMA指令,相当于将硬盘操作转换为内存操作,从而发挥内存存储的性能优势,提升性能。从指令集的角度来看,内存操作的指令集比硬盘操作的指令集更简单,因此降低了存储节点执行读写指令的复杂度,进一步提升性能。
第二,由于网关设备基于NVMe的目的逻辑地址(第一目的地址)确定RDMA存储节点的信息(第一远程直接内存访问RDMA存储节点的信息),支持寻址卸载,降低存储节点的CPU压力。
本实施例中寻址是指根据目的NVMe地址查找目的存储节点的过程。所谓“卸载”(offload)通常是指将原本由CPU负责的任务转移到特定硬件上执行。相关技术中,寻址通常由NOF存储节点的CPU执行。具体地,相关技术中NOF存储节点的CPU需要根据目的NVMe地址判断目的存储节点是不是本节点,如果不是本节点,NOF存储节点需要重新构造请求报文,再将构造的请求报文转发至最终的目的存储节点。寻址、重新构造请求报文以及转发报文的过程都会占用存储节点CPU大量的处理资源,转发报文的过程还会为存储节点带来网络IO压力。
本实施例中,寻址的任务(如根据第一目的地址确定第一RDMA存储节点的步骤)由网关设备执行,相当于卸载了NOF存储节点的寻址任务,从而减轻NOF存储节点的CPU压力,并节省转发报文对存储节点占用的网络IO压力。同时,由于在网络层(网关设备)就能够确定目的节点位置,而不需要到服务层(NOF存储节点)再重新定向,优化了网络流量的转发方式。
第三,由于在系统中部署了网关设备,由网关设备与RDMA存储节点建立逻辑连接 (RDMA连接),使得网关设备接管存储节点原有的后端扩展功能,因此优化了报文转发路径,减少报文转发时延。
相关技术的NOF存储网络中,客户端访问存储节点时,NOF请求报文的转发路径在逻辑上是客户端→网络设备→NOF前端存储节点→NOF后端存储节点。可见转发路径上至少需要经过网络设备和NOF前端存储节点这两跳中间节点,报文转发路径长,时延大。其中,NOF前端存储节点用于在目的存储地址不在本节点时向NOF后端存储节点转发报文。
本实施例中,客户端访问存储节点时,请求报文的转发路径在逻辑上是客户端→网关设备→RDMA存储节点,无需经过NOF前端存储节点的转发,因此缩短报文转发路径,减少报文转发时延。
第四,由于网关设备基于客户端发起的NOF请求报文执行处理流程,而无需要求客户端发起RDMA报文,从而不要求改动客户端,从而降低业务开通难度。
从客户端的角度来看,客户端按照原有的NOF流程发起访问,即可使用RDMA存储节点提供的存储服务,而不必感知存储节点的变化,也不必要求客户端支持RDMA,因此与原有的NOF存储方案兼容,便于快速开通业务。
第五,由于在系统中部署了网关设备,由网关设备与RDMA存储节点建立逻辑连接(RDMA连接),降低了存储系统进行扩容的难度,提升存储系统的可扩展性。
相关技术中,当NOF存储系统中新增了存储节点时,通常要求客户端与新增存储节点建立连接,客户端与新增存储节点连接后才能使用新增存储节点提供的存储容量,导致对客户端的要求高,扩容难度大。
本实施例中,由于与RDMA存储节点建立RDMA连接的工作由网关设备执行,因此当存储系统中新增RDMA存储节点时,由网关设备与新增的RDMA存储节点建立连接并进行交互,即可将新增RDMA存储节点的存储容量提供给客户端使用。从客户端的角度来看,不必要求客户端感知到新增的RDMA存储节点,也不必要求客户端与新增RDMA存储节点建立连接,客户端利用之前与网关设备建立的连接即可使用新增RDMA存储节点的存储容量,显然降低了对客户端的要求,也就降低了扩容的难度,满足存储系统灵活扩容的需求,提升可扩展性。
可选地,图9所示方法在包括以上S401至S406的基础上,还包括以下S407至S412。以上S401至S406为NOF-RDMA方向的交互流程。以下S407至S412为RDMA-NOF方向的交互流程。
S407、第一RDMA存储节点生成RDMA响应报文。
S408、第一RDMA存储节点发送RDMA响应报文。
RDMA响应报文是针对第一RDMA请求报文的响应报文。RDMA响应报文表示对第一RDMA请求报文中的RDMA指令进行回应。例如,在读数据的情况下,RDMA响应报文为RDMA read pespond报文。执行RDMA指令包括执行RDMA读操作的过程,RDMA响应报文包括从第一RDMA存储节点的内存空间中读取的数据。例如,读取的数据携带在RDMA读响应报文的载荷字段中。例如,在写数据的情况下,RDMA响应报文为RDMA ACK报文。
可选地,RDMA响应报文包括RDMA状态信息。RDMA状态信息指示RDMA响应报文与第一RDMA请求报文之间的对应关系。可选地,RDMA响应报文中RDMA状态信息和第一RDMA请求报文中RDMA状态信息的取值相同。或者,RDMA响应报文中RDMA状态 信息和第一RDMA请求报文中RDMA状态信息的取值不同,且RDMA响应报文中RDMA状态信息的取值与第一RDMA请求报文中RDMA状态信息的取值满足设定规则(如差值为1)。
S409、网关设备接收来自第一RDMA存储节点的RDMA响应报文。
S410、网关设备基于RDMA响应报文生成第一NOF响应报文。
第一NOF响应报文是针对第一NOF请求报文的响应报文。第一NOF响应报文表示对第一NOF请求报文中的NVMe指令进行回应。
在读数据的情况下,第一NOF响应报文包括第一NOF请求报文所请求获取的数据。第一NOF响应报文的生成过程包括:网关设备从RDMA响应报文中获得第一RDMA存储节点的内存空间保存的数据。网关设备基于第一RDMA存储节点的内存空间保存的数据,生成第一NOF响应报文。可选地,第一NOF响应报文还包括CQE,CQE用于表示已完成NVMe读操作。
在写数据的情况下,第一NOF响应报文为NOF写响应报文。第一NOF响应报文包括CQE,CQE用于表示已完成NVMe写操作,或者说数据已经保存成功。
可选地,第一NOF响应报文包含NOF状态信息。NOF状态信息指示第一NOF响应报文与第一NOF请求报文之间的对应关系。可选地,第一NOF响应报文中NOF状态信息和第一NOF请求报文中NOF状态信息的取值相同。例如,第一NOF请求报文和第一NOF响应报文包含相同的虚拟地址、相同的远端秘钥和相同的直接内存访问长度。或者,第一NOF请求报文中NOF状态信息和第一NOF响应报文中NOF状态信息的取值不同,且第一NOF请求报文中NOF状态信息的取值与第一NOF响应报文中NOF状态信息的取值满足设定规则(如差值为1)。例如,第一NOF请求报文中PSN和第一NOF响应报文中PSN之间的差值等于设定值。
S411、网关设备向客户端发送第一NOF响应报文。
S412、客户端接收第一NOF响应报文。
以上通过S407至S412,描述了响应报文的传输过程中,客户端、网关设备以及RDMA存储节点三侧的交互流程。通过上述S407至S412,由于网关设备实现了NOF协议栈代理,代替RDMA存储节点将响应报文回复给客户端,一方面,由于客户端感知到的响应报文仍是NOF报文,从而不必要求客户端感知协议报文转换的逻辑,降低维护客户端的难度。另一方面,也不必要求RDMA存储节点支持NOF协议,减少RDMA存储节点所需支持的协议种类,
上文结合图9所示实施例,描述了客户端、网关设备和RDMA存储节点三侧交互的整体流程。下面对图9所示实施例中一些步骤可能采用的具体实现方式进行介绍。
本申请实施例中,网关设备如何根据目的NVMe地址(如第一目的地址)获取RDMA存储节点的信息(如第一RDMA存储节点的信息)包括多种实现方式,下面对一些可能实现方式举例说明。
可选地,网关设备通过查询对应关系的方式从而获得RDMA存储节点的信息,下面对这种实现方式进行介绍。
由于本申请的一些实施例涉及不同信息之间的对应关系,为了区分不同对应关系,下文用“第一对应关系”、“第二对应关系”区分描述不同信息之间的对应关系。
以确定目的存储节点时使用的对应关系称为第一对应关系为例,可选地,在图9所示方 法中,网关设备接收到第一NOF请求报文后,网关设备从第一NOF请求报文获得第一目的地址,网关设备基于第一目的地址,从第一对应关系中查询得到第一RDMA存储节点的信息,从而确定第一目的地址对应的目的存储节点为第一RDMA存储节点。之后,网关设备基于第一RDMA存储节点的信息生成第一RDMA请求报文。第一RDMA请求报文包括第一RDMA存储节点的信息。
第一对应关系是指第一目的地址以及第一RDMA存储节点的信息之间的对应关系。第一对应关系包括第一目的地址以及第一RDMA存储节点的信息。
可选地,第一对应关系是一个表中一条表项的内容。例如,第一对应关系是同一条表项中两个字段的内容组合,这两个字段中一个字段表示第一目的地址,另一个字段表示第一RDMA存储节点的信息。在一种可能的实现中,第一对应关系具体是地址转换表中一条表项的内容。地址转换表请参考下述实例1的介绍,这里先不对地址转换表详细展开描述。
网关设备如何获得上述第一对应关系包括多种实现方式。下面结合两种可能的实现方式举例说明,见下述实现方式A至实现方式B。
实现方式A、网关设备生成第一对应关系。
实现方式A属于一种网关设备负责地址编排的方案。具体地,网关设备为第一RDMA存储节点分配NVMe逻辑地址,得到第一目的地址。网关设备建立第一目的地址与第一RDMA存储节点的信息之间的对应关系,从而产生上述第一对应关系。
在生成上述第一对应关系的过程中,网关设备如何获取第一RDMA存储节点的信息包括多种实现方式。下面结合4种可能的实现方式举例说明,见下述获取方式1至获取方式4。
获取方式1、第一RDMA存储节点主动向网关设备上报本节点的信息。
第一RDMA存储节点向网关设备发送第一RDMA存储节点的信息。网关设备接收第一RDMA存储节点发送的第一RDMA存储节点的信息,从而得到第一RDMA存储节点的信息。
第一RDMA存储节点上报信息的时机包括多种情况。在一种可能的实现中,第一RDMA存储节点在与网关设备建立RDMA连接时,向网关设备发送第一RDMA存储节点的信息。在另一种可能的实现中,第一RDMA存储节点在本节点的信息发生更新时,向网关设备发送第一RDMA存储节点的信息。比如,当发生网络位置移动、IP地址重分配、数据迁移、内存碎片整理等场景下,第一RDMA存储节点的信息可能发生更新。第一RDMA存储节点可选地每当发现本节点的信息发生更新时,向网关设备发送更新后的本节点的信息。在另一种可能的实现中,第一RDMA存储节点在上电、开机或重启时向网关设备发送第一RDMA存储节点的信息。在另一种可能的实现中,第一RDMA存储节点在接收到指令时向网关设备发送第一RDMA存储节点的信息。
第一RDMA存储节点上报信息的具体实现方式包括很多种。在一种可能的实现中,第一RDMA存储节点生成并向网关设备发送RDMA报文,RDMA报文携带第一RDMA存储节点的信息。作为一种具体示例,第一RDMA存储节点生成并向网关设备发送RDMA注册报文,RDMA注册报文携带第一RDMA存储节点的信息。RDMA注册报文用于将第一RDMA存储节点的内存空间注册为用于RDMA操作的空间。可选地,RDMA注册报文为RDMA中双边操作的报文,比如说RDMA注册报文是send报文或者receive报文。在另一种可能的实现中,第一RDMA存储节点利用RDMA之外的其他设备间通信协议向网关设备上报本节点的信息。例如,第一RDMA存储节点通过私有协议报文,或者存储节点与控制面之间的通信接口,或 者路由协议报文等方式,向网关设备上报本节点的信息。
获取方式2、网关设备从第一RDMA存储节点拉取第一RDMA存储节点的信息。
例如,网关设备生成并向第一RDMA存储节点发送查询请求,查询请求用于指示获取第一RDMA存储节点的信息。第一RDMA存储节点接收查询请求,生成并向网关设备发送查询响应,查询响应包括第一RDMA存储节点的信息。网关设备接收查询响应,从查询响应获得第一RDMA存储节点的信息。
上述查询请求和查询响应对应的协议类型包括多种实现方式。例如,上述查询请求和查询响应为网络配置(network configuration,NETCONF)报文或者简单网络管理协议(simple network management protocol,SNMP)报文等。
获取方式3、网关设备从控制面或者管理面网元获取第一RDMA存储节点的信息。
网关设备生成并向该控制面或者管理面网元发送查询请求,查询请求用于指示获取第一RDMA存储节点的信息。该控制面或者管理面网元接收查询请求,生成并向网关设备发送查询响应,查询响应包括第一RDMA存储节点的信息。网关设备接收查询响应,从查询响应获得第一RDMA存储节点的信息。
该控制面或者管理面网元包括很多种实现方式。例如,从存储系统中各个存储节点中选举出的一个存储节点,由选举出的存储节点充当控制面或者管理面网元。该控制面或者管理面网元可选地是一个NOF存储节点,或者是一个RDMA存储节点。又如,部署一个独立的网元作为控制面或者管理面网元。
获取方式4、网关设备通过静态配置的手段获取第一RDMA存储节点的信息。
具体地,网络管理员通过命令行、web界面或者其他方式,将第一RDMA存储节点的信息配置到网关设备。网关设备基于网络管理员的配置操作获得第一RDMA存储节点的信息。
在生成上述第一对应关系时,网关设备如何为第一RDMA存储节点分配NVMe逻辑地址包括多种实现方式。总体上来讲,网关设备以存储系统中不同存储节点对应的NVMe逻辑地址不重复为约束条件,为第一RDMA存储节点分配NVMe逻辑地址。
在一种可能的实现中,网关设备不仅获取第一RDMA存储节点的信息,还获取其他各个RDMA存储节点的信息。网关设备基于各个RDMA存储节点的信息创建存储资源池。存储资源池的存储空间来自于各个RDMA存储节点的内存空间。然后,网关设备对存储资源池中各个内存空间进行统一编址,使得每个内存空间都有一个唯一的全局地址。所谓全局地址是指这个地址指示的内存空间在存储资源池中是唯一的,不同全局地址对应的物理内存空间不重复。第一RDMA存储节点的内存空间的全局地址即为分配给第一RDMA存储节点的NVMe逻辑地址。可选地,将NOF存储节点的硬盘空间也纳入到存储资源池中,相当于将各个RDMA存储节点提供的内存空间以及各个NOF存储节点的硬盘空间进行池化,从而统一管理。例如,网关设备不仅获取RDMA存储节点的信息,还获取各个NOF存储节点的信息。网关设备基于各个RDMA存储节点的信息以及各个NOF存储节点的信息创建存储资源池。网关设备实现地址编排的更多细节请参考下文中实例3的描述。
实现方式B、网关设备从网关设备之外的其他设备接收第一对应关系。
实现方式B属于一种由网关设备之外的其他设备负责地址编排的方案。以控制面或者管理面网元负责地址编排为例,控制面或者管理面网元为第一RDMA存储节点分配NVMe逻辑地址,得到第一目的地址。控制面或者管理面网元建立第一目的地址与第一RDMA存储节 点的信息之间的对应关系,从而产生上述第一对应关系。控制面或者管理面网元向网关设备发送第一对应关系。网关设备接收控制面或者管理面网元发送的第一对应关系。
控制面或者管理面网元在生成上述第一对应关系时如何获得第一RDMA存储节点的信息以及如何分配NVMe逻辑地址可参考实现方式A中的描述,将实现方式A描述的步骤的执行主体从网关设备替换为控制面或者管理面网元即可。
可选地,控制面或者管理面网元与网关设备协作来生成上述第一对应关系。网关设备负责向控制面或者管理面网元上报第一RDMA存储节点的信息,控制面或者管理面网元根据网关设备上报的信息生成上述第一对应关系。
以上描述了一种网关设备通过查询对应关系的方式从而确定目的存储节点的方案。下面对这种方式的技术效果进行分析。
网关设备通过查询对应关系来确定目的存储节点,降低了实现复杂度,能够在转发报文的过程快速确定目的存储节点。尤其是,由于查询对应关系的方式的处理逻辑较为简单,比较模式化,很容易卸载到专用硬件上执行,从而不必耗费主控处理器的资源。作为一种可能的实现方式,上述第一对应关系和转发表项一起保存在接口板(也称业务板)上的存储器中,查询第一对应关系的动作由接口板上的处理器执行,从而无需将NOF请求报文的内容上送到主控处理器,节省主控处理器的算力并提高转发效率。
以上描述的网关设备通过查询对应关系的方式从而确定RDMA存储节点是本申请实施例的可选方式。在另一些实施例中,网关设备采用其他实现方式从而确定RDMA存储节点,下面对一些其他实现方式举例说明。
例如,在写数据的情况下,网关设备基于服务质量(quality of service,QoS)策略确定目的存储节点。例如,如果客户端的服务级别协议(service-level agreement,SLA)要求高,网关设备将RDMA存储节点确定为目的存储节点。如果客户端的SLA要求低,网关设备将NOF存储节点确定为目的存储节点。
又如,在写数据的情况下,网关设备基于容量均衡策略确定目的存储节点。具体地,网关设备接收到第一NOF请求报文后,网关设备根据存储系统中各存储节点当前的空闲容量,从各个存储节点中选择空闲容量最大的存储节点作为目的存储节点,以保证数据尽量均匀地写到各个存储节点上。
又如,由网关设备之外的其他设备执行查询对应关系的步骤,再将查询对应关系得知的目的存储节点通知给网关设备。
又如,由客户端指定目的存储节点。比如,客户端发送的第一NOF请求报文包含第一RDMA存储节点的标识。
以上列举的各种确定目的存储节点的方式均是可选方式,本实施例对网关设备接收到第一NOF请求报文后如何确定目的存储节点不做限定。
本申请实施例中,网关设备如何生成NOF响应报文来回复客户端包括多种实现方式,下面对生成NOF响应报文时可能采用的一些实现方式举例说明。
可选地,网关设备在接收到RDMA存储节点返回的RDMA响应报文后,通过一些手段获取NOF状态信息,根据NOF状态信息生成NOF响应报文。
下面对网关设备如何获取NOF状态信息的一些可能实现方式进行介绍,见下述实现方式I和实现方式II。
实现方式I、网关设备通过查询对应关系的方式获得NOF状态信息。
以获取NOF状态信息时使用的对应关系称为第二对应关系为例,可选地,在图9所示方法S410中,网关设备基于RDMA响应报文获得RDMA状态信息;网关设备根据RDMA状态信息,从第二对应关系中查询得到NOF状态信息;网关设备基于NOF状态信息生成第一NOF响应报文。
第二对应关系是指RDMA状态信息与NOF状态信息之间的对应关系。第二对应关系包括RDMA状态信息与NOF状态信息之间的对应关系。RDMA状态信息的概念可参考上文术语解释部分(17)的介绍,NOF状态信息的概念可参考上文术语解释部分(18)的介绍。
可选地,第二对应关系是一个表中一条表项的内容。例如,第二对应关系是同一条表项中两个字段的内容组合,这两个字段中一个字段表示RDMA状态信息,另一个字段表示NOF状态信息。在一种可能的实现中,第二对应关系具体是NOF上下文表中一条表项的内容。NOF上下文表请参考下述实例1的介绍,这里先不对NOF上下文表详细展开描述。
网关设备如何获得上述第二对应关系包括多种实现方式。可选地,网关设备在将NOF请求报文转换为RDMA请求报文的过程中建立第二对应关系。例如,结合图9所示方法来看,网关设备接收到第一NOF请求报文后,网关设备基于第一NOF请求报文获得NOF状态信息;网关设备基于与RDMA存储节点之间RDMA连接的当前状态获得RDMA状态信息。网关设备建立该NOF状态信息与RDMA状态信息之间的对应关系。
下面以RDMA状态信息为RDMA PSN、NOF状态信息为NOF PSN为例,说明如何建立第二对应关系的一种可能实现方式。
例如,网关设备在执行图9所示方法时,网关设备获得第一NOF请求报文携带的NOF PSN,并获取本次待发送的RDMA请求报文(即第一RDMA请求报文)中携带的RDMA PSN,建立NOF PSN与RDMA PSN之间的对应关系。
网关设备获取RDMA PSN的基本原理是,网关设备与RDMA存储节点基于RDMA协议建立会话时,网关设备对RDMA PSN进行初始化,得到一个RDMA PSN。之后,每当网关设备要向RDMA存储节点发送一次RDMA请求报文时,网关设备先按照设定规则,对上一次在RDMA请求报文中携带的RDMA PSN进行更新,再将更新后的RDMA PSN携带在本次需要发送的RDMA请求报文中,然后发送RDMA请求报文。
其中,网关设备在交互过程中更新RDMA PSN的具体方式根据RDMA协议栈的处理逻辑确定。例如,在没有分片的情况下,更新RDMA PSN为对RDMA PSN加一,在分片的情况下,更新RDMA PSN为对RDMA PSN加上分片的数量,本实施例对更新RDMA PSN的具体方式不做限定。
以上描述的第二对应关系为RDMA PSN与NOF PSN之间的对应关系。在另一些实施例中,第二对应关系中RDMA PSN替换为其他RDMA状态信息,第二对应关系中NOF PSN替换为其他NOF状态信息,本实施例对第二对应关系中RDMA状态信息以及NOF状态信息的具体内容不做限定。
以上描述的建立NOF状态信息与RDMA状态信息之间的对应关系是可选方式。在另一些实施例中,建立NOF状态信息与其他信息之间的对应关系,网关设备通过查找NOF状态信息与其他信息之间的对应关系来获得NOF状态信息。比如,网关设备建立NOF状态信息与第一RDMA节点的信息(如设备标识)之间的对应关系。又如,网关设备维护一个会话表, 在网关设备与客户端基于NOF协议进行一次会话的过程中,每当网关设备与客户端交互一次报文,网关设备都将当前的NOF状态信息保存到会话表中,网关设备根据会话表中最近一次保存的NOF状态信息,确定当前发送NOF响应报文时所需使用的NOF状态信息。
以上实现方式I描述了一种网关设备通过查询对应关系的方式从而获得NOF状态信息的方案。下面对这种方式的技术效果进行分析,见下述两点。
第一,降低了实现复杂度,容易卸载到专用硬件上执行。这里的原理请参考上文中描述通过查询对应关系来确定目的存储节点的技术效果时的介绍。
第二,无需修改原有的RDMA协议,与原有的RDMA兼容性更好。
实现方式II、网关设备先将NOF状态信息携带在RDMA请求报文中,发送给RDMA存储节点,之后网关设备从RDMA存储节点返回的RDMA响应报文中获得NOF状态信息。
例如,结合图9所示方法来看,网关设备在接收到第一NOF请求报文之后,网关设备基于第一NOF请求报文获得NOF状态信息。网关设备将NOF状态信息添加至第一RDMA请求报文中,得到包括NOF状态信息的第一RDMA请求报文。网关设备向第一RDMA存储节点发送包括NOF状态信息的第一RDMA请求报文。RDMA存储节点接收到第一RDMA请求报文之后,RDMA存储节点从第一RDMA请求报文获得NOF状态信息。RDMA存储节点将NOF状态信息添加至RDMA响应报文中,得到包括NOF状态信息的RDMA响应报文。RDMA存储节点发送包括NOF状态信息的RDMA响应报文。网关设备接收到RDMA响应报文之后,网关设备基于RDMA响应报文获得NOF状态信息;网关设备基于NOF状态信息生成第一NOF响应报文。
第一RDMA请求报文和RDMA响应报文中NOF状态信息的携带位置包括多种情况。在一种可能的实现中,NOF状态信息位于RDMA头和载荷之间。在另一种可能的实现中,NOF状态信息位于RDMA头中。可选地,在RDMA协议中扩展一种新类型的报文头或者一种新类型的TLV,使用新类型的报文头或TLV携带NOF状态信息。或者,使用RDMA协议中的一些预留字段携带NOF状态信息,本实施例对如何在RDMA报文中携带NOF状态信息不作限定。
通过上述实现方式II的方法,网关设备无需本地维护额外的表项即可获得NOF状态信息,因此节省了网关设备的存储空间,也减少网关设备查表和写表带来的资源开销。
以上介绍网关设备通过获取NOF状态信息来生成NOF响应报文是本申请实施例的一种可选实现方式。在另一种可能的实现方式中,生成NOF报文头的主要处理工作放在RDMA存储节点上执行,见下述实现方式III。
实现方式III、RDMA存储节点处理得到NOF报文头的部分或全部内容,网关设备复用RDMA存储节点的处理结果来生成NOF响应报文。
可选地,网关设备在向RDMA存储节点发送RDMA请求报文之前,预先生成NOF报文头,并填充NOF报文头中部分字段的内容,将包含NOF报文头的RDMA请求报文发送至RDMA存储节点。RDMA存储节点接收到RDMA请求报文后,RDMA存储节点对NOF报文头进一步处理,比如填充NOF报文头中空白字段的内容,或者对网关设备已经填充的字段的内容进行修改。然后,RDMA存储节点将处理后的NOF报文头携带在RDMA响应报文中,返回包含NOF报文头的RDMA响应报文。网关设备接收到RDMA存储节点返回的RDMA响应报文后,网关设备接收到RDMA响应报文后,根据RDMA响应报文中的NOF报文头生 成NOF响应报文。
网关设备预先填充NOF报文头的哪些字段包括多种实现方式。可选地,网关设备使用NOF状态信息,填充NOF报文头中用于携带NOF状态信息的字段。可选地,网关设备还填充NOF报文头中MAC头的内容、IP头的内容或者UDP头的内容中一项或多项。网关设备在NOF报文头中预先填充的字段类型可根据业务场景设置,本实施例对网关设备具体预先填充NOF报文头的哪些字段不作限定。
结合图9所示方法来看,例如,网关设备生成包含第一NOF报文头的第一RDMA请求报文,向第一RDMA存储节点发送包含第一NOF报文头的第一RDMA请求报文。第一RDMA存储节点接收第一RDMA请求报文后,第一RDMA存储节点从第一RDMA请求报文获得第一NOF报文头,第一RDMA存储节点基于第一NOF报文头生成第二NOF报文头,生成并发送包括第二NOF报文头的RDMA响应报文。网关设备接收到RDMA响应报文后,网关设备基于RDMA响应报文中的第二NOF报文头,生成第一NOF响应报文。
可选地,NOF报文头封装在RDMA报文头的内层。网关设备生成第一NOF响应报文的具体过程包括:网关设备剥离RDMA响应报文中外层的RDMA报文头,将得到的RDMA响应报文剩余的部分作为NOF响应报文。
下面以NOF为RoCE为例对实现方式III举例说明。
例如,请参考图10,网关设备预先生成RoCE头(即第一NOF报文头),该RoCE头包括MAC头、IP头、UDP头和IB头。网关设备使用待回应给客户端的信息,填充MAC头、IP头、UDP头和IB头的内容。网关设备对RDMA头以及填充后的RoCE头进行封装,得到上述第一RDMA请求报文。其中,RoCE头封装在RDMA头的内层。第一RDMA存储节点根据RDMA请求报文中的RoCE头,生成RDMA响应报文。RDMA响应报文中RoCE头(第二NOF报文头)封装于RDMA头内层。网关设备剥离RDMA响应报文外层的RoCE头,将RDMA响应报文的剩余部分作为第一NOF响应报文,将第一NOF响应报文返回给客户端。
通过上述实现方式III,一方面,网关设备无需本地维护额外的表项即可获得NOF状态信息,因此节省了网关设备的存储空间,也减少网关设备查表和写表带来的资源开销。另一方面,通过将生成NOF报文头的工作转移到RDMA存储节点上执行,从而减轻网关设备的处理压力。
实现方式III中网关设备预先生成NOF报文头的步骤是可选方式。在另一些实施例中,由RDMA存储节点负责向RDMA报文中封装NOF报文头。
可选地,在写数据的情况下,网关设备支持将同一份数据写入到多个RDMA存储节点中的每一个RDMA存储节点,从而实现数据备份的功能。下面对如何实现数据备份的一些可能实现方式举例说明。
以将一份数据写入到两个RDMA存储节点的情况为例,例如,在图9所示方法中,上述第一NOF请求报文为NOF写请求报文,第一NOF请求报文携带NVMe写指令,NVMe写指令指示第一目的地址执行写操作。网关设备接收到第一NOF请求报文后,基于第一目的地址获取第一RDMA存储节点的信息和第二RDMA存储节点的信息。在这种情况下,网关设备不仅基于第一NOF请求报文生成了上述第一RDMA请求报文,还基于第一RDMA请求报文生成第二RDMA请求报文。网关设备不仅向第一RDMA存储节点发送第一RDMA请求报文,还向第二RDMA存储节点发送第二RDMA请求报文。
第二RDMA请求报文和第一RDMA请求报文具有类似的特征。第二RDMA请求报文同样包括第一NOF请求报文携带的待保存的数据。第二RDMA请求报文包括NVMe写指令对应的RDMA写指令。此外,第二RDMA请求报文还包括第二RDMA存储节点的信息。例如,第二RDMA请求报文包括第三目的地址、第二RDMA存储节点的网络位置信息和第二RDMA存储节点中一个或多个QP的标识。第三目的地址为第二RDMA存储节点中内存空间的地址。
第一RDMA存储节点针对第一RDMA请求报文的处理动作请参考图9所示实施例。第二RDMA存储节点针对第二RDMA请求报文的处理动作与第一RDMA存储节点的处理动作类似。具体地,第二RDMA存储节点会执行第二RDMA请求报文中的RDMA指令,找到内存中第三目的地址对应的位置,将第二RDMA请求报文中的数据保存至内存中第三目的地址对应的位置。
网关设备如何获取第二RDMA存储节点的信息可参考上文中获取第一存储节点的信息介绍。以通过查询对应关系的方式为例,例如,上述第一对应关系不仅包括第一目的地址以及第一RDMA存储节点的信息,还包括第二RDMA存储节点的信息,因此网关设备查找第一对应关系后能够得到第二RDMA存储节点的信息。
以上描述的根据一个目的NVMe地址获取到两个RDMA存储节点的信息,从而将一份数据写到两个RDMA存储节点的情况是举例说明,本实施例对根据一个目的NVMe地址能确定的RDMA存储节点的数量不做限定。例如,在采用多副本机制存储数据的场景下,根据一个目的NVMe确定的RDMA存储节点的数量可选地等于副本的数量。又如,在采用纠删码(erasure coding,EC)机制存储数据的场景下,根据一个目的NVMe确定的RDMA存储节点的数量可选地等于一个条带中数据块和校验块的数量之和。
在写数据的情况下,当根据一个目的NVMe地址得到多个RDMA存储节点的信息时,网关设备如何向多个RDMA存储节点发送RDMA写请求报文包括多种实现方式,下面结合两种发送方式举例说明。
发送方式一、网关设备向多个RDMA存储节点组播RDMA写请求报文。
网关设备可能采用的组播方式包括很多种实现方式。例如,组播方式包括而不限于基于比特位的显式复制(bit indexed explicit replication,BIER)、基于互联网协议第6版(internet protocol version 6,IPv6)的BIER(BIERv6)、组播组管理协议(internet group management protocol,IGMP)、协议无关组播(protocol independent multicast,PIM)、组播源发现协议、组播边界网关协议(multiprotocol border gateway protocol,MBGP)等等,本实施例对网关设备才有用的组播方式不做限定。
在采用组播方式的情况下,上述第一RDMA请求报文和第二RDMA请求报文均为组播报文。例如,第一RDMA请求报文和第二RDMA请求报文包括封装于RDMA报文头外层的组播报文头。组播报文头例如包括第一RDMA存储节点和第二RDMA存储节点所加入的组播组的标识,又如包括第一RDMA存储节点或者第二RDMA存储节点在组播域中的设备标识。组播报文头包括而不限于BIER头、BIERv6头、IGMP头、PIM头等等。
发送方式二、网关设备采用单播的方式向每个RDMA存储节点发送RDMA写请求报文。
在采用单播方式的情况下,上述第一RDMA请求报文和第二RDMA请求报文均为单播报文。
可选地,在读数据的情况下,网关设备支持将读请求发送至多个候选RDMA存储节点其 中的一个RDMA存储节点,从而支持负载分担特性,允许多个RDMA节点分担读取数据带来的处理压力。下面对如何实现负载分担的一些可能实现方式举例说明。
以两个RDMA存储节点分担读请求为例,例如,在图9所示方法中,上述第一NOF请求报文为NOF读请求报文,第一NOF请求报文携带NVMe读指令,NVMe读指令指示第一目的地址执行读操作。网关设备基于第一目的地址获取了第一RDMA存储节点的信息和第二RDMA存储节点的信息。在这种情况下,网关设备基于负载分担算法,从第一RDMA存储节点和第二RDMA存储节点中选择RDMA存储节点。在网关设备选择的RDMA存储节点为第一RDMA存储节点的情况下,网关设备向第一RDMA存储设备发送第一RDMA请求报文。在网关设备选择的RDMA存储节点为第二RDMA存储节点的情况下,图9所示方法描述的第一RDMA存储设备负责的步骤替换为由第二RDMA存储节点执行。
网关设备采用的负载分担算法包括多种具体实现方式。例如,负载分担算法为一致性哈希算法。又如,负载分担算法为从目的NVMe地址对应的多个RDMA存储节点选择数据的访问频次最低的存储节点,本实施例对网关设备采用的负载分担算法的类型不做限定。
上述各个实施例侧重介绍网关设备如何基于RDMA协议与RDMA存储节点交互的流程。在一些实施例中,网关设备还支持基于NOF协议与NOF存储节点交互。下面对网关设备与NOF存储节点交互的流程举例说明。例如,客户端请求访问的部分数据没有存储在RDMA存储节点中,而是存储在NOF存储节点中,则网关设备通过图11对应的方法,从而获得NOF存储节点上存储的、客户端请求访问的数据。又如,系统中的RDMA存储节点当前的存储容量不足,可能无法满足客户端存储数据的需求,则网关设备通过图11对应的方法,从而利用NOF存储节点的存储空间保存客户端的数据。
图11是本申请实施例提供的一种报文处理方法的流程图。图11所示方法包括以下步骤S501至步骤S512。
S501、客户端发送第一NOF请求报文。
S502、网关设备接收来自客户端的第一NOF请求报文。
S503、网关设备基于第一目的地址获取NOF存储节点的信息。
网关设备从第一NOF请求报文获取第一目的地址。网关设备基于第一目的地址获取目的存储节点的信息,得到了NOF存储节点的信息。
网关设备如何得到NOF存储节点的信息的实现方式与图9所示实施例中得到第一RDMA存储节点的信息的实现方式同理。以通过查询对应关系的方式从而获取存储节点的信息为例,例如,将上述第一对应关系从第一目的地址与第一RDMA存储节点的信息之间的对应关系替换为第一目的地址与NOF存储节点的信息之间的对应关系,因此网关设备查找对应关系后能够得到NOF存储节点的信息。
在一些实施例中,网关设备对第一NOF请求报文进行修改,得到第二NOF请求报文。例如,上述第一NOF请求报文包括第一NOF状态信息,网关设备将第一NOF状态信息修改为第二NOF状态信息,以得到包括第二NOF状态信息的第二NOF请求报文。
第一NOF状态信息为客户端与网关设备基于NOF协议交互的状态信息。第二NOF状态信息为网关设备与NOF存储节点基于NOF协议交互的状态信息。
S504、网关设备向NOF存储节点发送第二NOF请求报文。
第二NOF请求报文包括NVMe指令、第一目的地址以及NOF存储节点的信息。
S505、NOF存储节点接收第二NOF请求报文。
S506、NOF存储节点执行NVMe指令以对硬盘执行读/写操作。
S507、NOF存储节点生成第二NOF响应报文。
第二NOF响应报文为针对第二NOF请求报文的响应报文。
S508、NOF存储节点发送第二NOF响应报文。
S509、网关设备接收来自NOF存储节点的第二NOF响应报文。
S510、网关设备基于第二NOF响应报文生成第三NOF响应报文。
在一些实施例中,网关设备对第二NOF响应报文进行修改,得到第三NOF响应报文。例如,上述第二NOF响应报文包括第三NOF状态信息,网关设备将第第三NOF状态信息修改为第四NOF状态信息,以得到包括第四NOF状态信息的第三NOF响应报文。
第三NOF状态信息为网关设备与NOF存储节点基于NOF协议交互的状态信息。第四NOF状态信息为客户端与网关设备基于NOF协议交互的状态信息。
S511、网关设备向客户端发送第三NOF响应报文。
第三NOF响应报文为针对第一NOF请求报文的响应报文。
S512、客户端接收第三NOF响应报文。
网关设备通过执行本实施例提供的方法,支持原有的NOF交互流程,从而与原有的NOF存储方案保持兼容,而无需对现网设备大量替换。
图9对应的实施例与图11对应的实施例中的至少部分内容可以相互结合。
例如,结合两个实施例的另一种可能实现方式为,网关设备通过进行判断,选择性执行图9对应的实施例与图11对应的实施例其中一者。在一种可能的实现中,网关设备上的对应关系包括节点类型标识,该节点类型标识用于标识存储节点是RDMA存储节点还是NOF存储节点。网关设备接收到NOF请求后,判断对应关系中目的NVMe地址对应的节点类型标识表示RDMA存储节点还是NOF存储节点,如果节点类型标识表示RDMA存储节点,则进入图9对应的实施例。如果节点类型标识NOF存储节点,则进入图11对应的实施例。在另一种可能的实现中,网关设备根据目的地址获得RDMA存储节点的信息和NOF存储节点的信息后,网关设备根据设定的策略(如负载分担、容量分担、QoS策略等),从RDMA存储节点和NOF存储节点中选择一种存储节点作为NOF请求报文的响应方。如果网关设备选择RDMA存储节点,则进入图9对应的实施例。如果网关设备选择NOF存储节点,则进入图11对应的实施例。
再例如,结合两个实施例的另一种可能实现方式为,图9对应的实施例与图11对应的实施例均执行。也即是,网关设备接收到客户端的NOF请求报文后,不仅与RDMA存储节点交互,也与NOF存储节点交互。作为示例,RDMA存储节点以及NOF存储节点中一类节点充当主节点,另一类节点充当备节点。网关设备接收到客户端的NOF请求报文后,向RDMA存储节点发送RDMA请求,向NOF存储节点发送NOF请求,从而将数据分别保存到RDMA存储节点的内存中以及NOF存储节点的硬盘中。
下面结合一些具体应用场景对技术方案举例说明。
下述实例应用于IP-SAN存储区域网络(storage area network,SAN)存储服务中。
SAN是指通过网络将存储介质与计算机(如服务器)相连的架构。SAN支持将原本只能承载在单服务器的存储介质通过网络扩展到多服务器,大大提高了存储容量和可扩展性。SAN 分为光纤通道存储区域网络(fibre channel-storage area network,FC-SAN)和IP-SAN两种类型。FC-SAN和IP-SAN的主要区别是,FC-SAN中连接存储介质与服务器的网络为FC网络,换句话说,数据通过FC网络在存储介质与服务器之间传输。IP-SAN中连接存储介质与计算机的网络为IP网络,换句话说,数据通过IP网络在存储介质与服务器之间传输。
在IP-SAN的各种实现方式中,使用基于NVMe指令的NOF协议组建IP-SAN存储网络的效果较好,所以下述实例以在NOF的基础上进行改进为例进行描述。其中,使用NOF协议组建IP-SAN存储网络的效果较好的基本原理是,对于NVMe来说,在硬件形态上,NVMe子系统(NOF存储节点)直接通过PCIe总线和主机连接,路径中不再需要主机总线适配器(host bus adapter,HBA)卡,降低了系统开销。在主机侧,NVMe子系统减少了IO调度层,单独的命令层,IO路径更短,为低延迟提供了保障。并且NVMe命令队列可以支持到64K个命令队列,每个命令队列支持多达64K个命令。综上,NVMe更加高性能,高效。而NOF作为对NVMe的扩展,NOF继承了NVMe的优势,因此使用NOF协议组建IP-SAN存储网络的效果较好。
在另一些实施例中,将下述实例的方案应用在基于NVMe之外其他类型的存储协议指令以及基于这种指令的存储系统中,在这种场景下,对下述实例中涉及存储协议的部分进行修改,具体的实现方式与下述实例类似。
下述实例实现了一种网关设备,该网关设备能够替换传统的网络转发设备。该网关设备在实现二三层转发的基础上,支持以下四种功能。
(1)实现RDMA协议栈。
本实例提供的网关设备由于支持RDMA协议栈,能够与RDMA存储节点建立连接并基于RDMA协议进行交互。
(2)实现NOF协议栈。
本实例提供的网关设备由于实现NOF协议栈,能够代理NOF存储节点与客户端进行交互。
(3)实现NOF-RDMA这两种存储协议中报文互相转换。
通过部署本实施例提供的网关设备,支持在NOF存储网络中扩展RDMA存储节点。
(4)实现NOF-RDMA这两种存储协议对应的逻辑地址相互转换。
网关设备上保存了地址转换表,当解析出NOF操作的目的NVMe地址后,能够通过地址转换表将目的NVMe地址转换为RDMA的地址。并且,网关设备通过将NVMe指令转换为RDMA指令,从而将传统的NOF对目的NVMe硬盘操作转义为定向到对RDMA节点内存的操作。
下述实例能够提升传统NOF存储方案的性能和扩容灵活性。
对比传统NOF方案,对于客户端来说,下述实例的方案兼容原有方案,客户端不需要进行改进,也不需要感知到存储节点的变化。客户端既能够使用NOF存储节点提供的存储服务,也能够性能更好的RDMA存储节点提供的存储服务;对于存储节点来说,网关设备卸载了存储节点的地址管理任务,接管了存储服务器的后端扩展功能。网关设备能够处理NOF请求,根据NOF请求中的目的NVMe地址定向到目的存储节点,不需要存储节点扩展后端NOF,从而减轻存储节点CPU压力和网络I/O压力。
下面对下述实例应用的系统架构举例说明。
下述实例实现一种网关设备。RDMA存储节点无需与客户端建立逻辑连接,而是与该网关设备建立逻辑连接。
网关设备相当于存储资源的总入口。网关设备同时管理NOF存储节点和RDMA存储节点的存储空间。并且,网关设备能够将客户端的NOF请求中的目的地址映射到RDMA节点的内存空间的地址,使原本全路径NOF存储服务同时支持NOF存储服务和性能更好的RDMA存储服务。
图12是本申请实施例提供的一种部署网关设备后存储系统的架构示意图。图12以基于RDMA访问的内存为DRAM缓存为例描述。图12用不同的线型区分标识NOF相关的特征和RDMA相关的特征。
图12所示的存储系统中包括客户端、网关设备、NOF存储节点、RDMA存储节点A、RDMA存储节点B和RDMA存储节点C。NOF存储节点中包含NVMe存储介质。RDMA存储节点A、RDMA存储节点B和RDMA存储节点C中的每个存储节点包含DRAM缓存。
如图12所示,网关设备部署在客户端与各个存储节点之间。网关设备与客户端基于NOF协议建立NOF连接。并且,网关设备与RDMA存储节点A、RDMA存储节点B和RDMA存储节点C基于RDMA协议建立RDMA连接。并且,网关设备与NOF存储节点基于NOF协议建立NOF连接。
如图12所示,当客户端通过NOF连接发送NOF请求报文后,NOF请求报文到达网关设备。网关设备接收到NOF请求报文后,判断本地保存的对应关系中,与NOF请求报文中NVMe指令中目的地址对应的存储节点是NOF存储节点还是RDMA存储节点。如果目的地址对应的存储节点是RDMA存储节点,则网关设备将NOF请求报文转换为包含RDMA指令的RDMA请求报文,向目的地址对应的RDMA存储节点发送RDMA请求报文,以使RDMA存储节点采用RDMA的方式对DRAM缓存进行读写操作。例如,如果目的地址对应的存储节点是图12中RDMA存储节点A,则网关设备向RDMA存储节点A发送RDMA请求报文。如果目的地址对应的存储节点是NOF存储节点,则网关设备无需执行协议报文转换的步骤,向NOF存储节点发送NOF请求报文,以使NOF存储节点对NVMe存储介质进行读写操作。
RDMA存储节点中的内存介质为DRAM缓存是一种可选方式。在另一些实施例中,RDMA存储节点采用其他类型的内存介质,例如SCM、SRAM、DIMM或者内存型硬盘等,本实施例对RDMA存储节点中的内存介质的类型不作限定。
图12是一种简化示意图,处理器等其他硬件元件在图12中省略未示出,设备硬件结构将在其他实施例具体描述。
通过本实施例提供的网关设备以及方法流程,当需要扩容时,可选地使用原有的扩容方案,或者增加RDMA存储节点。新增的RDMA存储节点与网关设备建立连接。网关设备的地址映射表中添加该新增的RDMA存储节点的地址对应关系。由于使用基于RDMA访问的内存空间提供扩展的存储容量,因此性能更好。并且,这种扩容方式兼具了NOF水平扩容和垂直扩容的优点。
可选地,如果网关设备如果自身的缓存(cache)存储空间足够,网关设备充当一个存储节点为客户端提供存储服务。在这种场景下,NOF请求报文在网关设备终结,网关设备操作自身cache从而执行数据读写操作,并且网关设备构造NOF响应报文,从而与客户端交互。图13是一种网关设备充当存储节点的场景示意图。如图13所示,网关设备本地执行NOF请 求报文中的NVMe指令,对网关设备自身的cache进行数据读写操作,而无需将请求报文转发至存储节点。
本申请提供的一些实施例实现了一种网关设备。该网关设备能支持传统以太网的二三层转发,在此基础上实现了以下功能。
(1)处理NOF协议栈和RDMA协议栈。
网关设备能够处理RDMA协议栈,实现网关设备与RDMA存储节点的连接与交互。网关设备能够处理NOF协议栈,解析出NOF协议栈的信息,维护NOF协议栈的状态信息,实现将NOF报文回复到客户端的代理功能。
(2)NOF和RDMA的协议逻辑互相转换机制。
网关设备在实现NOF协议栈和RDMA的协议栈基础上,根据NOF报文和RDMA报文当前的交互信息和之前交互已知的状态信息,实现NOF-RDMA的报文转换和RDMA-NOF的报文转换,具体表现为NOF请求报文转换为RDMA请求报文,RDMA响应报文转换为NOF响应报文。
(3)NOF-RDMA方向的地址转换表。
本实施例提供了NOF-RMDA方向的地址转换表。地址转换表部署在网关设备上。地址转换表实现NOF中NVMe目的逻辑地址到RDMA目的逻辑地址的映射。在将NOF请求报文转换为RDMA请求报文的流程中,网关设备解析出NOF报文中NVMe指令中目的地址,通过查找该地址转换表找到对应的RDMA节点的内存地址以及其他信息,根据查表结果构造RDMA报文。
以上功能在网关设备中逻辑组合如图14所示。
图14中的服务器1和服务器2均是对RDMA存储节点的举例说明。服务器1和服务器2均配置有RDMA网卡。RDMA存储节点服务器1注册了长度为8K*100的内存空间用于RDMA读写操作,RDMA存储节点服务器1的内存空间对应的逻辑存储地址为LUN0。RDMA存储节点服务器2注册了长度为8K*100的内存空间用于RDMA读写操作,RDMA存储节点服务器2的内存空间对应的逻辑存储地址为LUN1。图14中磁盘阵列是对NOF存储节点的举例说明。
图14示出的网关设备包括NOF监听(snooping)模块、RDMA适配器(adapter)和多个端口。
NOF监听模块用于识别NOF报文。NOF监听模块接收到NOF报文之后,如果NOF监听模块识别出NOF报文的目的存储节点是NOF磁盘阵列,则NOF监听模块将NOF报文转发到NOF磁盘阵列;如果NOF监听模块识别出NOF报文的目的存储节点是RDMA存储节点,则将NOF报文发送给RDMA适配器。RDMA适配器将NOF报文转换为RMDA报文,将RDMA报文发送到RDMA节点。RDMA适配器另外还处理RDMA节点发送的RDMA响应报文,将RDMA响应报文转换为NOF响应报文,将NOF响应报文发送到客户端。
如图14所示,服务器1提供的RDMA内存空间包括100个大小为8KB的页。服务器2提供的RDMA内存空间包括100个大小为8KB的页。
网关设备将服务器1提供的RDMA内存空间虚拟化为LUN0,将服务器2提供的RDMA内存空间虚拟化为LUN1。LUN0和LUN1作为可使用的存储空间呈现给客户端。客户端发送NOF请求报文后,网关设备中RDMA适配器解析NOF请求报文中的目的NVMe地址。 如果目的NVMe地址中LUN ID是LUN0,那么RDMA适配器将NOF请求报文转换为待发送给服务器1的RDMA请求报文。如果目的NVMe地址中LUN ID是LUN1,那么RDMA适配器将NOF请求报文转换为待发送给服务器2的RDMA请求报文。
NOF监听模块对应于下述实例中报文解析模块中部分逻辑、地址转换表和NOF代理发包模块。RDMA适配器是用于进行NOF-RDMA的逻辑转换的模块。RDMA适配器对应下述实例中用于NOF-RDMA逻辑互相转换的模块,比如实例1中的NOF上下文表和实例2中报文附加信息处理。
图14选择一种专用网关设备来实现本实施例中网关设备,以提升报文转发性能。本实施例并不限定使用专用网关设备实现以上功能。在另一些实施例中,使用服务器、传统网络设备、FPGA设备等作为网关设备实现上述功能。
下面通过一些实例对网关设备执行的方法流程进行说明。
下述实例中预连接的过程和配置的过程使用RDMA的双边操作。预连接主要是指建立节点之间连接的过程。配置的过程主要是指存储节点上报内存空间的地址以及本节点的信息的过程。下述实例中数据实际存取过程中使用RDMA的单边操作。
下述实例中,网关设备对单边操作的读操作或写操作进行特殊处理以提升性能。网关设备的双边操作可选地不做特殊处理。网关设备按照规范正常解析即可。在RDMA存储节点通过双边操作上报内存空间的地址时,网关设备解析得到内存空间的地址完成后,可选地,网关设备将内存空间的地址通知给NOF存储节点。NOF存储节点对各个RDMA存储节点的内存空间的地址以及各个NVMe存储节点的硬盘空间的地址进行统一地址编排,得到NVMe逻辑地址,再将NVMe逻辑地址配置到网关设备。或者,由网关设备进行统一地址编排,则网关设备无需向NOF存储节点上报内存空间地址,网关设备直接管控所有的内存空间的地址以及所有硬盘空间的地址。
图15是本申请的一个实施例的流程图。图15主要体现了网关设备实现NOF协议代理的流程和NOF-RDMA的协议报文转换流程。图15所示的流程包括以下S61至S63。
S61、预连接和配置阶段。
S61具体包括S611至S614。
S611、客户端和NOF存储节点建立NOF连接。
S612、网关设备和RDMA存储节点建立RDMA连接。
S613、RDMA存储节点向NOF存储节点上报本节点的信息和内存空间的地址。
S614、NOF存储节点接收RDMA存储节点发送的节点信息和内存空间的地址。NOF存储节点进行统一地址编排,将地址转换表下发到网关设备。
上述流程中由NOF存储节点进行地址编排是一种可选的实现方式,在另一些实施例中,由网关设备进行地址编排。
上述流程是初始化过程。如果在存储系统运行过程中需要增加RDMA存储节点,通过重复执行S612、S613和S614即可将新增的RDMA存储节点加入整个存储系统。
S62、NOF协议代理流程。
如图15中S621所示,客户端发送NOF请求报文。NOF请求报文为NOF读请求报文或者NOF写请求报文。网关设备从客户端接收NOF请求报文。网关设备解析NOF请求报文,得到报文中的NVMe指令中目的存储地址。网关设备根据目的存储地址查找网关设备中的地 址转换表,得到目的存储节点的信息。如果目的存储地址位于NOF存储节点,则进入以下S622至S623。
S622、网关设备对NOF请求报文进行简单的代理处理,将处理后的NOF请求报文发送到NOF存储节点。
S623、NOF存储节点接收NOF请求报文。NOF存储节点针对NOF请求报文发送对应的NOF响应报文。
网关设备接收NOF响应报文。网关设备对NOF响应报文进行简单的代理处理,将处理后的NOF响应报文发送到客户端。
其中,在NOF请求报文为NOF读请求报文的情况下,NOF响应报文为NOF读响应报文。在NOF请求报文为NOF写请求报文的情况下,NOF响应报文为NOF写响应报文。
S63、NOF-RDMA报文转换的流程。
如果网关设备查找网关设备中的地址转换表,得到目的存储地址位于RDMA存储节点,则网关设备执行以下S631至S633。
S631、网关设备根据NOF-RDMA的转换逻辑以及目的RDMA节点的信息,封装RDMA的单边操作请求报文。网关设备将RDMA的单边操作请求报文发送到RDMA存储节点。
其中,在客户端发送的NOF请求报文为NOF读请求报文的情况下,网关设备发送的RDMA请求报文为RDMA读请求报文。在客户端发送的NOF请求报文为NOF写请求报文的情况下,网关设备发送的RDMA请求报文为RDMA写请求报文。
S632、RDMA存储节点从网关设备接收RDMA单边操作请求报文。RDMA存储节点基于RDMA单边操作请求报文执行RDMA指令,并生成和发送RDMA单边操作响应报文。网关设备得到RDMA存储节点的RDMA单边操作响应报文后,网关设备根据RDMA-NOF的转换逻辑,将RDMA单边操作响应报文转换为NOF响应报文。
其中,在RDMA存储节点发送的RDMA单边操作响应报文为RDMA读响应报文的情况下,网关设备转换的NOF响应报文为NOF读响应报文。在RDMA存储节点发送的RDMA单边操作响应报文为RDMA写响应报文的情况下,网关设备转换的NOF响应报文为NOF写响应报文。
S633、网关设备向客户端发送NOF响应报文。
实例1
图16是网关设备内部逻辑功能架构的示意图。实例1是网关设备内部的一种实现方式。网关设备实现NOF-RDMA的协议报文转化功能。当客户端的NOF请求报文到达网关设备,网关设备解析出NOF请求报文携带的NVMe指令。网关设备根据NVMe指令中的目的存储地址以及地址转换表,确定目的存储节点。目的存储节点存在以下两种情况。
情况(1)目的存储节点是NOF存储节点。
在目的存储节点是NOF存储节点的情况下,网关设备通过执行简单的NOF协议代理操作,保持原有的NOF交互流程。
情况(2)目的存储节点是RDMA存储节点。
在目的存储节点是RDMA存储节点的情况下,网关设备将NVMe指令转换为RDMA指令。在指令转换的同时,网关设备保存NOF的状态信息(本申请实施例称NOF的状态信息为NOF上下文)到NOF上下文表中。然后,网关设备根据转化后的RDMA指令封装对应的 RDMA请求。网关设备将RDMA请求发送到对应的RDMA存储节点。
在RDMA存储节点回应RDMA响应报文后,网关设备实现RDMA-NOF的转换,并且网关设备根据NOF上下文表中的内容还原NOF状态信息。网关设备使用NOF状态信息封装NOF响应报文,将NOF响应报文发送至客户端。
如图16所示,网关设备中的模块主要包括报文解析模块、地址转换表、NOF代理发包模块、NOF与RDMA报文转换模块、NOF上下文表、RDMA代理发包模块。
图16中存在同名的模块,如报文解析模块-1、报文解析模块-2。同名模块的处理逻辑相同或相近。为了使全流程更简洁,同名的模块以后缀编号的方式分散在流程不同位置,下文中介绍这些模块时不做特殊区分。
报文解析模块
报文解析模块用于解析报文,从NOF报文和RDMA报文中提取出协议类型和报文的内容。报文解析模块的功能具体包括以下(1)至(5)。
(1)报文解析分类
报文解析模块解析报文中的传输层信息。报文解析模块根据报文中传输层信息中的端口号,判断报文是否是NOF报文或者RDMA报文。如果报文是NOF报文或者RDMA报文,则报文解析模块将报文发送至后续对应的协议栈(即NOF协议栈或者RDMA协议栈),以便后续协议栈继续解析报文。如果报文不是NOF报文且不是RDMA报文,则报文解析模块不做特殊处理,按照原有的转发逻辑直接转发报文即可。
按照协议规范,NOF报文和RDMA报文均包括UDP头。UDP头中的目的端口号为4791。协议栈中UDP层的上层是IB层。根据IB层中规定的操作码(Operation Code,OPcode)和IB层再上一层的操作码能够确定报文是RDMA报文还是NOF报文。可选地,NOF报文和RDMA报文通过不同的入端口进入至网关设备,网关设备根据报文的入端口和报文中的端口号判断报文是RDMA报文或者NOF报文。
(2)NOF协议栈解析
报文解析模块解析NOF报文,解析的NOF报文包含从客户端到存储节点方向的请求报文和从存储节点到主机方向的响应报文。报文解析模块解析出NOF报文中的farbic信息和NVMe指令。可选地,farbic信息是RoCEv2信息,例如farbic信息包括MAC层信息、IP层信息、UDP层信息和IB层信息。
(3)RDMA协议栈解析
报文解析模块解析RDMA报文,解析的RDMA报文主要是从存储节点到客户端方向的响应报文。
报文解析模块解析出RDMA报文中的RDMA字段相关信息。
(4)提取信息
报文解析模块提取出(2)(3)中协议解析后的字段携带的信息,缓存这些信息给后续模块使用。
(5)输出
报文解析模块对报文解析完成后,报文解析模块将NOF报文或RDMA报文输出至后续对应的处理模块。NOF报文和RDMA报文之外的其他报文不做特殊处理,按正常逻辑处理转发。
地址转换表
地址转换表用于指示目的NVMe地址与目的存储节点的信息之间的对应关系。地址转换记录了NOF协议中NVMe指令中的目的存储地址对应的实际节点信息。下面对地址转换表具体说明,详见以下(1)至(5)。
(1)地址转换表的格式
地址转换表中目的NVMe逻辑地址为索引,目的存储节点信息为值。
在协议中目的NVMe逻辑地址包括start LBA字段的内容、block number字段的内容以及连接本身属性中包含的block size。
目的存储节点的信息包含存储节点的网络位置信息(如二三层信息等)、DQP(根据DQP确定此RDMA存储节点或NOF存储节点的一条逻辑连接)。其中,二三层信息用于确定物理通道,二三层信息用于找到一个具体的设备(即存储节点)。二层信息例如为MAC地址,三层信息例如为IP地址。
如果目的存储节点是RDMA存储节点,地址转换表还包括RDMA存储节点对应的RETH信息(即RDMA存储节点上报注册的某段内存地址。
(2)地址转换表的功能
在NOF报文解析完成后,网关设备根据NVMe指令中的目的NVMe地址查询地址转换表,得到地址转换表中目的NVMe地址对应的目的存储节点信息。
网关设备根据目的存储节点信息能够确定目的存储节点是NOF节点还是RDMA节点,从而进入后续不同的处理逻辑。网关设备根据目的存储节点信息还能够确定和目的存储节点的逻辑连接以及目的存储节点中存储空间的逻辑地址。其中NOF节点中硬盘空间的逻辑地址不需要映射,RDMA节点的内存空间的逻辑地址在地址转换表中映射为RETH。
可选地,地址转换表中的每条表项还包括一个标志位,该标志位用于标识目的存储节点是NOF节点还是RDMA节点。网关设备根据目的NVMe地址对应的标志位的取值,确定目的存储节点是NOF节点还是RDMA节点。
(3)地址转换表支持多路RDMA
两节点的RDMA连接可选地通过不同的QP区分。每一路RDMA管理自己的资源。地址转换表保存了目的存储地址与每一个RDMA存储节点的QP映射信息,从而支持RDMA的多路访问。
RDMA多路访问是指支持通过多个逻辑通道访问一个RDMA节点。一个RDMA节点具有多个QP对,每个QP对是一个逻辑通道。地址转换表中同一个RDMA存储节点的不同QP对应于不同的表项,因此通过地址转换表能区分同一个RDMA存储节点上不同QP,从而支持通过不同的逻辑通道访问RDMA节点。由于多个通道具有比单个通道更高的性能和可用性,因此通过地址转换表支持多路RDMA,能够提升性能和可用性。
(4)地址转换表支持负载分担和热备份
地址转换表能够将某个目的逻辑地址映射到多个RDMA存储节点。
其中,在NOF请求为写请求的情况下,网关设备根据地址转换表查找到多个RDMA存储节点后,网关设备向查找到的每个RDMA节点发送一个RDMA写请求,以便将数据同步写入至多个RDMA存储节点。可选地,网关设备利用组播机制发送RDMA写请求,也即是将RDMA写请求组播到多个RDMA服务节点。
其中,在NOF请求为读请求的情况下,网关设备根据地址转换表查找到多个RDMA存储节点后,网关设备应用一致性哈希算法或者其他负载分担算法,从查找到的多个RDMA存储节点中随机选择一个RDMA存储节点,网关设备向选择的RDMA存储节点发送RDMA读请求,从而提升系统的性能和稳定性。具体应用的负载分担算法根据业务和设备能力确定。
(5)地址转换表的输出结果
网关设备根据地址转换表查询得到的目的存储节点的信息,确定目的存储节点是NOF存储节点还是RDMA存储节点。
若目的存储节点为NOF存储节点,网关设备获取NOF节点的网络位置信息、逻辑连接信息,后续通过NOF代理模块处理。
若目的存储节点为RDMA存储节点,网关设备获取RDMA节点的网络位置信息、逻辑连接信息、目的内存地址,后续通过NOF-RDMA报文转换模块处理。
例如,地址转换表如下表2所示。
表2
在上述表2所示的地址转换表中,目的NVMe地址为查表时使用的索引(index)或者说键(key),目的存储节点信息为查表得到的查询结果或者说键对应的值(value)。表2中用“QP+数字”的形式来简化表示QP的标识。表2中用“RETH+数字”的形式来简化表示一个RETH的具体内容,即服务器中一段内存空间的地址。
上述表2示出的目的NVMe地址包括Start LBA、block size和block number这三个属性。
当目的NVMe地址中Start LBA为0x0000、block size为512、block number为32时,该目的NVMe地址代表的逻辑地址范围是0x0000---0x3FFF,根据该目的NVMe地址查询到的 目的存储节点的信息为RDMA服务器1的信息,例如包括RDMA服务器1的IP地址、QP1和RETH1。
当目的NVMe地址中Start LBA为0x4000、block size为512、block number为32时,该目的NVMe地址代表的逻辑地址范围是0x4000---0x7FFF,根据该目的NVMe地址查询到的目的存储节点的信息为RDMA服务器1的信息,例如包括RDMA服务器1的IP地址、QP1和RETH2。
当目的NVMe地址中Start LBA为0x8000、block size为512、block number为64时,该目的NVMe地址代表的逻辑地址范围是0x8000---0xFFFF,根据该目的NVMe地址查询到的目的存储节点的信息为RDMA服务器1的信息,例如包括RDMA服务器1的IP地址、QP2和RETH3。
RDMA服务器1包含QP1和QP2对应的2个队列对,QP1标识的队列对对应于RDMA服务器1中RETH1标识的内存空间,QP2标识的队列对对应于RDMA服务器1中RETH2标识的内存空间。
当目的NVMe地址中Start LBA为0x10000、block size为512、block number为128时,该目的NVMe地址代表的逻辑地址范围是0x10000---00x1FFFF和0x20000---0x2FFFF,根据该目的NVMe地址查询到的目的存储节点的信息为RDMA服务器2的信息以及RDMA服务器3的信息,例如包括RDMA服务器2的MAC地址、QP10和RETH4、RDMA服务器3的MPLS标签、QP20和RETH5。RDMA服务器2和RDMA服务器3具有负载分担的关系。
当目的NVMe地址中Start LBA为0x20000、block size为512、block number为128时,代表逻辑地址范围是0x0000---0x3FFF,根据该目的NVMe地址查询到的目的存储节点的信息为NOF服务器1的信息,此时提供NOF存储服务。
下面结合图17对地址转换表举例说明。
图17中地址转换表的内容如上表2所示。图17表示地址转换表中有三个长度为64K的逻辑地址段。地址转换表中第一个长度为64K的逻辑地址段为地址0x0000至地址0xFFFF。第一个地址段对应的目的存储节点为RDMA服务器1。第一转换表中第一个地址段对应于RDMA服务器1中两个QP(图17中QP1和QP2)的标识,QP1和QP2这两个逻辑通道分别对应于RDMA服务器1的两个内存地址段。地址转换表中第二个长度为64K的逻辑地址段为地址0xFFFF地址至地址0x1FFFF。第二个逻辑地址段对应的目的存储节点为RDMA服务器2和RDMA服务器3这两个RDMA节点。RDMA服务器2和RDMA服务器3存储相同的数据,表示一段逻辑地址可以实现主备和负载分担。地址转换表中第三个长度为64K的逻辑地址段为地址0x1FFFF至地址0x2FFFF。地址转换表中第三个逻辑地址段对应的目的存储节点是NOF服务器1,表示可以兼容原先的NOF网络的NOF节点。
NOF代理发包模块
NOF代理发包模块用于接管原有的NOF报文转发流程,根据NOF连接状态和NOF代理逻辑,修改或者构造NOF报文。NOF代理发包模块的功能具体包括以下(1)至(3)。
(1)NOF协议栈代理
NOF协议栈代理类似于报文解析模块的NOF协议栈。上述报文解析模块的NOF协议栈主要负责解析报文,而NOF代理发包模块中NOF协议栈主要负责NOF报文代理处理。NOF报文代理的功能包含维护NOF协议的连接状态、修改或者构造NOF报文。
(2)NOF报文的修改或者构造
由于NOF报文经过了网关设备的代理处理,NOF报文中网络层以上信息发生变化。网关设备不是将接收到的NOF报文原封不动地转发出去,所以NOF代理发包模块要根据NOF的连接状态修改NOF报文或者构造NOF报文。
网关设备在接收到客户端的NOF请求报文后,网关设备修改NOF请求报文,将修改后的NOF请求报文发送到NOF存储节点。网关设备在接收到NOF存储节点的NOF响应报文后,网关设备修改NOF响应报文,将修改后的NOF响应报文发送到客户端。
网关设备在接收到RDMA存储节点的RDMA响应报文后,构造NOF响应报文,将NOF响应报文发送到客户端。
(3)输出
NOF代理发包模块的输出结果为发送到客户端的NOF响应报文或者发送到NOF存储节点的NOF请求报文。
NOF与RDMA报文转换模块用于进行NOF与RDMA之间协议报文互相转换。NOF与RDMA报文转换模块分为NOF-RDMA转换模块和RDMA-NOF转换模块这两个子模块。下面对NOF与RDMA报文转换模块具体说明,参见下述(1)至(3)。
(1)NOF-RDMA转换模块
NOF-RDMA转换模块用于实现从NOF到RDMA的协议报文转换。具体地,在基于地址转换表从NOF请求报文中确定目的RDMA存储节点之后,NOF请求报文进入NOF-RDMA转换模块,此时已经解析出NOF协议中NVMe指令。NOF-RDMA转换模块处理客户端的NOF请求报文,得到RDMA请求报文。
NOF-RDMA转换模块根据目的RDMA存储节点的地址、QP等参数,获取RDMA状态信息,后续使用本处获取的RDMA状态信息。其中,目的RDMA存储节点的地址、QP等参数例如根据地址转换表获取。RDMA状态信息例如根据RDMA代理发包模块获取。
NOF-RDMA的转换子模块将NVMe指令转换为RDMA指令。其中NVMe的读操作转化为RDMA的读操作,NVMe的写操作转化为RDMA的写操作。NOF-RDMA转换模块按照RDMA协议标准,对RDMA请求报文中RDMA协议固定字段进行预填充。NOF-RDMA转换模块后续的模块会补全RDMA请求报文需要携带的RDMA协议信息,将包含完整RDMA协议信息的DMA请求报文发送到RDMA代理发包模块。
(2)RDMA-NOF转换模块
RDMA-NOF转换模块用于实现从RDMA到NOF的协议报文转换。具体地,RDMA存储节点回应的RDMA响应报文在经过报文解析模块的处理后进入RDMA-NOF转换模块,此时已经解析出报文中携带的RDMA协议中的信息。RDMA-NOF转换模块将RDMA协议中的信息转换为NOF协议中的信息。
RDMA-NOF转换模块当接收到RDMA读响应报文时,从RDMA读响应报文解析出数据以及PSN,RDMA-NOF转换模块根据PSN以及数据将RDMA读响应报文转换为NOF读响应报文,或者构造NOF读响应报文。
RDMA-NOF转换模块当接收到RDMA写响应报文时,从RDMA写响应报文解析出PSN,根据PSN将RDMA写响应报文转换为NOF写响应报文或者构造NOF写响应报文。
RDMA-NOF转换模块按照NOF协议标准,将NOF响应报文中NOF协议固定字段预填 充。RDMA-NOF转换模块后续模块会补全NOF响应报文需要携带的NOF协议信息,将包含完整NOF协议信息的NOF响应报文发送至NOF代理发包模块。
NOF-RDMA转换模块和RDMA-NOF转换模块的处理逻辑有所区别。在RDMA-NOF转换模块处理报文时还没有获取NOF的状态信息,需要由RDMA-NOF转换模块的下一个模块,即NOF上下文表获取NOF的状态信息,所以RDMA-NOF转换模块只能预填充NOF协议中NVMe部分的信息。
(3)输出
NOF与RDMA报文转换模块的输出结果为填充了目标协议中部分固定字段和现在已知信息的字段的报文。对于NOF-RDMA转换过程而言,目标协议为RDMA协议。对于RDMA-NOF转换过程而言,目标协议为NOF协议。
NOF上下文表
下面通过(1)至(4)对NOF上下文表具体说明。
(1)NOF上下文表的格式
NOF上下文表中的索引为RDMA PSN值。在NOF-RDMA报文转换流程中,NOF上下文表中的RDMA PSN是网关设备在生成RDMA报文过程中产生的,该RDMA PSN例如来自于RDMA代理发包模块。
在RDMA-NOF报文转换流程中,NOF上下文表中的RDMA PSN是网关设备从RDMA报文的RDMA PSN字段解析得到的。
(2)NOF上下文表中状态信息的内容
NOF上下文表中状态信息的内容包含回应客户端所需要的所有缺失的NOF状态信息。这些状态信息在NOF-RDMA时可选地直接从报文中解析得到,或者由网关设备计算得到。以NOF为RoCE(RoCE协议是fabric的一种具体体现)为例,NOF状态信息包括RoCE层的PSN、DQP和RETH,以及NVMe层的SQHD和command ID等。需要获取的NOF状态信息包括但不限于上述情况,具体参数根据实际使用场景可能会有变化。其中,PSN、SQHD和Command ID由网关设备计算得到,具体计算方法是根据当前值作加法修正。
(3)NOF上下文表功能
NOF上下文表负责维护NOF连接中的状态与RDMA连接中的状态之间的对应关系。在网关设备将NOF报文转换为RDMA报文,基于RDMA报文与RDMA存储节点交互时,RDMA侧的交互没有NOF的状态信息。将NOF报文转换为RDMA报文的过程类似CPU切换进程,切换的新进程(类似于本实施例中网关与RDMA存储节点基于RDMA交互)没有当前进程(类似于本实施例中网关与客户端基于NOF交互)的信息,所以CPU将当前进程信息(类似于本实施例中NOF状态信息)保存到上下文表中。这里借用CPU处理的上下文这个概念来帮助理解NOF上下文表功能。通过设计NOF上下文表,当NOF转换到RDMA时,网关设备保存当前NOF状态信息到NOF上下文表。在完成RDMA交互后,网关设备再通过NOF上下文表恢复NOF状态信息。
(4)输出
在NOF-RDMA转换过程中,网关设备保存NOF状态信息到NOF上下文表中,后续由RDMA发包模块继续处理。在RMDA-NOF转换过程中,网关设备查找NOF上下文表获取NOF状态信息,将NOF状态信息输出到NOF代理发包模块,从而为发送NOF报文的过程 提供所需参数。
图18示出了NOF上下文表的建立过程以及查找过程。图18中以RDMA状态信息为RDMA PSN为例描述。如图18所示,从NOF到RDMA的方向来看,在将NOF请求报文转换为RDMA请求报文的过程中,从RDMA代理发包模块获取当前RDMA的PSN,将当前RDMA的PSN作为NOF上下文表中的索引,并根据NOF请求报文获取NOF状态,将NOF状态作为NOF上下文表中与索引对应的值,从而建立NOF上下文表。从RDMA到NOF的方向来看,在将RDMA响应报文转换为NOF响应报文的过程中,从RDMA响应报文中获取PSN,以PSN为索引从NOF上下文表中查找NOF状态信息,将查找到的NOF状态信息提供给NOF代理发包模块。
RDMA代理发包模块
RDMA代理发包模块类似NOF代理发包模块。RDMA代理发包模块与NOF代理发包模块的主要区别是,RDMA代理发包模块代理的是RDMA协议。并且只在与RDMA存储节点交互时,在发包环节使用RDMA代理发包模块。RDMA代理发包模块的功能具体包括下述(1)至(3)。
(1)RDMA协议栈代理
网关设备实现RDMA的协议栈。网关设备作为客户端与RDMA存储节点建立连接。RDMA代理发包模块主要使用RDMA协议栈客户端发包的部分。
(2)RDMA报文构造
在经过NOF-RDMA指令转换并且将NOF状态信息保存到NOF上下文表之后,由RDMA代理发包模块构造RDMA的请求报文。
(3)输出
RDMA代理发包模块的输出结果为发送到RDMA存储节点的RDMA请求报文。
图19和图20是实例1中网关设备执行方法的完整流程图。图19示出了客户端->存储节点方向的网关设备执行方法的完整流程图。图20示出了存储节点->客户端方向的网关设备执行方法的完整流程图。
如图19所示,客户端->存储节点方向网关设备执行的方法流程包括以下S71至S710。
S71、网关设备接收报文。
S72、网关设备对接收到的报文进行解析。
S73、网关设备判断接收到的报文是否为NOF报文。若接收到的报文是NOF报文,则网关设备执行S74。若接收到的报文不是NOF报文,则网关设备执行S710。
S74、网关设备从地址转换表查找目的存储节点的信息。
S75、网关设备判断目的存储节点是否为RDMA存储节点。若目的存储节点是RDMA存储节点,则网关设备执行S76。若目的存储节点不是RDMA存储节点,则网关设备执行S79。
S76、网关设备执行NOF-RDMA指令转换。
S77、网关设备将NOF状态保存到NOF上下文表中。
S78、网关设备实现RDMA代理的功能,发送RDMA报文。
S79、网关设备实现NOF代理的功能,发送NOF报文。
S710、网关设备按照原始报文转发流程转发报文。
如图20所示,存储节点->客户端方向网关设备执行的方法流程包括以下S81至S88。
S81、网关设备接收报文。
S82、网关设备对接收到的报文进行解析。
S83、网关设备判断接收到的报文是否为NOF报文或RDMA报文。若接收到的报文是NOF报文或RDMA报文,则网关设备执行S84。若接收到的报文不是NOF报文且不是RDMA报文,则网关设备执行S88。
S84、网关设备判断接收到的报文是否为RDMA报文。若接收到的报文是RDMA报文,则网关设备执行S85。若接收到的报文不是RDMA报文(即接收到的报文是NOF报文),则网关设备执行S87。
S85、网关设备将RDMA报文中的信息转换为NOF协议中的信息。
S86、网关设备根据RDMA报文中的RDMA状态信息,从NOF上下文表中查找到NOF状态信息。
S87、网关设备发送NOF报文。
S88、网关设备按照原始报文转发流程转发报文。
上述实例1提供了一种新的网关设备,该网关设备位于存储节点的网关位置。该网关设备支持NOF协议栈和RDMA协议栈。该网关设备具有NOF-RDMA的协议栈转换的能力。并且该网关设备能够根据目的逻辑存储地址进行目的节点定向。
实例1达到的效果包括而不限于下述(1)至(3)。
(1)RDMA存储介质是内存,内存的性能优于现有的NVMe硬盘。实例1提供的网关设备使NOF存储网络能支持RDMA,从而发挥内存存储的优势,提升性能。
(2)原来存储方案中全部业务处理任务都由服务端(即存储节点)负责执行。实例1提供的网关设备能够卸载服务端的部分业务处理任务(即网关设备代替服务端执行部分业务处理任务),从而减轻服务端CPU压力,提升整体性能。
(3)如(2)所述,通过将服务端(即存储节点)的部分业务处理任务卸载到网关设备,能够缩短报文转发路径,从而提升整体性能。
总结实例1的方案可见,实例1改变现有的NOF存储网络结构,改变原本存储后端只能扩展NOF存储节点的情况,通过本实施例的网关设备能支持NOF存储网络扩展RDMA存储节点。
此外,实例1改变现有的NOF存储网络中存储介质全部是硬盘的现状,能支持NVMe的硬盘操作语义转换到RDMA的内存操作语义,实现硬盘存储服务和内存存储服务的协同。
此外,在存在多个存储节点的情况下,网关设备能完成目的存储逻辑地址的定向,减轻现在的存储节点的CPU压力。
此外,实例1可提供为一种非入侵的扩充支持方案。非入侵是指实例1不改变现有的业务部署,从而避免影响业务现有的正在运行的系统。实例1可作为一种加强的模式,优化业务性能。
实例2
实例2是对实例1中NOF上下文表的一种替代方案。
实例2与实例1的主要区别体现在实例2采用piggeback或者类piggeback的模式传输RDMA报文。piggeback是指本端在报文中携带指定信息,将携带指定信息的报文发送给对端后,对端将指定信息再返回给本端。实例2中指定信息为NOF状态信息或者NOF报文头的 内容。
当目的存储节点是RDMA存储节点,网关设备不保存NOF状态信息到NOF上下文表中,而是将已有的NOF状态信息预填充到响应报文头中,再将包含NOF状态信息的响应报文头封装到RDMA请求报文中。NOF状态信息作为RDMA请求报文中一个附加头信息。
RDMA存储节点需要感知这种协议上的变化。RDMA存储节点对这段附加头信息不处理;或者,RDMA存储节点按照需求对附加头信息进行处理,例如RDMA存储节点计算ICRC。RDMA存储节点在生成RDMA响应报文的过程中,在RDMA响应报文中携带附加头信息。RDMA存储节点发送包含该附加头信息的RDMA响应报文,使得附加头信息返回至网关设备。网关设备根据这段附加字段恢复NOF响应报文所需携带的状态信息。网关设备构造NOF响应报文,将NOF响应报文发送到客户端。
这种方案由于无需保存NOF上下文表,从而节省了网关设备内部的存储空间,并且减少了查写表的过程。
图21示出了实例2的一种逻辑功能架构图。如图21所示,实例2中的网关设备同样包含报文解析模块、地址转换表、NOF-RDMA转换模块、RDMA代理发包模块和NOF代理发包模块。实例2中报文解析模块、NOF-RDMA转换模块以及地址转换表与实例1类似,不再详细描述。
实例2中RDMA代理发包模块和NOF代理发包模块中新增了附加头信息的处理。下面对RDMA代理发包模块和NOF代理发包模块这两个模块在实例2新增的业务逻辑进行介绍。
RDMA代理发包模块
实例2中RDMA代理发包模块保留实例1中RDMA代理发包模块原有的功能。实例2中RDMA代理发包模块在构造RDMA报文时新增了在RDMA报文中添加附加头信息的步骤。实例2包括两种具体实现方式,下面以NOF的fabric层使用RoCEV2协议为例,分别描述两种实现方式。
实现方式(1)网关设备在RDMA报文中携带NOF状态信息。
具体地,规定在RDMA报文中每个附加字段携带NOF状态信息中的哪个信息。附加字段中携带的附加头信息与实例1中NOF上下文表中的值(即NOF状态信息)类似。RDMA报文中的附加头信息相当于实例1中NOF上下文表中一条表项中的值。也可以这样理解,NOF状态信息无需作为NOF上下文表中表项的值保存在网关设备本地,而是随报文流动。
RDMA存储节点不对此附加头信息作任何处理。RDMA存储节点只是接收到携带这种附加字段的RDMA报文,从附加字段中提取附加头信息,在RDMA的常规业务逻辑处理完成后,再将附加头信息封装至RDMA响应报文。网关设备接收到RDMA响应报文后,按照标准读取附加头信息,使用附加头信息构造NOF响应报文。
实现方式(2)网关设备预生成NOF报文头,将NOF报文头作为附加头信息。
网关设备预先构造NOF报文头。网关设备将待返回给客户端的NOF已有信息全部填充到NOF报文头中。然后,网关设备将此NOF报文头作为附加头信息,向RDMA存储节点发送包含NOF报文头的RDMA请求报文。RDMA存储节点接收到RDMA请求报文后,RDMA存储节点继续处理修改RDMA请求报文中的NOF报文头。例如RDMA存储节点补齐NOF报文头缺失内容,计算报文ICRC等。RDMA存储节点将处理过的NOF报文头继续作为附加头信息,将NOF报文头封装至RDMA响应报文中,使得NOF报文头作为RDMA响应报文 中的内层头。
网关设备收到RDMA响应报文后,网关设备剥去RDMA响应报文中的外层报文头。网关设备使用RDMA响应报文从内层头(NOF报文头)开始的部分作为NOF响应报文。
实例2中NOF代理发包模块与实例2中RDMA代理发包模块两种机制配合使用。
对比实例1中从NOF上下文表中获取NOF状态信息,实例2中NOF代理发包模块从报文中附加头信息中获取到NOF状态信息,后续处理类似实例1。
实例2中NOF代理发包模块剥离RDMA响应报文中外层头,转发剥离外层头后的报文。可选地,实例2中NOF代理发包模块根据网络情况,对报文中二层部分或报文中三层部分进行修改。
实例2中由于报文中携带附加头信息,附加头信息会占用报文额外的空间。
实例2的实现方式(1)在报文中需要额外占用的空间与实例1的NOF上下文表的每一个表项长度一致,例如在RoCEV2场景下,在报文中额外占用的空间大约为20B-30B。
实例2的实现方式(2)中由于在报文中增加了完整的二层头和三层头,所以增加的二层头和三层头需要额外占用一些空间。根据二三层头的情况,增加的二层头和三层头大概在报文中占用40B-50B的空间。考虑转发物理层的最大传输单元(Maximum Transmission Unit,MTU)和对应MTU下RDMA普通报文的长度是有限制关系的,在报文中增加NOF报文头后,报文的整体长度仍满足MTU的限制,因此不会造成额外分片。
图22和图23是实例2中网关设备执行方法的完整流程图。图22示出了客户端->存储节点方向的网关设备执行方法的完整流程图。图22所示的流程将实例1中图19的流程中S77替换为S77’、网关设备构造报文附加字段。图22所示的流程其他步骤可参考图19。
图23示出了存储节点->客户端方向的网关设备执行方法的完整流程图。图23所示的流程将实例1中图20的流程中S86替换为S86’、网关设备处理报文附加字段。图23所示的流程其他步骤可参考图20。
实例2的技术效果同实例1。对比实例1和实例2,实例2提供的网关设备不需要部署NOF上下文表,因此减少了网关设备内部存储空间的消耗。并且,实例2减少了查写表的过程。但是实例2需要修改RDMA协议,使RDMA协议支持附加字段的识别和处理。
实例3
实例3为实例1和实例2的补充。实例3主要补充控制面流程。图24是实例3的逻辑功能架构图。如图24所示,实例3中网关设备包含报文解析模块、地址转换表和地址编排模块。实例3中报文解析模块以及地址转换表与实例1类似,不再详细描述。实例3主要解释存储节点的地址如何下发到网关设备。实例3涉及RDMA的双边操作和NOF控制通道的信息交互报文。
地址编排模块
地址编排模块用于处理RDMA双边操作中RDMA注册存储地址空间的报文和NOF控制通道的信息交互报文。地址编排模块对RDMA存储节点通过双边操作报文上报的内存地址、NOF控制通道的信息交互报文中NVMe存储地址段进行统一编排管理后,生成统一的虚拟地址,生成的虚拟地址后续会写入到地址转换表中。地址编排模块的功能具体包括以下(1)至(3)。
(1)地址解析
RDMA协议中,RDMA节点通过执行双边操作的send操作或receive操作注册RDMA存储节点的内存空间的地址,将内存空间的地址上报给用户,后续用户能通过RDMA节点上报的地址直接操作这段内存空间的地址。NOF协议中通过控制通道,由NOF存储节点通知用户存储节点的可用硬盘地址段,后续用户能通过NOF存储节点上报的地址直接操作这段硬盘地址。地址编排模块用于从RDMA节点发送的报文中解析出RDMA节点上报的内存地址,从NOF节点发送的报文中解析出NVMe节点上报的硬盘地址。
(2)地址编排
地址编排模块将各存储节点上报的地址统一编排为全局的虚拟地址。地址编排模块编排得到的地址为地址转换表中的内容。地址编排模块编排得到的地址具体为地址转换表中用来查找目的存储节点的信息的索引,即NVMe逻辑地址。地址转换表中保存的目的存储节点的信息是各存储节点上报的地址。
(3)输出
地址编排模块将包含编排得到的地址表项输出到地址转换表。
本实施例以NOF存储节点通过NOF协议中控制通道向网关设备上报硬盘地址为例进行说明,在另一些实施例中,提供一种专用于上报地址的报文,NOF存储节点通过发送专用的报文上报硬盘地址。
实例3中地址转换表同实例1。实例1描述了地址转换表的查表流程,实例3描述地址转换表的写表流程。
图25示出了实例3中网关设备执行方法的完整流程图。如图25所示,网关设备执行的方法流程包括以下S91至S98。
S91、网关设备接收报文。
S92、网关设备对接收到的报文进行解析。
S93、网关设备判断接收到的报文是否为RDMA双边操作报文。
若接收到的报文是RDMA双边操作报文,则网关设备执行S94。若接收到的报文不是RDMA双边操作报文,则网关设备执行S95。
S94、网关设备解析RDMA注册的地址信息。
S95、网关设备判断接收到的报文是否为来自于NOF控制通道的地址上报报文。若接收到的报文是来自于NOF控制通道的地址上报报文,则网关设备执行S96。若接收到的报文不是来自于NOF控制通道的地址上报报文,则网关设备执行S98。
S96、网关设备根据报文中携带的地址进行地址编排,或者解析出报文中携带的地址。
S97、网关设备配置地址转换表。
S98、网关设备执行实例1或者实例2的流程。
下面对实例3的技术效果进行描述。
实例3对实例1和实例2的细节进行了补充,补全了控制面流程。本实施例提供的网关设备解析出RMDA存储节点在注册内存时上报的内存地址和NOF存储节点通过控制通道上报的硬盘地址,网关设备统一编排各存储节点上报的地址,最终生成地址转换表的表项。
在另一些实施例中,网关设备或者各个存储节点将RMDA存储节点的内存地址以及NOF存储节点的硬盘地址上报到具有提供统一的地址编排管控软件的服务端。服务端进行地址编排,并将地址转换表的内容发送给网关设备。
总结上述各个实施例的方案来看,本申请实施例实现了一种网关设备,此网关设备可选地部署在传统NOF存储网络中,该网关设备实现了下述(1)至(4)。
(1)同时支持NOF的协议栈和RDMA的协议栈。
本申请实施例提供的网关设备能够处理RDMA协议栈,实现网关设备与RDMA存储节点的连接与交互。
本申请实施例提供的网关设备能够处理NOF协议栈,解析出NOF协议栈的信息,维护NOF协议栈的状态信息,网关设备能够代替NOF服务器向客户端回复NOF报文,实现代理NOF服务器的功能。
(2)NOF-RDMA的协议逻辑互相转换机制。NOF-RDMA的协议逻辑互相转换机制的具体表现为NOF请求报文转换为RDMA请求报文,RDMA响应报文转换为NOF响应报文。
(3)NOF-RDMA地址转换表。
本实施例将NOF-RMDA地址转换表部署在网关设备。地址转换表实现NOF中NVMe目的逻辑地址到RDMA目的逻辑地址的映射。
(4)将原有的NOF存储网络中单纯NOF硬盘介质存储方案,替换为NOF硬盘介质存储和RDMA内存介质存储的混合存储方式。
本申请实施例提供的存储方案可选地和内存型硬盘结合,发挥更大的作用。
现阶段RDMA存储节点大部分是服务器。本实施例中RDMA存储节点所需实现的功能基本上就是网络协议解析、总线数据搬迁和操作内存介质,并不需要很强的CPU能力。现阶段正在研究智能网卡-PCIE总线-内存的直通设备。智能网卡-PCIE总线-内存的直通设备比服务器更轻量,本实施例可选地使用这种设备作为存储节点,实现出一种NOF存储网络的海量存储的方案。例如,结合图9实施例来看,在一种可能的实现中图9实施例中第一RDMA存储节点为智能网卡-PCIE总线-内存的直通设备,图9实施例中S405、S406、S407、S408由第一RDMA存储节点中智能网卡执行。在S406中,第一RDMA存储节点中智能网卡通过PCIE总线与内存进行数据传输,从而执行读/写操作,如此,将CPU的处理工作卸载到智能网卡上,减轻CPU的计算负担,提高图9实施例的运行效率。
可选地,NOF使用RoCE之外的其他网络作为承载NVMe的fabric。本实施例可选地应用在NVMe承载在其他fabric之上的场景下,比如应用在NVMe over TCP的场景。NVMe over TCP的场景即NVMe直接承载在TCP上,而不是承载在UDP和IB上。例如,结合图9实施例来看,在一种可能的实现中,图9实施例中S401中的第一NOF请求报文、S411中第一NOF响应报文为TCP报文,NOF状态信息包括TCP中的序列号。如此,网关设备支持基于TCP与客户端交互,满足更多的业务场景。
附图26是本申请实施例提供的一种报文处理装置700的结构示意图,附图26所示的装置700设于网关设备上。装置700包括接收单元701、处理单元702和发送单元703。
可选地,结合附图8所示的应用场景来看,附图26所示的装置700设于图8中的网关设备33。
可选地,结合附图9所示的方法流程来看,附图26所示的装置700设于图9中的网关设备。接收单元701用于支持图9中的网关设备执行S402和S409。处理单元702用于支持图9中的网关设备执行S403和S410。发送单元703用于支持图9中的网关设备执行S404和S411。
可选地,结合附图11所示的方法流程来看,附图26所示的装置700设于图11中的网关 设备。接收单元701用于支持图11中的网关设备执行S502和S509。处理单元702用于支持图11中的网关设备执行S503和S510。发送单元703用于支持图11中的网关设备执行S504和S511。
可选地,结合附图12所示的应用场景来看,附图26所示的装置700设于图12中的网关设备。接收单元701和发送单元703通过图12中网关设备中的网口实现。接收单元701用于支持图12中网关设备从图12中客户端接收NOF请求报文。发送单元703用于支持图12中网关设备向图12中RDMA存储节点A发送RDMA请求报文或者向图12中NOF存储节点发送NOF请求报文。
可选地,结合附图13所示的应用场景来看,附图26所示的装置700设于图13中的网关设备。附图26所示的装置700还包括存储单元,存储单元通过图13中网关设备中Cache(缓存)实现。
可选地,结合附图14所示的应用场景来看,附图26所示的装置700设于附图14中的网关设备。处理单元702包括附图14中RDMA适配器和NOF监听模块,接收单元701和发送单元703包括附图14中各个端口。
可选地,结合附图15所示的方法流程来看,附图26所示的装置700设于图15中的网关设备。处理单元702用于支持图15中的网关设备执行S612、查找地址转换表、NOF简单代理处理、NOF-RDMA报文转换和RDMA-NOF报文转换,接收单元701用于支持图15中的网关设备接收S614中下发的地址转换表、S621中的NO读/写请求、S623中NO读/写响应、S632中RDMA读/写响应,发送单元703用于支持图15中的网关设备执行S622、S631和S633。
可选地,结合附图16所示架构来看,附图26所示的装置700设于图16中的网关设备。处理单元702包括附图16中NOF-RDMA转换模块、RDMA-NOF转换模块和报文解析模块,发送单元703包括附图16中NOF代理发包模块和RDMA代理发包模块。附图26所示的装置700还包括存储单元,存储单元用于保存附图16中NOF上下文表。
可选地,结合附图17所示架构来看,附图26所示的装置700设于图17中的网关设备。附图26所示的装置700还包括存储单元,存储单元用于保存附图17中地址转换表。
可选地,结合附图18所示架构来看,附图26所示的装置700设于图18中的网关设备。发送单元703包括附图18中NOF代理发包模块和RDMA代理发包模块。处理单元702用于执行附图18中存储NOF状态以及查找NOF状态的步骤。附图26所示的装置700还包括存储单元,存储单元用于保存附图18中地址转换表。
可选地,结合附图19所示的方法流程来看,装置700用于支持网关设备执行附图19所示的方法流程。接收单元701用于支持网关设备执行附图19中S71;处理单元702用于支持网关设备执行附图19中S72、S73、S74、S75、S76和S77;发送单元703用于支持网关设备执行附图19中S78、S79和S710。
可选地,结合附图20所示的方法流程来看,装置700用于支持网关设备执行附图20所示的方法流程。接收单元701用于支持网关设备执行附图20中S81;处理单元702用于支持网关设备执行附图20中S82、S83、S84、S85和S86;发送单元703用于支持网关设备执行附图20中S87和S88。
可选地,结合附图21所示架构来看,附图26所示的装置700设于图21中的网关设备。处理单元702包括附图21中报文解析模块、NOF-RDMA转换模块和RDMA-NOF转换模块。 发送单元703包括附图21中RDMA代理发包模块和NOF代理发包模块。附图26所示的装置700还包括存储单元,存储单元用于保存附图21中地址转换表。
可选地,结合附图22所示的方法流程来看,装置700用于支持网关设备执行附图22所示的方法流程。接收单元701用于支持网关设备执行附图22中S71;处理单元702用于支持网关设备执行附图22中S72、S73、S74、S75、S76和S77';发送单元703用于支持网关设备执行附图22中S78、S79和S710。
可选地,结合附图23所示的方法流程来看,装置700用于支持网关设备执行附图23所示的方法流程。接收单元701用于支持网关设备执行附图23中S81;处理单元702用于支持网关设备执行附图23中S82、S83、S84、S85和S86';发送单元703用于支持网关设备执行附图23中S87和S88。
可选地,结合附图25所示的方法流程来看,装置700用于支持网关设备执行附图25所示的方法流程。接收单元701用于支持网关设备执行附图25中S91;处理单元702用于支持网关设备执行附图25中S92、S93、S94、S95、S96和S97。
附图26所描述的装置实施例仅仅是示意性的,例如,上述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
装置700中的各个单元全部或部分地通过软件、硬件、固件或者其任意组合来实现。
在采用软件实现的情况下,例如,上述处理单元702是由附图27中的至少一个处理器801读取存储器802中存储的程序代码810后,生成的软件功能单元来实现。又如,上述处理单元702是由附图28中的网络处理器932或中央处理器911或中央处理器931读取存储器912或存储器934中存储的程序代码后,生成的软件功能单元来实现。
在采用硬件实现的情况下,例如,附图26中上述各个单元由设备中的不同硬件分别实现,例如处理单元702由附图27中的至少一个处理器801或者附图28中的网络处理器932或中央处理器911或中央处理器931中的一部分处理资源(例如多核处理器中的一个核或两个核)实现,或者采用现场可编程门阵列(field-programmable gate array,FPGA)、或协处理器等可编程器件来完成。接收单元701和发送单元703由附图27中的网络接口803或附图28中接口板930实现。
附图27是本申请实施例提供的一种网关设备800的结构示意图。网关设备800包括至少一个处理器801、存储器802以及至少一个网络接口803。
可选地,结合附图8所示的应用场景来看,附图27所示的网关设备800为图8中的网关设备33。
可选地,结合附图9所示的方法流程来看,附图27所示的网关设备800为图9中的网关设备。网络接口803用于支持图9中的网关设备执行S402、S404、S409和S411。处理器801用于支持图9中的网关设备执行S403和S410。
可选地,结合附图11所示的方法流程来看,附图27所示的网关设备800为图11中的网关设备。网络接口803用于支持图11中的网关设备执行S502、S504、S509和S511。处理器801用于支持图11中的网关设备执行S503和S510。
可选地,结合附图12所示的应用场景来看,附图27所示的网关设备800为图12中的网关设备。网络接口803为图12中网关设备中的网口。
可选地,结合附图13所示的应用场景来看,附图27所示的网关设备800为图13中的网关设备。存储器802包括图13中网关设备中Cache(缓存)。
可选地,结合附图14所示的应用场景来看,附图27所示的网关设备800为附图14中的网关设备。处理器801包括附图14中RDMA适配器和NOF监听模块,网络接口803包括附图14中各个端口。
可选地,结合附图15所示的方法流程来看,附图27所示的网关设备800为图15中的网关设备。处理器801用于支持图15中的网关设备执行S612、查找地址转换表、NOF简单代理处理、NOF-RDMA报文转换和RDMA-NOF报文转换,网络接口803用于支持图15中的网关设备接收S614中下发的地址转换表、S621中的NO读/写请求、S623中NO读/写响应、S632中RDMA读/写响应、S622、S631和S633。
可选地,结合附图16所示架构来看,附图27所示的网关设备800为图16中的网关设备。处理器801包括附图16中NOF-RDMA转换模块、RDMA-NOF转换模块和报文解析模块,网络接口803包括附图16中NOF代理发包模块和RDMA代理发包模块。存储器802用于保存附图16中NOF上下文表。
可选地,结合附图17所示架构来看,附图27所示的网关设备800为图17中的网关设备。存储器802用于保存附图17中地址转换表。
可选地,结合附图18所示架构来看,附图27所示的网关设备800为图18中的网关设备。网络接口803包括附图18中NOF代理发包模块和RDMA代理发包模块。处理器801用于执行附图18中存储NOF状态以及查找NOF状态的步骤。存储器802用于保存附图18中地址转换表。
可选地,结合附图19所示的方法流程来看,网关设备800用于执行附图19所示的方法流程。网络接口803用于执行附图19中S71、S78、S79和S710;处理器801用于执行附图19中S72、S73、S74、S75、S76和S77。
可选地,结合附图20所示的方法流程来看,网关设备800用于执行附图20所示的方法流程。网络接口803用于执行附图20中S81、S87和S88;处理器801用于执行附图20中S82、S83、S84、S85和S86。
可选地,结合附图21所示架构来看,附图27所示的网关设备800为图21中的网关设备。处理器801包括附图21中报文解析模块、NOF-RDMA转换模块和RDMA-NOF转换模块。网络接口803包括附图21中RDMA代理发包模块和NOF代理发包模块。存储器802用于保存用于保存附图21中地址转换表。
可选地,结合附图22所示的方法流程来看,网关设备800用于执行附图22所示的方法流程。网络接口803用于执行附图22中S71、S78、S79和S710;处理器801用于执行附图22中S72、S73、S74、S75、S76和S77'。
可选地,结合附图23所示的方法流程来看,网关设备800用于执行附图23所示的方法流程。网络接口803用于执行附图23中S81、S87和S88;处理器801用于执行附图23中S82、S83、S84、S85和S86'。
可选地,结合附图25所示的方法流程来看,网关设备800用于执行附图25所示的方法 流程。网络接口803用于执行附图25中S91;处理器801用于执行附图25中S92、S93、S94、S95、S96和S97。
处理器801例如是通用中央处理器(central processing unit,CPU)、网络处理器(network processer,NP)、图形处理器(graphics processing unit,GPU)、神经网络处理器(neural-network processing units,NPU)、数据处理单元(data processing unit,DPU)、微处理器或者一个或多个用于实现本申请方案的集成电路。例如,处理器801包括专用集成电路(application-specific integrated circuit,ASIC),可编程逻辑器件(programmable logic device,PLD)或其组合。PLD例如是复杂可编程逻辑器件(complex programmable logic device,CPLD)、现场可编程逻辑门阵列(field-programmable gate array,FPGA)、通用阵列逻辑(generic array logic,GAL)或其任意组合。
在一些实施例中,处理器801包括一个或多个CPU,如附图27中所示的CPU0和CPU1。
存储器802例如是只读存储器(read-only memory,ROM)或可存储静态信息和指令的其它类型的静态存储设备,又如是随机存取存储器(random access memory,RAM)或者可存储信息和指令的其它类型的动态存储设备,又如是电可擦可编程只读存储器(electrically erasable programmable read-only Memory,EEPROM)、只读光盘(compact disc read-only memory,CD-ROM)或其它光盘存储、光碟存储(包括压缩光碟、激光碟、光碟、数字通用光碟、蓝光光碟等)、磁盘存储介质或者其它磁存储设备,或者是能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其它介质,但不限于此。可选地,存储器802独立存在,并通过内部连接804与处理器801相连接。或者,可选地存储器802和处理器801集成在一起。
网络接口803使用任何收发器一类的装置,用于与其它设备或通信网络通信。网络接口803例如包括有线网络接口或者无线网络接口中的至少一项。其中,有线网络接口例如为以太网接口。以太网接口例如是光接口,电接口或其组合。无线网络接口例如为无线局域网(wireless local area networks,WLAN)接口,蜂窝网络网络接口或其组合等。
在一些实施例中,处理器801和网络接口803相互配合以完成上述实施例涉及的发送报文以及接收报文的过程。
例如,上述发送第一RDMA请求报文的过程包括:处理器801指示网络接口803发送第一RDMA请求报文。在一种可能的实现中,处理器801生成并向网络接口803发送指令,网络接口803基于处理器801的指令,发送第一RDMA请求报文。
例如,上述接收第一NOF请求报文的过程包括:网络接口803接收第一NOF请求报文,对第一NOF请求报文进行部分加工处理(如解封装)后,发送到处理器801,使得处理器801获得第一NOF请求报文携带的上述实施例时所需的信息(如第一目的地址)。
在一些实施例中,网关设备800可选地包括多个处理器,如附图27中所示的处理器801和处理器805。这些处理器中的每一个例如是一个单核处理器(single-CPU),又如是一个多核处理器(multi-CPU)。这里的处理器可选地指一个或多个设备、电路、和/或用于处理数据(如计算机程序指令)的处理核。在一种可能的实现中,由多个核或者多个处理器分别执行上述方法实施例其中的部分步骤。
在一些实施例中,网关设备800还包括内部连接804。处理器801、存储器802以及至少一个网络接口803通过内部连接804连接。内部连接804包括通路,在上述组件之间传送信 息。可选地,内部连接804是单板或总线。可选地,内部连接804分为地址总线、数据总线、控制总线等。在一些实施例中,网关设备800还包括输入输出接口806。
可选地,处理器801通过读取存储器802中保存的程序代码810实现上述实施例中的方法,或者,处理器801通过内部存储的程序代码实现上述实施例中的方法。在处理器801通过读取存储器802中保存的程序代码810实现上述实施例中的方法的情况下,存储器802中保存实现本申请实施例提供的方法的程序代码。
处理器801实现上述功能的更多细节请参考前面各个方法实施例中的描述,在这里不再重复。
参见附图28,附图28是本申请实施例提供的一种网关设备900的结构示意图。网关设备900包括:主控板910和接口板930。
可选地,结合附图8所示的应用场景来看,附图28所示的网关设备900为图8中的网关设备33。
可选地,结合附图9所示的方法流程来看,附图28所示的网关设备900为图9中的网关设备。接口板930用于支持图9中的网关设备执行S402、S404、S409、S411、S403和S410。
可选地,结合附图11所示的方法流程来看,附图28所示的网关设备900为图11中的网关设备。接口板930用于支持图11中的网关设备执行S502、S504、S509、S511、S503和S510。
可选地,结合附图12所示的应用场景来看,附图28所示的网关设备900为图12中的网关设备。接口板930包括图12中网关设备中的网口。
可选地,结合附图13所示的应用场景来看,附图28所示的网关设备900为图13中的网关设备。转发表项存储器934包括图13中网关设备中Cache(缓存)。
可选地,结合附图14所示的应用场景来看,附图28所示的网关设备900为附图14中的网关设备。接口板930包括附图14中RDMA适配器、NOF监听模块和各个端口。
可选地,结合附图15所示的方法流程来看,附图28所示的网关设备900为图15中的网关设备。接口板930用于支持图15中的网关设备执行S612、查找地址转换表、NOF简单代理处理、NOF-RDMA报文转换、RDMA-NOF报文转换、S614中下发的地址转换表、S621中的NOF读/写请求、S623中NO读/写响应、S632中RDMA读/写响应、S622、S631和S633。
可选地,结合附图16所示架构来看,附图28所示的网关设备900为图16中的网关设备。接口板930包括附图16中NOF-RDMA转换模块、RDMA-NOF转换模块、报文解析模块、NOF代理发包模块和RDMA代理发包模块。转发表项存储器934用于保存附图16中NOF上下文表。
可选地,结合附图17所示架构来看,附图28所示的网关设备900为图17中的网关设备。转发表项存储器934用于保存附图17中地址转换表。
可选地,结合附图18所示架构来看,附图28所示的网关设备900为图18中的网关设备。接口板930包括附图18中NOF代理发包模块和RDMA代理发包模块。存储器912或转发表项存储器934用于保存附图18中地址转换表。
可选地,结合附图19所示的方法流程来看,网关设备900用于执行附图19所示的方法流程。接口板930用于执行附图19中S71、S78、S79、S710、S72、S73、S74、S75、S76和S77。
可选地,结合附图20所示的方法流程来看,网关设备900用于执行附图20所示的方法 流程。接口板930用于执行附图20中S81、S87、S88、S82、S83、S84、S85和S86。
可选地,结合附图21所示架构来看,附图28所示的网关设备900为图21中的网关设备。接口板930包括附图21中报文解析模块、NOF-RDMA转换模块、RDMA-NOF转换模块、RDMA代理发包模块和NOF代理发包模块。转发表项存储器934用于保存用于保存附图21中地址转换表。
可选地,结合附图22所示的方法流程来看,网关设备900用于执行附图22所示的方法流程。接口板930用于执行附图22中S71、S78、S79、S710、S72、S73、S74、S75、S76和S77'。
可选地,结合附图23所示的方法流程来看,网关设备900用于执行附图23所示的方法流程。接口板930用于执行附图23中S81、S87、S88、S82、S83、S84、S85和S86'。
可选地,结合附图25所示的方法流程来看,网关设备900用于执行附图25所示的方法流程。接口板930用于执行附图25中S91、S92、S93、S95;主控板910用于执行附图25中S94、S96和S97。
主控板910也称为主处理单元(main processing unit,MPU)或路由处理卡(route processor card),主控板910用于对网关设备900中各个组件的控制和管理,包括路由计算、设备管理、设备维护、协议处理功能。主控板910包括:中央处理器911和存储器912。
接口板930也称为线路接口单元卡(line processing unit,LPU)、线卡(line card)或业务板。接口板930用于提供各种业务接口并实现数据包的转发。业务接口包括而不限于以太网接口、POS(packet over sONET/SDH)接口等,以太网接口例如是灵活以太网业务接口(flexible ethernet clients,FlexE clients)。接口板930包括:中央处理器931、网络处理器932、转发表项存储器934和物理接口卡(physical interface card,PIC)933。
接口板930上的中央处理器931用于对接口板930进行控制管理并与主控板910上的中央处理器911进行通信。
网络处理器932用于实现报文的转发处理。网络处理器932的形态例如是转发芯片。具体而言,网络处理器932用于基于转发表项存储器934保存的转发表转发接收到的报文,如果报文的目的地址为网关设备900的地址,则将该报文上送至CPU(如中央处理器911)处理;如果报文的目的地址不是网关设备900的地址,则根据该目的地址从转发表中查找到该目的地址对应的下一跳和出接口,将该报文转发到该目的地址对应的出接口。其中,上行报文的处理包括:报文入接口的处理,转发表查找;下行报文的处理:转发表查找等等。
物理接口卡933用于实现物理层的对接功能,原始的流量由此进入接口板930,以及处理后的报文从该物理接口卡933发出。物理接口卡933也称为子卡,可安装在接口板930上,负责将光电信号转换为报文并对报文进行合法性检查后转发给网络处理器932处理。在一些实施例中,中央处理器也可执行网络处理器932的功能,比如基于通用CPU实现软件转发,从而物理接口卡933中不需要网络处理器932。
可选地,网关设备900包括多个接口板,例如网关设备900还包括接口板940,接口板940包括:中央处理器941、网络处理器942、转发表项存储器944和物理接口卡943。
可选地,网关设备900还包括交换网板920。交换网板920也例如称为交换网板单元(switch fabric unit,SFU)。在网络设备有多个接口板930的情况下,交换网板920用于完成各接口板之间的数据交换。例如,接口板930和接口板940之间例如通过交换网板920通信。
主控板910和接口板930耦合。例如。主控板910、接口板930和接口板940,以及交换网板920之间通过系统总线与系统背板相连实现互通。在一种可能的实现方式中,主控板910和接口板930之间建立进程间通信协议(inter-process communication,IPC)通道,主控板910和接口板930之间通过IPC通道进行通信。
在逻辑上,网关设备900包括控制面和转发面,控制面包括主控板910和中央处理器931,转发面包括执行转发的各个组件,比如转发表项存储器934、物理接口卡933和网络处理器932。控制面执行路由器、生成转发表、处理信令和协议报文、配置与维护设备的状态等功能,控制面将生成的转发表下发给转发面,在转发面,网络处理器932基于控制面下发的转发表对物理接口卡933收到的报文查表转发。控制面下发的转发表例如保存在转发表项存储器934中。在有些实施例中,控制面和转发面例如完全分离,不在同一设备上。
接口板940上的操作与接口板930的操作一致,为了简洁,不再赘述。
主控板可能有一块或多块,有多块的时候例如包括主用主控板和备用主控板。接口板可能有一块或多块,网络设备的数据处理能力越强,提供的接口板越多。接口板上的物理接口卡也可以有一块或多块。交换网板可能没有,也可能有一块或多块,有多块的时候可以共同实现负荷分担冗余备份。在集中式转发架构下,网络设备可以不需要交换网板,接口板承担整个系统的业务数据的处理功能。在分布式转发架构下,网络设备可以有至少一块交换网板,通过交换网板实现多块接口板之间的数据交换,提供大容量的数据交换和处理能力。所以,分布式架构的网络设备的数据接入和处理能力要大于集中式架构的设备。可选地,网络设备的形态也可以是只有一块板卡,即没有交换网板,接口板和主控板的功能集成在该一块板卡上,此时接口板上的中央处理器和主控板上的中央处理器在该一块板卡上可以合并为一个中央处理器,执行两者叠加后的功能,这种形态设备的数据交换和处理能力较低(例如,低端交换机或路由器等网络设备)。具体采用哪种架构,取决于具体的组网部署场景,此处不做任何限定。
本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分可互相参考,每个实施例重点说明的都是与其他实施例的不同之处。
A参考B,指的是A与B相同或者A为B的简单变形。
本申请实施例的说明书和权利要求书中的术语“第一”和“第二”等是用于区别不同的对象,而不是用于描述对象的特定顺序,也不能理解为指示或暗示相对重要性。例如,第一RDMA存储节点和第二RDMA存储节点用于区别不同的RDMA存储节点,而不是用于描述RDMA存储节点的特定顺序,也不能理解为第一RDMA存储节点比第二RDMA存储节点更重要。
本申请实施例,除非另有说明,“至少一个”的含义是指一个或多个,“多个”的含义是指两个或两个以上。例如,多个RDMA存储节点是指两个或两个以上的RDMA存储节点。
上述实施例可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行计算机程序指令时,全部或部分地产生按照本申请实施例描述的流程或功能。计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,计算机指令可以从一个网站站点、计算机、服务 器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘Solid State Disk(SSD))等。
以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的范围。

Claims (37)

  1. 一种报文处理方法,其特征在于,所述方法包括:
    网关设备接收来自客户端的第一承载在网络端的非易失性高速传输总线NOF请求报文,所述第一NOF请求报文携带非易失性高速传输总线NVMe指令,所述NVMe指令指示对第一目的地址执行读/写操作;
    所述网关设备基于所述第一目的地址获取第一远程直接内存访问RDMA存储节点的信息;
    所述网关设备向所述第一RDMA存储节点发送第一RDMA请求报文,所述第一RDMA请求报文携带所述NVMe指令对应的RDMA指令。
  2. 根据权利要求1所述的方法,其特征在于,所述网关设备基于所述第一目的地址获取第一远程直接内存访问RDMA存储节点的信息,包括:
    所述网关设备基于所述第一目的地址,从第一对应关系中查询得到所述第一RDMA存储节点的信息,所述第一对应关系指示所述第一目的地址以及所述第一RDMA存储节点的信息之间的对应关系。
  3. 根据权利要求1或2所述的方法,其特征在于,所述第一RDMA存储节点的信息包括以下信息中的至少一个:第二目的地址、所述第一RDMA存储节点的网络位置信息、所述第一RDMA存储节点中一个或多个队列对QP的标识和远端秘钥R_Key,所述第二目的地址指向所述第一RDMA存储节点中的内存空间,所述R_Key指示访问所述第一RDMA存储节点的内存的权限。
  4. 根据权利要求3所述的方法,其特征在于,所述网络位置信息包括介质访问控制层MAC地址、网际协议IP地址、多协议标签交换MPLS标签或者段标识SID中至少一项。
  5. 根据权利要求1至4中任一项所述的方法,其特征在于,所述网关设备向所述第一RDMA存储节点发送所述第一RDMA请求报文之后,所述方法还包括:
    网关设备接收来自所述第一RDMA存储节点的RDMA响应报文,所述RDMA响应报文是针对所述第一RDMA请求报文的响应报文;
    所述网关设备基于所述RDMA响应报文生成第一NOF响应报文,第一NOF响应报文是针对所述第一NOF请求报文的响应报文;
    所述网关设备向所述客户端发送所述第一NOF响应报文。
  6. 根据权利要求5所述的方法,其特征在于,所述网关设备基于所述RDMA响应报文生成第一NOF响应报文,包括:
    所述网关设备基于所述RDMA响应报文获得RDMA状态信息,所述RDMA状态信息指示所述RDMA响应报文与所述第一RDMA请求报文之间的对应关系;
    所述网关设备根据所述RDMA状态信息,从第二对应关系中查询得到NOF状态信息,所述第二对应关系包括所述RDMA状态信息与所述NOF状态信息之间的对应关系,所述NOF状态信息指示所述第一NOF响应报文与所述第一NOF请求报文之间的对应关系;
    所述网关设备基于所述NOF状态信息生成所述第一NOF响应报文。
  7. 根据权利要求6所述的方法,其特征在于,所述网关设备根据所述RDMA状态信息,从第二对应关系中查询得到NOF状态信息之前,所述方法还包括:
    所述网关设备基于所述第一NOF请求报文获得所述NOF状态信息;
    所述网关设备建立所述第二对应关系,所述第二对应关系为所述NOF状态信息与所述RDMA状态信息之间的对应关系。
  8. 根据权利要求5所述的方法,其特征在于,所述网关设备基于所述RDMA响应报文生成第一NOF响应报文,包括:
    所述网关设备基于所述RDMA响应报文中的NOF状态信息生成所述第一NOF响应报文。
  9. 根据权利要求5所述的方法,其特征在于,所述第一RDMA请求报文包括第一NOF报文头,所述RDMA响应报文包括所述第一RDMA存储节点基于所述第一NOF报文头生成的第二NOF报文头,所述第一NOF响应报文包括所述第二NOF报文头。
  10. 根据权利要求6至9中任一项所述的方法,其特征在于,所述RDMA状态信息包括包序列号PSN。
  11. 根据权利要求6至10中任一项所述的方法,其特征在于,所述NOF状态信息包括包序列号PSN、发送队列头指针SQHD、命令标志符command ID、目的队列对DQP、虚拟地址、远端秘钥R_Key或直接内存访问DMA的长度中至少一项。
  12. 根据权利要求2至4中任一项所述的方法,其特征在于,所述第一对应关系还包括第二RDMA存储节点的信息,所述方法还包括:
    在所述NVMe指令指示写操作的情况下,所述网关设备向所述第二RDMA存储节点发送第二RDMA请求报文,所述第二RDMA请求报文携带所述NVMe指令对应的RDMA指令。
  13. 根据权利要求12所述的方法,其特征在于,所述第一RDMA请求报文和所述第二RDMA请求报文均为组播报文;或者,
    所述第一RDMA请求报文和所述第二RDMA请求报文均为单播报文。
  14. 根据权利要求2至4中任一项所述的方法,其特征在于,所述第一对应关系还包括第二RDMA存储节点的信息,所述方法还包括:
    在所述NVMe指令指示读操作的情况下,所述网关设备基于负载分担算法,从所述第一RDMA存储节点和所述第二RDMA存储节点中选择所述第一RDMA存储节点。
  15. 根据权利要求2至4中任一项所述的方法,其特征在于,所述第一对应关系是所述网关设备从所述网关设备之外的其他设备接收的,或者,所述第一对应关系是所述网关设备生成的。
  16. 根据权利要求1至15中任一项所述的方法,其特征在于,所述第一RDMA请求报文还包括所述第一RDMA存储节点的信息。
  17. 根据权利要求12或13所述的方法,其特征在于,所述第二RDMA请求报文还包括所述第二RDMA存储节点的信息。
  18. 根据权利要求1-17中任一项任述的方法,其特征在于,所述第一RDMA存储节点为存储服务器、内存或者存储阵列。
  19. 一种报文处理装置,其特征在于,所述装置设于网关设备,所述装置包括:
    接收单元,用于接收来自客户端的第一承载在网络端的非易失性高速传输总线NOF请求报文,所述第一NOF请求报文携带非易失性高速传输总线NVMe指令,所述NVMe指令指示对第一目的地址执行读/写操作;
    处理单元,用于基于所述第一目的地址获取第一远程直接内存访问RDMA存储节点的信息;
    发送单元,用于向所述第一RDMA存储节点发送第一RDMA请求报文,所述第一RDMA请求报文携带所述NVMe指令对应的RDMA指令。
  20. 根据权利要求19所述的装置,其特征在于,所述处理单元,用于基于所述第一目的地址,从第一对应关系中查询得到所述第一RDMA存储节点的信息,所述第一对应关系指示所述第一目的地址以及所述第一RDMA存储节点的信息之间的对应关系。
  21. 根据权利要求19或20所述的装置,其特征在于,所述第一RDMA存储节点的信息包括以下信息中的至少一个:第二目的地址、所述第一RDMA存储节点的网络位置信息、所述第一RDMA存储节点中一个或多个队列对QP的标识和远端秘钥R_Key,所述第二目的地址指向所述第一RDMA存储节点中的内存空间,所述R_Key指示访问所述第一RDMA存储节点的内存的权限。
  22. 根据权利要求21所述的装置,其特征在于,所述网络位置信息包括介质访问控制层MAC地址、网际协议IP地址、多协议标签交换MPLS标签或者段标识SID中至少一项。
  23. 根据权利要求19至22中任一项所述的装置,其特征在于,
    所述接收单元,还用于接收来自所述第一RDMA存储节点的RDMA响应报文,所述RDMA响应报文包括所述第一RDMA存储节点执行所述RDMA指令得到的结果;
    所述处理单元,还用于基于所述RDMA响应报文生成第一NOF响应报文;
    所述发送单元,还用于向所述客户端发送所述第一NOF响应报文。
  24. 根据权利要求23所述的装置,其特征在于,所述处理单元,用于基于所述RDMA响应报文获得RDMA状态信息,所述RDMA状态信息指示所述RDMA响应报文与所述第一RDMA请求报文之间的对应关系;根据所述RDMA状态信息,从第二对应关系中查询得到NOF状态信息,所述第二对应关系包括所述RDMA状态信息与所述NOF状态信息之间的对应关系,所述NOF状态信息指示所述第一NOF响应报文与所述第一NOF请求报文之间的对应关系;基于所述NOF状态信息生成所述第一NOF响应报文。
  25. 根据权利要求24所述的装置,其特征在于,所述处理单元,用于基于所述第一NOF请求报文获得所述NOF状态信息;建立所述第二对应关系,所述第二对应关系为所述NOF状态信息与所述RDMA状态信息之间的对应关系。
  26. 根据权利要求23所述的装置,其特征在于,所述处理单元,用于基于所述RDMA响应报文中的NOF状态信息生成所述第一NOF响应报文。
  27. 根据权利要求23所述的装置,其特征在于,所述第一RDMA请求报文包括第一NOF报文头,所述RDMA响应报文包括所述第一RDMA存储节点基于所述第一NOF报文头生成的第二NOF报文头,所述第一NOF响应报文包括所述第二NOF报文头。
  28. 根据权利要求20至22中任一项所述的装置,其特征在于,所述第一对应关系还包括第二RDMA存储节点的信息;
    所述发送单元,还用于在所述NVMe指令指示写操作的情况下,向所述第二RDMA存储节点发送第二RDMA请求报文,所述第二RDMA请求报文携带所述NVMe指令对应的RDMA指令。
  29. 根据权利要求20至22中任一项所述的装置,其特征在于,所述第一对应关系还包括第二RDMA存储节点的信息,所述处理单元,还用于在所述NVMe指令指示读操作的情况下,基于负载分担算法,从所述第一RDMA存储节点和所述第二RDMA存储节点中选择所述第一RDMA存储节点。
  30. 根据权利要求20至22中任一项所述的装置,其特征在于,所述第一对应关系是所述接收单元从所述网关设备之外的其他设备接收的,或者,所述第一对应关系是所述处理单元生成的。
  31. 根据权利要求19至30中任一项所述的装置,其特征在于,
    所述处理单元,还用于基于所述第一目的地址获取NOF存储节点的信息;基于所述第一NOF请求报文生成第二NOF请求报文;
    所述发送单元,还用于向所述NOF存储节点发送所述第二NOF请求报文。
  32. 根据权利要求31所述的装置,其特征在于,
    所述接收单元,还用于接收来自所述NOF存储节点的第二NOF响应报文;
    所述处理单元,还用于基于所述第二NOF响应报文生成第三NOF响应报文;
    所述发送单元,还用于向所述客户端发送所述第三NOF响应报文。
  33. 根据权利要求19-32中任一项所述的装置,其特征在于,所述第一RDMA存储节点为存储服务器、内存或者存储阵列。
  34. 一种网关设备,其特征在于,所述网关设备包括处理器和网络接口,所述处理器与存储器耦合,所述网络接口用于接收或发送报文,所述存储器中存储有至少一条计算机程序指令,所述至少一条计算机程序指令由所述处理器加载并执行,以使所述网关设备实现权利要求1至18中任一项所述的方法。
  35. 一种存储系统,其特征在于,所述存储系统包括如权利要求34所述的网关设备以及一个或多个RDMA存储节点。
  36. 一种计算机可读存储介质,其特征在于,所述存储介质中存储有至少一条指令,所述指令在计算机上运行时,使得计算机执行如权利要求1至权利要求18中任一项所述的方法。
  37. 一种计算机程序产品,其特征在于,所述计算机程序产品包括一个或多个计算机程序指令,当所述计算机程序指令被计算机加载并运行时,使得所述计算机执行权利要求1至权利要求18中任一项所述的方法。
PCT/CN2023/071947 2022-01-30 2023-01-12 报文处理方法、网关设备及存储系统 WO2023143103A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210114823.1 2022-01-30
CN202210114823.1A CN116566933A (zh) 2022-01-30 2022-01-30 报文处理方法、网关设备及存储系统

Publications (1)

Publication Number Publication Date
WO2023143103A1 true WO2023143103A1 (zh) 2023-08-03

Family

ID=87470454

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/071947 WO2023143103A1 (zh) 2022-01-30 2023-01-12 报文处理方法、网关设备及存储系统

Country Status (2)

Country Link
CN (1) CN116566933A (zh)
WO (1) WO2023143103A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117742606A (zh) * 2023-12-20 2024-03-22 成都北中网芯科技有限公司 一种dpu芯片及其报文存储模块和报文复制方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180067685A1 (en) * 2015-11-19 2018-03-08 Huawei Technologies Co., Ltd. Method for Implementing NVME over Fabrics, Terminal, Server, and System
CN108829353A (zh) * 2018-06-15 2018-11-16 郑州云海信息技术有限公司 一种基于NVMe的网络化存储系统及方法
CN110199270A (zh) * 2017-12-26 2019-09-03 华为技术有限公司 存储系统中存储设备的管理方法及装置
US20200319812A1 (en) * 2020-06-03 2020-10-08 Intel Corporation Intermediary for storage command transfers
CN113688072A (zh) * 2020-05-19 2021-11-23 华为技术有限公司 数据处理方法及设备

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180067685A1 (en) * 2015-11-19 2018-03-08 Huawei Technologies Co., Ltd. Method for Implementing NVME over Fabrics, Terminal, Server, and System
CN110199270A (zh) * 2017-12-26 2019-09-03 华为技术有限公司 存储系统中存储设备的管理方法及装置
CN108829353A (zh) * 2018-06-15 2018-11-16 郑州云海信息技术有限公司 一种基于NVMe的网络化存储系统及方法
CN113688072A (zh) * 2020-05-19 2021-11-23 华为技术有限公司 数据处理方法及设备
US20200319812A1 (en) * 2020-06-03 2020-10-08 Intel Corporation Intermediary for storage command transfers

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117742606A (zh) * 2023-12-20 2024-03-22 成都北中网芯科技有限公司 一种dpu芯片及其报文存储模块和报文复制方法

Also Published As

Publication number Publication date
CN116566933A (zh) 2023-08-08

Similar Documents

Publication Publication Date Title
CN108701004B (zh) 一种数据处理的系统、方法及对应装置
JP4651692B2 (ja) ネットワークトラフィックのインテリジェントロードバランシング及びフェイルオーバー
JP4722157B2 (ja) ネットワークトラフィックのインテリジェントロードバランシング及びフェイルオーバー
US7996569B2 (en) Method and system for zero copy in a virtualized network environment
JP4840943B2 (ja) ネットワークトラフィックのインテリジェントロードバランシング及びフェイルオーバー
US8180949B1 (en) Resource virtualization switch
US9450780B2 (en) Packet processing approach to improve performance and energy efficiency for software routers
JP2022122873A (ja) 高性能コンピューティング環境においてパーティションメンバーシップに関連して定義されるマルチキャストグループメンバーシップを提供するシステムおよび方法
US9813283B2 (en) Efficient data transfer between servers and remote peripherals
US10466935B2 (en) Methods for sharing NVM SSD across a cluster group and devices thereof
US10348624B2 (en) Virtual machine data flow management method and system
US10541928B2 (en) Policy aware framework for application input output management
EP4318251A1 (en) Data access system and method, and device and network card
JP5994190B2 (ja) パケット転送方法およびシステム
WO2023143103A1 (zh) 报文处理方法、网关设备及存储系统
WO2020134144A1 (zh) 数据或报文转发的方法、节点和系统
CN110688237B (zh) 转发报文的方法、中间设备和计算机设备
EP3086512B1 (en) Implementation method and apparatus for vlan to access vf network and fcf
US20240048612A1 (en) Computing Node Management System and Method for Managing a Plurality of Computing Nodes
CN110471627B (zh) 一种共享存储的方法、系统及装置
CN113835618A (zh) 数据存储装置、存储系统和提供虚拟化存储的方法
WO2022267909A1 (zh) 一种数据读写方法以及相关装置
US7460528B1 (en) Processing data packets at a storage service module of a switch
US7382776B1 (en) Performing block storage virtualization at a switch
WO2020215455A1 (zh) 一种基于Virtio端口传输数据的方法和系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23745986

Country of ref document: EP

Kind code of ref document: A1