CN111459417B

CN111459417B - Non-lock transmission method and system for NVMeoF storage network

Info

Publication number: CN111459417B
Application number: CN202010338868.8A
Authority: CN
Inventors: 李琼; 宋振龙; 赵曦; 谢徐超; 谢旻; 袁远; 黎铁军; 肖立权; 魏登萍; 任静; 李世杰; 陈浩稳
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2020-04-26
Filing date: 2020-04-26
Publication date: 2023-08-18
Anticipated expiration: 2040-04-26
Also published as: CN111459417A

Abstract

The application discloses a lock-free transmission method and a lock-free transmission system for an NVMeoF storage network, wherein a host side creates NVMeoF queues with the same number according to the number of CPU cores, and applies for a section of blank memory in an array format for each NVMeoF queue; when the command packet arrives, the command packet is added into a blank memory in an array format corresponding to the NVMeoF queue and is cached through an independent linked list; and polling each NVMeoF queue, and sending the cached command packet to the target end through the network. The application adds the chain table buffer in each NVMeoF queue, adopts a method of polling a plurality of chain tables, relieves the problem of strong competition of the plurality of NVMeoF queues to a single transmission chain table lock and solves the problem of I/O bottleneck under the condition of large I/O pressure.

Description

Non-lock transmission method and system for NVMeoF storage network

Technical Field

The application relates to a remote storage and storage network technology, in particular to a lock-free transmission method and system for an NVMeoF storage network.

Background

The NVMe protocol is customized for a novel high-speed Non-Volatile Memory (such as a flash Memory, a 3D XPoint and the like), and the high-efficiency combination of the PCIe interface and the NVMe protocol reduces the overhead of an I/O protocol stack and the storage access delay, improves the I/O throughput rate and the I/O bandwidth, and is widely applied to a data center.

However, limited to the scalability of the PCIe bus, the NVMe protocol is not suitable for large-scale cross-network remote storage access. The NVMeoF storage network protocol based on RDMA network expansion is generated, and can be used for communicating with remote NVMe equipment between a host and a storage system through various network structures, so that an effective technical approach is provided for constructing a high-performance and easily-expandable network storage system for a data center, and the future development trend is realized.

NVMeoF (NVMe over Fabrics) may choose different link layer and physical layer protocol implementations, bearer networks including Infiniband, ROCE, iWARP, fibreChannel, etc., and custom protocol based RDMA networks such as the custom high speed interconnect network employed in the Tianhe supercomputer.

The RDMA network-based NVMeoF network storage I/O command transfer flow is shown in FIG. 1. The command packet (CMD Capsule) is a generic name after the I/O request command packet in fig. 1, and includes optional additional SGL or command data in addition to the basic command ID, the operation code, the cache address, and the command parameters; the response packet (RSP Capsule) is a generic term after the I/O response packet in fig. 1, and contains optional command data in addition to basic command parameters, SQ (transmit queue) head pointer, command status, command ID; information such as an ID number of a queue, a size of the queue, an nvme command message and the like is packaged in an NVMeoF queue (nvme_fabrics_queue); the send list (send_list) is a list used to store command packets or response packet pointers.

The number of nvmeoh queues is created based on the number of cores of the current server CPU, in the case of a multi-core CPU, the traditional way of connection is a "multi-producer, single consumer model". A connection mode based on a single transmission linked list is generally adopted, as shown in fig. 2. I/O command packets sent by a plurality of NVMeoF queues at a Host end (Host) are all required to be sent through an interconnection network interface card (network card for short), so that a sending chain table is required to be locked, and as only one sending chain table is provided, each request is required to lock the chain table for ensuring mutual exclusivity when entering and exiting the chain table; each command packet can be added to the tail of the transmission chain table after being locked, and waiting for the network card to transmit. Similarly, I/O response packets generated by the Target end queue may need to be inserted into the transmit chain table after the lock is acquired. Under the condition of multi-core and multi-process, competition for a linked list lock is very frequent, and the processing efficiency of the I/O request is seriously affected, so that the rate of sending a request to a target end by a host end and sending a response to the host end by the target end is too slow, and the performance of the underlying high-speed NVMe storage device cannot be fully exerted. Lock contention is particularly acute where I/O pressure is high, resulting in a significant I/O bottleneck problem.

Disclosure of Invention

The application aims to solve the technical problems: aiming at the problem that the lock competition is strong to cause the I/O bottleneck in the prior art under the condition of high I/O pressure, the application provides a lock-free transmission method and a lock-free transmission system for an NVMeoF storage network.

In order to solve the technical problems, the application adopts the following technical scheme:

the lock-free transmission method for the NVMeoF storage network comprises the following implementation steps:

1) The host side creates the same number of NVMeoF queues according to the number of CPU cores, and applies for a section of blank memory in an array format for each NVMeoF queue;

2) When the command packet arrives, the command packet is added into a blank memory in an array format corresponding to the NVMeoF queue and is cached through an independent linked list; and polling each NVMeoF queue, and sending the cached command packet to the target end through the network.

Optionally, caching through an independent linked list in step 2) specifically refers to caching the command packet in the blank memory in the array format corresponding to the nvmeoh queue to the tail of the transmission linked list corresponding to the nvmeoh queue; the step of sending the cached command packet to the target end through the network specifically means that the command packet at the head of the sending chain table is sent to the target end through the network.

Optionally, the host side further includes the step of managing the blank memory of the array format of each nvmeoh queue by using a linked list: 1. adding the blank memory in the array format of each NVMeoF queue into a management linked list after the blank memory is applied; 2. when a command packet arrives, a head node in the management linked list is taken out and then deleted from the management linked list; when the I/O pressure is overlarge, if the management chain table is empty, applying a temporary memory space to store the command packet; 3. assigning a value to a head node extracted from the management linked list, and storing the message content of the command packet into an address pointed by the head node; 4. the head node after assignment is not in any management linked list, and then the head node is added to the tail part of a transmission linked list corresponding to the NVMeoF queue; 5. when a network card polls a certain NVMeoF queue, taking out a head node from a transmission linked list in the NVMeoF queue, and deleting the node from the transmission linked list after taking out the head node; 6. the network card sends the message content stored in the address pointed by the head node through the network card; 7. clearing the address pointed by the head node after the transmission is completed, and re-adding the address pointed by the head node to the tail part of the management linked list so as to ensure that the applied memory is repeatedly available; if the address in the head node is the memory address of the temporary application, the address is released immediately after the network card transmission is completed.

Optionally, step 1) further includes a step that the target initializes the nvmeoh queue: creating the same NVMeoF queues as the NVMeoF queues at the host end, and applying for a blank memory with an array format for each NVMeoF queue.

Optionally, step 2) further includes a step of sending a response packet after the target receives the command packet: when the response packet arrives, the response packet is added into a blank memory in an array format corresponding to the NVMeoF queue and is cached through an independent linked list; and polling each NVMeoF queue and sending the cached response packet to the host through the network.

Optionally, when the target receives the command packet and then sends the response packet, the caching by the independent linked list specifically refers to caching the command packet in the blank memory in the array format corresponding to the nvmeoh queue to the tail of the sending linked list corresponding to the nvmeoh queue; the step of sending the buffered command packet to the host end through the network specifically means that the command packet at the head of the sending chain table is sent to the host end through the network.

Optionally, the target end further includes the step of managing the blank memory of the array format of each nvmeoh queue by using a linked list: 1. adding the blank memory in the array format of each NVMeoF queue into a management linked list after the blank memory is applied; 2. when a response packet arrives, the head node in the management linked list is taken out and then deleted from the management linked list; when the I/O pressure is overlarge, if the management chain table is empty, applying a temporary memory space to store the response packet; 3. assigning a value to the head node extracted from the management linked list, and storing the message content of the response packet into an address pointed by the head node; 4. the head node after assignment is not in any management linked list, and then the head node is added to the tail part of a transmission linked list corresponding to the NVMeoF queue; 5. when a network card polls a certain NVMeoF queue, taking out a head node from a transmission linked list in the NVMeoF queue, and deleting the node from the transmission linked list after taking out the head node; 6. the network card sends the message content stored in the address pointed by the head node through the network card; 7. clearing the address pointed by the head node after the transmission is completed, and re-adding the address pointed by the head node to the tail part of the management linked list to ensure that the applied memory is repeatedly available; if the address in the head node is the memory address of the temporary application, the address is released immediately after the network card transmission is completed.

In addition, the application also provides a non-locking transmission system facing the NVMeoF storage network, which comprises a host end and a target end, wherein the host end is programmed or configured to execute the steps of the non-locking transmission method facing the NVMeoF storage network, or the target end is programmed or configured to execute the steps of the non-locking transmission method facing the NVMeoF storage network

The application also provides a non-locking transmission system facing the NVMeoF storage network, which comprises a computing device, wherein the computing device is programmed or configured to execute the steps of the non-locking transmission method facing the NVMeoF storage network, or a computer program programmed or configured to execute the non-locking transmission method facing the NVMeoF storage network is stored in a memory of the computing device.

Furthermore, the application provides a computer readable storage medium, on which a computer program programmed or configured to execute the non-lock transmission method for the nvmeoh storage network is stored.

Compared with the prior art, the application has the following advantages: the host side creates the same number of NVMeoF queues according to the number of CPU cores, and applies for a section of blank memory in an array format for each NVMeoF queue; when the command packet arrives, the command packet is added into a blank memory in an array format corresponding to the NVMeoF queue and is cached through an independent linked list; and (3) polling each NVMeoF queue by the network card, and sending the cached command packet to the target end through the network. The application adds the chain table buffer in each NVMeoF queue, adopts a method of polling a plurality of chain tables, relieves the problem of strong competition of the plurality of NVMeoF queues to a single transmission chain table lock and solves the problem of I/O bottleneck under the condition of large I/O pressure.

Drawings

FIG. 1 is a schematic diagram of a conventional NVMeoF I/O command processing flow.

Fig. 2 is a conventional I/O message transmission manner of the conventional nvmeoh.

Fig. 3 is a schematic diagram of the basic principle of the method according to the embodiment of the application.

FIG. 4 is a "Single producer Single consumer model" of NVMeoF I/O messaging in accordance with one embodiment of the present application.

FIG. 5 is a flow chart of a linked list management array according to an embodiment of the present application.

Fig. 6 is an improved nvmeoh I/O messaging mechanism in accordance with an embodiment of the present application.

Detailed Description

As shown in fig. 3 and fig. 4, the implementation steps of the lock-free transmission method for the nvmeoh storage network in this embodiment include:

2) When the command packet arrives, the command packet is added into a blank memory in an array format corresponding to the NVMeoF queue and is cached through an independent linked list; and (3) polling each NVMeoF queue by the network card, and sending the cached command packet to the target end through the network.

As shown in fig. 3 and fig. 4, in the lock-free transmission method for an nvmeoh storage network of this embodiment, when a Host (Host) and a Target (Target) establish a connection, the Host creates the number of queues of the nvmeoh queues according to the number of CPU cores, and the Target creates the corresponding number of queues according to the number of nvmeoh queues at the Host. Each queue applies for a section of blank memory in an array format when being created, and when a command packet or a response packet of each queue arrives, the messages are respectively stored in the memory of the corresponding queue. The network card polls each queue and sends the messages in the memory one by one, and the network card is connected in the mode of 'single producer and single consumer model'. All IO instructions cannot compete for the lock of the transmission chain table, and the problem of low efficiency caused by intense lock competition is avoided.

As an optional implementation manner, in this embodiment, a transmit chain table is used to organize the command packets to be transmitted of each nvmeoh queue, so as to improve the polling processing efficiency of the network card. In this embodiment, the caching through the independent linked list in step 2) specifically refers to caching the command packet in the blank memory in the array format corresponding to the nvmeoh f queue to the tail of the transmitting linked list corresponding to the nvmeoh queue; the step of sending the cached command packet to the target end through the network specifically means that the command packet at the head of the sending chain table is sent to the target end through the network.

When the upper IO pressure is too high, the host end can generate the situation that the receiving speed of the message exceeds the sending speed of the network card, so that the memory overflows. In order to solve the above problem, as shown in fig. 5, the host side further includes a step of managing the blank memory in the array format of each nvmeoh queue by using a linked list: 1. adding the blank memory in the array format of each NVMeoF queue into a management linked list after the blank memory is applied; 2. when a command packet arrives, a head node in the management linked list is taken out and then deleted from the management linked list; when the I/O pressure is overlarge, if the management chain table is empty, applying a temporary memory space to store the command packet; 3. assigning a value to a head node extracted from the management linked list, and storing the message content of the command packet into an address pointed by the head node; 4. the head node after assignment is not in any management linked list, and then the head node is added to the tail part of a transmission linked list corresponding to the NVMeoF queue; 5. when a network card polls a certain NVMeoF queue, taking out a head node from a transmission linked list in the NVMeoF queue, and deleting the node from the transmission linked list after taking out the head node; 6. the network card sends the message content stored in the address pointed by the head node through the network card; 7. clearing the address pointed by the head node after the transmission is completed, and re-adding the address pointed by the head node to the tail part of the management linked list to ensure that the applied memory is repeatedly available; if the address in the head node is the memory address of the temporary application, the address is released immediately after the network card transmission is completed. As an alternative implementation, the I/O pressure may be selected according to the requirement, and other feasible I/O pressure index values may be selected according to the requirement.

Referring to fig. 4, step 1) in this embodiment further includes the step that the target end initializes the nvmeoh queue: creating the same NVMeoF queues as the NVMeoF queues at the host end, and applying for a blank memory with an array format for each NVMeoF queue.

In this embodiment, step 2) further includes the step of receiving the command packet by the target and then sending a response packet: when the response packet arrives, the response packet is added into a blank memory in an array format corresponding to the NVMeoF queue and is cached through an independent linked list; and polling each NVMeoF queue and sending the cached response packet to the host through the network.

In this embodiment, when the target receives the command packet and then sends the response packet, the caching by the independent linked list specifically refers to caching the command packet in the blank memory in the array format corresponding to the nvmeoh queue to the tail of the sending linked list corresponding to the nvmeoh queue; the step of sending the buffered command packet to the host end through the network specifically means that the command packet at the head of the sending chain table is sent to the host end through the network.

When the upper IO pressure is too high, the receiving speed of the message at the target end exceeds the sending speed of the network card, so that the memory overflows. In order to solve the above problem, referring to fig. 5, the target end in this embodiment further includes a step of managing the blank memory of the array format of each nvmeoh queue by using a linked list: 1. adding the blank memory in the array format of each NVMeoF queue into a management linked list after the blank memory is applied; 2. when a response packet arrives, the head node in the management linked list is taken out and then deleted from the management linked list; when the I/O pressure is overlarge, if the management chain table is empty, applying a temporary memory space to store the response packet; 3. assigning a value to the head node extracted from the management linked list, and storing the message content of the response packet into an address pointed by the head node; 4. the head node after assignment is not in any management linked list, and then the head node is added to the tail part of a transmission linked list corresponding to the NVMeoF queue; 5. when a network card polls a certain NVMeoF queue, taking out a head node from a transmission linked list in the NVMeoF queue, and deleting the node from the transmission linked list after taking out the head node; 6. the network card sends the message content stored in the address pointed by the head node through the network card; 7. clearing the address pointed by the head node after the transmission is completed, and re-adding the address pointed by the head node to the tail part of the management linked list so as to ensure that the applied memory is repeatedly available; if the address in the head node is the memory address of the temporary application, the address is released immediately after the network card transmission is completed.

In summary, the network card "single producer single consumer model" finally obtained by the lockless transmission method for the nvmeoh storage network in this embodiment is shown in fig. 6. Referring to fig. 6, it can be known that, in the non-lock transmission method for an nvmeoh storage network in this embodiment, each nvmeoh queue corresponds to one transmission linked list, so that a situation that multiple nvmeoh queues compete for one transmission linked list is avoided, a network card "single producer single consumer model" is implemented, and meanwhile, when the memory is not enough, the non-lock transmission method for an nvmeoh storage network in this embodiment can temporarily apply for a section of memory at both a host end and a target end and treat the memory in a manner of releasing the memory after the memory is used up, thereby effectively solving the problem of memory overflow. According to the lock-free transmission method for the NVMeoF storage network, a linked list cache is added in each NVMeoF queue, a method of polling a plurality of linked lists is adopted, the problem that the plurality of NVMeoF queues compete strongly for a single transmission linked list lock is solved, and the problem of I/O bottleneck under the condition of large I/O pressure is solved.

In addition, the embodiment also provides a lock-free transmission system facing the nvmeoh storage network, which comprises a host end and a target end, wherein the host end is programmed or configured to execute the steps of the lock-free transmission method facing the nvmeoh storage network, or the target end is programmed or configured to execute the steps of the lock-free transmission method facing the nvmeoh storage network.

In addition, the embodiment also provides a lock-free transmission system facing the NVMeoF storage network, which comprises a computing device, wherein the computing device is programmed or configured to execute the steps of the lock-free transmission method facing the NVMeoF storage network.

In addition, the embodiment also provides a lock-free transmission system facing the NVMeoF storage network, which comprises a computing device, wherein a computer program programmed or configured to execute the lock-free transmission method facing the NVMeoF storage network is stored in a memory of the computing device.

Furthermore, the present embodiment also provides a computer readable storage medium having stored thereon a computer program programmed or configured to perform the foregoing method of lockless transmission towards an nvmeoh storage network.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The present application is directed to methods, apparatus (systems), and computer program products in accordance with embodiments of the present application that produce means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks by reference to the instructions that execute in the flowchart and/or processor. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only a preferred embodiment of the present application, and the protection scope of the present application is not limited to the above examples, and all technical solutions belonging to the concept of the present application belong to the protection scope of the present application. It should be noted that modifications and adaptations to the present application may occur to one skilled in the art without departing from the principles of the present application and are intended to be within the scope of the present application.

Claims

1. The lock-free transmission method for the NVMeoF storage network is characterized by comprising the following implementation steps of:

2) When the command packet arrives, the command packet is added into a blank memory in an array format corresponding to the NVMeoF queue and is cached through an independent linked list; polling each NVMeoF queue, and sending the cached command packet to a target end through a network;

the host side further comprises the step of managing the blank memory of the array format of each NVMeoF queue by adopting a linked list: 1. adding the blank memory in the array format of each NVMeoF queue into a management linked list after the blank memory is applied; 2. when a command packet arrives, a head node in the management linked list is taken out and then deleted from the management linked list; when the I/O pressure is overlarge, if the management chain table is empty, applying a temporary memory space to store the command packet; 3. assigning a value to a head node extracted from the management linked list, and storing the message content of the command packet into an address pointed by the head node; 4. the head node after assignment is not in any management linked list, and then the head node is added to the tail part of a transmission linked list corresponding to the NVMeoF queue; 5. when a network card polls a certain NVMeoF queue, taking out a head node from a transmission linked list in the NVMeoF queue, and deleting the node from the transmission linked list after taking out the head node; 6. the network card sends the message content stored in the address pointed by the head node through the network card; 7. clearing the address pointed by the head node after the transmission is completed, and re-adding the address pointed by the head node to the tail part of the management linked list to ensure that the applied memory is repeatedly available; if the address in the head node is the memory address of the temporary application, the address is released immediately after the network card transmission is completed.

2. The method for lock-free transmission to an nvmeoh storage network according to claim 1, wherein in step 2), caching by an independent linked list specifically means that a command packet in a blank memory in an array format corresponding to an nvmeoh queue is cached to a tail of a transmission linked list corresponding to the nvmeoh queue; the step of sending the cached command packet to the target end through the network specifically means that the command packet at the head of the sending chain table is sent to the target end through the network.

3. The method for lock-free transmission to an nvmeoh storage network as claimed in claim 1, wherein step 1) further comprises the step of initializing an nvmeoh queue by the target end: creating the same NVMeoF queues as the NVMeoF queues at the host end, and applying for a blank memory with an array format for each NVMeoF queue.

4. The method for lock-free transmission in nvmeoh storage network as claimed in claim 3, wherein step 2) further comprises the step of sending a response packet after receiving the command packet by the target terminal: when the response packet arrives, the response packet is added into a blank memory in an array format corresponding to the NVMeoF queue and is cached through an independent linked list; and polling each NVMeoF queue and sending the cached response packet to the host through the network.

5. The lock-free transmission method for an nvmeoh storage network according to claim 4, wherein when the target receives the command packet and then sends the response packet, the caching by the independent linked list specifically means that the command packet in the blank memory in the array format corresponding to the nvmeoh queue is cached to the tail of the sending linked list corresponding to the nvmeoh queue; the step of sending the buffered command packet to the host end through the network specifically means that the command packet at the head of the sending chain table is sent to the host end through the network.

6. The method for lock-free transmission to an nvmeoh storage network as claimed in claim 5, wherein the target further comprises the step of managing the empty memory of the array format of each nvmeoh queue using a linked list: 1. adding the blank memory in the array format of each NVMeoF queue into a management linked list after the blank memory is applied; 2. when a response packet arrives, the head node in the management linked list is taken out and then deleted from the management linked list; when the I/O pressure is overlarge, if the management chain table is empty, applying a temporary memory space to store the response packet; 3. assigning a value to the head node extracted from the management linked list, and storing the message content of the response packet into an address pointed by the head node; 4. the head node after assignment is not in any management linked list, and then the head node is added to the tail part of a transmission linked list corresponding to the NVMeoF queue; 5. when a network card polls a certain NVMeoF queue, taking out a head node from a transmission linked list in the NVMeoF queue, and deleting the node from the transmission linked list after taking out the head node; 6. the network card sends the message content stored in the address pointed by the head node through the network card; 7. clearing the address pointed by the head node after the transmission is completed, and re-adding the address pointed by the head node to the tail part of the management linked list so as to ensure that the applied memory is repeatedly available; if the address in the head node is the memory address of the temporary application, the address is released immediately after the network card transmission is completed.

7. A lockless transmission system for an nvmeoh storage network, comprising a host side and a target side, wherein the host side is programmed or configured to perform the steps of the lockless transmission method for an nvmeoh storage network of claim 1 or 2, or the target side is programmed or configured to perform the steps of the lockless transmission method for an nvmeoh storage network of any one of claims 3-6.

8. A non-locking transmission system for an nvmeoh storage network, comprising a computing device, wherein the computing device is programmed or configured to perform the steps of the non-locking transmission method for an nvmeoh storage network according to any one of claims 1-6, or a computer program programmed or configured to perform the non-locking transmission method for an nvmeoh storage network according to any one of claims 1-6 is stored on a memory of the computing device.

9. A computer readable storage medium having stored thereon a computer program programmed or configured to perform the method of lockless transmission to an nvmeoh storage network of any one of claims 1-6.