WO2024113832A1 - 节点异常事件的处理方法、网卡及存储集群 - Google Patents
节点异常事件的处理方法、网卡及存储集群 Download PDFInfo
- Publication number
- WO2024113832A1 WO2024113832A1 PCT/CN2023/103864 CN2023103864W WO2024113832A1 WO 2024113832 A1 WO2024113832 A1 WO 2024113832A1 CN 2023103864 W CN2023103864 W CN 2023103864W WO 2024113832 A1 WO2024113832 A1 WO 2024113832A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- node
- host
- network card
- storage
- notification message
- Prior art date
Links
- 230000005856 abnormality Effects 0.000 title claims abstract description 33
- 238000003672 processing method Methods 0.000 title abstract description 4
- 230000002159 abnormal effect Effects 0.000 claims abstract description 169
- 238000000034 method Methods 0.000 claims abstract description 110
- 230000015654 memory Effects 0.000 claims abstract description 97
- 238000001514 detection method Methods 0.000 claims abstract description 36
- 238000012545 processing Methods 0.000 claims description 56
- 230000007246 mechanism Effects 0.000 claims description 12
- 238000004590 computer program Methods 0.000 claims description 8
- 230000002093 peripheral effect Effects 0.000 claims description 4
- 238000004891 communication Methods 0.000 description 38
- 238000010586 diagram Methods 0.000 description 22
- 230000008569 process Effects 0.000 description 17
- 230000006870 function Effects 0.000 description 11
- 238000005516 engineering process Methods 0.000 description 10
- 101001121408 Homo sapiens L-amino-acid oxidase Proteins 0.000 description 5
- 102100026388 L-amino-acid oxidase Human genes 0.000 description 5
- 239000004744 fabric Substances 0.000 description 5
- 230000003287 optical effect Effects 0.000 description 5
- 101100012902 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) FIG2 gene Proteins 0.000 description 4
- 230000003993 interaction Effects 0.000 description 4
- 101100233916 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) KAR5 gene Proteins 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 239000007787 solid Substances 0.000 description 3
- 238000006243 chemical reaction Methods 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- 101000827703 Homo sapiens Polyphosphoinositide phosphatase Proteins 0.000 description 1
- 102100023591 Polyphosphoinositide phosphatase Human genes 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 238000013475 authorization Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0654—Management of faults, events, alarms or notifications using network fault recovery
- H04L41/0663—Performing the actions predefined by failover planning, e.g. switching to standby network elements
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0631—Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/08—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
- H04L43/0805—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters by checking availability
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/10—Protocols in which an application is distributed across nodes in the network
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/10—Protocols in which an application is distributed across nodes in the network
- H04L67/1097—Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]
Definitions
- the present application relates to the field of storage technology, and in particular to a method for processing abnormal node events, a network card, and a storage cluster.
- NVMe-oF non-volatile memory express over fabrics
- storage devices based on NVMe-oF usually use a multi-node architecture (a node can also be understood as a controller) to enable host access to storage devices to improve the reliability and continuity of storage services.
- the host accesses the storage device's memory through two paths where node A and node B are located in the storage device.
- an abnormal event occurs in node A (such as node failure, upgrade or restart, etc.)
- the host can switch to the path where node B is located and access the memory through the path where node B is located.
- a heartbeat connection is established between the host and node A.
- the host can detect whether an abnormal event occurs in node A through the heartbeat connection. If an abnormal event is detected in node A, the host switches to the path where node B is located and accesses the storage through the path where node B is located.
- the above method relies on the heartbeat timeout mechanism, which makes it take a long time for the host to successfully switch the path after the node is abnormal, resulting in a long time for the business to drop to zero during the switching period, thereby affecting the reliability and continuity of the storage business.
- This application provides a method for processing node abnormal events, a network card, and a storage cluster, which can effectively reduce the delay of host switching paths and improve the continuity and reliability of storage services.
- the technical solution is as follows:
- a method for processing a node abnormal event is provided, which is applied to a network card of a storage device, wherein the storage device includes a network card and a plurality of nodes, the network card is communicatively connected to a first node among the plurality of nodes, and the node is used to manage a storage device, and the method includes:
- the node of the storage device is also the storage controller of the storage device, which can process the commands issued by the host, manage the memory, etc.
- the abnormal event related to the first node can be an abnormal event that occurs in the first node itself, such as a failure or restart of the first node, or an abnormal event that occurs in the communication link between the network card and the first node, such as a disconnection of the communication link, etc., and the present application is not limited thereto.
- the network card is not passive to wait for the heartbeat message to discover the fault, but the network card is connected to the first node among the multiple nodes by communication, so that when the network card detects the abnormal event related to the first node, it actively sends a notification message to the host to inform the host that the path where the first node is located is abnormal, so that the host can switch the path.
- This method can effectively reduce the delay of the host switching the path and improve the continuity and reliability of the storage service. It should be understood that since the network card is connected to the node by communication and both belong to the storage device, once an abnormal event related to the node occurs, the network card can be informed of the abnormal event in time, and respond quickly to inform the host.
- the notification message includes path status information, and the path status information indicates that an abnormality occurs on the path where the first node is located.
- the network card is communicatively connected to the first node via a peripheral component interconnect express (PCIe) link.
- PCIe peripheral component interconnect express
- the method further includes: performing link anomaly detection on the PCIe link to determine whether an abnormal event related to the first node occurs.
- the performing link abnormality detection on the PCIe link includes:
- a polling mechanism detection and/or an interruption detection mechanism detection is performed on the PCIe link, and when it is detected that the PCIe link is abnormal, it is determined that an abnormal event related to the first node has occurred.
- the network card and the first node are connected by a PCIe link, and the data transmission rate of the PCIe link is relatively high, the network card can quickly detect an abnormal event related to the first node through the PCIe link. For example, if the network card detects that the PCIe link is disconnected, or the network card receives an error packet through the PCIe link, etc., the network card believes that an abnormal event is likely to have occurred at the first node, and the present application is not limited thereto.
- the network card sends a notification message to the host, including any of the following:
- the network card sends the notification message to the host through the transport layer
- the network card sends the notification message to the host through the application layer.
- the network card can quickly send a notification message to the host through the transport layer or application layer, thereby reducing the delay of the host switching path.
- the method further comprises:
- the path status information is obtained from the management queue information of the network card, the path status information in the management queue information is configured by the first node according to a first command issued by the host, the first command carries the path status information, and the first command indicates that when an abnormal event related to the first node is detected, the notification message is sent to the host.
- the management queue information of the network card is configured with path status information, and the first command is an asynchronous event request command, so that the network card can actively send a notification message to the host when an abnormal event related to the first node is detected.
- the method further comprises:
- the path status information is obtained from the input/output queue context of the network card, the path status information in the input/output queue context is configured by the first node according to a second command issued by the host, the second command indicates that the path status information is generated based on the operating system type of the host, and the path status information is configured in the input/output queue context.
- the method further comprises:
- a third command is received from the host, where the third command is a read command or a write command, and the third command indicates that the notification message is sent to the host when an abnormal event related to the first node is detected.
- path status information is configured in the input and output queue context of the network card, so that the network card can send a notification message to the host when it detects an abnormal event related to the first node and receives the third command issued by the host, so that the host can switch the path.
- This process can effectively reduce the delay of the host switching path and improve the continuity and reliability of the storage service.
- an embodiment of the present application provides a method for processing a node abnormal event, which is applied to a host, wherein the host is communicatively connected to a network card of a storage device, wherein the storage device includes the network card and a plurality of nodes, wherein the network card is communicatively connected to a first node among the plurality of nodes, and the node is used to manage a memory, wherein the method includes:
- the memory is accessed through a path where a node other than the first node among the multiple nodes is located.
- the notification message includes path status information, and the path status information indicates that an abnormality occurs on the path where the first node is located.
- the host receives a notification message sent by the network card when an abnormal event related to the first node is detected, including any of the following:
- the host receives the notification message through the transport layer
- the host receives the notification message through the application layer
- the method further comprises:
- a first command is sent to the first node so that the first node configures the path status information carried by the first command in the management queue information, and the first command indicates that the notification message is sent to the host when an abnormal event related to the first node is detected.
- the method further comprises:
- a second command is issued to the first node, so that the first node generates the path status information according to the instruction of the second command and based on the operating system type of the host, and configures the path status information in the input and output queue context.
- the method further comprises:
- a third command is issued to the network card, where the third command is a read command or a write command, and the third command indicates that the notification message is sent to the host when an abnormal event related to the first node is detected.
- an embodiment of the present application provides a device for processing node abnormal events, wherein the device is configured in a network card in a storage device, the storage device includes the network card and multiple nodes, the network card is communicatively connected to a first node among the multiple nodes, the node is used to manage a memory, and the device includes at least one functional unit for executing a method for processing node abnormal events as provided in the aforementioned first aspect or any possible implementation of the first aspect.
- an embodiment of the present application provides a device for processing node abnormal events, wherein the device is configured on a host, the host is communicatively connected to a network card of a storage device, the storage device includes the network card and multiple nodes, the network card is communicatively connected to a first node among the multiple nodes, the node is used to manage a memory, and the device includes at least one functional unit for executing a method for processing node abnormal events as provided in the aforementioned second aspect or any possible implementation of the second aspect.
- an embodiment of the present application provides a network card, which is configured in a storage device, and the network card includes a processor, a memory and an interface, the interface is used to communicate with a node in the storage device, and the memory is used to store at least one piece of program code, and the at least one piece of program code is loaded by the processor and implements a method for handling node abnormal events as provided in the first aspect or any possible implementation method of the first aspect.
- an embodiment of the present application provides a storage cluster, which includes a network card, multiple nodes and a memory, the network card is communicatively connected to the nodes, the nodes are used to manage the memory, and the network card is used to execute a method for handling node abnormal events as provided in the first aspect or any possible implementation of the first aspect.
- the storage cluster is a centralized storage device
- the node is a storage controller
- the network card is connected to the node via a system bus
- the memory is connected to the node via a system bus.
- the storage cluster is a distributed storage system, which includes multiple independent storage devices, and each of the storage devices is connected to each other through a wired network or a wireless network to form a storage network; wherein each of the storage devices includes the network card, the node and the memory, and the network card is connected to the node through a system bus, and the memory is connected to the node through a system bus; or, each of the storage devices includes the network card and the node, and the memory is communicatively connected to the node in each of the storage devices.
- an embodiment of the present application provides a host, which includes a processor and a memory, and the processor is used to execute instructions stored in the memory so that the host executes a method for processing node abnormal events as provided in the aforementioned second aspect or any possible implementation of the second aspect.
- an embodiment of the present application provides a computer-readable storage medium, the computer-readable storage medium is used to store at least one program code, the at least one program code is used to implement the method for processing node abnormal events provided by the aforementioned first aspect or any possible implementation of the first aspect.
- the at least one program code is used to implement the method for processing node abnormal events provided by the aforementioned second aspect or any possible implementation of the second aspect.
- the storage medium includes but is not limited to volatile memory, such as random access memory, non-volatile memory, such as flash memory, hard disk drive (HDD), solid state drive (SSD).
- an embodiment of the present application provides a computer program product, which, when the computer program product is run on a storage device, enables the storage device to implement the method for handling abnormal node events provided by the aforementioned first aspect or any possible implementation of the first aspect.
- the computer program product when the computer program product is run on a host, enables the host to implement the method for handling abnormal node events provided by the aforementioned second aspect or any possible implementation of the second aspect.
- the computer program product can be a software installation package, and when the aforementioned method needs to be implemented, the computer program product can be downloaded and executed.
- FIG1 is a schematic diagram of a storage architecture provided in an embodiment of the present application.
- FIG2 is a schematic diagram of an implementation environment provided by an embodiment of the present application.
- FIG3 is a schematic diagram of another implementation environment provided by an embodiment of the present application.
- FIG4 is a schematic diagram of the structure of a host provided in an embodiment of the present application.
- FIG5 is a schematic diagram of the structure of a storage device provided in an embodiment of the present application.
- FIG6 is a schematic diagram of the structure of a distributed storage system provided in an embodiment of the present application.
- FIG. 7 is a schematic diagram of a logic unit of a network card provided in an embodiment of the present application.
- FIG8 is a method for processing a node abnormal event provided in an embodiment of the present application.
- FIG9 is another method for processing node abnormal events provided in an embodiment of the present application.
- FIG10 is another method for processing node abnormal events provided in an embodiment of the present application.
- FIG11 is a schematic diagram of a method for processing a node abnormal event provided in an embodiment of the present application.
- FIG12 is a schematic diagram of another method for processing abnormal node events provided in an embodiment of the present application.
- FIG. 13 is a schematic diagram of the structure of a device for processing node abnormal events provided in an embodiment of the present application.
- FIG. 14 is a schematic diagram of the structure of another device for processing abnormal node events provided in an embodiment of the present application.
- NVMe is a set of hardware and software standards that allow solid state drives (SSDs) to use the Peripheral Component Interconnect bus (PCI express, PCIe); PCIe is the actual physical connection channel.
- SSDs solid state drives
- PCI express Peripheral Component Interconnect bus
- NVM stands for non-volatile memory, which is a common form of flash memory for SSDs.
- NVMe mainly provides a low-latency, internally concurrent native interface specification for flash-based storage devices, and also provides native storage concurrency support for modern central processing units (CPUs), computer platforms and related applications, allowing host hardware and software to fully utilize the parallel storage capabilities of solid-state storage devices.
- CPUs central processing units
- NVMe commands refer to commands defined by the NVMe protocol.
- the commands in the NVMe protocol are divided into management (administrator, Admin) commands and input/output (input/output, I/O) commands.
- I/O commands are also called NVM commands.
- Admin commands are used to manage and control NVMe storage media.
- I/O commands are used to transfer data.
- I/O commands in the NVMe protocol include NVMe read commands and NVMe write commands.
- a queue pair is a pair of queues used to carry NVMe commands, consisting of a submission queue (SQ) and a completion queue (CQ).
- SQ submission queue
- CQ completion queue
- the host submits commands to the NVMe node (controller) through the SQ
- the NVMe controller submits the completion status to the CQ.
- the NVMe-oF specification is a high-speed storage protocol built on the basis of the NVMe protocol.
- NVMe-oF is used to access NVMe storage media across networks.
- NVMe-oF adds fabric-related commands based on NVMe, so that the application scenarios of NVMe are not limited to within a device, but can be extended to cross-network communications.
- the so-called "fabric” refers to the network between the host and the storage medium.
- Typical forms of fabric include Ethernet, Fibre Channel, InfiniBand (IB), remote direct memory access (RDMA), etc.
- RDMA over converged Ethernet RoCE is used to implement fabric, but this application is not limited to this.
- RDMA is a technology that bypasses the operating system kernel of the remote device to access the memory of the remote device. Since RDMA technology usually does not need to go through the operating system, it not only saves a lot of CPU resources, but also improves throughput and reduces network communication latency.
- a namespace is a formatted amount of non-volatile memory that can be directly accessed by the host, and can also be understood as a storage space.
- a namespace appears to the host as a real physical disk. For example, if an SSD disk includes two namespaces, the host can access the two physical disks and format and partition them separately.
- the technical solution provided in the embodiment of the present application can be applied to the storage architecture based on NVMe-oF, and can improve the continuity and reliability of storage services.
- FIG1 the application scenario involved in the present application is introduced.
- FIG1 is a schematic diagram of a storage architecture provided by an embodiment of the present application.
- the storage architecture includes a host, a switch, A host and a storage device based on NVMe-oF, the storage device adopts a dual-node architecture to realize the host's access to the storage in the storage device.
- the storage device includes a network interface card (NIC), a node A (node A), a node B (node B) and a memory, the data in the memory can be indexed by a namespace (namespace), and the network card and the node are connected by a PCIe link communication.
- NIC network interface card
- node A node A
- node B node B
- memory the data in the memory can be indexed by a namespace (namespace)
- namespace namespace
- the host 1 can access the memory through the four redundant paths where the nodes A and B are located in the storage device.
- the host 1 accesses the memory through the path where the node A is located, after an abnormal event occurs in the node A (such as node failure, upgrade or restart, etc.), the host 1 can switch to the path where the node B is located, and access the memory through the path where the node B is located, so as to achieve the continuity of the storage business.
- a heartbeat connection is established between the host 1 and the node A, and the host 1 detects whether an abnormal event occurs in the node A through the heartbeat connection.
- the host 1 When an abnormal event is detected in the node A, the host 1 switches to the path where the node B is located, and accesses the memory through the path where the node B is located.
- the above method relies on the heartbeat timeout mechanism, which makes it take a long time for the host to successfully switch the path after the node is abnormal, resulting in a long time for the business to drop to zero during the switching period, thereby affecting the reliability and continuity of the storage business.
- the present application provides a method for handling node abnormal events, which can reduce the latency of the host switching path and improve the continuity and reliability of storage services when an abnormal event related to a node in a storage device is detected (or in a node reset scenario).
- FIG2 is a schematic diagram of an implementation environment provided by an embodiment of the present application.
- the implementation environment includes a host 100 and a storage device 200
- the storage device 200 includes a network card 201, multiple nodes 202 and a memory 203, and the host 100 and the storage device 200 are directly or indirectly connected via a wired network or a wireless network.
- FIG1 and FIG2 are both centralized storage devices (or storage arrays), such as a storage area network (SAN).
- the node 202 is a storage controller
- the memory 203 is a persistent storage medium, such as a hard disk drive HDD or a solid-state drive SSD, etc.
- the present application is not limited thereto, and the storage device 200 can also be understood as a storage cluster including multiple nodes 202 (i.e., storage controllers).
- the host 100 refers to a device for running a storage service, such as a device for running a RoCE service, which is not limited to this.
- the host 100 runs a storage service by accessing a storage device 200.
- the host 100 has a path switching capability and can switch from one path to another to access the storage device 200 to improve the continuity and reliability of the storage service.
- the host 100 is a terminal device or a server running a client, etc., and the present application is not limited to this.
- the protocol stack of the host 100 includes: a file system, block input/output (block I/O), a small computer system interface (small computer system interface, SCSI), NVMe, a driver, and physical devices, etc., and the present application is not limited to this.
- the number of hosts 100 can be one or more, and the present application is not limited to this.
- the storage device 200 is used to provide accessible storage space for the host 100, such as providing read and write access to disk space, etc.
- the storage device 200 includes a network card 201, multiple nodes 202, and a memory 203.
- the network card 201 is connected to the node 202 through a system bus (such as a PCIe link), and the memory 203 is connected to the node 202 through a system bus.
- the node 202 is also a storage controller (controller) of the storage device, which can process commands issued by the host, manage the memory 203, and so on.
- network cards 201 and nodes 202 shown in the figure, and the connection relationship between the network card 201 and the node 202 are only schematic illustrations.
- a network card can be connected to one or more nodes, and a node can also be connected to one or more network cards. This application does not limit this.
- the network card 201 has the ability to handle abnormal events related to the node 202, including the ability to detect abnormal events related to the node 202 and the ability to notify the host 100 of abnormal events.
- the network card 201 can send a notification message to the host 100 when an abnormal event related to the first node is detected, informing the host 100 that an abnormality has occurred in the path where the first node is located, so that the host 100 accesses the memory 203 through the path where the nodes other than the first node among the multiple nodes 202 are located.
- the implementation environment also includes a switch 300, and the host 100 can access the storage device 200 through the switch 300.
- the present application is not limited to this. It should be understood that the switch 300 is an optional device, and the host 100 can also directly access the storage device 200.
- the wireless network or wired network uses standard communication technologies and/or protocols.
- the network is typically the Internet, but can also be any network, including but not limited to a local area network (LAN), a metropolitan area network (MAN), a wide area network (WAN), a mobile, wired or wireless network, a private network or any combination of virtual private networks.
- technologies and/or formats including hypertext markup language (HTML), extensible markup language (XML), etc. are used to represent data exchanged over the network.
- SSL secure socket layer
- TLS transport layer security
- VPN virtual private network
- IPsec Internet protocol security
- the storage device 200 includes multiple nodes 202, so that the host 100 can access the storage device 200 through multiple paths. It should be understood that the present application does not limit the number of storage devices 200. When there are multiple storage devices 200, the method for handling node abnormal events provided by the present application is also applicable. This situation is introduced below with reference to the distributed architecture shown in FIG. 3.
- FIG3 is a schematic diagram of another implementation environment provided by an embodiment of the present application.
- the implementation environment includes a host 100 and a distributed storage system 400.
- the host 100 and the distributed storage system 400 are directly or indirectly connected via a wired network or a wireless network.
- the distributed storage system 400 is a storage cluster including a plurality of independent storage devices 200. Each storage device 200 is connected via a wired network or a wireless network to form a storage network.
- the host 100 runs the storage service by accessing the distributed storage system 400.
- the host 100 has a path switching capability and can switch from one storage device 200 to another storage device 200 to run the storage service, so as to improve the continuity and reliability of the storage service.
- the number of hosts 100 can be one or more, and this application does not limit this.
- each storage device 200 includes a network card 201, at least one node 202, and a memory 203.
- the network card 201 and the node 202 are connected through a system bus (such as a PCIe link), and the memory 203 and the node 202 are connected through a system bus.
- Node 202 is used to manage the memory 203. It should be understood that the number of network cards 201 and nodes 202 shown in the figure, as well as the connection relationship between the network card 201 and the node 202 are only schematic illustrations.
- a network card can be connected to one or more nodes, and a node can also be connected to one or more network cards. This application does not limit this.
- the first network card and the first node are located in the first storage device, and the other nodes selected by the host are located in the second storage device.
- the other nodes selected by the host are also the nodes selected by the host switching path.
- the first network card detects an abnormal event related to the first node, and the first network card sends a notification message to the host 100, informing the host 100 that the path where the first node is located is abnormal, so that the host 100 accesses the second storage device to run the storage service (it should be understood that the combination of each node 202 in the distributed storage system 400 is similar to the multiple nodes 202 in the implementation environment shown in Figure 2 above, which will not be repeated here).
- the implementation environment further includes a switch 300, which is similar to the implementation environment shown in Figure 2 above, so it is not repeated.
- the wireless network or wired network uses standard communication technology and/or protocol, which is similar to the implementation environment shown in Figure 2 above, so it is not repeated.
- each storage device includes a network card and a node, and the memory of the distributed storage system is communicatively connected to the nodes in each storage device.
- the memory of the distributed storage system can be located inside the storage device or outside the storage device.
- the present application does not limit the number of memories in the distributed storage system. The number of memories can be one or more, and can be configured according to actual needs, which will not be repeated here.
- Fig. 4 is a schematic diagram of the structure of a host provided by an embodiment of the present application.
- the host 100 includes a memory 101, a processor 102, a communication interface 103 and a bus 104.
- the memory 101, the processor 102 and the communication interface 103 are connected to each other through the bus 104.
- the memory 101 may be a read-only memory (ROM) or other types of static storage devices that can store static information and instructions, a random access memory (RAM) or other types of dynamic storage devices that can store information and instructions, or an electrically erasable programmable read-only memory (EEPROM), a compact disc read-only memory (CD-ROM) or other optical disc storage, optical disc storage (including compressed optical disc, laser disc, optical disc, digital versatile disc, Blu-ray disc, etc.), a magnetic disk storage medium or other magnetic storage device, or any other medium that can be used to carry or store the desired program code in the form of instructions or data structures and can be accessed by a computer, but is not limited thereto.
- the memory 101 is used to store at least one section of program code.
- the processor 102 and the communication interface 103 are used to execute the steps involved in the host in the following method for processing node abnormal events.
- the processor 102 may be a network processor (NP), a central processing unit (CPU), an application-specific integrated circuit (ASIC), or an integrated circuit for controlling the execution of the program of the present application.
- the processor 102 may be a single-CPU processor or a multi-CPU processor.
- the number of the processors 102 may be one or more.
- the memory 101 and the processor 102 may be separately arranged or integrated.
- the communication interface 103 uses a transceiver module such as a transceiver to implement communication between the host 100 and other devices or communication networks. For example, a command can be sent to the storage device 200 through the communication interface 103, and for another example, a notification message sent by the storage device 200 can be received, but the present application is not limited thereto.
- a transceiver module such as a transceiver to implement communication between the host 100 and other devices or communication networks. For example, a command can be sent to the storage device 200 through the communication interface 103, and for another example, a notification message sent by the storage device 200 can be received, but the present application is not limited thereto.
- the bus 104 may include a path for transmitting information between various components of the host 100 (eg, the memory 101 , the processor 102 , and the communication interface 103 ).
- FIG5 is a schematic diagram of the structure of a storage device provided in an embodiment of the present application.
- the storage device 200 is a centralized storage device based on NVMe-oF, including a network card 201, multiple nodes 202, a memory 203, and a bus 204.
- the network card 201, multiple nodes 202, and the memory 203 are connected to each other through the bus 204.
- the network card 201 is used to realize the communication between the storage device 200 and other devices or communication networks.
- the storage device 200 can send a notification message to the host 100 through the network card 201.
- the network card 201 has the processing capability for abnormal events related to the node, including the detection capability for abnormal events related to the node and the capability to notify the host 100 of abnormal events.
- the network card 201 includes a processor 2011, a memory 2012 and an interface 2013, the interface 2013 is used to communicate with at least one node 202, and the memory 2012 is used to store at least one section of program code, which is loaded by the processor 2011 and implements the steps involved in the network card in the following method for processing abnormal events of nodes.
- the processor 2011 can be an NPU, a CPU, etc.
- the processor 2011 can be a single-core processor (single-CPU) or a multi-core processor (multi-CPU, MPU), and the present application is not limited thereto.
- network card 201 is an RDMA network card (RNIC) or other smart network card (smart NIC), but the present application is not limited to this.
- Node 202 is used to manage memory 203 and process commands sent by host 100. For example, node 202 writes data to memory 203 or reads data from memory 203 according to the I/O command sent by host 100, and the present application is not limited to this. In practical applications, node 202 can have various forms. Schematically, node 202 is the main body for processing NVMe-oF protocol. For example, node 202 includes a CPU and a memory, the CPU is used to perform operations such as address conversion and reading and writing data, and the memory is used to temporarily store data to be written to memory 203, or read from memory 203 to be sent to host 100, and the present application is not limited to this.
- the memory 203 includes at least one solid state drive SSD for storing data, wherein the SSD is a memory that mainly uses flash memory as permanent storage.
- the bus 204 may include a path for transmitting information between various components of the storage device 200 (eg, the network card 201 , the plurality of nodes 202 , and the memory 203 ).
- FIG6 is a schematic diagram of the structure of a distributed storage system provided by an embodiment of the present application.
- the distributed storage system 400 includes a plurality of independent storage devices 200, each storage device includes a network card 201, at least one node 202 and a memory 203, and in each storage device 200, the network card 201 is connected to the node 202 via a system bus, and the memory 203 is connected to the node 202 via a system bus.
- Node 202 is used to manage the memory 203.
- Each storage device 200 is connected via a network, wherein the network may be a wide area network or a local area network, etc., but the present application is not limited thereto.
- the network is connected to the network via the network card 201 of each storage device 200.
- the structure of any storage device 200 in the storage cluster shown in FIG6 is the same as that of the storage device shown in FIG5 , so it will not be repeated here.
- the present application does not limit the location of the memory and the number of memories in the distributed storage system. That is, Figure 6 is only one form of the distributed storage system provided by the present application, and does not constitute a limitation on the present application, and will not be repeated here.
- the network card 201 of the storage device 200 has the ability to process abnormal events related to the node 202, including the ability to detect abnormal events related to the node 202 and the ability to notify the host 100 of abnormal events.
- the capabilities of the network card 201 of the storage device 200 are introduced by taking the logical level as an example.
- FIG7 is a schematic diagram of a logic unit of a network card provided by an embodiment of the present application.
- the network card is connected to a first node among multiple nodes through a PCIe link communication, and a fault reflector is deployed on the network card.
- the fault reflector is run by a processor of the network card to provide processing capabilities for abnormal events related to the node.
- the fault reflector includes a detection logic unit and an execution logic unit.
- the detection logic unit is used to detect abnormal events related to the node.
- the detection logic unit is used to detect abnormal events related to the first node, and in the case of detecting abnormal events related to the first node, notify the execution logic unit.
- the detection logic unit can use an interface call or other message communication methods inside the chip to notify the execution logic unit, which is not limited to this.
- the detection logic unit is executed by a CPU or an MPU, and the present application is not limited to this.
- the execution logic unit is used to execute the step of sending a notification message to the host 100 according to the notification of the detection logic unit to notify the host 100 that an abnormality has occurred in the path where the first node is located.
- the execution logic unit is executed by a CPU or an NPU, and the present application is not limited to this.
- Fig. 8 is a method for processing a node abnormality event provided by an embodiment of the present application. As shown in Fig. 8, the interaction between the host 100 and the storage device 200 is taken as an example for introduction. The method includes the following steps 801 to 805.
- a host establishes a communication connection with a first node among multiple nodes of a storage device through a network card of the storage device.
- the storage device includes a plurality of nodes, the first node is any one of the plurality of nodes, wherein the node is also referred to as a storage controller.
- the network card of the storage device is connected to the first node for communication.
- the host sends a communication connection request to the first node through the network card, and the first node establishes a communication connection with the host according to the received communication connection request, so that the host can access the storage through the path where the first node is located.
- an NVMe connection is established between the host and the first node based on the NVMe protocol, so that the host can issue NVMe commands to the first node, and the NVMe commands include management commands and input/output commands.
- the network card of the storage device is an RNIC, and an RDMA connection can be established between the host and the first node based on the RDMA protocol, so that the host can issue RDMA commands to the first node to implement the RDMA function, which not only saves a lot of CPU resources, but also improves throughput and reduces network communication delay.
- a network card of a storage device detects an abnormal event related to a first node.
- the network card of the storage device performs link abnormality detection on the PCIe link between the network card and the first node to determine whether an abnormal event related to the first node occurs.
- the abnormal event related to the first node can be an abnormal event that occurs in the first node itself, such as a failure or restart of the first node, or an abnormal event that occurs in the PCIe link between the network card and the first node, such as a PCIe link disconnection, etc., and the present application is not limited thereto. It should be understood that the network card can infer that an abnormal event has occurred in the first node through the abnormality of the PCIe link.
- the network card detects that the PCIe link is disconnected, or the network card receives an error packet through the PCIe link, etc., the network card believes that an abnormal event is likely to have occurred in the first node, and the present application is not limited thereto.
- the network card and the first node are connected by a PCIe link communication, and the data transmission rate of the PCIe link is high, the network card can quickly detect an abnormal event related to the first node through the PCIe link.
- the network card performs a polling mechanism detection and/or an interruption detection mechanism detection on the PCIe link, and when an abnormality is detected in the PCIe link, it is determined that an abnormal event related to the first node has occurred.
- the polling mechanism refers to the network card monitoring the operating status of the external device of the network card in a polling manner.
- the interruption detection mechanism refers to the external device of the network card actively reporting an interruption signal to the network card when an abnormal event occurs, so that the network card is informed of the abnormal event of the external device.
- the network card can also use other methods to detect abnormal events related to the first node, for example, establishing a heartbeat connection between the network card and the first node, detecting abnormal events related to the first node through the heartbeat connection, etc.
- the present application is not limited to this.
- the network card of the storage device detects an abnormal event related to the first node, the network card sends a notification message to the host, where the notification message indicates that an abnormality occurs in the path where the first node is located.
- the network card of the storage device when the network card of the storage device detects that an abnormal event occurs in the first node, based on information pre-configured in the network card, generates a notification message and sends the notification message to the host, wherein the notification message includes path status information, and the path status information indicates that an abnormal event occurs in the path where the first node is located.
- the network card sends a notification message to the host in at least one of the following ways:
- the first one is that when the network card detects an abnormal event related to the first node, it actively generates a notification message and sends the notification message to the host.
- the second type is that when the network card detects an abnormal event related to the first node and receives a read command or a write command sent by the host, the network card actively generates a notification message and sends the notification message to the host.
- the network card can actively notify the host when detecting an abnormal event related to the first node, so that the host can access the storage through the path where the nodes other than the first node are located among the multiple nodes. It should be noted that the above two methods of sending notification messages will be described in detail in subsequent embodiments and will not be repeated here.
- the network card of the storage device can detect abnormal events related to the first node, and actively notify the host when abnormal events related to the first node are detected.
- the network card can periodically detect abnormal events related to the first node.
- the device may also continuously detect abnormal events related to the first node, but the present application is not limited thereto.
- a fault reflector is deployed on the network card of the storage device, which can provide a processing function for abnormal events related to the node. Accordingly, the above step 802 can be executed by the detection logic unit in the fault reflector, and the above step 803 can be executed by the execution detection unit in the fault reflector.
- the host receives the notification message.
- the host accesses the storage through a path where a node other than the first node among the multiple nodes is located.
- the host obtains path status information from the notification message. Since the path status information indicates that an abnormality has occurred in the path where the first node is located, the host learns that the memory may not be accessible through the path where the first node is located. Therefore, the host accesses the memory through the path where the nodes other than the first node among multiple nodes are located, thereby achieving continuity of storage services.
- the network card of the storage device is communicated with the first node, and can promptly send a notification message to the host when an abnormal event related to the first node is detected, informing the host that an abnormality has occurred in the path where the first node is located, so as to facilitate the host to switch the path.
- This method can effectively reduce the delay of the host switching the path and improve the continuity and reliability of the storage service.
- the network card of the storage device sends a notification message to the host in at least one manner.
- the two aforementioned manners are respectively introduced below through the embodiments shown in FIG. 9 and FIG. 10 .
- Fig. 9 is another method for processing node abnormal events provided by an embodiment of the present application. As shown in Fig. 9, the interaction between the host 100 and the storage device 200 is used as an example for introduction. The method includes the following steps 901 to 909.
- a host establishes a communication connection with a first node among multiple nodes of a storage device through a network card of the storage device.
- this step is the same as step 801 in the embodiment shown in Figure 8 above, so it will not be repeated here.
- the host sends a first command to the first node.
- the first command carries path status information.
- the first command indicates that a notification message is sent to the host when an abnormal event related to the first node is detected.
- the first command is a management command
- the host sends the first command to the first node through the management queue (admin queue), and the management queue is used to store NVMe management commands.
- the first command is an asynchronous event request (AER) command.
- AER command is an asynchronous command, which is used to notify the host about status, errors, health information, etc. when certain events occur, that is, the host does not require immediate reporting of the completion of the AER command, but reports completion in the event of an abnormal event.
- the host can send at least one AER command to the node to enable the node to report asynchronous events. This command does not set a timeout.
- the node When there is an asynchronous event that needs to be reported to the host, the node generates a completion queue entry (CQE) information and sends the CQE information to the completion queue (CQ) of the host.
- CQE completion queue entry
- the path status information can be configured according to actual needs, and this application does not limit this.
- the path status information is 03h, and its detailed definition refers to the following Table 1 (it should be understood that the following Table 1 is only a schematic illustration of the path status information, and other similar fields in the relevant protocol that can indicate the path status information can be applied to this application in the same way, and this application is not limited to this).
- the first command also carries a command identifier of the first command, and this application is not limited to this.
- Table 1 Asynchronous event information (asynchronous event information) - error status (error status)
- the first node receives a first command sent by the host.
- the first node configures the path status information carried by the first command into the management queue information of the network card.
- the management queue information is information maintained by the network card and stored on the network card.
- the first node parses the received first command to obtain the path status information, calls the preset interface provided by the network card, and configures the path status information in the management queue information, so that when an abnormal event is detected in the first node later, the path status information is obtained from the management queue information to complete the first command.
- the first command also includes a command identifier of the first command, and the relevant information carried by the first command can be collectively referred to as the first preset information.
- the first node can configure the first preset information in the management queue information according to the received first command.
- the management queue information is a kind of context information related to the management queue, so the life cycle of the management queue information is consistent with the life cycle of the management queue.
- the first node when the first command is the nth AER command issued by the host, the first node configures the path status information in the management queue information.
- n is a positive integer, for example, n is 1, that is, the first command is the first AER command issued by the host, and the present application is not limited to this, and can be set according to actual needs.
- the number of AER commands issued by the host to the first node is limited, or the number of AER events that the first node can process is limited.
- the above process can be understood as the first node transferring a certain AER command to the network card for processing, that is, by pre-setting which AER command the first node transfers to the network card for processing, the network card is enabled to report asynchronous events.
- the host sends a first command to the first node, so that the first node configures the path status information in the management queue information of the network card, so that the network card can send a corresponding notification message to the host when an abnormal event related to the first node is detected.
- the network card of the storage device detects an abnormal event related to the first node.
- this step is the same as step 802 in the embodiment shown in Figure 8 above, so it will not be repeated here.
- the network card of the storage device detects an abnormal event related to the first node, the network card obtains path status information from the management queue information of the network card and generates a notification message.
- the pre-configured information in the management queue information includes path status information and may also include the command identifier of the first command. Accordingly, the notification message generated by the network card includes path status information and may also include the command identifier of the first command, which will not be repeated here.
- the notification message includes the following content:
- error status includes built-in error status 03h (i.e., path status information, which is pre-configured by the interface parameters of the driver and the network card, refer to the aforementioned step 902);
- DW2 submission queue identifier: 0 (pre-configured by the interface parameters of the driver and the network card, refer to the above step 902); submission queue head pointer (SQHD): 0 (pre-configured by the interface parameters of the driver and the network card, refer to the above step 902);
- DW3 Command identifier: obtained by parsing the first command (pre-configured by specifying interface parameters of the driver and network card).
- the information in the notification message other than the path status information is optional information and can be configured according to the needs.
- the present application is not limited to the content shown in the above example.
- the status of the command indicated in the CQE is defined according to the status field of the command status field (status field).
- the present application is not limited to the content shown in the above example.
- the network card of the storage device sends a notification message to the host, where the notification message indicates that an abnormality occurs in the path where the first node is located.
- the network card sends the notification message to the host through the transport layer.
- the notification message is a communication manager (CM) message based on RDMA, but the present application is not limited thereto.
- the host receives the notification message.
- the host accesses the storage through a path where a node other than the first node among the multiple nodes is located.
- steps 907 to 909 are the same as steps 803 to 805 in the embodiment shown in Figure 8, so they are not repeated here.
- path status information is configured in the management queue information of the network card, so that the network card can actively send a notification message to the host when an abnormal event related to the first node is detected, so that the host can switch the path.
- Fig. 10 is another method for processing node abnormal events provided by an embodiment of the present application. As shown in Fig. 10, the interaction between the host 100 and the storage device 200 is used as an example for introduction. The method includes the following steps 1001 to 1011.
- a host establishes a communication connection with a first node among multiple nodes of a storage device through a network card of the storage device.
- this step is the same as step 801 in the embodiment shown in Figure 8 above, so it will not be repeated here.
- the host sends a second command to the first node.
- the second command instructs to generate path status information based on the operating system type of the host, and configure the path status information in the input and output queue context of the network card.
- the second command is a management command, and the host sends the second command to the first node through the management queue.
- the input and output queue context is information maintained by the network card and stored on the network card. It is context information related to the input and output queue, and the input and output queue is used to store NVMe I/O commands.
- the first node receives a second command sent by the host.
- the first node generates path status information according to the instruction of the second command and based on the operating system type of the host, and configures the path status information in the input and output queue context.
- the first node parses the received second command, generates path status information based on the operating system type of the host according to the instruction of the second command, calls the preset interface provided by the network card, and configures the path status information in the input/output queue context, so that the network card responds to the I/O command issued by the host based on the input/output queue context.
- the input/output queue context is a kind of context information related to the input/output queue, the life cycle of the input/output queue context is consistent with the life cycle of the input/output queue.
- the path status information generated in this step can be referred to as the second preset information.
- the path status information can be configured according to actual needs, and this application does not limit this.
- the path status information is 0 ⁇ 360h, where 0 ⁇ 3 indicates a path error (its detailed definition refers to Table 2 below. It should be understood that Table 2 below is only a schematic illustration of the path status information. Other similar fields in the relevant protocol that can indicate path status information can be applied to this application in the same way, and this application is not limited to this), and 0 ⁇ 60 indicates that the node has detected a path error, and this application is not limited to this. It should be noted that the specific meaning of the above path status information is only for illustrative purposes and does not constitute a limitation on this application.
- the host sends a second command to the first node, so that the first node configures the path status information in the input and output queue context of the network card, so that the network card can send a corresponding notification message to the host when it detects an abnormal event related to the first node and receives an I/O command sent by the host.
- the network card of the storage device detects an abnormal event related to the first node.
- this step is the same as step 802 in the embodiment shown in Figure 8 above, so it will not be repeated here.
- the host sends a third command to the network card, where the third command is a read command or a write command, and the third command indicates that a notification message is sent to the host when an abnormal event related to the first node is detected.
- the host sends a third command to the first node through the input and output queue, which is intercepted by the network card.
- the third command is also an I/O command.
- the network card of the storage device receives the third command.
- the network card of the storage device detects an abnormal event related to the first node, it obtains path status information from the input and output queue context and generates a notification message.
- the notification message also includes a command identifier of a third command, but the present application is not limited thereto.
- the notification message includes the following content:
- DW0 command specific: 0 (pre-configuration is specified by the interface parameters of the driver and the network card, refer to the above step 1004, and supports differentiated pre-configuration of the operating system);
- DW1 command specific: 0 (pre-configuration is specified by the interface parameters of the driver and the network card, refer to the above step 1004, and supports differentiated pre-configuration of the operating system);
- DW2 submission queue identifier (submission queue identifier): obtained by parsing the third command by the network card (such as converting the sequence number of the queue pair based on the basic transmission header of the third command, the specific conversion method is not limited, and can be set according to needs); submission queue head pointer (submission queue head pointer, SQHD): dynamically generated, for example, 0, the present application is not limited thereto;
- DW3 Path status information: 0 ⁇ 360h (pre-configuration is specified through the interface parameters of the driver and the network card, refer to the aforementioned step 1004, and operating system differentiated pre-configuration is supported); command identifier: obtained by parsing the third command.
- the information in the notification message other than the path status information is optional information and can be configured as needed. This application is not limited to the content shown in the above example.
- the network card of the storage device sends a notification message to the host, where the notification message indicates that an abnormality occurs in the path where the first node is located.
- the network card sends the notification message to the host through the application layer.
- the notification message is an NVMe encapsulated message, but the present application is not limited thereto.
- the host receives the notification message.
- the host accesses the storage through a path where a node other than the first node among the multiple nodes is located.
- steps 1009 to 1011 are the same as steps 803 to 805 in the embodiment shown in Figure 8, so they are not repeated here.
- path status information is configured in the input and output queue context of the network card, so that the network card can send a notification message to the host when it detects an abnormal event related to the first node and receives a third command issued by the host, so that the host can switch the path. This process can effectively reduce the delay of the host switching path and improve the continuity and reliability of storage services.
- FIG. 11 and 12 the method for processing abnormal node events provided in the present application is illustrated by combining the embodiments shown in the above-mentioned FIG. 9 and FIG. 10 as an example.
- FIG 11 is a schematic diagram of a method for processing a node abnormal event provided by an embodiment of the present application.
- the host establishes a communication connection with a first node among multiple nodes of the storage device through the network card of the storage device, creates a management queue and an input-output queue, wherein the communication connection includes an NVMe connection and an RDMA connection.
- the host sends a first command to the first node through the management queue, the first node receives the first command, parses the first command, obtains the first preset information, calls the preset interface provided by the network card, and configures the first preset information in the management queue information (the first preset information refers to the above step 904).
- the host sends a second command to the first node through the management queue, the second node receives the second command, and according to the indication of the second command, based on the operating system type of the host, generates the second preset information, calls the preset interface provided by the network card, and configures the second preset information in the input-output queue context (the second preset information refers to the above step 1004).
- the process shown in Figure 11 can be understood as an initialization process, and through the process shown in Figure 11, the network card of the storage device can notify the host of abnormal events related to the first node.
- the present application does not limit the order in which the first command and the second command are sent.
- FIG 12 is a schematic diagram of another method for processing node abnormal events provided by an embodiment of the present application.
- a fault reflector is deployed on the network card, and the fault reflector includes a detection logic unit and an execution logic unit.
- the detection logic unit detects an abnormal event related to the first node, and when an abnormal event related to the first node is detected, the execution logic unit is notified that a related abnormal event has occurred on the first node.
- the detection logic unit can also mark the abnormal event related to the first node, thereby avoiding repeated processing of the abnormal event related to the first node.
- the execution logic unit generates a notification message based on the first preset information according to the notification of the detection logic unit, and sends the notification message to the host.
- the notification message is also an AER asynchronous event completion message (it should be noted that, in some embodiments, the network card is connected to multiple hosts in communication, and the network card traverses all current management queues and sends notification messages to multiple hosts connected to the network card. This application does not limit this.)
- the network card receives the third command (ie, the I/O command)
- a notification message is generated based on the second preset information and sent to the host.
- the notification message is also an I/O response message.
- a fault reflector (a logical unit, including a detection logic unit and an execution logic unit) is deployed on the network card of the storage device, and the fault reflector is deployed separately from the node, so that when an abnormal event related to the first node is detected, the fault reflector takes over the NVMe command, and quickly returns to the host according to the pre-configured asynchronous event, and returns a specific error code to the newly received NVMe command, so as to quickly trigger the path switching of the host, thereby effectively reducing the delay of the host switching path, achieving the second-level convergence of the normal path, and improving the continuity and reliability of the storage business.
- M is a positive integer
- FIG13 is a schematic diagram of the structure of a device for processing node abnormal events provided by an embodiment of the present application.
- the device can implement the functions of the network card in the aforementioned storage device through software, hardware, or a combination of both.
- the device is configured on a network card on a storage device, and the storage device includes the network card and multiple nodes.
- the network card is connected to a first node among the multiple nodes in communication, and the node is used to manage the storage.
- the device includes:
- the sending unit 1301 is used to send a notification message to the host when an abnormal event related to the first node is detected, and the notification message indicates that an abnormality has occurred in the path where the first node is located, so that the host can access the storage through the path where a node other than the first node is located among the multiple nodes.
- the notification message includes path status information, and the path status information indicates that an abnormality occurs on the path where the first node is located.
- the network card is communicatively connected to the first node via a peripheral component interconnect bus PCIe link.
- the device further comprises: a detection unit, configured to:
- a link anomaly detection is performed on the PCIe link to determine whether an abnormal event related to the first node occurs.
- the detection unit is used to:
- a polling mechanism detection and/or an interruption detection mechanism detection is performed on the PCIe link, and when an abnormality of the PCIe link is detected, it is determined that an abnormal event related to the first node has occurred.
- the sending unit 1301 is used for any of the following:
- the network card sends the notification message to the host through the application layer
- the network card sends the notification message to the host through the transport layer.
- the apparatus further includes: an acquisition unit, configured to:
- the path status information is obtained from the management queue information of the network card.
- the path status information in the management queue information is configured by the first node according to a first command issued by the host.
- the first command carries the path status information.
- the first command indicates that when an abnormal event related to the first node is detected, the notification message is sent to the host.
- the acquisition unit is further used to:
- the path status information is obtained from the input/output queue context of the network card.
- the path status information in the input/output queue context is configured by the first node according to a second command issued by the host.
- the second command indicates that the path status information is generated based on the operating system type of the host and the path status information is configured in the input/output queue context.
- the apparatus further includes: a receiving unit, configured to:
- a third command sent by the host is received, where the third command is a read command or a write command, and the third command indicates that the notification message is sent to the host when an abnormal event related to the first node is detected.
- the above-mentioned device When the above-mentioned device detects an abnormal event related to the first node, it can promptly send a notification message to the host, informing the host that an abnormality has occurred in the path where the first node is located, so as to facilitate the host to switch the path.
- This method can effectively reduce the delay of the host switching the path and improve the continuity and reliability of the storage service.
- the node abnormal event processing device provided in the above embodiment only uses the division of the above functional modules as an example when processing node abnormal events.
- the above functions can be assigned to different functional modules as needed, that is, the internal structure of the device can be divided into different functional modules to complete all or part of the functions described above.
- the node abnormal event processing device provided in the above embodiment and the node abnormal event processing method embodiment belong to the same concept. The specific implementation process is detailed in the method The embodiments are not described in detail here.
- FIG14 is a schematic diagram of the structure of another device for processing node abnormal events provided by an embodiment of the present application.
- the device can implement the functions of the aforementioned host through software, hardware, or a combination of both.
- the device is configured on the host, the host is connected to the network card of the storage device, the storage device includes the network card and multiple nodes, the network card is connected to the first node of the multiple nodes, the node is used to manage the memory, and the device includes:
- the receiving unit 1401 is configured to receive a notification message sent by the network card when an abnormal event related to the first node is detected, wherein the notification message indicates that an abnormality occurs in a path where the first node is located;
- the access unit 1402 is configured to access the storage through a path where nodes other than the first node among the multiple nodes are located based on the notification message.
- the notification message includes path status information, and the path status information indicates that an abnormality occurs on the path where the first node is located.
- the receiving unit 1401 is used for any of the following:
- the apparatus further includes a sending unit, configured to:
- a first command is sent to the first node so that the first node configures the path status information carried by the first command in the management queue information, and the first command indicates that the notification message is sent to the host when an abnormal event related to the first node is detected.
- the sending unit is further configured to:
- a second command is issued to the first node, so that the first node generates the path status information according to the instruction of the second command and based on the operating system type of the host, and configures the path status information in the input and output queue context.
- the sending unit is further configured to:
- a third command is sent to the network card, where the third command is a read command or a write command, and the third command indicates that the notification message is sent to the host when an abnormal event related to the first node is detected.
- the above-mentioned device can receive the notification message sent by the network card of the storage device, so as to promptly know that the path where the first node is located is abnormal and perform path switching.
- This method can effectively reduce the delay of the host switching path and improve the continuity and reliability of the storage business.
- the node abnormal event processing device provided in the above embodiment only uses the division of the above functional modules as an example when processing node abnormal events.
- the above functions can be assigned to different functional modules as needed, that is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above.
- the node abnormal event processing device provided in the above embodiment and the node abnormal event processing method embodiment belong to the same concept. The specific implementation process is detailed in the method embodiment, which will not be repeated here.
- the information including but not limited to user device information, user personal information, etc.
- data including but not limited to data used for analysis, stored data, displayed data, etc.
- signals involved in this application are all authorized by the user or fully authorized by all parties, and the collection, use and processing of relevant data must comply with the relevant laws, regulations and standards of relevant countries and regions.
- the preset information involved in this application is obtained with full authorization.
- the words such as the term “first”, “second”, etc. are used to distinguish the same or similar items with substantially the same effects and functions. It should be understood that there is no logical or temporal dependency between “first”, “second”, and “nth”, nor is the quantity and execution order limited. It should also be understood that although the following description uses the terms first, second, etc. to describe various elements, these elements should not be limited by the terms. These terms are only used to distinguish an element from another element.
- the first node can be referred to as the second node, and similarly, the second node can be referred to as the first node.
- the first node and the second node can both be nodes, and in some cases, can be separate and different nodes.
- At least one in this application means one or more, and the term “plurality” in this application means two or more, for example, a plurality of nodes means two or more nodes.
- all or part of the embodiments may be implemented by software, hardware, firmware, or any combination thereof.
- all or part of the embodiments may be implemented in the form of program structure information.
- the program structure information includes one or more program instructions.
- All or part of the steps to implement the above embodiments may be completed by hardware, or by a program to instruct the relevant hardware.
- the program may be stored in a computer-readable storage medium.
- the above-mentioned storage medium may be a read-only memory, a disk or an optical disk, etc.
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Environmental & Geological Engineering (AREA)
- Computer And Data Communications (AREA)
Abstract
公开了一种节点异常事件的处理方法、网卡及存储集群,属于存储技术领域。该方法应用于存储设备的网卡,该存储设备还包括多个用于管理存储器的节点,其中,网卡与多个节点中的第一节点通信连接,能够在检测到与该第一节点相关的异常事件的情况下,主动向主机发送通知消息,告知主机该第一节点所在的路径发生异常,便于主机进行路径切换,这种方式能够有效降低主机切换路径的时延,提升存储业务的连续性和可靠性。
Description
本申请要求于2022年11月29日提交的申请号202211509370.9、发明名称为“一种RDMA网卡”的中国专利申请的优先权,以及,于2023年1月29日提交的申请号202310144857.X、发明名称为“节点异常事件的处理方法、网卡及存储集群”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本申请涉及存储技术领域,特别涉及一种节点异常事件的处理方法、网卡及存储集群。
基于光纤的非易失性内存接口(non-volatile memory express over fabrics,NVMe-oF)规范是一种存储网络协议。目前,基于NVMe-oF的存储设备通常采用多节点架构(节点也可以理解为是一种控制器)来实现主机对存储设备的访问,以提升存储业务的可靠性和连续性。例如,主机分别通过存储设备中节点A(node A)和节点B(node B)所在的两条路径来访问存储设备的存储器,在主机通过节点A所在的路径访问存储器的情况下,节点A发生异常事件后(如节点故障、升级或重启等),主机可以切换到节点B所在的路径,通过节点B所在的路径来访问存储器。
相关技术中,主机与节点A之间会建立心跳连接,主机能够通过该心跳连接检测节点A是否发生异常事件,在检测到节点A发生异常事件的情况下,切换到节点B所在的路径,通过节点B所在的路径来访问存储器。
然而,上述方法依赖于心跳超时机制,使得节点发生异常后主机成功切换路径的耗时较长,导致切换期间业务跌零的时间较长,从而影响存储业务的可靠性和连续性。
发明内容
本申请提供了一种节点异常事件的处理方法、网卡及存储集群,能够有效降低主机切换路径的时延,提升存储业务的连续性和可靠性。该技术方案如下:
第一方面,提供了一种节点异常事件的处理方法,应用于存储设备的网卡,所述存储设备包括网卡和多个节点,所述网卡与所述多个节点中的第一节点通信连接,所述节点用于管理存储器,所述方法包括:
在检测到与所述第一节点相关的异常事件的情况下,所述网卡向主机发送通知消息,所述通知消息指示所述第一节点所在的路径发生异常,以使所述主机通过所述多个节点中所述第一节点以外的节点所在的路径访问所述存储器。
其中,存储设备的节点也即是存储设备的存储控制器(controller),能够处理主机下发的命令,对存储器进行管理,等等。与第一节点相关的异常事件可以是第一节点自身发生的异常事件,例如第一节点故障、重启等,也可以是网卡与第一节点之间的通信链路发生的异常事件,例如通信链路断开等,本申请不限于此。在上述方法中,并非被动等待心跳消息来发现故障,而是网卡与多个节点中的第一节点通信连接,从而网卡在检测到与该第一节点相关的异常事件的情况下,主动向主机发送通知消息,告知主机该第一节点所在的路径发生异常,便于主机进行路径切换,这种方式能够有效降低主机切换路径的时延,提升存储业务的连续性和可靠性。应理解,由于网卡与节点通信连接,且都属于存储设备,因此一旦发生与该节点相关的异常事件,网卡能够及时获知这一异常事件,并快速做出响应,告知主机。
在一些实施例中,所述通知消息包括路径状态信息,所述路径状态信息指示所述第一节点所在的路径发生异常。
在一些实施例中,所述网卡与所述第一节点通过外围组件互连总线(peripheral component interconnect express,PCIe)链路通信连接。
在一些实施例中,所述方法还包括:对所述PCIe链路进行链路异常检测,来确定是否发生与所述第一节点相关的异常事件。
在一些实施例中,所述对所述PCIe链路进行链路异常检测,包括:
对所述PCIe链路进行轮询机制检测和/或中断检测机制检测,当检测到所述PCIe链路异常时,确定发生了与所述第一节点相关的异常事件。
其中,由于网卡与第一节点之间通过PCIe链路通信连接,而PCIe链路的数据传输速率较高,因此网卡能够通过PCIe链路,快速检测到与第一节点相关的异常事件。例如,若网卡检测到PCIe链路断开,或者网卡通过PCIe链路接收到了错包,等等,则网卡认为第一节点有很大可能发生了异常事件,本申请不限于此。
在一些实施例中,所述网卡向主机发送通知消息,包括下述任一项:
所述网卡通过传输层向所述主机发送所述通知消息;
所述网卡通过应用层向所述主机发送所述通知消息。
通过上述方式,网卡能够通过传输层或应用层快速向主机发送通知消息,进而降低主机切换路径的时延。
在一些实施例中,所述方法还包括:
从所述网卡的管理队列信息中获取所述路径状态信息,所述管理队列信息中的所述路径状态信息由所述第一节点根据所述主机下发的第一命令配置,所述第一命令携带所述路径状态信息,所述第一命令指示在检测到与所述第一节点相关的异常事件的情况下,向所述主机发送所述通知消息。
通过上述方式,网卡的管理队列信息中配置有路径状态信息,第一命令是一种异步事件请求命令,从而网卡能够在检测到与第一节点相关的异常事件的情况下,主动向主机发送通知消息。
在一些实施例中,所述方法还包括:
从所述网卡的输入输出队列上下文中获取所述路径状态信息,所述输入输出队列上下文中的所述路径状态信息由所述第一节点根据所述主机下发的第二命令配置,所述第二命令指示基于所述主机的操作系统类型,生成所述路径状态信息,将所述路径状态信息配置于所述输入输出队列上下文中。
在一些实施例中,所述方法还包括:
接收所述主机下发的第三命令,所述第三命令为读命令或写命令,所述第三命令指示在检测到与所述第一节点相关的异常事件的情况下,向所述主机发送所述通知消息。
通过上述方式,网卡的输入输出队列上下文中配置有路径状态信息,从而网卡能够在检测到与第一节点相关的异常事件,且接收到主机下发的第三命令的情况下,向主机发送通知消息,以便主机进行路径切换,这一过程能够有效降低主机切换路径的时延,提升存储业务的连续性和可靠性。
第二方面,本申请实施例提供了一种节点异常事件的处理方法,应用于主机,所述主机与存储设备的网卡通信连接,所述存储设备包括所述网卡和多个节点,所述网卡与所述多个节点中的第一节点通信连接,所述节点用于管理存储器,所述方法包括:
接收所述网卡在检测到与所述第一节点相关的异常事件的情况下发送的通知消息,所述通知消息指示所述第一节点所在的路径发生异常;
基于所述通知消息,通过所述多个节点中所述第一节点以外的节点所在的路径访问所述存储器。
在一些实施例中,所述通知消息包括路径状态信息,所述路径状态信息指示所述第一节点所在的路径发生异常。
在一些实施例中,所述主机接收所述网卡在检测到与所述第一节点相关的异常事件的情况下发送的通知消息,包括下述任一项:
所述主机通过传输层接收所述通知消息;
所述主机通过应用层接收所述通知消息;
在一些实施例中,所述方法还包括:
向所述第一节点下发第一命令,以使所述第一节点将所述第一命令携带的所述路径状态信息配置于所述管理队列信息中,所述第一命令指示在检测到与所述第一节点相关的异常事件的情况下,向所述主机发送所述通知消息。
在一些实施例中,所述方法还包括:
向所述第一节点下发第二命令,以使所述第一节点根据所述第二命令的指示,基于所述主机的操作系统类型,生成所述路径状态信息,将所述路径状态信息配置于所述输入输出队列上下文中。
在一些实施例中,所述方法还包括:
向所述网卡下发第三命令,所述第三命令为读命令或写命令,所述第三命令指示在检测到与所述第一节点相关的异常事件的情况下,向所述主机发送所述通知消息。
第三方面,本申请实施例提供了一种节点异常事件的处理装置,所述装置配置于存储设备中的网卡,所述存储设备包括所述网卡和多个节点,所述网卡与所述多个节点中的第一节点通信连接,所述节点用于管理存储器,所述装置包括至少一个功能单元,用于执行如前述第一方面或第一方面的任意一种可能的实现方式所提供的节点异常事件的处理方法。
第四方面,本申请实施例提供了一种节点异常事件的处理装置,所述装置配置于主机,所述主机与存储设备的网卡通信连接,所述存储设备包括所述网卡和多个节点,所述网卡与所述多个节点中的第一节点通信连接,所述节点用于管理存储器,所述装置包括至少一个功能单元,用于执行如前述第二方面或第二方面的任意一种可能的实现方式所提供的节点异常事件的处理方法。
第五方面,本申请实施例提供了一种网卡,所述网卡配置于存储设备,所述网卡包括处理器、存储器和接口,所述接口用于与所述存储设备中的节点通信连接,所述存储器用于存储至少一段程序代码,所述至少一段程序代码由所述处理器加载并实现如前述第一方面或第一方面的任意一种可能的实现方式所提供的节点异常事件的处理方法。
第六方面,本申请实施例提供了一种存储集群,所述存储集群包括网卡、多个节点以及存储器,所述网卡与所述节点通信连接,所述节点用于管理所述存储器,所述网卡用于执行如前述第一方面或第一方面的任意一种可能的实现方式所提供的节点异常事件的处理方法。
在一些实施例中,所述存储集群是集中式存储设备,所述节点是存储控制器,所述网卡与所述节点通过系统总线连接,所述存储器与所述节点通过系统总线连接。
在另一些实施例中,所述存储集群是分布式存储系统,所述分布式存储系统包括多个独立的存储设备,各个所述存储设备之间通过有线网络或者无线网络连接,组成存储网络;其中,每个所述存储设备包括所述网卡、所述节点以及所述存储器,所述网卡与所述节点通过系统总线连接,所述存储器与所述节点通过系统总线连接;或者,每个所述存储设备包括所述网卡和所述节点,所述存储器与每个所述存储设备中的所述节点之间通信连接。
第七方面,本申请实施例提供了一种主机,所述主机包括处理器和存储器,所述处理器用于执行所述存储器中存储的指令,以使所述主机执行如前述第二方面或第二方面的任意一种可能的实现方式所提供的节点异常事件的处理方法。
第八方面,本申请实施例提供了一种计算机可读存储介质,该计算机可读存储介质用于存储至少一段程序代码,该至少一段程序代码用于实现前述第一方面或第一方面的任意一种可能的实现方式所提供的节点异常事件的处理方法。或者,该至少一段程序代码用于实现如前述第二方面或第二方面的任意一种可能的实现方式所提供的节点异常事件的处理方法。该存储介质包括但不限于易失性存储器,例如随机访问存储器,非易失性存储器,例如快闪存储器、硬盘(hard disk drive,HDD)、固态硬盘(solid state drive,SSD)。
第九方面,本申请实施例提供了一种计算机程序产品,当该计算机程序产品在存储设备上运行时,使得该存储设备实现前述第一方面或第一方面的任意一种可能的实现方式所提供的节点异常事件的处理方法。或者,当该计算机程序产品在主机上运行时,使得该主机实现前述第二方面或第二方面的任意一种可能的实现方式所提供的节点异常事件的处理方法。该计算机程序产品可以为一个软件安装包,在需要实现前述方法的情况下,可以下载该计算机程序产品并执行该计算机程序产品。
图1是本申请实施例提供的一种存储架构的示意图;
图2是本申请实施例提供的一种实施环境的示意图;
图3是本申请实施例提供的另一种实施环境的示意图;
图4是本申请实施例提供的一种主机的结构示意图;
图5是本申请实施例提供的一种存储设备的结构示意图;
图6是本申请实施例提供的一种分布式存储系统的结构示意图;
图7是本申请实施例提供的一种网卡的逻辑单元示意图;
图8是本申请实施例提供的一种节点异常事件的处理方法;
图9是本申请实施例提供的另一种节点异常事件的处理方法;
图10是本申请实施例提供的另一种节点异常事件的处理方法;
图11是本申请实施例提供的一种节点异常事件的处理方法的示意图;
图12是本申请实施例提供的另一种节点异常事件的处理方法的示意图;
图13是本申请实施例提供的一种节点异常事件的处理装置的结构示意图;
图14是本申请实施例提供的另一种节点异常事件的处理装置的结构示意图。
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
为方便理解,下面先对本申请涉及的关键术语和关键概念进行说明。
NVMe,是一组允许固态硬盘(solid state disk,SSD)使用外围组件互连总线(PCI express,PCIe)的软硬件标准;而PCIe是实际的物理连接通道。NVM代表非易失性存储器(non-volatile memory),是SSD的常见的闪存形式。NVMe主要是为基于闪存的存储设备提供一个低延时、内部并发化的原生界面规范,也为现代中央处理器(central processing unit,CPU)、电脑平台及相关应用提供原生存储并发化的支持,令主机(host)硬件和软件可以充分利用固态存储设备的并行化存储能力。
NVMe命令,是指NVMe协议定义的命令。NVMe协议中命令分为管理(administrator,Admin)命令与输入/输出(input/output,I/O)命令,在一些实施例中,I/O命令也称NVM命令。其中,Admin命令用于管理和控制NVMe存储介质。I/O命令用于传输数据。示意性地,NVMe协议中I/O命令包括NVMe读命令和NVMe写命令。
队列对(queue pair,QP),是指一对用来承载NVMe命令的队列对,由一个提交队列(submission queue,SQ)和一个完成队列(completion queue,CQ)组成。示意性地,主机通过SQ提交命令给NVMe节点(控制器),NVMe控制器提交完成状态到CQ。
NVMe-oF规范,是一种建立在NVMe协议的基础之上的高速存储协议,NVMe-oF用于跨网络访问NVMe存储介质。NVMe-oF在NVMe的基础上增加了fabric相关的命令,使得NVMe的应用场景不局限于一个设备内部,而是能扩展到跨网络通信。其中,所谓“fabric”,就是指主机与存储介质之间的网络。fabric的典型形态例如为以太网、光纤通道、无限带宽(InfiniBand,IB)、远程直接内存访问(remote direct memory access,RDMA)等。比如使用基于融合以太网的RDMA(RDMA over converged ethernet,RoCE)实现fabric,本申请不限于此。
RDMA,是一种绕过远端设备的操作系统内核来访问远端设备的内存的技术。由于RDMA技术通常无需经过操作系统,从而不仅节省了大量CPU资源,还提高了吞吐量,降低了网络通信延迟。
命名空间(namespace),是一种可以被主机直接访问的格式化数量的非易失性内存,也可以理解为是一种存储空间。在一些实施例中,namespace对主机呈现就是一块真实的物理盘,比如一块SSD盘包括两个namespace,那么主机就能访问两块物理盘,并且可以分别对它们进行格式化和分区。
下面对本申请涉及的应用场景和实施环境进行介绍。
本申请实施例提供的技术方案能够应用于基于NVMe-oF的存储架构中,能够提升存储业务的连续性和可靠性。下面参考图1,对本申请涉及的应用场景进行介绍。
图1是本申请实施例提供的一种存储架构的示意图。如图1所示,该存储架构包括主机(host)、交换
机以及基于NVMe-oF的存储设备,该存储设备采用双节点架构来实现主机对该存储设备中存储器的访问。其中,该存储设备包括网卡(network interface card,NIC)、节点A(node A)、节点B(node B)以及存储器,存储器中的数据可以通过命名空间(namespace)进行索引,网卡与节点之间通过PCIe链路通信连接。示意性地,主机1能够通过存储设备中节点A和节点B所在的4条冗余路径来访问存储器,在主机1通过节点A所在的路径访问存储器的情况下,节点A发生异常事件后(如节点故障、升级或重启等),主机1可以切换到节点B所在的路径,通过节点B所在的路径来访问存储器,实现存储业务的连续性。相关技术中,主机1与节点A之间会建立心跳连接,主机1通过该心跳连接来检测节点A是否发生异常事件,在检测到节点A发生异常事件的情况下,切换到节点B所在的路径,通过节点B所在的路径来访问存储器。然而,上述方法依赖于心跳超时机制,使得节点发生异常后主机成功切换路径的耗时较长,导致切换期间业务跌零的时间较长,从而影响存储业务的可靠性和连续性。
基于上述图1所示的应用场景,本申请提供了一种节点异常事件的处理方法,能够在检测到与存储设备中节点相关的异常事件的情况下(或者说节点复位场景下),降低主机切换路径的时延,提升存储业务的连续性和可靠性。
下面参考图2和图3,对本申请涉及的实施环境进行介绍。
图2是本申请实施例提供的一种实施环境的示意图。如图2所示,该实施环境包括主机100和存储设备200,存储设备200包括网卡201、多个节点202以及存储器203,主机100和存储设备200之间通过有线网络或无线网络直接或间接地连接。应理解,图1和图2都是集中式存储设备(或者说存储阵列),例如存储区域网络(storage area network,SAN),示意性地,节点202是存储控制器(controller),存储器203是持久化存储介质,例如硬盘驱动HDD或者固态硬盘SSD等,本申请不限于此,存储设备200也可以理解为是一种包括多个节点202(也即存储控制器)的存储集群。
主机100是指用于运行存储业务的设备,例如是运行RoCE业务的设备,对此不作限定。示意性地,主机100通过访问存储设备200来运行存储业务。在本申请实施例中,主机100具备路径切换能力,能够从一条路径切换至另一条路径来访问存储设备200,以提升存储业务的连续性和可靠性。例如,主机100为运行有客户端的终端设备或者服务器等,本申请不限于此。示意性地,主机100的协议栈包括:文件系统(filesystem)、块输入/输出(block I/O)、小型计算机系统接口(small computer system interface,SCSI)、NVMe、驱动(driver)以及物理装置(physical devices)等,本申请不限于此。另外,主机100的数量可以是一个或多个,本申请对此不作限定。
存储设备200用于为主机100提供可访问的存储空间,如提供针对磁盘空间的读写访问等。在本申请实施例中,存储设备200包括网卡201、多个节点202以及存储器203。其中,网卡201与节点202之间通过系统总线(如PCIe链路)连接,存储器203与节点202之间通过系统总线连接。其中,节点202也即是存储设备的存储控制器(controller),能够处理主机下发的命令,对存储器203进行管理,等等。应理解,图中所示的网卡201和节点202的数量,以及网卡201与节点202之间的连接关系仅为示意性说明,一个网卡可以连接一个或多个节点,一个节点也可以连接一个或多个网卡,本申请对此不作限定。在本申请实施例中,网卡201具备针对与节点202相关的异常事件的处理能力,包括针对与节点202相关的异常事件的检测能力以及向主机100通知异常事件的能力。例如,以网卡201与多个节点202中的第一节点通信连接为例,该网卡201能够在检测到与第一节点相关的异常事件的情况下,向主机100发送通知消息,告知主机100第一节点所在的路径发生异常,使得主机100通过多个节点202中第一节点以外的节点所在的路径访问存储器203。
在一些实施例中,该实施环境还包括交换机300,主机100可以通过该交换机300的中转来访问存储设备200,本申请不限于此,应理解,交换机300为可选设备,主机100也可以直接访问存储设备200。
在一些实施例中,上述无线网络或有线网络使用标准通信技术和/或协议。网络通常为因特网、但也能够是任何网络,包括但不限于局域网(local area network,LAN)、城域网(metropolitan area network,MAN)、广域网(wide area network,WAN)、移动、有线或者无线网络、专用网络或者虚拟专用网络的任何组合。在一些实现方式中,使用包括超级文本标记语言(hyper text markup language,HTML)、可扩展标记语言(extensible markup language,XML)等的技术和/或格式来代表通过网络交换的数据。此外还能够使用诸如安全套接字层(secure socket layer,SSL)、传输层安全(transport layer security,TLS)、虚拟专用网络(virtual private network,VPN)、网际协议安全(internet protocol security,IPsec)等常规加密技术来加密所有或者
一些链路。在另一些实施例中,还能够使用定制和/或专用数据通信技术取代或者补充上述数据通信技术。
在上述图2所示的实施环境中,存储设备200包括多个节点202,从而主机100能够通过多条路径访问存储设备200。应理解,本申请对于存储设备200的数量不作限定,在存储设备200的数量为多个的情况下,同样适用于本申请提供的节点异常事件的处理方法。下面参考图3所示的分布式架构对这一情况进行介绍。
图3是本申请实施例提供的另一种实施环境的示意图,如图3所示,该实施环境包括主机100和分布式存储系统400,主机100和分布式存储系统400之间通过有线网络或无线网络直接或间接地连接。示意性地,该分布式存储系统400是一种包括多个独立的存储设备200的存储集群,各个存储设备200之间通过有线网络或者无线网络连接,组成存储网络。
主机100通过访问分布式存储系统400来运行存储业务。其中,主机100具备路径切换能力,能够从一个存储设备200切换至另一个存储设备200来运行存储业务,以提升存储业务的连续性和可靠性。主机100的数量可以是一个或多个,本申请对此不作限定。
分布式存储系统400用于为主机100提供可访问的存储空间,如提供针对磁盘空间的读写访问等。示意性地,每个存储设备200包括网卡201、至少一个节点202以及存储器203。其中,每个存储设备200中,网卡201与节点202之间通过系统总线(如PCIe链路)连接,存储器203与节点202之间通过系统总线连接。节点202用于管理存储器203。应理解,图中所示的网卡201和节点202的数量,以及网卡201与节点202之间的连接关系仅为示意性说明,一个网卡可以连接一个或多个节点,一个节点也可以连接一个或多个网卡,本申请对此不作限定。
在一些实施例中,第一网卡和第一节点位于第一存储设备,被主机选择的其他节点位于第二存储设备。其中,被主机选择的其他节点也即是主机切换路径所选择的节点。例如,主机100在访问第一存储设备的情况下,第一网卡检测到与第一节点相关的异常事件,第一网卡向主机100发送通知消息,告知主机100该第一节点所在的路径发生异常,使得主机100访问第二存储设备来运行存储业务(应理解,分布式存储系统400中各个节点202的组合同理于上述图2所示实施环境中的多个节点202,在此不再赘述)。
在一些实施例中,该实施环境还包括交换机300,与上述图2所示实施环境同理,故不再赘述。在一些实施例中,上述无线网络或有线网络使用标准通信技术和/或协议,与上述图2所示实施环境同理,故不再赘述。
需要说明的是,上述图3所示仅为本申请提供的分布式架构的一种形态,在另一些实施例中,在本申请提供的分布式存储系统中,每个存储设备包括网卡和节点,该分布式存储系统的存储器与每个存储设备中的节点之间通信连接。也即是说,分布式存储系统的存储器可以位于存储设备内部,也可以位于存储设备外部。另外,本申请对于分布式存储系统中存储器的数量不作限定,存储器的数量可以是一个,也可以是多个,能够根据实际需求进行配置,在此不再赘述。
下面对上述实施环境涉及设备的硬件结构进行介绍。
图4是本申请实施例提供的一种主机的结构示意图。如图4所示,该主机100包括存储器101、处理器102、通信接口103以及总线104。其中,存储器101、处理器102、通信接口103通过总线104实现彼此之间的通信连接。
存储器101可以是只读存储器(read-only memory,ROM)或可存储静态信息和指令的其它类型的静态存储设备,随机存取存储器(random access memory,RAM)或者可存储信息和指令的其它类型的动态存储设备,也可以是电可擦可编程只读存储器(electrically erasable programmable read-only memory,EEPROM)、只读光盘(compact disc read-only memory,CD-ROM)或其它光盘存储、光碟存储(包括压缩光碟、激光碟、光碟、数字通用光碟、蓝光光碟等)、磁盘存储介质或者其它磁存储设备、或者能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其它介质,但不限于此。示意性地,存储器101用于存储至少一段程序代码,当存储器101中存储的程序代码被处理器102执行时,处理器102和通信接口103用于执行下述节点异常事件的处理方法中主机所涉及的步骤。
处理器102可以是网络处理器(network processor,NP)、中央处理器(central processing unit,CPU)、特定应用集成电路(application-specific integrated circuit,ASIC)或用于控制本申请方案程序执行的集成电
路。该处理器102可以是一个单核(single-CPU)处理器,也可以是一个多核(multi-CPU)处理器。该处理器102的数量可以是一个,也可以是多个。其中,存储器101和处理器102可以分离设置,也可以集成在一起。
通信接口103使用例如收发器一类的收发模块,来实现主机100与其他设备或通信网络之间的通信。例如,可以通过通信接口103向存储设备200下发命令,又例如,接收存储设备200发送的通知消息等,本申请不限于此。
总线104可包括在主机100各个部件(例如,存储器101、处理器102、通信接口103)之间传送信息的通路。
图5是本申请实施例提供的一种存储设备的结构示意图。如图5所示,该存储设备200是一种基于NVMe-oF的集中式存储设备,包括网卡201、多个节点202、存储器203以及总线204。其中,网卡201、多个节点202以及存储器203通过总线204实现彼此之间的通信连接。
网卡201用于实现存储设备200与其他设备或通信网络之间的通信。例如,存储设备200可以通过网卡201向主机100发送通知消息。在本申请实施例中,网卡201具备针对与节点相关的异常事件的处理能力,包括针对与节点相关的异常事件的检测能力以及向主机100通知异常事件的能力。其中,网卡201包括处理器2011、存储器2012和接口2013,该接口2013用于与至少一个节点202通信连接,存储器2012用于存储至少一段程序代码,该至少一段程序代码由处理器2011加载并实现下述节点异常事件的处理方法中网卡所涉及的步骤。其中,该处理器2011可以是NPU、CPU等。另外,该处理器2011可以是一个单核处理器(single-CPU),也可以是一个多核处理器(multi-CPU,MPU),本申请不限于此。在一些实施例中,网卡201为RDMA网卡(RNIC)或者其他智能网卡(smart NIC),本申请不限于此。
节点202用于管理存储器203,处理主机100发送的命令。例如,节点202根据主机100发送的I/O命令,向存储器203中写入数据或者从存储器203中读取数据等,本申请不限于此。在实际应用中,节点202可具有多种形态。示意性地,节点202是处理NVMe-oF协议的主体。例如,节点202包括CPU和内存,CPU用于执行地址转换以及读写数据等操作,内存用于临时存储将要写入存储器203的数据,或者从存储器203读取出来将要发送给主机100的数据,本申请不限于此。
存储器203包括至少一个固态硬盘SSD,用于存储数据。其中,SSD是一种主要以闪存作为永久性存储的存储器。
总线204可包括在存储设备200各个部件(例如,网卡201、多个节点202、存储器203)之间传送信息的通路。
图6是本申请实施例提供的一种分布式存储系统的结构示意图。如图6所示,该分布式存储系统400包括多个独立的存储设备200,每个存储设备包括网卡201、至少一个节点202以及存储器203,每个存储设备200中,网卡201与节点202之间通过系统总线连接,存储器203与节点202之间通过系统总线连接。节点202用于管理存储器203。各个存储设备200之间通过网络连接,其中,该网络可以是广域网或局域网等等,本申请不限于此。具体地,通过各个存储设备200的网卡201与该网络连接。需要说明的是,图6所示的存储集群中任一个存储设备200的结构与前述图5所示的存储设备同理,故在此不再赘述。需要说明的是,基于上述对图3的介绍可知,本申请对于分布式存储系统中存储器所处的位置以及存储器的数量均不作限定,也即是,图6所示仅为本申请提供的分布式存储系统的其中一种形态,并不构成对本申请的限定,在此不再赘述。
下面通过几个方法实施例,对本申请提供的节点异常事件的处理方法进行介绍。
基于上述对存储设备200的介绍可知,存储设备200的网卡201具备针对与节点202相关的异常事件的处理能力,包括针对与节点202相关的异常事件的检测能力以及向主机100通知异常事件的能力。下面参考图7,以逻辑层面为例,对存储设备200的网卡201所具备的能力进行介绍。
图7是本申请实施例提供的一种网卡的逻辑单元示意图。如图7所示,网卡与多个节点中的第一节点之间通过PCIe链路通信连接,网卡上部署有故障反射器,该故障反射器由网卡的处理器来运行,用于提供针对与节点相关的异常事件的处理能力。其中,该故障反射器包括检测逻辑单元和执行逻辑单元,该检
测逻辑单元用于检测与第一节点相关的异常事件,在检测到与第一节点相关的异常事件的情况下,通知执行逻辑单元。示意性地,检测逻辑单元可以采用接口调用或芯片内部的其他消息通信方法来通知执行逻辑单元,对此不作限定。例如,该检测逻辑单元由CPU或MPU执行,本申请不限于此。该执行逻辑单元用于根据检测逻辑单元的通知,执行向主机100发送通知消息的步骤,以通知主机100第一节点所在的路径发生异常。例如,该执行逻辑单元由CPU或NPU执行,本申请不限于此。
基于此,下面通过几个方法实施例,以主机100与存储设备200之间的交互为例,来对本申请提供的节点异常事件的处理方法进行介绍。
图8是本申请实施例提供的一种节点异常事件的处理方法。如图8所示,以主机100与存储设备200之间的交互为例进行介绍,该方法包括下述步骤801至步骤805。
801、主机通过存储设备的网卡,与存储设备的多个节点中的第一节点建立通信连接。
在本申请实施例中,该存储设备包括多个节点,该第一节点为多个节点中任一个节点,其中,节点也称为存储控制器。存储设备的网卡与第一节点通信连接。主机通过该网卡,向该第一节点发送通信连接请求,第一节点根据接收到的通信连接请求,与主机建立通信连接,从而主机能够通过第一节点所在的路径来访问存储器。
示意性地,主机与第一节点之间基于NVMe协议建立NVMe连接,从而主机能够向第一节点下发NVMe命令,NVMe命令包括管理命令和输入/输出命令。在一些实施例中,存储设备的网卡为RNIC,主机与第一节点之间能够基于RDMA协议建立RDMA连接,从而主机能够向第一节点下发RDMA命令,实现RDMA功能,不仅节省了大量CPU资源,还提高了吞吐量,降低了网络通信延迟。
802、存储设备的网卡检测与第一节点相关的异常事件。
在本申请实施例中,存储设备的网卡对网卡与第一节点之间的PCIe链路进行链路异常检测,来确定是否发生与第一节点相关的异常事件。其中,与第一节点相关的异常事件可以是第一节点自身发生的异常事件,例如第一节点故障、重启等,也可以是网卡与第一节点之间的PCIe链路发生的异常事件,例如PCIe链路断开等,本申请不限于此。应理解,网卡可以通过PCIe链路异常,推测第一节点发生异常事件。例如,若网卡检测到PCIe链路断开,或者网卡通过PCIe链路接收到了错包,等等,则网卡认为第一节点有很大可能发生了异常事件,本申请不限于此。另外,由于网卡与第一节点之间通过PCIe链路通信连接,而PCIe链路的数据传输速率较高,因此网卡能够通过PCIe链路,快速检测到与第一节点相关的异常事件。
在一些实施例中,网卡对PCIe链路进行轮询机制检测和/或中断检测机制检测,当检测到PCIe链路异常时,确定发生了与第一节点相关的异常事件。其中,轮询机制是指网卡以轮询的方式,监控网卡的外接设备的运行状态。中断检测机制是指网卡的外接设备在发生异常事件的情况下,会主动上报中断信号给网卡,以便网卡获知该外接设备发生异常事件。
当然,网卡还能够采用其他方式来检测与第一节点相关的异常事件,例如,网卡与第一节点之间建立心跳连接,通过该心跳连接检测与第一节点相关的异常事件,等等,本申请不限于此。
803、存储设备的网卡在检测到与第一节点相关的异常事件的情况下,向主机发送通知消息,该通知消息指示该第一节点所在的路径发生异常。
在本申请实施例中,存储设备的网卡在检测到第一节点发生异常事件的情况下,基于在网卡中预先配置的信息,生成通知消息,向主机发送该通知消息。其中,该通知消息包括路径状态信息,该路径状态信息指示第一节点所在的路径发生异常。
另外,在本步骤中,网卡向主机发送通知消息包括以下至少一种方式:
第一种、网卡在检测到与第一节点相关的异常事件的情况下,主动生成通知消息,向主机发送该通知消息。
第二种、网卡在检测到与第一节点相关的异常事件,且网卡接收到主机发送的读命令或写命令的情况下,主动生成通知消息,向主机发送该通知消息。
通过上述方式,网卡能够在检测到与第一节点相关的异常事件的情况下,主动通知主机,以使主机通过多个节点中第一节点以外的节点所在的路径来访问存储器。需要说明的是,上述两种发送通知消息的方式会在后续实施例中进行详细介绍,在此不再赘述。
经过上述步骤802和步骤803,存储设备的网卡能够检测与第一节点相关的异常事件,并在检测到与第一节点相关的异常事件的情况下,主动通知主机。其中,网卡可以周期性检测与第一节点相关的异常事
件,也可以持续检测与第一节点相关异常事件,本申请不限于此。另外,结合前述图7的介绍可知,存储设备的网卡上部署有故障反射器,能够提供针对与节点相关的异常事件的处理功能。相应地,上述步骤802可以由故障反射器中的检测逻辑单元执行,上述步骤803可以由故障反射器中的执行检测单元执行。
804、主机接收该通知消息。
805、主机基于该通知消息,通过多个节点中第一节点以外的节点所在的路径访问存储器。
在本申请实施例中,主机从该通知消息中获取到路径状态信息,由于该路径状态信息指示第一节点所在的路径发生异常,从而主机获知到通过该第一节点所在的路径很有可能无法访问到存储器,故主机通过多个节点中第一节点以外的节点所在的路径来访问存储器,实现存储业务的连续性。
在上述节点异常事件的处理方法中,存储设备的网卡与第一节点通信连接,能够在检测到与该第一节点相关的异常事件的情况下,及时向主机发送通知消息,告知主机该第一节点所在的路径发生异常,便于主机进行路径切换,这种方式能够有效降低主机切换路径的时延,提升存储业务的连续性和可靠性。
基于上述图8所示实施例可知,存储设备的网卡向主机发送通知消息包括至少一种方式,下面分别通过图9和图10所示实施例,对上述涉及的两种方式分别进行介绍。
图9是本申请实施例提供的另一种节点异常事件的处理方法。如图9所示,以主机100与存储设备200之间的交互为例进行介绍,该方法包括下述步骤901至步骤909。
901、主机通过存储设备的网卡,与存储设备的多个节点中的第一节点建立通信连接。
其中,本步骤与前述图8所示实施例中步骤801同理,故不再赘述。
902、主机向第一节点下发第一命令,该第一命令携带路径状态信息,该第一命令指示在检测到与第一节点相关的异常事件的情况下,向主机发送通知消息。
其中,第一命令为管理命令,主机通过管理队列(admin queue)向第一节点下发第一命令,管理队列用于存放NVMe管理命令。示意性地,该第一命令为异步事件请求(asynchronous event request,AER)命令。其中,AER命令是一种异步命令,用于在某些事件发生时,通知主机关于状态、错误、健康信息等,也即是,主机不要求立即报告完成该AER命令,而是在发生异常事件的情况下再报告完成。应理解,主机能够向节点下发至少一个AER命令来使能节点报告异步事件,这个命令不设置超时时间,在有异步事件需要报告给主机时,节点产生一个完成队列条目(completion queue entry,CQE)信息,将该CQE信息发送给主机的完成队列(completion queue,CQ)。
其中,该路径状态信息能够根据实际需求进行配置,本申请对此不作限定。例如,该路径状态信息为03h,其详细定义参考下述表1(应理解,下述表1所示仅为针对路径状态信息的一种示意性说明,相关协议中其他类似能够指示路径状态信息的字段同理能够应用于本申请中,本申请不限于此)。在一些实施例中,该第一命令还携带第一命令的命令标识,本申请不限于此。
表1异步事件信息(asynchronous event information)-错误状态(error status)
903、第一节点接收主机下发的第一命令。
904、第一节点将该第一命令携带的路径状态信息配置于网卡的管理队列信息中。
其中,管理队列信息为网卡所维护的信息,存储于网卡上。第一节点对接收到的第一命令进行解析,得到路径状态信息,调用网卡提供的预置接口,将该路径状态信息配置于管理队列信息中,以便后续在检测到第一节点发生异常事件的情况下,从该管理队列信息中获取路径状态信息,完成第一命令。另外,基于前述步骤902可知,第一命令还包括第一命令的命令标识,可以将第一命令所携带的相关信息统称为第一预置信息,在本步骤中,第一节点能够根据接收到的第一命令,将第一预置信息配置于管理队列信息中。应理解,管理队列信息是一种与管理队列相关的上下文信息,因此该管理队列信息的生命周期与管理队列的生命周期一致。
在一些实施例中,第一节点在第一命令是主机下发的第n个AER命令的情况下,将该路径状态信息配置于管理队列信息中。其中,n为正整数,例如,n为1,即第一命令为主机下发的第一个AER命令,本申请不限于此,能够根据实际需求进行设置。需要说明的是,主机向第一节点下发AER命令的数量有限,或者说第一节点能够处理的AER事件的数量有限,上述过程可以理解为第一节点将某一个AER命令转由网卡来处理,也即是,通过预先设置第一节点将哪个AER命令转由网卡处理,使能网卡报告异步事件。
经过上述步骤902至步骤904,主机通过向第一节点下发第一命令的方式,使得第一节点将路径状态信息配置于网卡的管理队列信息中,从而网卡能够在检测到与第一节点相关的异常事件的情况下,向主机发送相应通知消息。
905、存储设备的网卡检测与第一节点相关的异常事件。
其中,本步骤与前述图8所示实施例中步骤802同理,故不再赘述。
906、存储设备的网卡在检测到与第一节点相关的异常事件的情况下,从网卡的管理队列信息获取路径状态信息,生成通知消息。
其中,基于前述步骤904可知,管理队列信息中预先配置的信息包括路径状态信息,还可以包括第一命令的命令标识,相应地,网卡生成的通知消息包括路径状态信息,还可以包括第一命令的命令标识,在此不再赘述。
下面以通知消息为CQE信息为例,对该通知消息的具体内容进行举例说明,示意性地,该通知消息包括下述内容:
1.DW0:错误状态(error status)包括内置错误状态03h(即路径状态信息,通过驱动和网卡的接口参数指定预配置,参考前述步骤902);
2.DW1:保留(reserve):00;
3.DW2:提交队列标识符(submission queue identifier):0(通过驱动和网卡的接口参数指定预配置,参考前述步骤902);提交队列头指针(submission queue head pointer,SQHD):0(通过驱动和网卡的接口参数指定预配置,参考前述步骤902);
4.DW3:命令标识:对第一命令进行解析得到(通过驱动和网卡的接口参数指定预配置)。
应理解,由于通知消息的作用在于通知主机第一节点所在路径发生异常,故通知消息中除路径状态信息之外的信息为可选信息,能够根据需求进行配置,本申请并不限于上述举例所示内容。例如,根据命令状态域(status field)的状态字段来定义CQE中指示的命令的状态,本申请并不限于上述举例所示内容。
907、存储设备的网卡向主机发送通知消息,该通知消息指示该第一节点所在的路径发生异常。
其中,网卡通过传输层(transport layer)向主机发送该通知消息。例如,该通知消息为基于RDMA的连接管理(communication manager,CM)报文,本申请不限于此。
908、主机接收该通知消息。
909、主机基于该通知消息,通过多个节点中第一节点以外的节点所在的路径访问存储器。
其中,上述步骤907至步骤909与前述图8所示实施例中步骤803至步骤805同理,故不再赘述。
在上述节点异常事件的处理方法中,网卡的管理队列信息中配置有路径状态信息,从而网卡能够在检测到与第一节点相关的异常事件的情况下,主动向主机发送通知消息,以便主机进行路径切换,这一过程能够有效降低主机切换路径的时延,提升存储业务的连续性和可靠性。
图10是本申请实施例提供的另一种节点异常事件的处理方法。如图10所示,以主机100与存储设备200之间的交互为例进行介绍,该方法包括下述步骤1001至步骤1011。
1001、主机通过存储设备的网卡,与存储设备的多个节点中的第一节点建立通信连接。
其中,本步骤与前述图8所示实施例中步骤801同理,故不再赘述。
1002、主机向第一节点下发第二命令,该第二命令指示基于主机的操作系统类型,生成路径状态信息,将该路径状态信息配置于网卡的输入输出队列上下文中。
其中,该第二命令为管理命令,主机通过管理队列向第一节点下发第二命令。其中,输入输出队列上下文为网卡所维护的信息,存储于网卡上,是一种与输入输出队列相关的上下文信息,该输入输出队列用于存放NVMe I/O命令。
1003、第一节点接收主机下发的第二命令。
1004、第一节点根据第二命令的指示,基于主机的操作系统类型,生成路径状态信息,将路径状态信息配置于输入输出队列上下文中。
其中,第一节点对接收到第二命令进行解析,根据第二命令的指示,基于主机的操作系统类型,生成路径状态信息,调用网卡提供的预置接口,将该路径状态信息配置于输入输出队列上下文中,以便网卡基于该输入输出队列上下文,响应主机下发的I/O命令。应理解,由于输入输出队列上下文是一种与输入输出队列相关的上下文信息,因此该输入输出队列上下文的生命周期与输入输出队列的生命周期一致。在一些实施例中,可以将本步骤生成的路径状态信息称为第二预置信息。
其中,该路径状态信息能够根据实际需求进行配置,本申请对此不作限定。例如,路径状态信息为0×360h,其中,0×3指示路径错误(其详细定义参考下述表2,应理解,下述表2所示仅为针对路径状态信息的一种示意性说明,相关协议中其他类似能够指示路径状态信息的字段同理能够应用于本申请中,本申请不限于此),0×60指示节点检测到路径错误,本申请不限于此。需要说明的是,上述路径状态信息的具体含义仅为举例说明,并不构成对本申请的限定。
表2状态码(status code)-状态码类型值(status code type values)
经过上述步骤1002至步骤1004,主机通过向第一节点下发第二命令的方式,使得第一节点将路径状态信息配置于网卡的输入输出队列上下文中,从而网卡能够在检测到与第一节点相关的异常事件,且接收到主机下发的I/O命令的情况下,向主机发送相应通知消息。
1005、存储设备的网卡检测与第一节点相关的异常事件。
其中,本步骤与前述图8所示实施例中步骤802同理,故不再赘述。
1006、主机向网卡下发第三命令,该第三命令为读命令或写命令,该第三命令指示在检测到与第一节点相关的异常事件的情况下,向主机发送通知消息。
其中,主机通过输入输出队列向第一节点下发第三命令,由网卡拦截,该第三命令也即是I/O命令。
1007、存储设备的网卡接收该第三命令。
1008、存储设备的网卡在检测到与第一节点相关的异常事件的情况下,从输入输出队列上下文获取路径状态信息,生成通知消息。
在一些实施例中,该通知消息还包括第三命令的命令标识,本申请不限于此。
下面以通知消息为CQE信息为例,对该通知消息的具体内容进行举例说明,示意性地,该通知消息包括下述内容:
1.DW0:命令特定(command specific):0(通过驱动和网卡的接口参数指定预配置,参考前述步骤1004,支持操作系统差异化预配置);
2.DW1:命令特定(command specific):0(通过驱动和网卡的接口参数指定预配置,参考前述步骤1004,支持操作系统差异化预配置);
3.DW2:提交队列标识符(submission queue identifier):由网卡从第三命令中解析得到(如基于第三命令的基本传输头的队列对序号转换得到,具体转换方式不作限定,能够根据需求进行设置);提交队列头指针(submission queue head pointer,SQHD):动态生成,例如为0,本申请不限于此;
4.DW3:路径状态信息:0×360h(通过驱动和网卡的接口参数指定预配置,参考前述步骤1004,支持操作系统差异化预配置);命令标识:对第三命令进行解析得到。
应理解,由于通知消息的作用在于通知主机第一节点所在路径发生异常,故通知消息中除路径状态信息之外的信息为可选信息,能够根据需求进行配置,本申请并不限于上述举例所示内容。
1009、存储设备的网卡向主机发送通知消息,该通知消息指示该第一节点所在的路径发生异常。
其中,网卡通过应用层(application layer)向主机发送该通知消息。例如,该通知消息为NVMe封装报文,本申请不限于此。
1010、主机接收该通知消息。
1011、主机基于该通知消息,通过多个节点中第一节点以外的节点所在的路径访问存储器。
其中,上述步骤1009至步骤1011与前述图8所示实施例中步骤803至步骤805同理,故不再赘述。
在上述节点异常事件的处理方法中,网卡的输入输出队列上下文中配置有路径状态信息,从而网卡能够在检测到与第一节点相关的异常事件,且接收到主机下发的第三命令的情况下,向主机发送通知消息,以便主机进行路径切换,这一过程能够有效降低主机切换路径的时延,提升存储业务的连续性和可靠性。
基于上述图9和图10所示实施例,对网卡向主机发送通知消息所涉及的两种方式进行了介绍,应理解,上述图9和图10所示实施例可以结合,即,网卡在检测到与第一节点相关的异常事件的情况下,既能够主动向主机上报AER完成事件,又能够在接收到第三命令的情况下,向主机返回I/O响应,具体过程与前述图9和图10所示实施例同理,故不再赘述。
下面参考图11和图12,以结合上述图9和图10所示实施例为例,对本申请提供的节点异常事件的处理方法进行举例说明。
图11是本申请实施例提供的一种节点异常事件的处理方法的示意图。如图11所示,主机通过存储设备的网卡,与存储设备的多个节点中的第一节点建立通信连接,创建管理队列和输入输出队列,其中,通信连接包括NVMe连接和RDMA连接。接着,主机通过管理队列向第一节点下发第一命令,第一节点接收该第一命令,对该第一命令进行解析,得到第一预置信息,调用网卡提供的预置接口,将该第一预置信息配置于管理队列信息中(第一预置信息参考上述步骤904)。另外,主机通过管理队列向第一节点下发第二命令,第二节点接收该第二命令,根据第二命令的指示,基于主机的操作系统类型,生成第二预置信息,调用网卡提供的预置接口,将第二预置信息配置于输入输出队列上下文中(第二预置信息参考上述步骤1004)。需要说明的是,图11所示流程可以理解为是一种初始化流程,通过图11所示流程,使得存储设备的网卡能够向主机通知与第一节点相关的异常事件。而且,本申请对于第一命令和第二命令的发送顺序不作限定。
图12是本申请实施例提供的另一种节点异常事件的处理方法的示意图。如图12所示,网卡上部署有故障反射器,该故障反射器包括检测逻辑单元和执行逻辑单元。示意性地,检测逻辑单元检测与第一节点相关的异常事件,在检测到与第一节点相关的异常事件的情况下,通知执行逻辑单元第一节点发生了相关的异常事件。在这一过程中,检测逻辑单元还可以标记与第一节点相关的异常事件,从而避免重复处理与第一节点相关的异常事件。接着,执行逻辑单元根据检测逻辑单元的通知,基于第一预置信息,生成通知消息,向主机发送该通知消息,该通知消息也即是AER异步事件完成消息(需要说明的是,在一些实施例中,网卡与多个主机通信连接,网卡遍历当前所有的管理队列,向网卡连接的多个主机发送通知消息,
本申请对此不作限定)。另外,在网卡接收到第三命令(即I/O命令)的情况下,基于第二预置信息,生成通知消息,向主机发送该通知消息,该通知消息也即是I/O响应消息。
通过上述方式,存储设备的网卡上部署有故障反射器(一种逻辑单元,包括检测逻辑单元和执行逻辑单元),故障反射器与节点分离部署,从而在检测到与第一节点相关的异常事件的情况下,由该故障反射器接管NVMe命令,并根据预先配置的异步事件快速返回主机,并且对新接收的NVMe命令返回特定的错误码,实现快速触发主机的路径切换,从而有效降低了主机切换路径的时延,实现正常路径的秒级收敛,提升存储业务的连续性和可靠性。即使在极端场景下,如主机访问存储设备的存储器存在M条冗余路径(M为正整数),若M-1条路径对应的节点发生异常事件(例如,集群供电单平面整体故障),通过M-1次的I/O路径反馈和主动上报,只要有一个节点正常,主机就可以秒级快速收敛到一条正常处理的节点路径上,触发I/O秒级切换,提升了存储业务的连续性和可靠性。
图13是本申请实施例提供的一种节点异常事件的处理装置的结构示意图。该装置可以通过软件、硬件或者两者的结合实现前述存储设备中网卡所具备的功能。如图13所示,该装置配置于存储设备上的网卡,该存储设备包括该网卡和多个节点,该网卡与该多个节点中的第一节点通信连接,该节点用于管理存储器,该装置包括:
发送单元1301,用于在检测到与该第一节点相关的异常事件的情况下,向主机发送通知消息,该通知消息指示该第一节点所在的路径发生异常,以使该主机通过该多个节点中该第一节点以外的节点所在的路径访问该存储器。
在一些实施例中,该通知消息包括路径状态信息,该路径状态信息指示该第一节点所在的路径发生异常。
在一些实施例中,该网卡与该第一节点通过外围组件互连总线PCIe链路通信连接。
在一些实施例中,该装置还包括:检测单元,用于:
对PCIe链路进行链路异常检测,来确定是否发生与第一节点相关的异常事件。
在一些实施例中,该检测单元,用于:
对PCIe链路进行轮询机制检测和/或中断检测机制检测,当检测到PCIe链路异常时,确定发生了与第一节点相关的异常事件。
在一些实施例中,该发送单元1301,用于下述任一项:
该网卡通过应用层向该主机发送该通知消息;
该网卡通过传输层向该主机发送该通知消息。
在一些实施例中,该装置还包括:获取单元,用于:
从该网卡的管理队列信息中获取该路径状态信息,该管理队列信息中的该路径状态信息由该第一节点根据该主机下发的第一命令配置,该第一命令携带该路径状态信息,该第一命令指示在检测到与该第一节点相关的异常事件的情况下,向该主机发送该通知消息。
在一些实施例中,该获取单元,还用于:
从该网卡的输入输出队列上下文中获取该路径状态信息,该输入输出队列上下文中的该路径状态信息由该第一节点根据该主机下发的第二命令配置,该第二命令指示基于该主机的操作系统类型,生成该路径状态信息,将该路径状态信息配置于该输入输出队列上下文中。
在一些实施例中,该装置还包括:接收单元,用于:
接收该主机下发的第三命令,该第三命令为读命令或写命令,该第三命令指示在检测到与该第一节点相关的异常事件的情况下,向该主机发送该通知消息。
上述装置能够在检测到与第一节点相关的异常事件的情况下,及时向主机发送通知消息,告知主机该第一节点所在的路径发生异常,便于主机进行路径切换,这种方式能够有效降低主机切换路径的时延,提升存储业务的连续性和可靠性。
需要说明的是:上述实施例提供的节点异常事件的处理装置在进行节点异常事件的处理时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的节点异常事件的处理装置与节点异常事件的处理方法实施例属于同一构思,其具体实现过程详见方法
实施例,这里不再赘述。
图14是本申请实施例提供的另一种节点异常事件的处理装置的结构示意图。该装置可以通过软件、硬件或者两者的结合实现前述主机所具备的功能。如图14所示,该装置配置于主机,该主机与存储设备的网卡通信连接,该存储设备包括该网卡和多个节点,该网卡与该多个节点中的第一节点通信连接,该节点用于管理存储器,该装置包括:
接收单元1401,用于接收该网卡在检测到与该第一节点相关的异常事件的情况下发送的通知消息,该通知消息指示该第一节点所在的路径发生异常;
访问单元1402,用于基于该通知消息,通过该多个节点中该第一节点以外的节点所在的路径访问该存储器。
在一些实施例中,该通知消息包括路径状态信息,该路径状态信息指示该第一节点所在的路径发生异常。
在一些实施例中,该接收单元1401,用于下述任一项:
通过传输层接收该通知消息;
通过应用层接收该通知消息;
在一些实施例中,该装置还包括,发送单元,用于:
向该第一节点下发第一命令,以使该第一节点将该第一命令携带的该路径状态信息配置于该管理队列信息中,该第一命令指示在检测到与该第一节点相关的异常事件的情况下,向该主机发送该通知消息。
在一些实施例中,该发送单元,还用于:
向该第一节点下发第二命令,以使该第一节点根据该第二命令的指示,基于该主机的操作系统类型,生成该路径状态信息,将该路径状态信息配置于该输入输出队列上下文中。
在一些实施例中,该发送单元,还用于:
向该网卡下发第三命令,该第三命令为读命令或写命令,该第三命令指示在检测到与该第一节点相关的异常事件的情况下,向该主机发送该通知消息。
上述装置能够接收存储设备的网卡发送的通知消息,从而及时获知第一节点所在的路径发生异常,进行路径切换,这种方式能够有效降低主机切换路径的时延,提升存储业务的连续性和可靠性。
需要说明的是:上述实施例提供的节点异常事件的处理装置在进行节点异常事件的处理时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的节点异常事件的处理装置与节点异常事件的处理方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
需要说明的是,本申请所涉及的信息(包括但不限于用户设备信息、用户个人信息等)、数据(包括但不限于用于分析的数据、存储的数据、展示的数据等)以及信号,均为经用户授权或者经过各方充分授权的,且相关数据的收集、使用和处理需要遵守相关国家和地区的相关法律法规和标准。例如,本申请中涉及到的预置信息等都是在充分授权的情况下获取的。
本申请中术语“第一”“第二”等字样用于对作用和功能基本相同的相同项或相似项进行区分,应理解,“第一”、“第二”、“第n”之间不具有逻辑或时序上的依赖关系,也不对数量和执行顺序进行限定。还应理解,尽管以下描述使用术语第一、第二等来描述各种元素,但这些元素不应受术语的限制。这些术语只是用于将一元素与另一元素区别分开。例如,在不脱离各种所述示例的范围的情况下,第一节点可以被称为第二节点,并且类似地,第二节点可以被称为第一节点。第一节点和第二节点都可以是节点,并且在某些情况下,可以是单独且不同的节点。
本申请中术语“至少一个”的含义是指一个或多个,本申请中术语“多个”的含义是指两个或两个以上,例如,多个节点是指两个或两个以上的节点。
以上描述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以程序结构信息的形式实现。该程序结构信息包括一个或多个程序指令。在计算设备上加载和执行该程序指令时,全部或部分地产生按照本申请实施例中的流程或功能。
实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,该程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的范围。
Claims (19)
- 一种节点异常事件的处理方法,其特征在于,应用于存储设备的网卡,所述存储设备包括所述网卡和多个节点,所述网卡与所述多个节点中的第一节点通信连接,所述节点用于管理存储器,所述方法包括:在检测到与所述第一节点相关的异常事件的情况下,所述网卡向主机发送通知消息,所述通知消息指示所述第一节点所在的路径发生异常,以使所述主机通过所述多个节点中所述第一节点以外的节点所在的路径访问所述存储器。
- 根据权利要求1所述的方法,其特征在于,所述通知消息包括路径状态信息,所述路径状态信息指示所述第一节点所在的路径发生异常。
- 根据权利要求1或2所述的方法,其特征在于,所述网卡与所述第一节点通过外围组件互连总线PCIe链路通信连接。
- 根据权利要求3所述的方法,其特征在于,所述方法还包括:对所述PCIe链路进行链路异常检测,来确定是否发生与所述第一节点相关的异常事件。
- 根据权利要求4所述的方法,其特征在于,所述对所述PCIe链路进行链路异常检测,包括:对所述PCIe链路进行轮询机制检测和/或中断检测机制检测,当检测到所述PCIe链路异常时,确定发生了与所述第一节点相关的异常事件。
- 根据权利要求2至5中任一项所述的方法,其特征在于,所述网卡向主机发送通知消息,包括下述任一项:所述网卡通过传输层向所述主机发送所述通知消息;所述网卡通过应用层向所述主机发送所述通知消息。
- 根据权利要求2至6中任一项所述的方法,其特征在于,所述方法还包括:从所述网卡的管理队列信息中获取所述路径状态信息,所述管理队列信息中的所述路径状态信息由所述第一节点根据所述主机下发的第一命令配置,所述第一命令携带所述路径状态信息,所述第一命令指示在检测到与所述第一节点相关的异常事件的情况下,向所述主机发送所述通知消息。
- 根据权利要求2至6中任一项所述的方法,其特征在于,所述方法还包括:从所述网卡的输入输出队列上下文中获取所述路径状态信息,所述输入输出队列上下文中的所述路径状态信息由所述第一节点根据所述主机下发的第二命令配置,所述第二命令指示基于所述主机的操作系统类型,生成所述路径状态信息,将所述路径状态信息配置于所述输入输出队列上下文中。
- 根据权利要求2至6、8中任一项所述的方法,其特征在于,所述方法还包括:接收所述主机下发的第三命令,所述第三命令为读命令或写命令,所述第三命令指示在检测到与所述第一节点相关的异常事件的情况下,向所述主机发送所述通知消息。
- 一种节点异常事件的处理方法,其特征在于,应用于主机,所述主机与存储设备的网卡通信连接,所述存储设备包括所述网卡和多个节点,所述网卡与所述多个节点中的第一节点通信连接,所述节点用于管理存储器,所述方法包括:接收所述网卡在检测到与所述第一节点相关的异常事件的情况下发送的通知消息,所述通知消息指示所述第一节点所在的路径发生异常;基于所述通知消息,通过所述多个节点中所述第一节点以外的节点所在的路径访问所述存储器。
- 一种节点异常事件的处理装置,其特征在于,所述装置配置于存储设备中的网卡,所述存储设备包括所述网卡和多个节点,所述网卡与所述多个节点中的第一节点通信连接,所述节点用于管理存储器,所述装置包括:发送单元,用于在检测到与所述第一节点相关的异常事件的情况下,向主机发送通知消息,所述通知消息指示所述第一节点所在的路径发生异常,以使所述主机通过所述多个节点中所述第一节点以外的节点所在的路径访问所述存储器。
- 一种节点异常事件的处理装置,其特征在于,所述装置配置于主机,所述主机与存储设备的网卡通信连接,所述存储设备包括所述网卡和多个节点,所述网卡与所述多个节点中的第一节点通信连接,所述节点用于管理存储器,所述装置包括:接收单元,用于接收所述网卡在检测到与所述第一节点相关的异常事件的情况下发送的通知消息,所述通知消息指示所述第一节点所在的路径发生异常;访问单元,用于基于所述通知消息,通过所述多个节点中所述第一节点以外的节点所在的路径访问所述存储器。
- 一种网卡,其特征在于,所述网卡配置于存储设备,所述网卡包括处理器、存储器和接口,所述接口用于与所述存储设备中的节点通信连接,所述存储器用于存储至少一段程序代码,所述至少一段程序代码由所述处理器加载并实现如权利要求1-9中任一项所述的节点异常事件的处理方法。
- 一种存储集群,其特征在于,所述存储集群包括网卡、多个节点以及存储器,所述网卡与所述节点通信连接,所述节点用于管理所述存储器,所述网卡用于执行如权利要求1-9中任一项所述的节点异常事件的处理方法。
- 根据权利要求14所述的存储集群,其特征在于,所述存储集群是集中式存储设备,所述节点是存储控制器,所述网卡与所述节点通过系统总线连接,所述存储器与所述节点通过系统总线连接。
- 根据权利要求14所述的存储集群,其特征在于,所述存储集群是分布式存储系统,所述分布式存储系统包括多个独立的存储设备,各个所述存储设备之间通过有线网络或者无线网络连接,组成存储网络;其中,每个所述存储设备包括所述网卡、所述节点以及所述存储器,所述网卡与所述节点通过系统总线连接,所述存储器与所述节点通过系统总线连接;或者,每个所述存储设备包括所述网卡和所述节点,所述存储器与每个所述存储设备中的所述节点之间通信连接。
- 一种主机,其特征在于,所述主机包括处理器和存储器,所述处理器用于执行所述存储器中存储的指令,以使所述主机执行如权利要求10所述的节点异常事件的处理方法。
- 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质用于存储至少一段程序代码,所述至少一段程序代码用于执行如权利要求1-9中任一项所述的节点异常事件的处理方法,或者,执行如权利要求10所述的节点异常事件的处理方法。
- 一种计算机程序产品,其特征在于,当所述计算机程序产品在存储设备上运行时,使得所述存储设备执行如权利要求1-9中任一项所述的节点异常事件的处理方法,或者,当所述计算机程序产品在主机上运行时,使得所述主机执行如权利要求10所述的节点异常事件的处理方法。
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211509370.9 | 2022-11-29 | ||
CN202211509370 | 2022-11-29 | ||
CN202310144857.XA CN118118321A (zh) | 2022-11-29 | 2023-01-29 | 节点异常事件的处理方法、网卡及存储集群 |
CN202310144857.X | 2023-01-29 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2024113832A1 true WO2024113832A1 (zh) | 2024-06-06 |
Family
ID=91218650
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2023/103864 WO2024113832A1 (zh) | 2022-11-29 | 2023-06-29 | 节点异常事件的处理方法、网卡及存储集群 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN118118321A (zh) |
WO (1) | WO2024113832A1 (zh) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160380804A1 (en) * | 2015-06-24 | 2016-12-29 | Fujitsu Limited | Storage control apparatus and storage control method |
CN107547252A (zh) * | 2017-06-29 | 2018-01-05 | 新华三技术有限公司 | 一种网络故障处理方法和装置 |
CN110740072A (zh) * | 2018-07-20 | 2020-01-31 | 华为技术有限公司 | 一种故障检测方法、装置和相关设备 |
CN113805788A (zh) * | 2020-06-12 | 2021-12-17 | 华为技术有限公司 | 一种分布式存储系统及其异常处理方法和相关装置 |
CN114257541A (zh) * | 2020-09-10 | 2022-03-29 | 华为技术有限公司 | 一种故障链路的切换方法、系统及相关设备 |
-
2023
- 2023-01-29 CN CN202310144857.XA patent/CN118118321A/zh active Pending
- 2023-06-29 WO PCT/CN2023/103864 patent/WO2024113832A1/zh unknown
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160380804A1 (en) * | 2015-06-24 | 2016-12-29 | Fujitsu Limited | Storage control apparatus and storage control method |
CN107547252A (zh) * | 2017-06-29 | 2018-01-05 | 新华三技术有限公司 | 一种网络故障处理方法和装置 |
CN110740072A (zh) * | 2018-07-20 | 2020-01-31 | 华为技术有限公司 | 一种故障检测方法、装置和相关设备 |
CN113805788A (zh) * | 2020-06-12 | 2021-12-17 | 华为技术有限公司 | 一种分布式存储系统及其异常处理方法和相关装置 |
CN114257541A (zh) * | 2020-09-10 | 2022-03-29 | 华为技术有限公司 | 一种故障链路的切换方法、系统及相关设备 |
Also Published As
Publication number | Publication date |
---|---|
CN118118321A (zh) | 2024-05-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108696569B (zh) | 在NVMe-oF以太网SSD中提供数据复制的系统和方法 | |
EP3032787B1 (en) | Method, device, system and storage medium for implementing packet transmission in pcie switching network | |
US10860511B1 (en) | Integrated network-attachable controller that interconnects a solid-state drive with a remote server computer | |
US20170286363A1 (en) | Methods and systems for accessing host memory through non-volatile memory over fabric bridging with direct target access | |
US10313380B2 (en) | System and method for centralized virtual interface card driver logging in a network environment | |
US10592322B1 (en) | Adaptive timeout mechanism | |
US20140032796A1 (en) | Input/output processing | |
US9807154B2 (en) | Scalable logging control for distributed network devices | |
US20160077996A1 (en) | Fibre Channel Storage Array Having Standby Controller With ALUA Standby Mode for Forwarding SCSI Commands | |
US9864717B2 (en) | Input/output processing | |
JP2004531175A (ja) | ローカル識別子を使ったエンド・ノード区分 | |
US20230080588A1 (en) | Mqtt protocol simulation method and simulation device | |
US20210096939A1 (en) | Fault Tolerance Processing Method, Apparatus, and Server | |
US20240152290A1 (en) | Data writing method, data reading method, apparatus, device, system, and medium | |
CN115277348B (zh) | 一种服务器管理方法、服务器及服务器管理系统 | |
US20230421451A1 (en) | Method and system for facilitating high availability in a multi-fabric system | |
EP3014817B1 (en) | Hardware management communication protocol | |
US20190042161A1 (en) | Hard Disk Operation Method and Hard Disk Manager | |
WO2012141695A1 (en) | Input/output processing | |
WO2024113832A1 (zh) | 节点异常事件的处理方法、网卡及存储集群 | |
US7184411B2 (en) | Switch-management agent trap scheme in an infiniband-architecture switch | |
WO2024103923A1 (zh) | 一种故障通知方法及相关装置 | |
US20230261971A1 (en) | Robust Vertical Redundancy Of Networking Devices | |
CN118714183A (zh) | 报文传输的方法和装置 | |
CN118869775A (zh) | 报文传输的方法和装置 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23895996 Country of ref document: EP Kind code of ref document: A1 |