CN110612510A

CN110612510A - Input/output (I/O) isolation without a dedicated arbiter

Info

Publication number: CN110612510A
Application number: CN201880028528.6A
Authority: CN
Inventors: V·戈埃尔; J·加赫洛特; S·马拉施; A·托利; N·S·梅赫拉
Original assignee: Huaruitai Technology Co Ltd
Current assignee: Huaruitai Technology Co Ltd
Priority date: 2017-03-31
Filing date: 2018-03-29
Publication date: 2019-12-24
Anticipated expiration: 2038-03-29
Also published as: EP3602268B1; EP3602268A1; CN110612510B; US11079971B2; US20180285221A1; WO2018183733A1

Abstract

Methods, systems, and processes for performing input/output (I/O) isolation without requiring a dedicated arbiter are disclosed. The coordinating storage identifier is stored as metadata in a storage device. A coordinated storage identifier is associated with a node of the cluster coupled to the storage device, and an I/O isolation operation is performed using the coordinated storage identifier.

Description

Input/output (I/O) isolation without a dedicated arbiter

Technical Field

The present disclosure relates to distributed storage in a cluster environment. In particular, the present disclosure relates to performing input/output (I/O) isolation operations without a dedicated arbiter in such a clustered environment.

Description of the related Art

A cluster is a distributed computing system having multiple nodes that work together to provide processing power and shared storage resources by distributing the processing load over more than one node and eliminating or at least minimizing single points of failure. Thus, despite a problem with one node (or computing device) in the cluster, different applications running on multiple nodes still function. The clusters may be implemented in computing devices referred to as appliances.

An apparatus is a hardware device with integrated software (e.g., firmware) designed to provide one or more business services. The apparatus may also be configured with hardware and/or software to enable it to act as a client and/or server. The end-users of these clients and/or servers need not be aware of the technical details of the underlying operating system running on the device, as the hardware and/or software is pre-configured (e.g., by the manufacturer). In this way, the device is designed as a secure black box for the end user (e.g., customer). Multiple independent nodes (e.g., compute nodes) may be configured to execute in a given apparatus to provide high data availability.

Input/output (I/O) isolation (or simply isolation) refers to a method of isolating nodes of a cluster and/or protecting shared resources of a cluster when a node fails (or suspected to fail). Since a cluster implemented in an apparatus may have multiple compute nodes (or simply nodes), there is a possibility that one node may fail at a time. The failed node may control (or at least access) shared resources, such as shared storage used and required by other nodes in the cluster. The cluster must be able to take corrective action when a node fails because data corruption can occur if two nodes located in different sub-clusters or network partitions attempt to control the shared storage in an uncoordinated manner. Thus, the isolation operation results in isolating (or terminating) one or more nodes in the cluster (e.g., to prevent uncoordinated access to the shared storage).

Typically, an external computing device implementing what is referred to as a coordination point may assist in isolation operations. A coordination point is a computing device (e.g., a server or storage device) that provides a locking mechanism to determine which node (or nodes) are allowed to share storage (e.g., data drives) in isolation from other nodes in the cluster. For example, a node must eject (or offload) a peer's registration key from the coordinating point before the node is allowed to isolate the peer from the shared storage. However, for clusters implemented in a device, it is impractical to perform isolation operations using a coordination point, because, as described above, the device is designed as a safe and non-modifiable black box.

As previously mentioned, the apparatus is designed to be usable only with the device's own internal software and hardware components (including storage shelves). Typically, in existing device deployments, storage shelves (e.g., shared storage devices) are configured as backup target spaces, and thus the high availability solutions provided in such computing environments cannot themselves utilize a dedicated arbiter (e.g., a coordination point) for isolation operations. In addition, the internal services provided by the apparatus are typically not exposed to (and cannot be accessed by) the user or other computing device (e.g., an external client). Thus, the possibility of using a dedicated arbiter in a high availability cluster implemented within a device to perform isolation operations is not desirable, as devices implemented in high availability computing environments do not have dedicated arbiters and are not upgradeable.

Disclosure of Invention

Methods, systems, and processes for performing input/output (I/O) isolation without requiring a dedicated arbiter are disclosed. One such method involves storing the coordinating storage identifier as metadata in a storage device. In this example, the coordinating storage identifier is associated with a node of a cluster coupled to the storage device, and the I/O isolation operation is performed using the coordinating storage identifier.

In one embodiment, the method accesses a configuration file that includes isolation schema metadata identifying a coordinating storage identifier. In this example, the coordinating storage identifier identifies the coordinator and the data disks, and the coordinating storage identifier is a data key generated by a volume manager executed by the node.

In some embodiments, the method identifies one or more storage identifiers other than the coordinating storage identifier to identify one or more data disks, determines that another node other than the node has lost the quarantine contention, and determines whether the quarantine engine has ejected the data key from the coordinator and the data disks. In this example, the method ejects the data key from the coordinator and the data disk if the quarantine engine has not already ejected the data key from the coordinator and the data disk.

In other embodiments, the method detects that the cluster is divided into two or more network partitions and ejects the data key from the coordinator and the data disk. In this example, upon ejection of the data key from the coordinator and data disk, the method determines the result of an isolation race performed as part of the I/O isolation operation.

In some embodiments, the method determines that another node is a failed node, sends a notification to a volume manager of the failed node to eject the data key from one or more data disks other than the coordinator and the data disk, and receives confirmation from the volume manager that the data key has been ejected from the one or more data disks. In this example, the coordinator and the data disk and one or more data disks are shared by the node and another node.

The foregoing is a summary and thus contains, by necessity, simplifications, generalizations, and omissions of detail; consequently, those skilled in the art will appreciate that the summary is illustrative only and is not intended to be in any way limiting. Other aspects, features and advantages of the present disclosure will become apparent in the non-limiting detailed description set forth below as defined solely by the claims.

Drawings

The present disclosure may be better understood, and its numerous objects, features, and advantages made apparent to those skilled in the art by referencing the accompanying drawings.

Fig. 1A is a block diagram 100A of a computing system configured to perform input/output (I/O) isolation without a dedicated arbiter, according to one embodiment of the present disclosure.

Fig. 1B is a block diagram 100B of a compute cluster configured to perform input/output (I/O) isolation without a dedicated arbiter, according to one embodiment of the present disclosure.

Fig. 2A is a block diagram 200A of a computing cluster configured for startup and shutdown of nodes according to one embodiment of the disclosure.

Fig. 2B is a block diagram 200B of a compute cluster configured for network partitioning according to one embodiment of the present disclosure.

Fig. 3 is a table 300A showing an isolation operation table according to one embodiment of the present disclosure.

Fig. 4 is a flow diagram 400 illustrating a process for registering data keys on a designated coordinator and data disk according to one embodiment of the present disclosure.

Fig. 5 is a flow diagram 500 illustrating a process for removing one or more data keys from one or more data disks at cluster shutdown, according to one embodiment of the present disclosure.

Fig. 6 is a flow diagram 600 illustrating a process for declaring an isolation contention result when partitioning a network according to one embodiment of the present disclosure.

FIG. 7 is a flow diagram 700 illustrating a process for restarting data operations after performing I/O isolation operations according to one embodiment of the present disclosure.

FIG. 8 is a block diagram 800 of a computing system showing how a volume manager and isolation engine are implemented in software according to one embodiment of the present disclosure.

Fig. 9 is a block diagram 900 of a networking system showing how various devices communicate via a network, according to one embodiment of the present disclosure.

While the disclosure is susceptible to various modifications and alternative forms, specific embodiments thereof have been shown by way of example in the drawings and the detailed description. It should be understood that the drawings and detailed description thereto are not intended to limit the disclosure to the particular form disclosed. On the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the disclosure as defined by the appended claims.

Detailed Description

Introduction to

The device may be configured to execute critical applications (e.g., to protect vital customer data). Applications that are executed within such devices are expected to be highly available. For example, a device having a single compute node (e.g., a node having dedicated memory, processing power, etc.) may be upgraded by adding one or more additional compute nodes to provide high availability of data. Thus, a device may be configured to implement, for example, a cluster of nodes.

"split brain" refers to a situation (or scenario) where the risk of data availability (e.g., from shared storage) becomes inconsistent due to the maintenance of separate data sets whose extents overlap. For example, such overlap may potentially occur due to network partitions in which the sub-clusters cannot communicate with each other to synchronize their respective data sets. The data sets of each sub-cluster (or network partition) may be randomly served to clients by their own special data set updates without the need to coordinate with other data sets from other sub-clusters. Thus, when a split brain condition occurs in a cluster, the decision to decide which sub-cluster should continue to run is referred to as a partition arbitration process, or simply arbitration. Typically, arbitration in such a cluster environment is performed by performing I/O isolation operations using a coordination point, as previously described.

High availability data solutions such as those described above require protection of computing settings from data corruption by performing I/O isolation operations in the cluster. As described above, conventional I/O isolation operations require a dedicated arbiter (e.g., a coordination disk and/or a coordination point server) to reliably maintain only one sub-cluster in such situations (e.g., during system suspension). Such device-based high availability solutions fail to use a dedicated arbiter (such as a coordination disk or coordination point server) for I/O isolation operations because the devices are designed to operate in a standalone fashion (e.g., using only internal software and hardware components including storage shelves) and the storage shelves (e.g., shared storage devices) are configured to back up the target space.

It should therefore be appreciated that it would be desirable to have an I/O isolation solution that can reliably manage the actual split brain and system hang condition and provide storage access to only one sub-cluster to prevent corruption of data in a cluster having a fully configured shared storage configuration. Methods, systems, and processes are described herein for performing I/O isolation operations without a dedicated arbiter.

Example System for input/output (I/O) isolation without dedicated Arbitrator

FIG. 1A is a block diagram 100A of a computing system configured to perform input/output (I/O) isolation without a dedicated arbiter, according to one embodiment. As shown in fig. 1A, the apparatus 105 includes nodes 110(1) - (N), and is communicatively coupled to a shared storage 125 (e.g., directly, or via a network or some other type of interconnection). Each of nodes 110(1) - (N) is a separate and independent computing node because each of nodes 110(1) - (N) includes its own memory and processor. In this manner, each of nodes 110(1) - (N) functions as a separate computing device.

Node 110(1) is an example that represents other nodes in device 105 (e.g., nodes 110(2) - (N)), and includes volume manager 115(1) and quarantine engine 120 (1). Volume manager 115(1) (e.g., a Cluster Volume Manager (CVM)) manages one or more data disks that are part of shared storage 125. Isolation engine 120(1) (e.g., VxFEN) performs I/O isolation operations. Each node (e.g., nodes 110(1) - (N)) in the appliance 105 includes at least a volume manager and an isolation engine.

The apparatus 105 is communicatively coupled to a shared storage 125. Shared storage 125 is shared by nodes 110(1) - (N) and may include a plurality of different storage devices (e.g., a coordinator and data disks and one or more data disks). As shown in FIG. 1A, the coordinator and data disks are identified by a coordinating storage identifier 130, and the data disks are identified by storage identifiers 135(1) - (N). In certain embodiments, coordinating storage identifier 130 and storage identifier 135(1) - (N) is a data key.

FIG. 1B is a block diagram 100B of a compute cluster configured to perform input/output (I/O) isolation without a dedicated arbiter, according to one embodiment. As shown in fig. 1B, cluster 140 includes at least nodes 110(1) and 110(2) communicatively coupled to shared storage 125 via network 160. The volume manager 115(1) of the node 110(1) includes a data key generator 145 (1). Data key generator 145(1) generates coordinated store identifier 130 and/or store identifier 135 (1). Coordinated storage identifier 130 and storage identifier 135(1) are data keys that may be registered on a Logical Unit Number (LUN) (e.g., a LUN that identifies a coordinator and a data disk, respectively).

Node 110(1) also includes configuration file 150(1), which also includes isolation schema metadata 155 (1). In one embodiment, the isolation mode metadata 155(1) includes information indicating that I/O isolation operations may be performed by the isolation engine 120(1) without using a dedicated arbiter. Similarly, volume manager 115(2) of node 110(2) includes a data key generator 145(2), a configuration file 150(2) including isolation schema metadata 155(2), and an isolation engine 120 (2). In another embodiment, the isolation mode metadata 155(2) includes information indicating that I/O isolation operations may be performed by the isolation engine 120(2) without using a dedicated arbiter.

Shared storage 125 stores as metadata 165, coordinated storage identifiers 130 and storage identifiers 135(1) - (N) generated by data key generators 145(1) and/or 145 (2). In certain embodiments, coordinated storage identifier 130 is a data key registered on a LUN designated by device 105 as a coordinator and a data disk, and storage identifier 135(1) is the same data key registered on a LUN that identifies a data disk (other than a coordinator and a data disk). Thus, a data key is a mechanism for allowing nodes executing in the cluster 140 to access one or more disks in the shared storage 125. For example, a data key generated by the data key generator 145(1) and stored on a disk in the shared storage 125 allows the node 110(1) to access data from the given disk. Similarly, another data key generated by the data key generator 145(2) and stored on a disk in the shared storage 125 allows the node 110(2) to access data from the given disk.

In certain embodiments, volume managers 115(1) and 115(2) employ small computer system interface 3(SCSI-3) key-based reservation (e.g., referred to as Persistent Reservation (PR)) to provide access to one or more shared storage devices 125 only to connected components (e.g., nodes 110(1) and 110 (2)). Such persistent reservation operations may include at least a registration operation, a deregistration operation, a reservation operation, or a preemption operation, and may be based on nodes 110(1) and 110(2) accessing LUNs on SCSI-3 disks as part of shared storage 125 as part of performing the persistent reservation operation.

For example, SCSI-3PR allows multiple nodes to access a shared storage device while preventing other nodes from accessing the shared storage device. SCSI-3 remains unchanged in SCSI bus resets (e.g., resets of Host Bus Adapters (HBAs)) and supports multiple paths from the host to the disks (e.g., SCSI-3 compliant disks). As described above, the SCSI-3PR uses registration and reservation to perform I/O isolation operations. Each node registers its own "key" (e.g., data key) with a SCSI-3 device (e.g., a shared storage device). Nodes that register for keys form membership and establish reservations, which are typically set to "write-exclusive registration only" (WERO). The WERO setting allows only registered nodes to perform write operations. There can only be one reservation in many registrations for a given disk.

In SCSI-3PR based isolation, write access may be blocked by removing (or ejecting) the registration from the shared storage device. Only the registered node can "eject" or remove the registration of another node. A node that wishes to eject another node may issue a "preempt and abort" command. The ejection node is final and atomic; the ejected node cannot eject another node. In a cluster environment, nodes register the same data key for all paths of a shared storage device. Thus, a single preemption and abort command may be used to eject nodes from all paths to the shared storage device.

The same SCSI-3 data disks controlled by volume managers 115(1) and 115(2) may be used for storage as well as for I/O isolation purposes. Since the volume manager is enabled for device storage (e.g., shared storage 125), isolation engines 120(1) and 120(2) may coordinate I/O isolation operations with volume managers 115(1) and 115(2) and perform the I/O isolation operations using the data key PR mechanism described above.

Example of Cluster startup and shutdown

FIG. 2A is a block diagram 200A of a computing cluster configured for startup and shutdown of nodes, and also illustrating data disk sharing, according to one embodiment. As shown in fig. 2A, nodes 110(1) and 110(2) share a single coordinator and data disk (e.g., coordinator and data disk 205), and a data disk (e.g., data disk 210(1) - (N)) that is part of shared storage 125. In one embodiment, the coordinator and data disks 205 are identified by the coordinating storage identifier 130, and the data disks 210(1) - (N) are identified by the storage identifiers 135(1) - (N), respectively. When configuring device 105 in a high availability mode, device 105 designates a data LUN managed by one volume manager exactly as a quarantine coordination point (e.g., coordinator and data disk 205).

In one embodiment, volume manager 115(1) is a client of quarantine engine 120(1) and is higher up in the stack. Thus, during startup, when node 110(1) joins cluster 140, isolation engine 120(1) becomes operational before volume manager 115 (1). The quarantine engine 120(1) first appears (or begins operation) and places, registers and/or stores on the data disk a volume manager compatible key (e.g., a data key generated by the volume manager 115(1), such as the data key generator 145(1) in the APGR001) that is selected by the appliance 105 to be shared by the quarantine engine 120(1) and the volume manager 115(1) (e.g., the coordinator and the data disk 205). This exemplary process is represented in fig. 2A by the left dashed line (1).

Next, volume manager 115(1) appears (or begins to conduct), and the same data key (e.g., APGR001) is placed, registered, and/or stored on all data disks controlled by volume manager 115(1), including coordinator and data disks 205 (e.g., coordinator and data disks 205 and data disks 210(1) - (N)). It should be noted, however, that registering the same data key (e.g., APGR001) on the coordinator and data disk 205 (e.g., the data key previously registered on the coordinator and data disk 205 by the isolation engine 120(1)) by the volume manager 115(1) is an idempotent operation (e.g., using the ignore registration option). In this example, APGR001 is the coordinated storage identifier 130 and the storage identifier 135(1) - (N), and the coordinator and data disk 205 are the designated LUNs. This exemplary process is represented in fig. 2A by the left dashed line (2).

Similarly, in another embodiment, volume manager 115(2) is a client of quarantine engine 120(2) and is higher up in the stack. Thus, during startup, when node 110(2) joins cluster 140, isolation engine 120(2) becomes operational before volume manager 115 (2). The quarantine engine 120(2) first appears (or begins operation) and places, registers and/or stores on the data disk a volume manager compatible key (e.g., a data key generated by the volume manager 115(2), such as the data key generator 145(2) in BPGR001) that is selected by the device 105 to be shared by the quarantine engine 120(2) and the volume manager 115(2) (e.g., the coordinator and the data disk 205). This exemplary process is represented in fig. 2A by the right dashed line (1).

Next, volume manager 115(2) appears (or begins to conduct), and the same data key (e.g., BPGR001) is placed, registered, and/or stored on all data disks controlled by volume manager 115(2), including coordinator and data disks 205 (e.g., coordinator and data disks 205 and data disks 210(1) - (N)). Registering the same data key (e.g., BPGR001) on the coordinator and data disk 205 (e.g., the data key previously registered on the coordinator and data disk 205 by the isolation engine 120 (2)) by the volume manager 115(2) is an idempotent operation (e.g., using the ignore registration option). In this example, BPGR001 is another/different coordinated storage identifier and other/different storage identifier in addition to coordinated storage identifier 130 and storage identifier 135(1) - (N), respectively, and coordinator and data disk 205 are designated LUNs. This exemplary process is represented in fig. 2A by the right dashed line (2).

During shutdown, the volume manager 115(1) first slows down and removes the data key (e.g., APGR001) from all data disks (e.g., data disks 210(1) - (N)) (including the coordinator and data disks 205). This exemplary process is represented in fig. 2A by the left dashed line (a). The quarantine engine 120(1) next descends and attempts to remove the data key (e.g., APGR001) from the coordinator and data disk 205. If volume manager 115(1) has removed the data key from the coordinator and data disk 205, then quarantine engine 120(1) determines that the data key has been removed from the coordinator and data disk 205. If for some reason the data key is still present on the coordinator and data disk 205, the isolation engine 120(1) removes or ejects the data key (e.g., APGR001) from the coordinator and data disk 205 (e.g., coordinating storage identifier 130). This exemplary process is represented in fig. 2A by the left dashed line (B).

Similarly, during shutdown, the volume manager 115(2) first slows down (e.g., due to being higher in the stack) and removes the data key (e.g., BPGR001) from all data disks (e.g., data disks 210(1) - (N)) (including the coordinator and data disks 205). This exemplary process is represented in fig. 2A by the right dashed line (a). The isolation engine 120(2) next descends and attempts to remove the data key (e.g., BPGR001) from the coordinator and data disk 205. If volume manager 115(2) has removed the data key from the coordinator and data disk 205, quarantine engine 120(2) determines that the data key has been removed from the coordinator and data disk 205. If for some reason the data key is still present on the coordinator and data disk 205, the isolation engine 120(2) removes or ejects the data key (e.g., BPGR001) from the coordinator and data disk 205 (e.g., another coordinating storage identifier associated with node 110(2) in addition to the coordinating storage identifier 130 associated with node 110 (1)). This exemplary process is represented in fig. 2A by the right dashed line (B).

Examples of network partitioning

In the event of a failure of communication between nodes, such as a failure of a portion of a network during network partitioning, each of two or more sub-clusters of nodes may determine that another sub-cluster of nodes is (or may be) failed. For example, contention (also referred to as "isolation contention") may occur between two (or more) sub-clusters of nodes, where the control module of each sub-cluster of nodes determines that another sub-cluster of nodes fails.

Fig. 2B is a block diagram 200B of a compute cluster configured for network partitioning, according to one embodiment. As shown in fig. 2B, if there is a network partition (e.g., nodes 110(1) and 110(2) that is now part of a different sub-cluster), the quarantine engine on the winning node (e.g., quarantine engine 120(1) on node 110(1)) ejects or removes the data key of the failed node (e.g., another coordinated storage identifier, such as data key BPGR001 of node 110(2)), announces the quarantine contention, and notifies volume manager 115(1) of the result of the quarantine contention. This exemplary process is represented by dashed line (1) in fig. 2B.

Next, based on the quarantine contention results, volume manager 115(1) removes the data key (e.g., other storage identifier, such as data key BPGR001 of node 110(2)) of the failed node from all data disks (e.g., data disks 210(1) - (N)), including the coordinator and data disk 205, which in this example has been removed from the coordinator and data disk 205 by quarantine engine 120(1), as part of the example process shown by dashed line (1). This second and subsequent example process is represented by dashed line (2) in fig. 2B. Thus, it should be appreciated that since the fence engine 120(1) performs I/O isolation operations with the volume manager 115(1), the node 110(2) is blocked and denied access to all data disks (e.g., data disks 210(1) - (N)) without using a dedicated arbiter. This exemplary process is represented by dashed line (3) in fig. 2B.

Examples of input/output (I/O) isolation without a dedicated arbiter

In one embodiment, node 110(1) stores the coordinating storage identifier (e.g., coordinating storage identifier 130) as metadata (e.g., metadata 165) in a storage device (e.g., a coordinator and data disk 205 that are part of shared storage 125). In this example, the coordinated storage identifier 130 is associated with a node 110(1) of the cluster 140 that is coupled to the shared storage device 125. Isolation engine 120(1) in conjunction with volume manager 115(1) performs I/O isolation operations using a coordinated storage identifier (e.g., as described with respect to FIGS. 2A and 2B).

In another embodiment, the node 110(1) accesses a configuration file 150(1) that includes isolation schema metadata 155 (1). Isolated mode metadata 155(1) identifies a coordinating storage identifier (e.g., one of the LUNs managed by the CVM designated by device 105 as a VxFEN coordination point). The coordinating storage identifier identifies the coordinator and the data disk 205, and is a data key (e.g., as shown in fig. 1B) that is generated (e.g., using the data key generator 145(1)) by the volume manager 115(1) executed by and implemented on the node 115 (1).

In some embodiments, the quarantine engine 120(1) identifies one or more storage identifiers other than the coordinating storage identifier (which may be the same data key, e.g., APGR001) to identify one or more data disks (e.g., data disks 210(1) - (N)), and determines that other nodes than node 110(1) have lost quarantine contention (e.g., node 110(2)), and determines whether the volume manager 115(1) has removed the data key from the coordinator and data disk 205. If the data key has not been removed from the coordinator and data disk 205, the isolation engine 120(1) removes the data key from the coordinator and data disk 205.

In other embodiments, the isolation engine 120(1) detects that the cluster is divided into two or more network partitions (e.g., as shown in fig. 2B), and ejects the data key from the coordinator and data disk 205. Upon ejection of the data key from the coordinator and data disk 205, the isolation engine 120(1) determines the result of the isolation contention performed as part of the I/O isolation operation. In certain embodiments, the quarantine engine 120(1) determines that the node 110(2) is a failed node, sends a notification to the volume manager 115(1) to detect the data key from one or more data disks other than the coordinator and the data disk (e.g., data disks 210(1) - (N)), and receives a confirmation from the volume manager 115(1) that the data key has been ejected from the one or more data disks. If node 110(2) is the winning node and node 110(1) is the failing node (e.g., using another coordinated storage identifier/other storage identifier, such as data key BPGR001 as shown in fig. 2A and 2B), then isolation engine 120(2) may perform this process with volume manager 115 (2).

FIG. 3 is a table 300A illustrating an isolation operation table, according to one embodiment. The quarantine operation table 305 includes at least a volume manager action field 310 and a quarantine engine action field 315. Although the process shown in isolation operation table 305 is for node 110(1) and data key APGR001, the same process applies for node 110(2) and data key BPGR 001. First, quarantine engine 120(1) registers the volume manager complaint data key (e.g., APGR001) on the designated coordination LUN. Volume manager 115(1) then registers the same data key (e.g., APGR001) on the LUNs (e.g., data disks 210(1) - (N)) managed by volume manager 115(1), including coordinating LUNs (e.g., coordinator and data disks 205).

As the stack slows down, if volume manager 115(1) has not removed the data key, volume manager 115(1) removes the data key from the LUN, and quarantine engine 120(1) removes the data key (e.g., APGR001) from the coordinated LUN. Next, quarantine engine 115(1) ejects data keys from the coordinated LUNs when network partitions and nodes hang and announces quarantine contention results. Finally, volume manager 115(1) receives a notification from the winning node (e.g., node 110(1)) to clear the data keys (e.g., data disks 210(1) - (N)) in the remaining LUNs, eject the data keys from the remaining LUNs, and then resume (data) operations.

Similarly, quarantine engine 120(2) can register, place, and/or store a volume manager complaint data key (e.g., BPGR001) on a specified coordinated LUN. Volume manager 115(2) may then register the same data key (e.g., BPGR001) on the LUNs (e.g., data disks 210(1) - (N)) managed by volume manager 115(2), including coordinating LUNs (e.g., coordinator and data disk 205). As the stack slows down, if volume manager 115(2) has not removed the data key, volume manager 115(2) may remove the data key from the LUN, and quarantine engine 120(2) removes the data key from the coordinated LUN (e.g., BPGR 001). Next, quarantine engine 115(2) may eject data keys from the coordinated LUNs when network partitions and nodes hang and declare quarantine contention results. Finally, volume manager 115(2) may receive a notification from the winning node (e.g., node 110(2)) to clear the data keys (e.g., data disks 210(1) - (N)) in the remaining LUNs, eject the data keys from the remaining LUNs, and then resume (data) operations.

Example procedures for input/output (I/O) isolation without a dedicated arbiter

FIG. 4 is a flow diagram 400 illustrating a process for registering data keys on a designated coordinator and data disk, according to one embodiment. The process starts at 405 by detecting a cluster start. At 410, the process registers a data key (e.g., a coordinating storage identifier such as coordinating storage identifier 130) for a given node (CVM compliant) on a (designated) Coordinator and Data Disk (CDD). At 415, the process determines whether there is a cluster shutdown (e.g., initiated due to network partitioning, node suspension, etc.). If there is no cluster shutdown, the process loops back to 415. However, if there is a cluster shutdown (initiation), the process determines at 420 if the data key is still present on the CDD.

If the data key is not on the (specified) CDD, the process ends. However, if the data key is present on the CDD (e.g., on the coordinator and data disk 205 that the device 105 originally designated as coordinating LUNs during high availability mode configuration), then the process ends at 425 by removing the data key (e.g., coordinating storage identifier 130) from the CDD.

FIG. 5 is a flow diagram 500 illustrating a process for removing one or more data keys from one or more data disks at cluster shutdown, according to one embodiment. The process starts at 505 by detecting a cluster start. At 510, the process registers, places, and/or stores (e.g., the same) data key (e.g., using the same data key as a storage identifier of the coordinated storage identifier 135, such as storage identifiers 135(1) - (N)) on a managed data disk (e.g., a LUN managed by a volume manager, such as a LUN that identifies data disk 210(1) - (N)). At 515, the process determines whether there is an initiated cluster shutdown (e.g., as a result of a network partition, node suspension, etc.). If there is no cluster shutdown (init), the process loops back to 515. However, if a cluster shutdown has been initiated, the process ends at 520 by removing the data key (e.g., storage identifier) from the managed data disk.

Fig. 6 is a flow diagram 600 illustrating a process for declaring an isolation contention result when partitioning a network, according to one embodiment. The process begins at 605 by determining whether a network partition exists (e.g., as shown in fig. 2B). If there is no network partition, the process loops back to 605. However, if there is a network partition, the process initiates an I/O isolation operation at 610 and attempts to eject the data key from the designated coordinated storage unit (e.g., from the coordinator and data disk 205) at 615. The process ends at 620 by declaring an isolation contention result.

FIG. 7 is a flow diagram 700 illustrating a process for restarting data operations after performing I/O isolation operations, according to one embodiment. The process begins at 705 by receiving a notification from a winning node that quarantines the contention results. At 710, the process ejects one or more data keys from one or more storage units (e.g., data disks 210(1) - (N)) associated with the managed storage identifier (e.g., storage identifiers 135(1) - (N)). The process ends at 715 by restarting the data operation.

Exemplary computing Environment

FIG. 8 is a block diagram 800 of a computing system illustrating how a volume manager and isolation engine may be implemented in software according to one embodiment. Computing system 800 may include nodes 110(1) - (N) and broadly represent any single or multi-processor computing device or system capable of executing computer-readable instructions. Examples of computing system 800 include, but are not limited to, any one or more of the following various devices: workstations, personal computers, laptops, client-side terminals, servers, distributed computing systems, handheld devices (e.g., personal digital assistants and mobile phones), network devices, storage controllers (e.g., array controllers, tape drive controllers, or hard drive controllers), and so forth. In its most basic configuration, computing system 800 may include at least one processor 855 and memory 860. By executing software to execute the volume manager 115 and/or the isolation engine 120, the computing system 800 becomes a special purpose computing device configured to perform I/O isolation without requiring a special purpose arbiter.

Processor 855 generally represents any type or form of processing unit capable of processing data or interpreting and executing instructions. In certain embodiments, processor 855 may receive instructions from a software application or module. These instructions may cause processor 855 to perform the functions of one or more of the embodiments described and/or illustrated herein. For example, processor 855 may perform and/or be a means for performing all or some of the operations described herein. The processor 855 may also perform and/or be a means for performing any other operations, methods, or processes described and/or illustrated herein. Memory 860 generally represents any type or form of volatile or non-volatile storage device or medium capable of storing data and/or other computer-readable instructions. Examples include, but are not limited to, Random Access Memory (RAM), Read Only Memory (ROM), flash memory, or any other suitable memory device. Although not required, in some embodiments, computing system 800 may include both volatile and nonvolatile memory units. In one example, program instructions implementing volume manager 115 and/or isolation engine 120 may be loaded into memory 860.

In certain embodiments, computing system 800 may include one or more components or elements in addition to processor 855 and/or memory 860. For example, as shown in fig. 8, computing system 800 may include a memory controller 820, an input/output (I/O) controller 835, and a communications interface 845, each of which may be interconnected via a communications infrastructure 805. Communication infrastructure 805 generally represents any type or form of infrastructure capable of facilitating communication between one or more components of a computing device. Examples of communication infrastructure 805 include, but are not limited to, a communication bus (such as an Industry Standard Architecture (ISA), Peripheral Component Interconnect (PCI), PCI Express (PCIe), or similar bus) and a network.

Memory controller 820 generally represents any type/form of device capable of processing memory or data or controlling communication between one or more components of computing system 800. In certain embodiments, memory controller 820 may control communication between processor 855, memory 860, and I/O controller 835 via communication infrastructure 805. In certain embodiments, memory controller 820 may perform one or more of the operations or features described and/or illustrated herein, alone or in combination with other elements, and/or may be a means for performing one or more of the operations or features described and/or illustrated herein, alone or in combination with other elements.

I/O controller 835 generally represents any type or form of module capable of coordinating and/or controlling the input and output functions of an appliance and/or computing device. For example, in certain embodiments, I/O controller 835 may control or facilitate data transfer between one or more elements of computing system 800, such as processor 855, memory 860, communications interface 845, display adapter 815, input interface 825, and storage interface 840.

Communications interface 845 represents, in a broad sense, any type or form of communications device or adapter capable of facilitating communications between computing system 800 and one or more other devices. Communication interface 845 may facilitate communication between computing system 800 and a private or public network including additional computing systems. Examples of communications interface 845 include, but are not limited to, a wired network interface (such as a network interface card), a wireless network interface (such as a wireless network interface card), a modem, and any other suitable interface. Communication interface 845 can provide a direct connection to a remote server via a direct link to a network, such as the internet, and can also provide such a connection indirectly through, for example, a local area network such as an ethernet network, a personal area network, a telephone or cable network, a cellular telephone connection, a satellite data connection, or any other suitable connection.

Communications interface 845 may also represent a host adapter configured to facilitate communications between computing system 800 and one or more additional network or storage devices via an external bus or communications channel. Examples of host adapters include: small Computer System Interface (SCSI) host adapters, Universal Serial Bus (USB) host adapters, Institute of Electrical and Electronics Engineers (IEEE)1394 host adapters, Serial Advanced Technology Attachment (SATA), serial SCSI (sas), and external SATA (esata) host adapters, Advanced Technology Attachment (ATA) and parallel ATA (pata) host adapters, fibre channel interface adapters, ethernet adapters, and the like. Communication interface 845 may also allow computing system 800 to perform distributed or remote computing (e.g., by receiving/transmitting instructions from/to a remote device for execution).

As shown in fig. 8, computing system 800 may also include at least one display device 810 coupled to communication infrastructure 805 via a display adapter 815. Display device 810 generally represents any type or form of device capable of visually displaying information forwarded by display adapter 815. Similarly, display adapter 815 generally represents any type or form of device configured to forward graphics, text, and other data from communication infrastructure 805 (or from a frame buffer, as known in the art) for display on display device 810. Computing system 800 may also include at least one input device 830 coupled to communication infrastructure 805 via an input interface 825. Input device 830 generally represents any type or form of input device capable of providing input, generated by a computer or human, to computing system 800. Examples of input device 830 include a keyboard, a pointing device, a voice recognition device, or any other input device.

Computing system 800 can also include storage 850 coupled to communication infrastructure 805 via storage interface 840. Storage device 850 generally represents any type or form of storage device or medium capable of storing data and/or other computer-readable instructions. For example, storage device 850 may include a magnetic disk drive (e.g., a so-called hard drive), a floppy disk drive, a magnetic tape drive, an optical disk drive, a flash memory drive, and so forth. Storage interface 840 generally represents any type or form of interface or device for transferring and/or transmitting data between storage device 850 and other components of computing system 800. Storage 850 may be configured to read from and/or write to removable storage units configured to store computer software, data, or other computer-readable information. Examples of suitable removable storage units include floppy disks, magnetic tape, optical disks, flash memory devices, etc. Storage device 850 may also include other similar structures or devices for allowing computer software, data, or other computer-readable instructions to be loaded into computing system 800. For example, storage device 850 may be configured to read and write software, data, or other computer-readable information. Storage 850 may also be part of computing system 800 or may be a separate device accessed through other interface systems.

Many other devices or subsystems may be connected to computing system 800. Conversely, the components and devices illustrated in fig. 8 need not all be present to practice the embodiments described and/or illustrated herein. The devices and subsystems referenced above may also be interconnected in different ways from that shown in fig. 8. Computing system 800 may also employ any number of software configurations, firmware configurations, and/or hardware configurations. For example, one or more of the embodiments disclosed herein may be encoded as a computer program (also referred to as computer software, software applications, computer-readable instructions, or computer control logic) on a computer-readable storage medium. Examples of computer-readable storage media include magnetic storage media (e.g., hard disk drives and floppy disks), optical storage media (e.g., CD-ROMs or DVD-ROMs), electronic storage media (e.g., solid state drives and flash memory media), and so forth. Such computer programs may also be transferred to computing system 800 for storage in memory or on a carrier medium via a network such as the internet.

A computer readable medium embodying a computer program may be loaded into computing system 800. All or a portion of the computer program stored on the computer-readable medium may then be stored in the memory 860 and/or in various portions of the storage device 850, the coordinator and data disk 205, and/or the data disks 210(1) - (N). When executed by processor 855, a computer program loaded into computing system 800 may cause processor 855 to perform the functions of one or more of the embodiments described and/or illustrated herein and/or may cause processor to be a means for performing the functions of one or more of the embodiments described and/or illustrated herein. Additionally or alternatively, one or more of the embodiments described and/or illustrated herein may be implemented in firmware and/or hardware. For example, computing system 800 may be configured as an Application Specific Integrated Circuit (ASIC) adapted to implement one or more of the embodiments disclosed herein.

Exemplary networked Environment

Fig. 9 is a block diagram of a networked system showing how various computing devices communicate via a network, according to one embodiment. In certain embodiments, a Network Attached Storage (NAS) device may be configured to communicate with nodes 110(1) - (N), the coordinator and data disks 205, the data disks 210(1) - (N), and/or the independent I/O quarantine system 905 using various protocols such as Network File System (NFS), Server Message Block (SMB), or Common Internet File System (CIFS). Network 160 generally represents any type or form of computer network or architecture capable of facilitating communication between device 105, the coordinator and data disks 205, data disks 210(1) - (N), and/or independent I/O isolation systems 905.

In certain embodiments, a communication interface, such as communication interface 845 in FIG. 8, may be used to provide connectivity between the device 105, the coordinator and data disks 205, the data disks 210(1) - (N), and/or the independent I/O isolation system 905 and the network 160. In at least certain embodiments, data disks 210(1) - (N) are referred to herein so as to each such data disk is dedicated to storing data. Similarly, the coordinator and data disk 205 are referred to as such herein because the coordinator and data disk 205 store data and are used to perform I/O isolation operations without a dedicated arbiter. The embodiments described and/or illustrated herein are not limited to the internet or any particular network-based environment.

In some embodiments, network 160 may be a Storage Area Network (SAN). In other embodiments, the independent I/O isolation system 905 may be part of node 110(1) - (N), or may be separate. If separate, the independent I/O isolation system 905 and the nodes 110(1) - (N) and the device 105 may be communicatively coupled via the network 160. In one embodiment, all or a portion of one or more of the disclosed embodiments may be encoded as a computer program and loaded onto and executed by node 110(1) - (N) and/or independent I/O isolation system 905 or any combination thereof. All or a portion of one or more embodiments disclosed herein may also be encoded as a computer program, stored on the node 110(1) - (N), the independent I/O isolation system 905, the data disks 210(1) - (N), and/or the coordinator and data disks 205, and distributed via the network 160.

In some examples, all or a portion of nodes 110(1) - (N), independent I/O isolation system 905, data disks 210(1) - (N), and/or coordinator and data disks 205 may represent portions of a cloud computing or network-based environment. Cloud computing environments may provide various services and applications via the internet. These cloud-based services (e.g., software as a service, platform as a service, infrastructure as a service, etc.) may be accessible through a web browser or other remote interface.

The various functions described herein may be provided through a remote desktop environment or any other cloud-based computing environment. Further, one or more of the components described herein may transform data, physical devices, and/or representations of physical devices from one form to another. For example, volume manager 115 and/or isolation engine 120 may translate the behavior of nodes 110(1) - (N) such that nodes 110(1) - (N) perform I/O isolation without a dedicated arbiter. For example, multiple SCSI-3 compliant engines and/or modules (such as volume manager 115 and/or quarantine engine 120) may work together on the same set of SCSI-3 disks.

Although the present disclosure has been described in connection with several embodiments, it is not intended to be limited to the specific form set forth herein. On the contrary, the present disclosure is intended to cover such alternatives, modifications, and equivalents as may be reasonably included within the scope of the present disclosure as defined by the appended claims.

Claims

1. A computer-implemented method, comprising:

storing the coordinated storage identifier as metadata in a storage device, wherein

The coordinating storage identifier is associated with a node of a cluster coupled to the storage device; and

performing an input/output (I/O) isolation operation using the coordinated storage identifier.

2. The computer-implemented method of claim 1, further comprising:

access a configuration file, wherein

The configuration file includes isolated schema metadata, and

the isolated mode metadata identifies the coordinated storage identifier, and

the coordinating storage identifier identifies the coordinator and the data disks.

3. The computer-implemented method of claim 2, wherein

The coordinating storage identifier is a data key that,

the data key is generated by a volume manager, and

the volume manager is executed by the node.

4. The computer-implemented method of claim 3, wherein

Identifying one or more storage identifiers other than the coordinating storage identifier to identify one or more data disks.

5. The computer-implemented method of claim 4, further comprising:

determining that another node other than the node has lost quarantine contention;

determining whether an isolation engine has ejected the data key from the coordinator and the data disk; and

if the quarantine engine has not ejected the data key from the coordinator and the data disk, then the data key is ejected from the coordinator and the data disk.

6. The computer-implemented method of claim 5, further comprising:

detecting that the cluster is divided into a plurality of network partitions; and

ejecting the data key from the coordinator and the data disk.

7. The computer-implemented method of claim 6, further comprising:

determining a result of an isolation race upon ejection of the data key from the coordinator and the data disk, wherein

The isolation contention is performed as part of the I/O isolation operation;

determining that the other node is a failed node;

sending a notification to the volume manager of the node to eject the data key of the failed node from the one or more data disks other than a coordinator and a data disk; and

receiving confirmation from the volume manager that the data key has been ejected from the one or more data disks.

8. The computer-implemented method of claim 7, wherein

The coordinator and the data disks and the one or more data disks are shared by the node and the another node,

performing the I/O isolation operation without one or more dedicated arbiters, and

the I/O isolation operations are performed by the isolation engine and the volume manager on the same set of Small Computer System Interface (SCSI) -3 compatible disks.

9. A non-transitory computer readable storage medium comprising program instructions executable to:

10. The non-transitory computer-readable storage medium of claim 9, further comprising:

access a configuration file, wherein

The configuration file includes isolated schema metadata, and

the isolated mode metadata identifies the coordinated storage identifier, and

11. The non-transitory computer-readable storage medium of claim 10, wherein

The coordinating storage identifier is a data key that,

the data key is generated by a volume manager, and

the volume manager is executed by the node.

12. The non-transitory computer readable storage medium of claim 11, wherein

13. The non-transitory computer-readable storage medium of claim 12, further comprising:

determining whether an isolation engine has ejected the data key from the coordinator and the data disk;

if the quarantine engine has not ejected the data key from the coordinator and the data disk, then ejecting the data key from the coordinator and the data disk;

detecting that the cluster is divided into a plurality of network partitions;

ejecting the data key from the coordinator and the data disk;

The isolation contention is performed as part of the I/O isolation operation;

determining that the other node is a failed node;

14. The non-transitory computer readable storage medium of claim 13, wherein

15. A system, comprising:

one or more processors; and

a memory coupled to the one or more processors, wherein the memory stores program instructions executable by the one or more processors to:

16. The system of claim 15, further comprising:

access a configuration file, wherein

The configuration file includes isolated schema metadata, and

the isolated mode metadata identifies the coordinated storage identifier, and

17. The system of claim 16, wherein

The coordinating storage identifier is a data key that,

the data key is generated by a volume manager, and

the volume manager is executed by the node.

18. The system of claim 17, wherein

19. The system of claim 18, further comprising:

detecting that the cluster is divided into a plurality of network partitions;

ejecting the data key from the coordinator and the data disk;

The isolation contention is performed as part of the I/O isolation operation;

determining that the other node is a failed node;

sending a notification to the volume manager of the failed node to eject the data key from the one or more data disks other than a coordinator and a data disk; and

20. The system of claim 19, wherein