WO2007013961A2 - Architecture and method for configuring a simplified cluster over a network with fencing and quorum - Google Patents

Architecture and method for configuring a simplified cluster over a network with fencing and quorum Download PDF

Info

Publication number
WO2007013961A2
WO2007013961A2 PCT/US2006/028148 US2006028148W WO2007013961A2 WO 2007013961 A2 WO2007013961 A2 WO 2007013961A2 US 2006028148 W US2006028148 W US 2006028148W WO 2007013961 A2 WO2007013961 A2 WO 2007013961A2
Authority
WO
WIPO (PCT)
Prior art keywords
cluster
quorum
storage system
storage
cluster member
Prior art date
Application number
PCT/US2006/028148
Other languages
French (fr)
Other versions
WO2007013961A3 (en
Inventor
Pranoop Erasani
Original Assignee
Network Appliance, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Network Appliance, Inc. filed Critical Network Appliance, Inc.
Priority to EP06800150A priority Critical patent/EP1907932A2/en
Publication of WO2007013961A2 publication Critical patent/WO2007013961A2/en
Publication of WO2007013961A3 publication Critical patent/WO2007013961A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1479Generic software techniques for error detection or fault masking
    • G06F11/1482Generic software techniques for error detection or fault masking by means of middleware or OS functionality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1415Saving, restoring, recovering or retrying at system level
    • G06F11/142Reconfiguring to eliminate the error
    • G06F11/1425Reconfiguring to eliminate the error by reconfiguration of node membership
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/202Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where processing functionality is redundant
    • G06F11/2023Failover techniques
    • G06F11/2033Failover techniques switching over of hardware resources
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/08Configuration management of networks or network elements
    • H04L41/0893Assignment of logical groups to network elements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/08Configuration management of networks or network elements
    • H04L41/0894Policy-based network configuration management

Definitions

  • This invention relates to data storage systems and more particularly to providing failure fencing of network files and quorum capability in a simplified networked data storage system.
  • a storage system is a computer that provides storage service relating to the organization of information on writable persistent storage devices, such as memories, tapes or disks.
  • the storage system is commonly deployed within a storage area network (SAN) or a network attached storage (NAS) environment.
  • SAN storage area network
  • NAS network attached storage
  • the storage system may be embodied as a storage system including an operating system that implements a file system to logically organize the information as a hierarchical structure of directories and files on, e.g. the disks.
  • Each "on-disk" file may be implemented as a set of data structures, e.g., disk blocks, configured to store information, such as the actual data for the file.
  • a directory may be implemented as a specially formatted file in which information about other files and directories are stored.
  • the client may comprise an application executing on a computer that "connects" to a storage system over a computer network, such as a point- to-point link, shared local area network, wide area network or virtual private network implemented over a public network, such as the Internet.
  • NAS systems generally utilize file-based access protocols; therefore, each client may request the services of the storage system by issuing file system protocol messages (in the form of packets) to the file system over the network.
  • file system protocols such as the conventional Common Internet File System (CIFS), the Network File System (NFS) and the Direct Access File System (DAFS) protocols, the utility of the storage system may be enhanced for networking clients.
  • CIFS Common Internet File System
  • NFS Network File System
  • DAFS Direct Access File System
  • a SAN is a high-speed network that enables establishment of direct connections between a storage system and its storage devices.
  • the SAN may thus be viewed as an extension to a storage bus and, as such, an operating system of the storage system (a storage operating system, as hereinafter defined) enables access to stored information using block-based access protocols over the "extended bus.”
  • the extended bus is typically embodied as Fiber Channel (FC) or Ethernet media (i.e., network) adapted to operate with block access protocols, such as Small Computer Systems Interface (SCSI) protocol encapsulation over FC or TCP/IP/Ethernet.
  • FC Fiber Channel
  • Ethernet media i.e., network
  • a SAN arrangement or deployment allows decoupling of storage from the storage system, such as an application server, and placing of that storage on a network.
  • the SAN storage system typically manages specifically assigned storage resources.
  • storage can be grouped (or pooled) into zones (e.g., through conventional logical unit number or "lun" zoning, masking and management techniques), the storage devices are still pre-assigned by a user that has administrative privileges, (e.g., a storage system administrator, as defined hereinafter) to the storage system.
  • the storage system may operate in any type of configuration including a NAS arrangement, a SAN arrangement, or a hybrid storage system that incorporates both NAS and SAN aspects of storage.
  • Access to disks by the storage system is governed by an associated "storage operating system,” which generally refers to the computer-executable code operable on a storage system that manages data access, and may implement file system semantics.
  • the NetApp® Data ONT APTM operating system available from Network Appliance, Inc., of Sunnyvale, California that implements the Write Anywhere File Layout (WAFLTM) file system is an example of such a storage operating system implemented as a microkernel.
  • the storage operating system can also be implemented as an application program operating over a general-purpose operating system, such as UNIX® or Windows NT®, or as a general-purpose operating system with configurable functionality, which is configured for storage applications as described herein.
  • clients requesting services from applications whose data is stored on a storage system are typically served by coupled server nodes that are clustered into one or more groups.
  • node groups are Unix®-based host-clustering products.
  • the groups typically share access to the data stored on the storage system from a direct access storage/storage area network (DAS/SAN).
  • DAS/SAN direct access storage/storage area network
  • each node is typically directly coupled to a dedicated disk assigned for the purpose of determining access to the storage system.
  • the detecting node asserts a claim upon the disk.
  • the node that asserts a claim to the disk first is granted continued access to the storage system.
  • the node(s) that failed to assert a claim over the disk may have to leave the cluster.
  • the disk helps in determining the new membership of the cluster.
  • the new membership of the cluster receives and transmits data requests from its respective client to the associated DAS storage system with which it is interfaced without interruption.
  • storage systems that are interfaced with multiple independent clustered hosts use Small Computer System Interface (SCSI) reservations to place a reservation on the disk to gain access to the storage system.
  • SCSI Small Computer System Interface
  • messages which assert such reservations are usually made over a SCSI transport bus, which has a finite length.
  • SCSI transport coupling has a maximum operable length, which thus limits the distance by which a cluster of nodes can be geographically distributed.
  • a node may be located in one geographic location that experiences a large-scale power failure. It would be advantageous in such an instance to have redundant nodes deployed in different locations. In other words, in a high availability environment, it is desirable that one or more clusters or nodes are deployed in a geographic location which is widely distributed from the other nodes to avoid a catastrophic failure.
  • the typical reservation mechanism is not suitable due to the finite length of the SCSI bus.
  • a fiber channel coupling could be used to couple the disk to the nodes. Although this may provide some additional distance, the fiber channel coupling itself can be comparatively expensive and has its own limitations with respect to length.
  • fencing techniques are employed. However such fencing techniques had not generally been available to a host- cluster where the cluster is operating in a networked storage environment.
  • a fencing technique for use in a networked storage environment is described in co-pending, commonly-owned United States Patent Application No. 11/187,78 lof Erasani et al., for A CLIENT FAILURE FENCING MECHANISM FOR FENCING NETWORKED FILE SYSTEM DATA IN HOST-CLUSTER ENVIRONMENT, filed on even date herewith, which is presently incorporated by reference as though fully set forth herein, and United States Patent Application No.
  • the present invention overcomes the disadvantages of the prior art by providing a clustered networked storage environment that includes a quorum facility that supports a file system protocol, such as the network file system (NFS) protocol, as a shared data source in a clustered environment.
  • a plurality of nodes interconnected as a cluster is configured to utilize the storage services provided by an associated networked storage system.
  • Each node in the cluster is an identically configured redundant node that may be utilized in the case of failover or for load balancing with respect to the other nodes in the cluster.
  • the nodes are hereinafter referred to as a "cluster members.”
  • Each cluster member is supervised and controlled by cluster software executing on one or more processors in the cluster member.
  • cluster membership is also controlled by an associated network accessed quorum device.
  • the arrangement of the nodes in the cluster, and the cluster software executing on each of the nodes, as well as the quorum device, are hereinafter collectively referred to as the "cluster infrastructure.”
  • the clusters are coupled with the associated storage system through an appropri- ate network such as a wide area network, a virtual private network implemented over a public network (Internet), or a shared local area network.
  • an appropri- ate network such as a wide area network, a virtual private network implemented over a public network (Internet), or a shared local area network.
  • the clients are typically configured to access information stored on the storage system as directories and files.
  • the cluster members typically communicate with the storage system over a network by exchanging discreet frames or packets of data according to prede- fined protocols, such as the NFS over Transmission Control Protocol/Internet Protocol (TCP/IP).
  • TCP/IP Transmission Control Protocol/Internet Protocol
  • each cluster member further includes a novel set of software instructions referred to herein as the "quorum program".
  • the quorum program is invoked when a change in cluster member- ship occurs, or when the cluster members are not receiving reliable information about the continued viability of the cluster, or for a variety of other reasons.
  • the cluster member is programmed to assert a claim on the quorum device configured in accordance with the present invention.
  • the node asserts a claim on the quorum device, illustratively by attempting to place a SCSI reservation on the device.
  • the quorum device is a virtual disk embodied in a logical unit (LUN) exported by the networked storage system.
  • LUN logical unit
  • the LUN is created as a quorum device upon which a SCSI-3 reservation can be placed by an initiator.
  • the LUN is created for this purpose as a SCSI target that exists solely as a quorum device.
  • the storage system generates the LUN as the quorum device as an export to the clustered host side of the environment.
  • a cluster member asserting a claim on the quorum device is an initiator and communicates with the SCSI target quorum device by establishing an iSCSI session.
  • the iSCSI session provides a communication path between the cluster member initiator and the quorum device target a TCP connection.
  • the TCP connection is provided for by the network which couples the storage system to the host clustered side of the environment.
  • establishing "quorum” means that in a two node cluster, the surviving node places a SCSI reservation on the LUN acting as the quorum device and thereby maintains continued access to the storage system.
  • a multiple node cluster i.e., greater than two nodes, several cluster members can have registrations with the quorum device, but only one will be able to place a reservation on the quorum device.
  • multiple node partition i.e the cluster is partitioned in to two sub-clusters of two or more cluster members each, then each of the sub-clusters nominate a cluster member from their group to place the reservation and clear registrations of the "losing" cluster members.
  • Those that are successful in having their representative node place the reservation first thus establish a "quorum," which is a new cluster that has continued access to the storage system,
  • SCSI Persistent Reservations are used by cluster members to assert a claim on the quorum device.
  • the se- quence for invocation of the novel quorum program is to open an iSCSI session, send a command regarding a SCSI reservation of the quorum device (LUN), and wait for a response.
  • the response is either that the SCSI reservation is successful and that cluster member now holds the quorum or that the reservation was unsuccessful and that cluster member must standby for further instruction.
  • the cluster member which opened the iSCSI session then closes the session.
  • the quorum program is a simple user interface that can be readily provided on the host side of the storage environment. Certain required configuration on the storage system side is also provided as described further herein. For example, the LUN which is created as the quorum device is mapped to the cluster members that are allowed access to it. This group of cluster members thus functions as an iSCSI group of initiators.
  • the quorum program can be configured to use SCSI Reserve/Release reservations, instead of Persistent Reservations.
  • the present invention allows SCSI reservation techniques to be employed in a networked storage environment, to provide a quorum facility for clustered-hosts associated of the storage system.
  • Fig. 1 is a schematic block diagram of a prior art storage system which utilizes a directly attached quorum disk
  • Fig. 2 is a schematic block diagram of a prior art storage system which uses a remotely deployed quorum disk that is coupled to each cluster member via fiber channel;
  • Fig. 3 is a schematic block diagram of an exemplary storage system environment for use with an illustrative embodiment of the present invention;
  • Fig. 4 is a schematic block diagram of the storage system with which the present invention can be used
  • Fig. 5 is a schematic block diagram of the storage operating system in accordance with the embodiment of the present invention
  • Fig. 6 is a flow chart detailing the steps of a procedure performed for configuring the storage system and creating the LUN to be used as the quorum device in accordance with an embodiment of the present invention
  • Fig. 7 is a flow chart detailing the steps of a procedure for downloading parameters into cluster members for a user interface in accordance with an embodiment of the present invention
  • Fig. 8 is a flow chart detailing the steps of a procedure for processing a SCSI reservation command directed to a LUN created in accordance with an embodiment of the present invention.
  • Fig. 9 is a flowchart detailing the steps of a procedure for an overall process for a simplified architecture for providing fencing techniques and a quorum facility in a network-attached storage system in accordance with an embodiment of the present invention.
  • Fig. 1 is a schematic block diagram of a storage environment 100 that includes a cluster 120 having nodes, referred to herein as "cluster members" 130a and 130b, each of which is an identically configured redundant node that utilizes the storage services of an associated storage system 200.
  • the cluster 120 is depicted as a two-node cluster, however, the architecture of the environment 100 can vary from that shown while remaining within the scope of the present invention.
  • the present invention is described below with reference to an illustrative two-node cluster; however, clusters can be made up of three, four or many nodes. In cases in which there is a cluster having a number of members that is greater than two, a quorum disk may not be needed.
  • the cluster may still use a quorum disk to grant access to the storage system for various reasons.
  • the solution provided by the present invention can also be applied to clusters comprised of more than two nodes.
  • Cluster members 130a and 130b comprise various functional components that cooperate to provide data from storage devices of the storage system 200 to a client 150.
  • the cluster member 130a includes a plurality of ports that couple the member to the cli- ent 150 over a computer network 152.
  • the cluster member 130b includes a plurality of ports that couple that member with the client 150 over a computer network 154.
  • each cluster member 130 for example, has a second set of ports that connect the cluster member to the storage system 200 by way of a network 160.
  • the cluster members 130a and 130b in the illustrative example, communicate over the net- work 160 using Transmission Control Protocol/Internet Protocol (TCP/IP).
  • TCP/IP Transmission Control Protocol/Internet Protocol
  • the cluster member 130a In addition to the ports which couple the cluster member 130a to the client 150 and to the network 160, the cluster member 130a also has a number of program modules executing thereon. For example, cluster software 132a performs overall configuration, supervision and control of the operation of the cluster member 130a. An application 134a running on the cluster member 130a communicates with the cluster software to perform the specific function of the application running on the cluster member 130a. This application 134a may be, for example, an Oracle® database application.
  • a SCSI-3 protocol driver 136a is provided as a mechanism by which the cluster member 130a acts as an initiator and accesses data provided by a data server, or "target.”
  • the target in this instance is a directly coupled, directly attached quorum disk 172.
  • the SCSI protocol driver 136a and the associated SCSI bus 138a can attempt to place a SCSI-3 reservation on the quorum disk 172.
  • the SCSI bus 138a has a particular maximum usable length for its effectiveness. Therefore, there is only a certain distance by which the cluster member 130a can be separated from its directly attached quorum disk 172.
  • cluster member 130b includes cluster software 132b which is in communication with an application program 134b.
  • the cluster member 130b is directly attached to quorum disk 172 in the same manner as cluster member 130a. Consequently, cluster members 130a and 130b must be within a particular distance of the directly at- tached quorum disk 172, and thus within a particular distance of each other. This limits the geographic distribution physically attainable by the cluster architecture.
  • FIG. 2 Another example of a prior art system is provided in Figure 2, in which like components have the same reference characters as in Fig. 1. It is noted however, that the client 150 and the associated networks have been omitted from Fig. 2 for clarity of illustra- tion; it should be understood that a client is being served by the cluster 120.
  • cluster members 130a and 130b are coupled to a directly attached quorum disk 172.
  • cluster member 130a for example, has a fiber channel driver 140a providing fiber channel-specific access to a quorum disk 172, via fiber channel coupling 142a.
  • cluster member 130b has a fi- ber channel driver 140b, which provides fiber channel- specific access to the disk 172 by fiber channel coupling 142b.
  • the fiber channel coupling 142a and 142b is particularly costly and could result in significantly increased costs in a large deployment.
  • Fig. 3 is a schematic block diagram of a storage environment 300 that includes a cluster 320 having cluster members 330a and 330b, each of which is in an identically configured redundant node that utilizes the storage services of an associated storage system 400.
  • the cluster 320 is depicted as a two-node cluster, however, the architecture of the environment 300 can widely vary from that shown while remaining within the scope of the present invention.
  • Cluster members 330a and 330b comprise various functional components that cooperate to provide data from storage devices of the storage system 400 to a client 350.
  • the cluster member 330a includes a plurality of ports that couple the member to the client 350 over a computer network 352.
  • the cluster member 330b includes a plurality of ports that couple the member to the client 350 over a computer network 354.
  • each cluster member 330a and 330b has a second set of ports that connect the cluster member to the storage system 400 by way of network 360.
  • the cluster members 330a and 330b in the illustrative example, communicate over the network 360 using TCP/IP. It should be understood that although networks 352, 354 and 360 are depicted in Fig. 3 as individual networks, these networks may in fact comprise a single network or any number of multiple networks, and the cluster members 330a and 330b can be interfaced with one or more such networks in a variety of configurations while remaining within the scope of the present invention.
  • the cluster member 330a In addition to the ports which couple the cluster member 330a, for example, to the client 350 and to the network 360, the cluster member 330a also has a number of program modules executing thereon.
  • cluster software 332a performs overall configuration, supervision and control of the operation of the cluster member 330a.
  • An ap- plication 334a, running on the cluster 330a communicates with the cluster software to perform the specific function of the application running on the cluster member 330a.
  • This application 334a may be, for example, an Oracle® database application.
  • fencing program 340a described in the above-identified commonly-owned United States Patent Application No. 11/187,781 is provided. The fencing program 340a allows the cluster member 330a to send fencing instructions to the storage system 400.
  • cluster membership changes such as when a cluster member fails, or upon the addition of a new cluster member, or upon a failure of the communication link between cluster members, for example, it may be desirable to "fence off' a failed cluster member to avoid that cluster member writing spurious data to a disk, for example.
  • the fencing program executing on a cluster member not affected by the change in cluster membership i.e., the "surviving" cluster member notifies the NFS server in the storage system that a modification must be made in one of the export lists such that a target cluster member, for example, cannot write to given exports of the storage system, thereby fencing off that member from that data.
  • the notification is to change the export lists within an export module of the storage system 400 in such a manner that the cluster member can no longer have write access to particular exports in the storage system 400.
  • the cluster member 330a also includes a quorum program 342a as described in further detail herein.
  • cluster member 330b includes cluster software 332b which is in communication with an application program 334b.
  • a fencing program 340b as herein before described, executes on the cluster member 330b.
  • the cluster members 330a and 330b are illustratively coupled by cluster interconnect 370 across which identification signals, such as a heartbeat, from the other cluster member will indicate the existence and continued viability of the other cluster member.
  • Cluster member 330b also has a quorum program 342b in accordance with the present invention executing thereon.
  • the quorum programs 342a and 342b communicate over a network 360 with a storage system 400. These communications include asserting a claim upon the vdisk (LUN) 380, which acts as the quorum device in accordance with an embodiment in the present invention as described in further detain hereinafter.
  • LUN vdisk
  • Other communications can also occur between the cluster members 330a and 330b and the LUN serving as quorum device 380 within the scope of the present invention. These other communications include test messages.
  • Fig. 4 is a schematic block diagram of a multi-protocol storage system 400 configured to provide storage service relating to the organization of information on storage devices, such as disks 402.
  • the storage system 400 is illustratively embodied as a storage appliance comprising a processor 422, a memory 424, a plurality of network adapters 425, 426 and a storage adapter 428 interconnected by a system bus 423.
  • the multiprotocol storage system 400 also includes a storage operating system 500 that provides a virtualization system (and, in particular, a file system) to logically organize the informa- tion as a hierarchical structure of named directory, file and virtual disk (vdisk) storage objects on the disks 402.
  • a virtualization system and, in particular, a file system
  • the multi-protocol storage system 400 presents (exports) disks to SAN clients through the creation of LUNs or vdisk objects.
  • a vdisk object (hereinafter "vdisk") is a special file type that is implemented by the virtualization system and translated into an emulated disk as viewed by the SAN clients.
  • the multi-protocol storage system thereafter makes these emulated disks accessible to the SAN clients through controlled exports, as described further herein.
  • the memory 424 comprises storage locations that are addressable by the processor and adapters for storing software program code and data structures.
  • the processor and adapters may, in turn, comprise processing elements and/or logic circuitry configured to execute the software code and manipulate the various data structures.
  • the storage operating system 500 portions of which are typically resident in memory and executed by the processing elements, functionally organizes the storage system by, inter alia, invoking storage operations in support of the storage service implemented by the system. It will be apparent to those skilled in the art that other processing and memory implementations, including various computer readable media, may be used for storing and executing program instructions pertaining to the inventive system and method described herein.
  • the network adapter 425 couples the storage system to a plurality of clients 460a,b over point-to-point links, wide area networks, virtual private networks implemented over a public network (Internet) or a shared local area network, hereinafter re- ferred to as an illustrative Ethernet network 465. Therefore, the network adapter 425 may comprise a network interface card (NIC) having the mechanical, electrical and signaling circuitry needed to connect the system to a network switch, such as a conventional Ethernet switch 470. For this NAS-based network environment, the clients are configured to access information stored on the multi-protocol system as files.
  • the clients 460 communicate with the storage system over network 465 by exchanging discrete frames or packets of data according to pre-defined protocols, such as the Transmission Control Protocol/Internet Protocol (TCP/IP).
  • TCP/IP Transmission Control Protocol/Internet Protocol
  • the clients 460 may be general-purpose computers configured to execute applications over a variety of operating systems, including the UNIX® and Microsoft® Win- dowsTM operating systems. Client systems generally utilize file-based access protocols when accessing information (in the form of files and directories) over a NAS-based network. Therefore, each client 460 may request the services of the storage system 400 by issuing file access protocol messages (in the form of packets) to the system over the network 465. For example, a client 460a running the Windows operating system may com- municate with the storage system 400 using the Common Internet File System (CIFS) protocol.
  • CIFS Common Internet File System
  • a client 460b running the UNIX operating system may communicate with the multi-protocol system using either the Network File System (NFS) protocol over TCP/IP or the Direct Access File System (DAFS) protocol over a virtual interface (VI) transport in accordance with a remote DMA (RDMA) protocol over TCP/IP.
  • NFS Network File System
  • DAFS Direct Access File System
  • VI virtual interface
  • RDMA remote DMA
  • the storage network "target” adapter 426 also couples the multi-protocol storage system 400 to clients 460 that may be further configured to access the stored information as blocks or disks.
  • the storage system is coupled to an illustrative Fiber Channel (FC) network 485.
  • FC is a networking standard describing a suite of protocols and media that is primarily found in SAN deployments.
  • the network target adapter 426 may comprise a FC host bus adapter (HBA) having the mechanical, electrical and signaling circuitry needed to connect the system 400 to a SAN network switch, such as a conventional FC switch 480.
  • HBA FC host bus adapter
  • the clients 460 generally utilize block-based access protocols, such as the Small Computer Systems Interface (SCSI) protocol, as discussed previously herein, when accessing information (in the form of blocks, disks or vdisks) over a SAN-based network.
  • SCSI Small Computer Systems Interface
  • I/O peripheral input/output
  • clients 460 operating in a SAN environment are initiator's that initiate requests and commands for data.
  • the multi-protocol storage system is thus a target configured to respond to the requests issued by the initiators in accordance with a request/response protocol.
  • the initiators and targets have end- point addresses that, in accordance with the FC protocol, comprise worldwide names (WWN).
  • WWN is a unique identifier, e.g., a Node Name or a Port Name, consisting of an 8-byte number.
  • the multi-protocol storage system 400 supports various SCSI-based protocols used in SAN deployments, and in other deployments including SCSI encapsulated over TCP (iSCSI) and SCSI encapsulated over FC (FCP).
  • the initiators hereinafter clients 460
  • the targets may thus request the services of the target (hereinafter storage system 400) by issuing iSCSI and FCP messages over the network 465, 485 to access information stored on the disks. It will be apparent to those skilled in the art that the clients may also request the services of the integrated multi-protocol storage system using other block access pro- tocols.
  • the multi-protocol storage system provides a unified and coherent access solution to vdisks/LUNs in a heterogeneous SAN environment.
  • the storage adapter 428 cooperates with the storage operating system 500 executing on the storage system to access information requested by the clients.
  • the information may be stored on the disks 402 or other similar media adapted to store information.
  • the storage adapter includes I/O interface circuitry that couples to the disks over an I/O interconnect arrangement, such as a conventional high-performance, FC serial link topology.
  • the information is retrieved by the storage adapter and, if necessary, processed by the processor 422 (or the adapter 428 itself) prior to being forwarded over the system bus 423 to the network adapters 425, 426, where the information is formatted into packets or messages and returned to the clients.
  • Storage of information on the system 400 is preferably implemented as one or more storage volumes (e.g., VOL1-2450) that comprise a cluster of physical storage disks 402, defining an overall logical arrangement of disk space.
  • the disks within a vol- ume are typically organized as one or more groups of Redundant Array of Independent (or Inexpensive) Disks (RAID).
  • RAID implementations enhance the reliability/integrity of data storage through the writing of data "stripes" across a given number of physical disks in the RAID group, and the appropriate storing of redundant information with respect to the striped data.
  • the redundant information enables recovery of data lost when a storage device fails. It will be apparent to those skilled in the art that other redundancy techniques, such as mirroring, may be used in accordance with the present invention.
  • each volume 450 is constructed from an array of physical disks 402 that are organized as RAID groups 440, 442, and 444.
  • the physical disks of each RAID group include those disks configured to store striped data (D) and those configured to store parity (P) for the data, in accordance with an illustrative RAID 4 level configuration. It should be noted that other RAID level configurations (e.g. RAID 5) are also contemplated for use with the teachings described herein.
  • RAID level configurations e.g. RAID 5
  • a minimum of one parity disk and one data disk may be employed.
  • a typical implementation may include three data and one parity disk per RAID group and at least one RAID group per volume.
  • FIG. 5 is a schematic block diagram of an exemplary storage operating system 500 that may be advantageously used in the present invention.
  • a storage operating system 500 comprises a series of software modules organized to form an integrated network pro- tocol stack, or generally, a multi-protocol engine that provides data paths for clients to access information stored on the multi-protocol storage system 400 using block and file access protocols.
  • the protocol stack includes media access layer 510 of network drivers (e.g., gigabit Ethernet drivers) that interfaces through network protocol layers, such as IP layer 512 and its supporting transport mechanism, the TCP layer 514.
  • a file system pro- tocol layer provides multi-protocol file access and, to that end, includes a support for the NFS protocol 520, the CIFS protocol 522, and the hypertext transfer protocol (HTTP) 524.
  • HTTP hypertext transfer protocol
  • An iSCSI driver layer of 528 provides block protocol access over the TCP/IP network protocol layers, while an FC driver layer 530 operates with the network adapter to receive and transmit block access requests and responses to and from the storage sys- tern.
  • the FC and iSCSI drivers provide FC-specific and iSCSI-specific access control to the LUNs (vdisks) and, thus, manage exports of vdisks to either iSCSI or FCP or, alternatively to both iSCSI and FCP when accessing a single vdisk on the storage system.
  • the operating system includes a disk storage layer 540 that implements a disk storage protocol such as a RAID protocol, and a disk driver layer 550 that implements a disk access protocol such as, e.g. a SCSI protocol.
  • the virtualization system 570 includes a file system 574 interacting with virtualization modules illustratively embodied as, e.g., vdisk module 576 and SCSI target module 578. Additionally, the SCSI target module 578 includes a set of initiator data structures 580 and a set of LUN data structures 584. These data structures store various configuration and tracking data utilized by the storage operating system for use with each initiator (client) and LUN (vdisk) associated with the storage system. Vdisk module 576, the file system 574, and the SCSI target module 578 can be im- plemented in software, hardware, firmware, or a combination thereof.
  • the vdisk module 576 communicates with the file system 574 to enable access by administrative interfaces in response to a storage system administrator issuing commands to a storage system 400.
  • the vdisk module 576 manages all SAN deployments by, among other things, implementing a comprehensive set of vdisk (LUN) com- mands issued by the storage system administrator.
  • LUN vdisk
  • These vdisk commands are converted into primitive file system operations ("primitives") that interact with a file system 574 and the SCSI target module 578 to implement the vdisks.
  • the SCSI target module 578 initiates emulation of a disk or LUN by providing a mapping and procedure that translates LUNs into the special vdisk file types.
  • the SCSI target module is illustratively dis- posed between the FC and iSCSI drivers 530 and 528 respectively and file system 574 to thereby provide a translation layer of a virtualization system 570 between a SAN block (LUN) and a file system space, where LUNs are represented as vdisks.
  • the SCSI target module 578 has a set of APIs that are based on the SCSI protocol that enable consistent interface to both the iSCSI and FC drivers 528, 530 respectively.
  • An iSCSI Software Target (ISWT) driver 579 is provided in association with the SCSI target module 578 to allow iSCSI-driven messages to reach the SCSI target. It is noted that by "disposing" SAN virtualization over the file system 574 the storage system 400 reverses approaches taken by prior systems to thereby provide a single unified storage platform for essentially all storage access protocols.
  • the file system 574 provides volume management capabilities for use in block based access to the information stored on the storage devices, such as disks. That is, in addition to providing file system semantics, such as naming of storage objects, the file system 574 provides functions normally associated with a volume manager. These functions include (i) aggregation of the disks, (ii) aggregation of the storage bandwidth of the disks, and (iii) reliability guarantees such as mirroring and/or parity (RAID) to thereby present one or more storage objects laid on the file system.
  • RAID mirroring and/or parity
  • the file system 574 illustratively implements the WAFL® file system having in on disk format representation that is block based using, e.g., 4 kilobyte (KB) blocks and using inodes to describe files.
  • the WAFL® file system uses files to store metadata describing the layout of its file system; these metadata files include, among others, an inode file.
  • a file handle i.e., an identifier that includes an inode number, is used to retrieve an inode from disk.
  • a description of the structure of the file system, including on-disk inodes and the inode file, is provided in commonly owned U.S. Patent No.
  • the teachings of this invention can be employed in a hybrid system that includes several types of different storage environments such as the particular storage environment 300 of Fig. 3.
  • the invention can be used by a storage system administrator that deploys a system implementing and controlling a plurality of satellite storage environments that, in turn, deploy thousands of drives in multiple networks that are geographically dispersed.
  • the term "storage system” as used herein should, therefore, be taken broadly to include such arrangements.
  • a host-clustered storage environment includes a quorum facility that supports a file system protocol, such as the NFS protocol, as a shared data source in a clustered environment.
  • a plurality of nodes interconnected as a cluster is configured to utilize the storage services provided by an associated networked storage system.
  • Each node in the cluster hereinafter referred to as a "cluster member,” is supervised and controlled by cluster software executing on one or more processors in the cluster member.
  • cluster membership is also controlled by an associated network accessed quorum device.
  • the arrangement of the nodes in the cluster, and the cluster software executing on each of the nodes, as well as the quorum device, are hereinafter collectively referred to as the "cluster infrastructure.”
  • each cluster member further includes a novel set of software instructions referred to herein as the "quorum program.”
  • the quorum program is invoked when a change in cluster member- ship occurs, or when the cluster members are not receiving reliable information about the continued viability of the cluster, or for a variety of other reasons.
  • the cluster member is programmed to assert a claim on the quorum device configured in accordance with the present invention.
  • the cluster member asserts a claim on the quorum device illustratively by attempting to place a SCSI reserva- tion on the device.
  • the quorum device is a vdisk embodied in a LUN exported by the networked storage system.
  • the LUN is created as a quorum device upon which a SCSI-3 reservation can be placed by an initiator.
  • the LUN is created for this purpose as a SCSI target that exists solely as a quorum device.
  • the storage system generates the LUN as the quorum device as an export to the clustered host side of the environment.
  • a cluster member asserting a claim on the quorum device which is accomplished illustratively by placing a SCSI reservation on the LUN serving as a quorum device, is an initiator and communicates with the SCSI target quorum device by establishing an iSCSI session.
  • the iSCSI session provides a communication path between the cluster member initiator and the quorum device target, preferably over a TCP connection.
  • the TCP connection is provided for by the network which couples the storage system to the host clustered side of the environment.
  • SCSI-3 For purposes of a more complete description, it is noted that a more recent version of the SCSI standard is known as SCSI-3.
  • a target organizes and advertises the presence of data using containers called “logical units” (LUNs).
  • LUNs logical units
  • An initiator requests services from a target by building a SCSI-3 "command descriptor block (CDB)."
  • CDBs are used to write data within a LUN.
  • Others are used to query the storage system to determine the available set of LUNs, or to clear error conditions and the like.
  • the SCSI-3 protocol defines the rules and procedures by which initiators request or receive services from targets.
  • cluster nodes are configured to act as "initiators” to assert claims on a quorum device that is the "target” using a SCSI-3 based reservation mechanism.
  • the quorum device in that instance acts as a tie breaker in the event of failure and insures that the sub-cluster that has the claim upon the quorum disk will be the one to survive. This en- sures that multiple independent clusters do not survive in case of a cluster failure. To allow otherwise, could mean that a failed cluster member may continue to survive, but may send spurious messages and possibly write incorrect data to one or more disks of the storage system.
  • SCSI Reserve/Release reservations There are two different types of reservations supported by the SCSI-3 specifica- tion.
  • SCSI Reserve/Release reservations There are two different types of reservations supported by the SCSI-3 specifica- tion.
  • Persistent Reservations The two reservation schemes cannot be used together. If a disk is reserved using SCSI Reserve/Release, it will reject all Persistent Reservation commands. Likewise, if a drive is reserved using Persistent Reservation, it will reject SCSI Reserve/Release.
  • SCSI Reserve/Release is essentially a lock/unlock mechanism. SCSI Reserve locks a drive and SCSI Release unlocks it. A drive that is not reserved can be used by any initiator. However, once an initiator issues a SCSI Reserve command to a drive, the drive will only accept commands from that initiator. Therefore, only one initiator can access the device if there is a reservation on it. The device will reject most commands from other initiators (commands such as SCSI Inquiry will still be processed) until the initiator issues a SCSI Release command to it or the drive is reset through either a soft reset or a power cycle, as will be understood by those skilled in the art.
  • Persistent Reservations allow initiators to reserve and unreserve a drive similar to the SCSI Reserve/Release functionality. However, they also allow initiators to determine who has a reservation on a device and to break the reservation of another device, if needed. Reserving a device is a two step process. Each initiator can register a key (an eight byte number) with the device. Once the key is registered, the initiator can try to reserve that device. If there is already a reservation on the device, the initiator can preempt it and atomically change the reservation to claim it as its own. The initiator can also read off the key of another initiator holding a reservation, as well as a list of all other keys registered on the device. If the initiator is programmed to.
  • Persistent Reservations support various access modes ranging from exclusive read/write to read-shared/write- exclusive for the device being reserved.
  • SCSI Persistent Reservations are used by cluster members to assert a claim on the quorum device.
  • only one Persistent Reservation command will occur during any one session. Accordingly, the sequence for invocation of the novel quorum program is to open an iSCSI session, send a command regarding a SCSI reservation of the quorum device (LUN), and wait for a re- sponse.
  • the response is either that the SCSI reservation is successful and that cluster member now holds the quorum or that the reservation was unsuccessful and that cluster member must standby for further instruction.
  • the cluster member which opened the iSCSI session then closes the session.
  • the quorum program is a user interface that can be readily provided on the host side of the storage environment. Certain required configuration on the storage system side is also provided as described further herein. For example, the LUN which is created as the quorum device is mapped to the cluster members that are allowed access to it. This group of cluster members thus functions as an iSCSI group of initiators.
  • the quorum program can be configured to use SCSI Reserve/Release res- ervations, instead of Persistent Reservations.
  • a basic configuration is required for the storage system before the quorum facility can be used for the intended purpose. This configuration includes creating the LUN that will be used as the quorum device in accordance with the invention.
  • Fig. 6 illustrates a procedure 600, the steps of which can be used to implement the required configuration on the storage system. The procedure starts with step 602 and continues to steps 604. Step 604 requires that the storage system is iSCSI licensed.
  • An exemplary command line for performing this step is as follows: storagesystera>license add XXXXXX In this command, where XXXXX appears, the iSCSI license key should be inserted, as will be understood by those skilled in the art. It is noted that in another case, the general iSCSI access could be licensed, but specifically for quorum purposes. A separate license such as an "iSCSI admin" license can be issued royalty free, similar to certain HTTP licenses, as will be understood by those skilled in the art.
  • the next step 606 is to check and set the iSCSI target nodename.
  • An exemplary command line for performing this step is as follows: storagesystem>iscsi nodename
  • the programmer should insert the identification of the iSCSI target nodename which in this instance will be the name of the storage system.
  • the storage system name may have the following format, however, any suitable format may be used: iqn.l 992-08.com. :s.335xxxxxx.
  • the nodename may be entered by setting the hostname as the suffix instead of the serial number. The hostname can be used rather than iSCSI nodename of the storage system as the ISCSI target's address.
  • Step 608 provides that an igroup is to be created comprising the initiator nodes.
  • the initiator nodes in the illustrative embodiment of the invention are the cluster members such as cluster members 330a and 330b of Fig. 3. If the initiator names for the cluster members for example, are iqn.1992-08.com.cll and iqn.l 992.08.
  • com.cl2 and the following command line can be used by way of example, to create an igroup in accordance with step 608:
  • Storagesystem>igroup create -i scntap-grp Storagesystem>igroup show scntap-grp (iSCSI) (os type: default):
  • com.cll Storagesystem>igroup add scntap-grp iqn.1192-08.com.cl2
  • the actual LUN is created.
  • more than one LUN can be created if desired in a particular application of the invention.
  • An exemplary command line for creating the LUN, which is illustratively located as ⁇ vol ⁇ volO ⁇ scntaplun is as follows: Storagesystem>lun create -s lg ⁇ vol ⁇ volO ⁇ scntaplun
  • steps 608 and 610 can be performed in either order. However, both must be successful before proceeding further.
  • step 612 the created LUN is mapped to the created igroup in step 608. This can be accomplished using the following command line:
  • Step 612 ensures that the LUN is available to the initiators in the specified group at the LUN ID as specified.
  • the iSCSI Software Target (ISWT) driver is configured for at least one network adapter.
  • As a target driver part of the ISWT' s responsibility is for driving certain hardware for the purposes of providing access to the storage system managed LUNs by the iSCSI initiators. This allows the storage system to provide the iSCSI target service over any or all of its standard network interfaces, and a single network interface can be used simultaneously for both iSCSI requests and other types of network traffic (e.g. NFS and or CIFS requests).
  • the command line which can be used to check the interface is as follows: storagesystem>iscsi show adapter This indicates which adapters are set up in step 614.
  • step 616 the next step (step 616) is to start the iSCSI driver so that iSCSI client calls are ready to be served.
  • step 616 to start the iSCSI service the following command line can be used: storagesystem>iscsi start.
  • the procedure 600 completes at step 618.
  • the procedure thus creates the LUN to be used as a network-accessed quorum device in accordance with the invention and allows it come online and to be accessible so that it is ready when needed to establish for a quorum.
  • a LUN may also be created for other purposes which are implemented using the quorum program of the present invention as set forth in each of the cluster members that interface with the LUN.
  • a user interface is to be downloaded from a storage system provider's website or in another suitable manner understood by those skilled in the art, into the individual cluster members that are to have the quorum facility associated herewith.
  • This is illustrated in the flowchart 700 of Fig. 7.
  • the "quorum program" in one or more of the cluster members may be either accompanied by or replaced by a host-side iSCSI driver, such as iSCSI driver 136a (Fig. 1), which is configured access the LUN serving as the quorum disk in accordance with the present invention.
  • the procedure 700 begins with the start step 702 and continues to step 704 in which an iSCSI parameter is to be supplied at the administrator level.
  • step 704 indicates that the LUN ID should be supplied to the cluster members.
  • This is the identification number of the target LUN in a storage system that is to act as the quorum device.
  • This target LUN will have already been created and will have an identification number pursuant to the procedure 600 of Fig. 6.
  • the next parameter that is to be supplied to the administrator is the target node name.
  • the target nodename is a string which indicates the storage system which exports the LUN.
  • a target nodename string may be, for example, "iqn.l992.08.com.sn.33583650".
  • the target hostname string is to be supplied to the cluster member in accordance with step 708.
  • the target hostname string is simply the host name.
  • the initiator session ID or "ISID"
  • ISID is to be supplied.
  • the initiator nodename string is supplied, which indicates which cluster member is involved so that when a response is sent back to the cluster member from the storage system, the cluster member is appropriately identified and addressed.
  • the initiator nodename string may be for example "iqn.1992.08.com.itst”.
  • the quorum program is downloaded from a storage system provider's website, or in another suitable manner, known to those skilled in the art, into the memory of the cluster member.
  • the quorum program is invoked when the cluster infrastructure determines that a new quorum is to be established.
  • the quorum program contains instructions to send a command line with various input options to specify commands to carry out Persistent Reservation actions on the SCSI target device using a quo- rum enable command.
  • the quorum enable command includes the following information:
  • the options include "-h” which requests that the usage screen is printed; the option “-a” sets an APTPL bit to activate persist in case of a power loss; -f indicates the 'file_name which specifies the file in which to read or write data; the option “-o blk_ofst” specifies the block offset in which to read or write data; the “-n num_blks” specifies a number of 512 byte blocks to read or write (128 max); the "-t target_hostname” option specifies the target host name, with a default as defined by the operator; the "-T tar- get_iscsi_node_name” option specifies a target iSCSI nodename, with an appropriate default; the "-I initiator iscsi node name” -option specifies the initiator iSCSI node name and default: iqn.l992-08.com..itst; the "-i ISID” option specifies an Initiator Ses- sion ID
  • the reservation types that can be implemented by the quorum enable command are as follows:
  • Operation is one of the following: rk - Read Keys re - Read Capabilities rr - Read Reservations rg - Register rv - Reserve rl - Release cl - Clear pt - Preempt pa - Preempt Abort ri - Register Ignore in - Inquiry LUN Serial No.
  • the quorum enable command is embodied in the quorum program 342a in cluster member 330a of Fig. 3 for example, and is illustratively based on the assumption that only one Persistent Reservation command will occur during any one session invocation. This avoids the need for the program to handle all aspects of iSCSI session management for purposes of simple invocation. Accordingly, the sequence for each invocation of quorum enable is set forth in the flowchart of Fig. 8.
  • the procedure 800 begins with the start step 802 and continues to 804 which are to create an iSCSI session.
  • an initiator communicates with a target via an iSCSI session.
  • a session is roughly equivalent to a SCSI initiator-target nexus, and consists of a communication path between an initiator and a target, and the current state of that communication (e.g. set of outstanding commands, state of each in-progress command, flow control command window and the like).
  • a session includes one or more TCP connections. If the session contains multiple TCP connections then the session can continue uninterrupted even if one of its underlying TCP connections is lost. An individual SCSI command is linked to a single connection, but if that connection is lost e.g. due to a cable pull, the initiator can detect this condition and reassign that SCSI command to one of the remaining TCP connections for completion.
  • An initiator is identified by a combination of its iSCSI initiator nodename and a numerical initiator session ID or ISID, as described herein before.
  • the procedure 800 continues to step 806 where a test unit ready (TUR) is sent to make sure that the SCSI target is available. Assuming the SCSI target is available, the procedure proceeds to step 808 where the SCSI PR command is constructed.
  • iSCSI protocols embodied as protocol data units or PDUs.
  • the PDU is the basic unit of communication between an iSCSI initiator and its target.
  • Each PDU consists of a 48-byte header and an optional data segment. Opcode and data segment length fields appear at fixed locations within the headers; the format of the rest of the header and the format and content of the data segment are not code specific.
  • the PDU is built to incorporate a Persistent Reservation command using quorum enable in accordance with the present invention.
  • the iSCSI PDU is sent to the target node, which in this instance is the LUN operating as a quorum device.
  • the LUN operating as a quorum device then returns a response to the initiator cluster member.
  • the response is parsed by the initiator cluster member and it is determined whether the reservation command operation was successful. If the operation is successful, then the cluster member holds the quorum. If the reservation was not successful, then the cluster member will wait for further information, hi either case, in accordance with step 816, the cluster member closes the iSCSI session. In accordance with step 818 a response is returned to the target indicating that the session was terminated.
  • the procedure 800 completes at step 820.
  • this section provides some sample commands which can be used in accordance with the present invention to carry out Persistent Reservation actions on a SCSI target device using the quorum enable command. Notably, the commands do not supply the -T option. If the -T option is not included in the command line options then the program will use SENDTARGETS to determine the target ISCI node- name, as will be understood by those skilled in the art.
  • Two separate initiators register a key with the SCSI target for the first time and instruct the target to persist the reservation (-a option): # quorum enable -a -t target_hostname -s serv_keyl -r 0 -i ISID -I initiator_iscsi_node_name -I 0 rg
  • each cluster member 330a and 330b also include a fencing program 340a and 340b respectively, which provide failure fencing techniques for file-based data, as well as a quorum facility as provided by the quorum programs 342a and 342b, respectively.
  • a fencing program 340a and 340b respectively, which provide failure fencing techniques for file-based data, as well as a quorum facility as provided by the quorum programs 342a and 342b, respectively.
  • the procedure 900 begins at the start step 902 and proceeds to step 904.
  • an initial fence configuration is established for the host cluster.
  • all cluster members initially have read and write access to the exports of the storage system that are involved in a particular application of the invention, hi accordance with step 906, a quorum device is provided by creating a LUN (vdisk) as an export on the storage system, upon which cluster members can place SCSI reservations as de- scribed in further detail herein.
  • LUN vdisk
  • a change in cluster membership is detected by a cluster member as in step 908. This can occur due to a failure of a cluster member, a failure of a communication link between cluster members, the addition of a new node as a cluster member or any other of a variety of circumstances which cause cluster membership to change.
  • the cluster members are programmed using the quorum program of the present invention to attempt to establish a new quorum, as in step 910, by placing a SCSI reservation on the LUN which has been created. This reservation is sent over the network using an iSCSI PDU as described herein.
  • a cluster member receives a response to its attempt to assert quorum on the LUN, as shown in step 912.
  • the response will either be that the cluster member is in the quorum or is not in a quorum.
  • a least one cluster member that holds quorum will then send a fencing message to the storage system over the network as show in step 914.
  • the fencing message requests the NFS server of the storage system to change export lists of the storage system to disallow write access of the failed cluster member to given exports of the storage system.
  • a server API message is provided for this procedure as set forth in the above incorporated United States Patent Application Numbers 11/187,781 and 11/187,649.
  • the procedure 900 completes in step 916.
  • a new cluster has been established with the surviving cluster members and the surviving cluster members will continue operation until notified otherwise by the storage system or the cluster infrastructure.
  • This can occur in a networked environment using the simplified system and method of the present invention for interfacing a host cluster with a storage system in a networked storage environment.
  • the invention provides for quorum capability and fencing techniques over a network without requiring a directly attached storage system or a directly attached quorum disk, or a fiber channel connection.
  • the invention provides a simplified user interface for providing a quorum facility and for fencing cluster members, which is easily portable across all Unix®-based host platforms.
  • the invention can be implemented and used over TCP with insured reliability.
  • the invention also provides a means to provide a quorum device and to fence cluster members while enabling the use of NFS in a shared collaborative cluster- ing environment.
  • the present invention has been written in terms of files and directories, the present invention also may be utilized to fence/unfence any form of networked data containers associated with a storage system.
  • the system of the present invention provides a simple and complete user interface that can be plugged into a host cluster framework which can accommodate dif- ferent types of shared data containers.
  • the system and method of the present invention supports NFS as a shared data source in a high-availability environment that includes one or more storage system clusters and one or more host clusters having end-to-end availability in mission-critical deployments having substantially constant availability.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Hardware Redundancy (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A host-clustered networked storage environment includes a 'quorum program.'The quorum program is invoked when a change in cluster membership occurs, or when the cluster members are not receiving reliable information about the continued viability of the cluster, or for a variety of other reasons. When the quorum program is so invoked, the cluster member is programmed to assert a claim on a quorum device configured in accordance with the present invention. More specifically, the quorum device is a vdisk embodied in as a logical unit (LUN) exported by the networked storage system. The LUN is created as a quorum device upon which a SCSI-3 reservation can be placed by an initiator. Thus, the LUN is created for this purpose as a SCSI target that exists solely as a quorum device. Fencing techniques are also provided in the networked environment such that failed cluster members can be fenced from given - exports of the networked -storage system.

Description

ARCHITECTURE AND METHOD FOR CONFIGURING A SIMPLIFIED CLUSTER OVER A NETWORK WITH FENCING
AND QUORUM
BACKGROUND OF THE INVENTION
Field of the Invention
This invention relates to data storage systems and more particularly to providing failure fencing of network files and quorum capability in a simplified networked data storage system.
Background Information A storage system is a computer that provides storage service relating to the organization of information on writable persistent storage devices, such as memories, tapes or disks. The storage system is commonly deployed within a storage area network (SAN) or a network attached storage (NAS) environment. When used within a NAS environment, the storage system may be embodied as a storage system including an operating system that implements a file system to logically organize the information as a hierarchical structure of directories and files on, e.g. the disks. Each "on-disk" file may be implemented as a set of data structures, e.g., disk blocks, configured to store information, such as the actual data for the file. A directory, on the other hand, may be implemented as a specially formatted file in which information about other files and directories are stored. In the client/server model, the client may comprise an application executing on a computer that "connects" to a storage system over a computer network, such as a point- to-point link, shared local area network, wide area network or virtual private network implemented over a public network, such as the Internet. NAS systems generally utilize file-based access protocols; therefore, each client may request the services of the storage system by issuing file system protocol messages (in the form of packets) to the file system over the network. By supporting a plurality of file system protocols, such as the conventional Common Internet File System (CIFS), the Network File System (NFS) and the Direct Access File System (DAFS) protocols, the utility of the storage system may be enhanced for networking clients.
A SAN is a high-speed network that enables establishment of direct connections between a storage system and its storage devices. The SAN may thus be viewed as an extension to a storage bus and, as such, an operating system of the storage system (a storage operating system, as hereinafter defined) enables access to stored information using block-based access protocols over the "extended bus." In this context, the extended bus is typically embodied as Fiber Channel (FC) or Ethernet media (i.e., network) adapted to operate with block access protocols, such as Small Computer Systems Interface (SCSI) protocol encapsulation over FC or TCP/IP/Ethernet.
A SAN arrangement or deployment allows decoupling of storage from the storage system, such as an application server, and placing of that storage on a network. However, the SAN storage system typically manages specifically assigned storage resources. Although storage can be grouped (or pooled) into zones (e.g., through conventional logical unit number or "lun" zoning, masking and management techniques), the storage devices are still pre-assigned by a user that has administrative privileges, (e.g., a storage system administrator, as defined hereinafter) to the storage system.
Thus, the storage system, as used herein, may operate in any type of configuration including a NAS arrangement, a SAN arrangement, or a hybrid storage system that incorporates both NAS and SAN aspects of storage.
Access to disks by the storage system is governed by an associated "storage operating system," which generally refers to the computer-executable code operable on a storage system that manages data access, and may implement file system semantics. In this sense, the NetApp® Data ONT AP™ operating system available from Network Appliance, Inc., of Sunnyvale, California that implements the Write Anywhere File Layout (WAFL™) file system is an example of such a storage operating system implemented as a microkernel. The storage operating system can also be implemented as an application program operating over a general-purpose operating system, such as UNIX® or Windows NT®, or as a general-purpose operating system with configurable functionality, which is configured for storage applications as described herein.
In many high availability server environments, clients requesting services from applications whose data is stored on a storage system are typically served by coupled server nodes that are clustered into one or more groups. Examples of these node groups are Unix®-based host-clustering products. The groups typically share access to the data stored on the storage system from a direct access storage/storage area network (DAS/SAN). Typically, there is a communication link configured to transport signals, such as a heartbeat, between nodes such that during normal operations, each node has no- tice that the other nodes are in operation.
The absence of a heartbeat signal indicates to a node that there has been a failure of some kind. Typically, only one member should be allowed access to the shared storage system. In order to resolve which of the two nodes can continue to gain access to the storage system, each node is typically directly coupled to a dedicated disk assigned for the purpose of determining access to the storage system. When a node is notified of a failure of another node, or detects the absence of the heartbeat from that node, the detecting node asserts a claim upon the disk. The node that asserts a claim to the disk first is granted continued access to the storage system. Depending on how the host-cluster framework is implemented, the node(s) that failed to assert a claim over the disk may have to leave the cluster. This can be achieved by the failed node committing "suicide," as will be understood by those skilled in the art, or by being explicitly terminated. Hence, the disk helps in determining the new membership of the cluster. Thus, the new membership of the cluster receives and transmits data requests from its respective client to the associated DAS storage system with which it is interfaced without interruption. Typically, storage systems that are interfaced with multiple independent clustered hosts use Small Computer System Interface (SCSI) reservations to place a reservation on the disk to gain access to the storage system. However, messages which assert such reservations are usually made over a SCSI transport bus, which has a finite length. Such SCSI transport coupling has a maximum operable length, which thus limits the distance by which a cluster of nodes can be geographically distributed. And yet, wide geographic distribution is sometimes important in a high availability environment to provide fault tolerance in case of a catastrophic failure in one geographic location. For example, a node may be located in one geographic location that experiences a large-scale power failure. It would be advantageous in such an instance to have redundant nodes deployed in different locations. In other words, in a high availability environment, it is desirable that one or more clusters or nodes are deployed in a geographic location which is widely distributed from the other nodes to avoid a catastrophic failure.
However, in terms of providing access for such clusters, the typical reservation mechanism is not suitable due to the finite length of the SCSI bus. In some instances, a fiber channel coupling could be used to couple the disk to the nodes. Although this may provide some additional distance, the fiber channel coupling itself can be comparatively expensive and has its own limitations with respect to length.
To further provide protection in the event of failed nodes, fencing techniques are employed. However such fencing techniques had not generally been available to a host- cluster where the cluster is operating in a networked storage environment. A fencing technique for use in a networked storage environment is described in co-pending, commonly-owned United States Patent Application No. 11/187,78 lof Erasani et al., for A CLIENT FAILURE FENCING MECHANISM FOR FENCING NETWORKED FILE SYSTEM DATA IN HOST-CLUSTER ENVIRONMENT, filed on even date herewith, which is presently incorporated by reference as though fully set forth herein, and United States Patent Application No. 11/187,649 for a SERVER API FOR FENCING CLUSTER HOSTS VIA EXPORT ACCESS RIGHTS, of Thomas Haynes et al., filed on even date herewith, which is also incorporated by reference as though fully set forth herein. There remains a need, therefore, for an improved architecture for a networked storage system having a host-clustered client which has a facility for determining which node has continued access to the storage system, that does not require a directly attached disk. There remains a further need for such a networked storage system, which also includes a feature that provides a technique for restricting access to certain data of the storage system.
SUMMARY OF THE INVENTION
The present invention overcomes the disadvantages of the prior art by providing a clustered networked storage environment that includes a quorum facility that supports a file system protocol, such as the network file system (NFS) protocol, as a shared data source in a clustered environment. A plurality of nodes interconnected as a cluster is configured to utilize the storage services provided by an associated networked storage system. Each node in the cluster is an identically configured redundant node that may be utilized in the case of failover or for load balancing with respect to the other nodes in the cluster. The nodes are hereinafter referred to as a "cluster members." Each cluster member is supervised and controlled by cluster software executing on one or more processors in the cluster member. As described in further detail herein, cluster membership is also controlled by an associated network accessed quorum device. The arrangement of the nodes in the cluster, and the cluster software executing on each of the nodes, as well as the quorum device, are hereinafter collectively referred to as the "cluster infrastructure."
The clusters are coupled with the associated storage system through an appropri- ate network such as a wide area network, a virtual private network implemented over a public network (Internet), or a shared local area network. For a networked environment, the clients are typically configured to access information stored on the storage system as directories and files. The cluster members typically communicate with the storage system over a network by exchanging discreet frames or packets of data according to prede- fined protocols, such as the NFS over Transmission Control Protocol/Internet Protocol (TCP/IP).
According to illustrative embodiments of the present invention, each cluster member further includes a novel set of software instructions referred to herein as the "quorum program". The quorum program is invoked when a change in cluster member- ship occurs, or when the cluster members are not receiving reliable information about the continued viability of the cluster, or for a variety of other reasons. When the quorum program is so invoked, the cluster member is programmed to assert a claim on the quorum device configured in accordance with the present invention. The node asserts a claim on the quorum device, illustratively by attempting to place a SCSI reservation on the device. More specifically, the quorum device is a virtual disk embodied in a logical unit (LUN) exported by the networked storage system. The LUN is created as a quorum device upon which a SCSI-3 reservation can be placed by an initiator. Thus, the LUN is created for this purpose as a SCSI target that exists solely as a quorum device. In accordance with illustrative embodiments of the invention, the storage system generates the LUN as the quorum device as an export to the clustered host side of the environment. A cluster member asserting a claim on the quorum device is an initiator and communicates with the SCSI target quorum device by establishing an iSCSI session. The iSCSI session provides a communication path between the cluster member initiator and the quorum device target a TCP connection. The TCP connection is provided for by the network which couples the storage system to the host clustered side of the environment.
As used herein, establishing "quorum" means that in a two node cluster, the surviving node places a SCSI reservation on the LUN acting as the quorum device and thereby maintains continued access to the storage system. In a multiple node cluster, i.e., greater than two nodes, several cluster members can have registrations with the quorum device, but only one will be able to place a reservation on the quorum device. In the case of multiple node partition, i.e the cluster is partitioned in to two sub-clusters of two or more cluster members each, then each of the sub-clusters nominate a cluster member from their group to place the reservation and clear registrations of the "losing" cluster members. Those that are successful in having their representative node place the reservation first, thus establish a "quorum," which is a new cluster that has continued access to the storage system,
In accordance one embodiment of the invention, SCSI Persistent Reservations are used by cluster members to assert a claim on the quorum device. Illustratively, only one Persistent Reservation command will occur during any one session. Accordingly, the se- quence for invocation of the novel quorum program is to open an iSCSI session, send a command regarding a SCSI reservation of the quorum device (LUN), and wait for a response. The response is either that the SCSI reservation is successful and that cluster member now holds the quorum or that the reservation was unsuccessful and that cluster member must standby for further instruction. After obtaining a response, the cluster member which opened the iSCSI session then closes the session. The quorum program is a simple user interface that can be readily provided on the host side of the storage environment. Certain required configuration on the storage system side is also provided as described further herein. For example, the LUN which is created as the quorum device is mapped to the cluster members that are allowed access to it. This group of cluster members thus functions as an iSCSI group of initiators. In accordance with another aspect of the present invention, the quorum program can be configured to use SCSI Reserve/Release reservations, instead of Persistent Reservations.
Further details regarding creating a LUN and mapping that LUN to a particular client on a storage system are provided in commonly owned United States Patent Application No. 10/619,122 filed on July 14, 2003, by Lee et al, for SYSTEM AND MESSAGE FOR OPTIMIZED LUN MASKING, which is presently incorporated herein as though fully set forth in its entirety.
By utilizing the teachings of the present invention, the present invention allows SCSI reservation techniques to be employed in a networked storage environment, to provide a quorum facility for clustered-hosts associated of the storage system.
BRIEF DESCRIPTION OF THE DRAWINGS
The above and further advantages of the invention may be understood by referring to the following description in conjunction with the accompanying drawings in which like reference numerals indicate identical or functionally similar elements:
Fig. 1 is a schematic block diagram of a prior art storage system which utilizes a directly attached quorum disk;
Fig. 2 is a schematic block diagram of a prior art storage system which uses a remotely deployed quorum disk that is coupled to each cluster member via fiber channel; Fig. 3 is a schematic block diagram of an exemplary storage system environment for use with an illustrative embodiment of the present invention;
Fig. 4 is a schematic block diagram of the storage system with which the present invention can be used; Fig. 5 is a schematic block diagram of the storage operating system in accordance with the embodiment of the present invention;
Fig. 6 is a flow chart detailing the steps of a procedure performed for configuring the storage system and creating the LUN to be used as the quorum device in accordance with an embodiment of the present invention; Fig. 7 is a flow chart detailing the steps of a procedure for downloading parameters into cluster members for a user interface in accordance with an embodiment of the present invention;
Fig. 8 is a flow chart detailing the steps of a procedure for processing a SCSI reservation command directed to a LUN created in accordance with an embodiment of the present invention; and
Fig. 9 is a flowchart detailing the steps of a procedure for an overall process for a simplified architecture for providing fencing techniques and a quorum facility in a network-attached storage system in accordance with an embodiment of the present invention.
DETAILED DESCRIPTION OF AN ILLUSTRATIVE
EMBODIMENT
A. Cluster Environment
Fig. 1 is a schematic block diagram of a storage environment 100 that includes a cluster 120 having nodes, referred to herein as "cluster members" 130a and 130b, each of which is an identically configured redundant node that utilizes the storage services of an associated storage system 200. For purposes of clarity of illustration, the cluster 120 is depicted as a two-node cluster, however, the architecture of the environment 100 can vary from that shown while remaining within the scope of the present invention. The present invention is described below with reference to an illustrative two-node cluster; however, clusters can be made up of three, four or many nodes. In cases in which there is a cluster having a number of members that is greater than two, a quorum disk may not be needed. In some other instances, however, in clusters having more than two nodes, the cluster may still use a quorum disk to grant access to the storage system for various reasons. Thus, the solution provided by the present invention can also be applied to clusters comprised of more than two nodes.
Cluster members 130a and 130b comprise various functional components that cooperate to provide data from storage devices of the storage system 200 to a client 150. The cluster member 130a includes a plurality of ports that couple the member to the cli- ent 150 over a computer network 152. Similarly, the cluster member 130b includes a plurality of ports that couple that member with the client 150 over a computer network 154. In addition, each cluster member 130, for example, has a second set of ports that connect the cluster member to the storage system 200 by way of a network 160. The cluster members 130a and 130b, in the illustrative example, communicate over the net- work 160 using Transmission Control Protocol/Internet Protocol (TCP/IP). It should be understood that although networks 152, 154 and 160 are depicted in Fig. 1 as individual networks, these networks may in fact comprise a single network or any number of multiple networks, and the cluster members 130a and 130b can be interfaced with one or more of such networks in a variety of configurations while remaining within the scope of the present invention.
In addition to the ports which couple the cluster member 130a to the client 150 and to the network 160, the cluster member 130a also has a number of program modules executing thereon. For example, cluster software 132a performs overall configuration, supervision and control of the operation of the cluster member 130a. An application 134a running on the cluster member 130a communicates with the cluster software to perform the specific function of the application running on the cluster member 130a. This application 134a may be, for example, an Oracle® database application.
In addition, a SCSI-3 protocol driver 136a is provided as a mechanism by which the cluster member 130a acts as an initiator and accesses data provided by a data server, or "target." The target in this instance is a directly coupled, directly attached quorum disk 172. Thus, using the SCSI protocol driver 136a and the associated SCSI bus 138a, the cluster 130a can attempt to place a SCSI-3 reservation on the quorum disk 172. As noted before, however, the SCSI bus 138a has a particular maximum usable length for its effectiveness. Therefore, there is only a certain distance by which the cluster member 130a can be separated from its directly attached quorum disk 172.
Similarly, cluster member 130b includes cluster software 132b which is in communication with an application program 134b. The cluster member 130b is directly attached to quorum disk 172 in the same manner as cluster member 130a. Consequently, cluster members 130a and 130b must be within a particular distance of the directly at- tached quorum disk 172, and thus within a particular distance of each other. This limits the geographic distribution physically attainable by the cluster architecture.
Another example of a prior art system is provided in Figure 2, in which like components have the same reference characters as in Fig. 1. It is noted however, that the client 150 and the associated networks have been omitted from Fig. 2 for clarity of illustra- tion; it should be understood that a client is being served by the cluster 120.
In the prior art, system illustrated in Fig. 2, cluster members 130a and 130b are coupled to a directly attached quorum disk 172. In this system, cluster member 130a, for example, has a fiber channel driver 140a providing fiber channel-specific access to a quorum disk 172, via fiber channel coupling 142a. Similarly, cluster member 130b has a fi- ber channel driver 140b, which provides fiber channel- specific access to the disk 172 by fiber channel coupling 142b. Though it allows some additional distance of separation from cluster members 130a and 130b, the fiber channel coupling 142a and 142b is particularly costly and could result in significantly increased costs in a large deployment.
Thus, it should be understood that the systems of Fig 1 and Fig. 2 have disadvan- tages in that they impose geographical imitations or higher costs, or both.
In accordance with illustrative embodiments of the present invention, Fig. 3 is a schematic block diagram of a storage environment 300 that includes a cluster 320 having cluster members 330a and 330b, each of which is in an identically configured redundant node that utilizes the storage services of an associated storage system 400. For purposes of clarity of illustration, the cluster 320 is depicted as a two-node cluster, however, the architecture of the environment 300 can widely vary from that shown while remaining within the scope of the present invention.
Cluster members 330a and 330b comprise various functional components that cooperate to provide data from storage devices of the storage system 400 to a client 350. The cluster member 330a includes a plurality of ports that couple the member to the client 350 over a computer network 352. Similarly, the cluster member 330b includes a plurality of ports that couple the member to the client 350 over a computer network 354. In addition, each cluster member 330a and 330b, for example, has a second set of ports that connect the cluster member to the storage system 400 by way of network 360. The cluster members 330a and 330b, in the illustrative example, communicate over the network 360 using TCP/IP. It should be understood that although networks 352, 354 and 360 are depicted in Fig. 3 as individual networks, these networks may in fact comprise a single network or any number of multiple networks, and the cluster members 330a and 330b can be interfaced with one or more such networks in a variety of configurations while remaining within the scope of the present invention.
In addition to the ports which couple the cluster member 330a, for example, to the client 350 and to the network 360, the cluster member 330a also has a number of program modules executing thereon. For example, cluster software 332a performs overall configuration, supervision and control of the operation of the cluster member 330a. An ap- plication 334a, running on the cluster 330a communicates with the cluster software to perform the specific function of the application running on the cluster member 330a. This application 334a may be, for example, an Oracle® database application. In addition, fencing program 340a described in the above-identified commonly-owned United States Patent Application No. 11/187,781 is provided. The fencing program 340a allows the cluster member 330a to send fencing instructions to the storage system 400. More specifically, when cluster membership changes, such as when a cluster member fails, or upon the addition of a new cluster member, or upon a failure of the communication link between cluster members, for example, it may be desirable to "fence off' a failed cluster member to avoid that cluster member writing spurious data to a disk, for example. In this case, the fencing program executing on a cluster member not affected by the change in cluster membership (i.e., the "surviving" cluster member) notifies the NFS server in the storage system that a modification must be made in one of the export lists such that a target cluster member, for example, cannot write to given exports of the storage system, thereby fencing off that member from that data. The notification is to change the export lists within an export module of the storage system 400 in such a manner that the cluster member can no longer have write access to particular exports in the storage system 400. In addition, in accordance with an illustrative embodiment of the invention, the cluster member 330a also includes a quorum program 342a as described in further detail herein.
Similarly, cluster member 330b includes cluster software 332b which is in communication with an application program 334b. A fencing program 340b as herein before described, executes on the cluster member 330b. The cluster members 330a and 330b are illustratively coupled by cluster interconnect 370 across which identification signals, such as a heartbeat, from the other cluster member will indicate the existence and continued viability of the other cluster member.
Cluster member 330b also has a quorum program 342b in accordance with the present invention executing thereon. The quorum programs 342a and 342b communicate over a network 360 with a storage system 400. These communications include asserting a claim upon the vdisk (LUN) 380, which acts as the quorum device in accordance with an embodiment in the present invention as described in further detain hereinafter. Other communications can also occur between the cluster members 330a and 330b and the LUN serving as quorum device 380 within the scope of the present invention. These other communications include test messages.
B. Storage System
Fig. 4 is a schematic block diagram of a multi-protocol storage system 400 configured to provide storage service relating to the organization of information on storage devices, such as disks 402. The storage system 400 is illustratively embodied as a storage appliance comprising a processor 422, a memory 424, a plurality of network adapters 425, 426 and a storage adapter 428 interconnected by a system bus 423. The multiprotocol storage system 400 also includes a storage operating system 500 that provides a virtualization system (and, in particular, a file system) to logically organize the informa- tion as a hierarchical structure of named directory, file and virtual disk (vdisk) storage objects on the disks 402.
Whereas clients of aNAS-based network environment have a storage viewpoint of files, the clients of a SAN-based network environment have a storage viewpoint of blocks or disks. To that end, the multi-protocol storage system 400 presents (exports) disks to SAN clients through the creation of LUNs or vdisk objects. A vdisk object (hereinafter "vdisk") is a special file type that is implemented by the virtualization system and translated into an emulated disk as viewed by the SAN clients. The multi-protocol storage system thereafter makes these emulated disks accessible to the SAN clients through controlled exports, as described further herein.
In the illustrative embodiment, the memory 424 comprises storage locations that are addressable by the processor and adapters for storing software program code and data structures. The processor and adapters may, in turn, comprise processing elements and/or logic circuitry configured to execute the software code and manipulate the various data structures. The storage operating system 500, portions of which are typically resident in memory and executed by the processing elements, functionally organizes the storage system by, inter alia, invoking storage operations in support of the storage service implemented by the system. It will be apparent to those skilled in the art that other processing and memory implementations, including various computer readable media, may be used for storing and executing program instructions pertaining to the inventive system and method described herein.
The network adapter 425 couples the storage system to a plurality of clients 460a,b over point-to-point links, wide area networks, virtual private networks implemented over a public network (Internet) or a shared local area network, hereinafter re- ferred to as an illustrative Ethernet network 465. Therefore, the network adapter 425 may comprise a network interface card (NIC) having the mechanical, electrical and signaling circuitry needed to connect the system to a network switch, such as a conventional Ethernet switch 470. For this NAS-based network environment, the clients are configured to access information stored on the multi-protocol system as files. The clients 460 communicate with the storage system over network 465 by exchanging discrete frames or packets of data according to pre-defined protocols, such as the Transmission Control Protocol/Internet Protocol (TCP/IP).
The clients 460 may be general-purpose computers configured to execute applications over a variety of operating systems, including the UNIX® and Microsoft® Win- dows™ operating systems. Client systems generally utilize file-based access protocols when accessing information (in the form of files and directories) over a NAS-based network. Therefore, each client 460 may request the services of the storage system 400 by issuing file access protocol messages (in the form of packets) to the system over the network 465. For example, a client 460a running the Windows operating system may com- municate with the storage system 400 using the Common Internet File System (CIFS) protocol. On the other hand, a client 460b running the UNIX operating system may communicate with the multi-protocol system using either the Network File System (NFS) protocol over TCP/IP or the Direct Access File System (DAFS) protocol over a virtual interface (VI) transport in accordance with a remote DMA (RDMA) protocol over TCP/IP. It will be apparent to those skilled in the art that other clients running other types of operating systems may also communicate with the integrated multi-protocol storage system using other file access protocols.
The storage network "target" adapter 426 also couples the multi-protocol storage system 400 to clients 460 that may be further configured to access the stored information as blocks or disks. For this SAN-based network environment, the storage system is coupled to an illustrative Fiber Channel (FC) network 485. FC is a networking standard describing a suite of protocols and media that is primarily found in SAN deployments. The network target adapter 426 may comprise a FC host bus adapter (HBA) having the mechanical, electrical and signaling circuitry needed to connect the system 400 to a SAN network switch, such as a conventional FC switch 480.
The clients 460 generally utilize block-based access protocols, such as the Small Computer Systems Interface (SCSI) protocol, as discussed previously herein, when accessing information (in the form of blocks, disks or vdisks) over a SAN-based network. SCSI is a peripheral input/output (I/O) interface with a standard, device independent pro- tocol that allows different peripheral devices, such as disks 402, to attach to the storage system 400. As noted herein, in SCSI terminology, clients 460 operating in a SAN environment are initiator's that initiate requests and commands for data. The multi-protocol storage system is thus a target configured to respond to the requests issued by the initiators in accordance with a request/response protocol. The initiators and targets have end- point addresses that, in accordance with the FC protocol, comprise worldwide names (WWN). A WWN is a unique identifier, e.g., a Node Name or a Port Name, consisting of an 8-byte number.
The multi-protocol storage system 400 supports various SCSI-based protocols used in SAN deployments, and in other deployments including SCSI encapsulated over TCP (iSCSI) and SCSI encapsulated over FC (FCP). The initiators (hereinafter clients 460) may thus request the services of the target (hereinafter storage system 400) by issuing iSCSI and FCP messages over the network 465, 485 to access information stored on the disks. It will be apparent to those skilled in the art that the clients may also request the services of the integrated multi-protocol storage system using other block access pro- tocols. By supporting a plurality of block access protocols, the multi-protocol storage system provides a unified and coherent access solution to vdisks/LUNs in a heterogeneous SAN environment.
The storage adapter 428 cooperates with the storage operating system 500 executing on the storage system to access information requested by the clients. The information may be stored on the disks 402 or other similar media adapted to store information. The storage adapter includes I/O interface circuitry that couples to the disks over an I/O interconnect arrangement, such as a conventional high-performance, FC serial link topology. The information is retrieved by the storage adapter and, if necessary, processed by the processor 422 (or the adapter 428 itself) prior to being forwarded over the system bus 423 to the network adapters 425, 426, where the information is formatted into packets or messages and returned to the clients.
Storage of information on the system 400 is preferably implemented as one or more storage volumes (e.g., VOL1-2450) that comprise a cluster of physical storage disks 402, defining an overall logical arrangement of disk space. The disks within a vol- ume are typically organized as one or more groups of Redundant Array of Independent (or Inexpensive) Disks (RAID). RAID implementations enhance the reliability/integrity of data storage through the writing of data "stripes" across a given number of physical disks in the RAID group, and the appropriate storing of redundant information with respect to the striped data. The redundant information enables recovery of data lost when a storage device fails. It will be apparent to those skilled in the art that other redundancy techniques, such as mirroring, may be used in accordance with the present invention.
Specifically, each volume 450 is constructed from an array of physical disks 402 that are organized as RAID groups 440, 442, and 444. The physical disks of each RAID group include those disks configured to store striped data (D) and those configured to store parity (P) for the data, in accordance with an illustrative RAID 4 level configuration. It should be noted that other RAID level configurations (e.g. RAID 5) are also contemplated for use with the teachings described herein. In the illustrative embodiment, a minimum of one parity disk and one data disk may be employed. However, a typical implementation may include three data and one parity disk per RAID group and at least one RAID group per volume.
C. Storage Operating System
Fig. 5 is a schematic block diagram of an exemplary storage operating system 500 that may be advantageously used in the present invention. A storage operating system 500 comprises a series of software modules organized to form an integrated network pro- tocol stack, or generally, a multi-protocol engine that provides data paths for clients to access information stored on the multi-protocol storage system 400 using block and file access protocols. The protocol stack includes media access layer 510 of network drivers (e.g., gigabit Ethernet drivers) that interfaces through network protocol layers, such as IP layer 512 and its supporting transport mechanism, the TCP layer 514. A file system pro- tocol layer provides multi-protocol file access and, to that end, includes a support for the NFS protocol 520, the CIFS protocol 522, and the hypertext transfer protocol (HTTP) 524.
An iSCSI driver layer of 528 provides block protocol access over the TCP/IP network protocol layers, while an FC driver layer 530 operates with the network adapter to receive and transmit block access requests and responses to and from the storage sys- tern. The FC and iSCSI drivers provide FC-specific and iSCSI-specific access control to the LUNs (vdisks) and, thus, manage exports of vdisks to either iSCSI or FCP or, alternatively to both iSCSI and FCP when accessing a single vdisk on the storage system. In addition, the operating system includes a disk storage layer 540 that implements a disk storage protocol such as a RAID protocol, and a disk driver layer 550 that implements a disk access protocol such as, e.g. a SCSI protocol.
Bridging the disk software modules with the integrated network protocol stack layer is a virtualization system 570. The virtualization system 570 includes a file system 574 interacting with virtualization modules illustratively embodied as, e.g., vdisk module 576 and SCSI target module 578. Additionally, the SCSI target module 578 includes a set of initiator data structures 580 and a set of LUN data structures 584. These data structures store various configuration and tracking data utilized by the storage operating system for use with each initiator (client) and LUN (vdisk) associated with the storage system. Vdisk module 576, the file system 574, and the SCSI target module 578 can be im- plemented in software, hardware, firmware, or a combination thereof.
The vdisk module 576 communicates with the file system 574 to enable access by administrative interfaces in response to a storage system administrator issuing commands to a storage system 400. In essence, the vdisk module 576 manages all SAN deployments by, among other things, implementing a comprehensive set of vdisk (LUN) com- mands issued by the storage system administrator. These vdisk commands are converted into primitive file system operations ("primitives") that interact with a file system 574 and the SCSI target module 578 to implement the vdisks. The SCSI target module 578 initiates emulation of a disk or LUN by providing a mapping and procedure that translates LUNs into the special vdisk file types. The SCSI target module is illustratively dis- posed between the FC and iSCSI drivers 530 and 528 respectively and file system 574 to thereby provide a translation layer of a virtualization system 570 between a SAN block (LUN) and a file system space, where LUNs are represented as vdisks. To that end, the SCSI target module 578 has a set of APIs that are based on the SCSI protocol that enable consistent interface to both the iSCSI and FC drivers 528, 530 respectively. An iSCSI Software Target (ISWT) driver 579 is provided in association with the SCSI target module 578 to allow iSCSI-driven messages to reach the SCSI target. It is noted that by "disposing" SAN virtualization over the file system 574 the storage system 400 reverses approaches taken by prior systems to thereby provide a single unified storage platform for essentially all storage access protocols.
The file system 574 provides volume management capabilities for use in block based access to the information stored on the storage devices, such as disks. That is, in addition to providing file system semantics, such as naming of storage objects, the file system 574 provides functions normally associated with a volume manager. These functions include (i) aggregation of the disks, (ii) aggregation of the storage bandwidth of the disks, and (iii) reliability guarantees such as mirroring and/or parity (RAID) to thereby present one or more storage objects laid on the file system.
The file system 574 illustratively implements the WAFL® file system having in on disk format representation that is block based using, e.g., 4 kilobyte (KB) blocks and using inodes to describe files. The WAFL® file system uses files to store metadata describing the layout of its file system; these metadata files include, among others, an inode file. A file handle, i.e., an identifier that includes an inode number, is used to retrieve an inode from disk. A description of the structure of the file system, including on-disk inodes and the inode file, is provided in commonly owned U.S. Patent No. 5,819,292, titled METHOD FOR MAINTAINING CONSISTENT STATES OF A FILE SYSTEM AND FOR CREATING USER-ACCESSIBLE READ-ONLY COPIES OF A FILE SYSTEM by David Hitz et al., issued October 6, 1998, which patent is hereby incorporated by reference as though fully set forth herein.
It should be understood that the teachings of this invention can be employed in a hybrid system that includes several types of different storage environments such as the particular storage environment 300 of Fig. 3. The invention can be used by a storage system administrator that deploys a system implementing and controlling a plurality of satellite storage environments that, in turn, deploy thousands of drives in multiple networks that are geographically dispersed. Thus, the term "storage system" as used herein, should, therefore, be taken broadly to include such arrangements.
D. Quorum Facility In an illustrative embodiment of the invention, a host-clustered storage environment includes a quorum facility that supports a file system protocol, such as the NFS protocol, as a shared data source in a clustered environment. A plurality of nodes interconnected as a cluster is configured to utilize the storage services provided by an associated networked storage system. Each node in the cluster, hereinafter referred to as a "cluster member," is supervised and controlled by cluster software executing on one or more processors in the cluster member. As described in further detail herein, cluster membership is also controlled by an associated network accessed quorum device. The arrangement of the nodes in the cluster, and the cluster software executing on each of the nodes, as well as the quorum device, are hereinafter collectively referred to as the "cluster infrastructure."
According to illustrative embodiments of the present invention, each cluster member further includes a novel set of software instructions referred to herein as the "quorum program." The quorum program is invoked when a change in cluster member- ship occurs, or when the cluster members are not receiving reliable information about the continued viability of the cluster, or for a variety of other reasons. When the quorum program is so invoked, the cluster member is programmed to assert a claim on the quorum device configured in accordance with the present invention. The cluster member asserts a claim on the quorum device illustratively by attempting to place a SCSI reserva- tion on the device. More specifically, the quorum device is a vdisk embodied in a LUN exported by the networked storage system. The LUN is created as a quorum device upon which a SCSI-3 reservation can be placed by an initiator. Thus, the LUN is created for this purpose as a SCSI target that exists solely as a quorum device.
In accordance with illustrative embodiments of the invention, the storage system generates the LUN as the quorum device as an export to the clustered host side of the environment. A cluster member asserting a claim on the quorum device, which is accomplished illustratively by placing a SCSI reservation on the LUN serving as a quorum device, is an initiator and communicates with the SCSI target quorum device by establishing an iSCSI session. The iSCSI session provides a communication path between the cluster member initiator and the quorum device target, preferably over a TCP connection. The TCP connection is provided for by the network which couples the storage system to the host clustered side of the environment.
For purposes of a more complete description, it is noted that a more recent version of the SCSI standard is known as SCSI-3. A target organizes and advertises the presence of data using containers called "logical units" (LUNs). An initiator requests services from a target by building a SCSI-3 "command descriptor block (CDB)." Some CDBs are used to write data within a LUN. Others are used to query the storage system to determine the available set of LUNs, or to clear error conditions and the like.
The SCSI-3 protocol defines the rules and procedures by which initiators request or receive services from targets. In a clustered environment, when a quorum facility is to be employed, cluster nodes are configured to act as "initiators" to assert claims on a quorum device that is the "target" using a SCSI-3 based reservation mechanism. The quorum device in that instance acts as a tie breaker in the event of failure and insures that the sub-cluster that has the claim upon the quorum disk will be the one to survive. This en- sures that multiple independent clusters do not survive in case of a cluster failure. To allow otherwise, could mean that a failed cluster member may continue to survive, but may send spurious messages and possibly write incorrect data to one or more disks of the storage system.
There are two different types of reservations supported by the SCSI-3 specifica- tion. The first type of reservation is known as SCSI Reserve/Release reservations. The second is known as Persistent Reservations. The two reservation schemes cannot be used together. If a disk is reserved using SCSI Reserve/Release, it will reject all Persistent Reservation commands. Likewise, if a drive is reserved using Persistent Reservation, it will reject SCSI Reserve/Release.
SCSI Reserve/Release is essentially a lock/unlock mechanism. SCSI Reserve locks a drive and SCSI Release unlocks it. A drive that is not reserved can be used by any initiator. However, once an initiator issues a SCSI Reserve command to a drive, the drive will only accept commands from that initiator. Therefore, only one initiator can access the device if there is a reservation on it. The device will reject most commands from other initiators (commands such as SCSI Inquiry will still be processed) until the initiator issues a SCSI Release command to it or the drive is reset through either a soft reset or a power cycle, as will be understood by those skilled in the art.
Persistent Reservations allow initiators to reserve and unreserve a drive similar to the SCSI Reserve/Release functionality. However, they also allow initiators to determine who has a reservation on a device and to break the reservation of another device, if needed. Reserving a device is a two step process. Each initiator can register a key (an eight byte number) with the device. Once the key is registered, the initiator can try to reserve that device. If there is already a reservation on the device, the initiator can preempt it and atomically change the reservation to claim it as its own. The initiator can also read off the key of another initiator holding a reservation, as well as a list of all other keys registered on the device. If the initiator is programmed to. understand the format of the keys, it can determine who currently has the device reserved. Persistent Reservations support various access modes ranging from exclusive read/write to read-shared/write- exclusive for the device being reserved. In accordance one embodiment of the invention, SCSI Persistent Reservations are used by cluster members to assert a claim on the quorum device. Illustratively, only one Persistent Reservation command will occur during any one session. Accordingly, the sequence for invocation of the novel quorum program is to open an iSCSI session, send a command regarding a SCSI reservation of the quorum device (LUN), and wait for a re- sponse. The response is either that the SCSI reservation is successful and that cluster member now holds the quorum or that the reservation was unsuccessful and that cluster member must standby for further instruction. After obtaining a response, the cluster member which opened the iSCSI session then closes the session. The quorum program is a user interface that can be readily provided on the host side of the storage environment. Certain required configuration on the storage system side is also provided as described further herein. For example, the LUN which is created as the quorum device is mapped to the cluster members that are allowed access to it. This group of cluster members thus functions as an iSCSI group of initiators. In accordance with another aspect of the present invention, the quorum program can be configured to use SCSI Reserve/Release res- ervations, instead of Persistent Reservations. Furthermore, in accordance with the present invention, a basic configuration is required for the storage system before the quorum facility can be used for the intended purpose. This configuration includes creating the LUN that will be used as the quorum device in accordance with the invention. Fig. 6 illustrates a procedure 600, the steps of which can be used to implement the required configuration on the storage system. The procedure starts with step 602 and continues to steps 604. Step 604 requires that the storage system is iSCSI licensed. An exemplary command line for performing this step is as follows: storagesystera>license add XXXXXX In this command, where XXXXXX appears, the iSCSI license key should be inserted, as will be understood by those skilled in the art. It is noted that in another case, the general iSCSI access could be licensed, but specifically for quorum purposes. A separate license such as an "iSCSI admin" license can be issued royalty free, similar to certain HTTP licenses, as will be understood by those skilled in the art. The next step 606 is to check and set the iSCSI target nodename. An exemplary command line for performing this step is as follows: storagesystem>iscsi nodename
The programmer should insert the identification of the iSCSI target nodename which in this instance will be the name of the storage system. By way of example, the storage system name may have the following format, however, any suitable format may be used: iqn.l 992-08.com. :s.335xxxxxx. Alternatively, the nodename may be entered by setting the hostname as the suffix instead of the serial number. The hostname can be used rather than iSCSI nodename of the storage system as the ISCSI target's address.
Step 608 provides that an igroup is to be created comprising the initiator nodes. The initiator nodes in the illustrative embodiment of the invention are the cluster members such as cluster members 330a and 330b of Fig. 3. If the initiator names for the cluster members for example, are iqn.1992-08.com.cll and iqn.l 992.08. com.cl2, and the following command line can be used by way of example, to create an igroup in accordance with step 608: Storagesystem>igroup create -i scntap-grp Storagesystem>igroup show scntap-grp (iSCSI) (os type: default): Storagesystem>igroup add scntap-grp iqn.1998-08.com.cll Storagesystem>igroup add scntap-grp iqn.1192-08.com.cl2
Storagesystem>igroup show scntap-grp
scntap-grp (iSCSI) (ostype: default): iqn.1992-08.com.cll iqn.1992-08.com. cl2
In accordance with step 610, the actual LUN is created. In certain embodiments of the invention, more than one LUN can be created if desired in a particular application of the invention. An exemplary command line for creating the LUN, which is illustratively located as \vol\volO\scntaplun is as follows: Storagesystem>lun create -s lg\vol\volO\scntaplun
Storagesystem>Iun show
\vol\volO\scntaplun Ig (1073741824) (r/w, online)
It is noted that steps 608 and 610 can be performed in either order. However, both must be successful before proceeding further. In step 612, the created LUN is mapped to the created igroup in step 608. This can be accomplished using the following command line:
StorageSystem>Iun show -v\vol\vol0\scntaplun
\vol\volO\sentaplun Ig (1073741824) (r-w, online)
Step 612 ensures that the LUN is available to the initiators in the specified group at the LUN ID as specified. In accordance with step 614, the iSCSI Software Target (ISWT) driver is configured for at least one network adapter. As a target driver, part of the ISWT' s responsibility is for driving certain hardware for the purposes of providing access to the storage system managed LUNs by the iSCSI initiators. This allows the storage system to provide the iSCSI target service over any or all of its standard network interfaces, and a single network interface can be used simultaneously for both iSCSI requests and other types of network traffic (e.g. NFS and or CIFS requests).
The command line which can be used to check the interface is as follows: storagesystem>iscsi show adapter This indicates which adapters are set up in step 614.
Now that the LUN has been mapped to the igroup and the iSCSI driver has been set up and implemented, the next step (step 616) is to start the iSCSI driver so that iSCSI client calls are ready to be served. At step 616, to start the iSCSI service the following command line can be used: storagesystem>iscsi start.
The procedure 600 completes at step 618. The procedure thus creates the LUN to be used as a network-accessed quorum device in accordance with the invention and allows it come online and to be accessible so that it is ready when needed to establish for a quorum. As noted, in addition to providing a quorum facility, a LUN may also be created for other purposes which are implemented using the quorum program of the present invention as set forth in each of the cluster members that interface with the LUN.
Once the storage system is appropriately configured, a user interface is to be downloaded from a storage system provider's website or in another suitable manner understood by those skilled in the art, into the individual cluster members that are to have the quorum facility associated herewith. This is illustrated in the flowchart 700 of Fig. 7. In another embodiment of the invention, the "quorum program" in one or more of the cluster members may be either accompanied by or replaced by a host-side iSCSI driver, such as iSCSI driver 136a (Fig. 1), which is configured access the LUN serving as the quorum disk in accordance with the present invention. The procedure 700 begins with the start step 702 and continues to step 704 in which an iSCSI parameter is to be supplied at the administrator level. More specifically, step 704 indicates that the LUN ID should be supplied to the cluster members. This is the identification number of the target LUN in a storage system that is to act as the quorum device. This target LUN will have already been created and will have an identification number pursuant to the procedure 600 of Fig. 6.
In step 706, the next parameter that is to be supplied to the administrator is the target node name. The target nodename is a string which indicates the storage system which exports the LUN. A target nodename string may be, for example, "iqn.l992.08.com.sn.33583650".
Next, the target hostname string is to be supplied to the cluster member in accordance with step 708. The target hostname string is simply the host name.
In accordance with step 710, the initiator session ID, or "ISID", is to be supplied. This is a 6 byte initiator session ID which takes the form, for example: 11:22:33:44:55:66.
In accordance with step 712, the initiator nodename string is supplied, which indicates which cluster member is involved so that when a response is sent back to the cluster member from the storage system, the cluster member is appropriately identified and addressed. The initiator nodename string may be for example "iqn.1992.08.com.itst". The setup procedure of 700 of Fig. 7 completes at step 714.
Once the storage system has been configured in accordance with procedure 600 of Fig. 6 and the cluster member has been supplied with the appropriate information in accordance with procedure 700 of Fig. 7, then the quorum program is downloaded from a storage system provider's website, or in another suitable manner, known to those skilled in the art, into the memory of the cluster member.
As noted, the quorum program is invoked when the cluster infrastructure determines that a new quorum is to be established. When this occurs, the quorum program contains instructions to send a command line with various input options to specify commands to carry out Persistent Reservation actions on the SCSI target device using a quo- rum enable command. The quorum enable command includes the following information:
Usage: quorum enable [-t target hostname] [-T target iscsi node name!
[-1 initiator iscsi node name] [-i ISID] [-1 lun] [-r resv key] [-s serv key] [-ffilejname] [-o blk_ofst] [-n num_blks] [-y type] [-a] [-v] [-h] Operation
The options include "-h" which requests that the usage screen is printed; the option "-a" sets an APTPL bit to activate persist in case of a power loss; -f indicates the 'file_name which specifies the file in which to read or write data; the option "-o blk_ofst" specifies the block offset in which to read or write data; the "-n num_blks" specifies a number of 512 byte blocks to read or write (128 max); the "-t target_hostname" option specifies the target host name, with a default as defined by the operator; the "-T tar- get_iscsi_node_name" option specifies a target iSCSI nodename, with an appropriate default; the "-I initiator iscsi node name" -option specifies the initiator iSCSI node name and default: iqn.l992-08.com..itst; the "-i ISID" option specifies an Initiator Ses- sion ID (default 0); the "-I lun" option specifies the LUN (default 0); the option "-r resvjkey" specifies the reservation key (default O)' and the "-s serv_key" -option specifies the service action resv key (default 0); the "-y type" specifies the reservation type (default 5); and -v is verbose.
The reservation types that can be implemented by the quorum enable command are as follows:
Reservation types
1- Write Exclusive
2- Obsolete 3- Exclusive Access
4- Obsolete
5- Write Exclusive Registrants Only
6- Exclusive Access Registrants Only
7- Write Exclusive All Registrants 8- Exclusive Access AU Registrants.
Operation is one of the following: rk - Read Keys re - Read Capabilities rr - Read Reservations rg - Register rv - Reserve rl - Release cl - Clear pt - Preempt pa - Preempt Abort ri - Register Ignore in - Inquiry LUN Serial No.
These codes conform to the SCSI-3 specification as will be understood by those skilled in the art. The quorum enable command is embodied in the quorum program 342a in cluster member 330a of Fig. 3 for example, and is illustratively based on the assumption that only one Persistent Reservation command will occur during any one session invocation. This avoids the need for the program to handle all aspects of iSCSI session management for purposes of simple invocation. Accordingly, the sequence for each invocation of quorum enable is set forth in the flowchart of Fig. 8.
The procedure 800 begins with the start step 802 and continues to 804 which are to create an iSCSI session. As noted herein, an initiator communicates with a target via an iSCSI session. A session is roughly equivalent to a SCSI initiator-target nexus, and consists of a communication path between an initiator and a target, and the current state of that communication (e.g. set of outstanding commands, state of each in-progress command, flow control command window and the like).
A session includes one or more TCP connections. If the session contains multiple TCP connections then the session can continue uninterrupted even if one of its underlying TCP connections is lost. An individual SCSI command is linked to a single connection, but if that connection is lost e.g. due to a cable pull, the initiator can detect this condition and reassign that SCSI command to one of the remaining TCP connections for completion.
An initiator is identified by a combination of its iSCSI initiator nodename and a numerical initiator session ID or ISID, as described herein before. After establishing this session, the procedure 800 continues to step 806 where a test unit ready (TUR) is sent to make sure that the SCSI target is available. Assuming the SCSI target is available, the procedure proceeds to step 808 where the SCSI PR command is constructed. As will be understood by those skilled in the art, iSCSI protocols embodied as protocol data units or PDUs.
The PDU is the basic unit of communication between an iSCSI initiator and its target. Each PDU consists of a 48-byte header and an optional data segment. Opcode and data segment length fields appear at fixed locations within the headers; the format of the rest of the header and the format and content of the data segment are not code specific. Thus the PDU is built to incorporate a Persistent Reservation command using quorum enable in accordance with the present invention. Once this is built, in accordance with step 810, the iSCSI PDU is sent to the target node, which in this instance is the LUN operating as a quorum device. The LUN operating as a quorum device then returns a response to the initiator cluster member. In accordance with step 814, the response is parsed by the initiator cluster member and it is determined whether the reservation command operation was successful. If the operation is successful, then the cluster member holds the quorum. If the reservation was not successful, then the cluster member will wait for further information, hi either case, in accordance with step 816, the cluster member closes the iSCSI session. In accordance with step 818 a response is returned to the target indicating that the session was terminated. The procedure 800 completes at step 820.
Examples For purposes of illustration, this section provides some sample commands which can be used in accordance with the present invention to carry out Persistent Reservation actions on a SCSI target device using the quorum enable command. Notably, the commands do not supply the -T option. If the -T option is not included in the command line options then the program will use SENDTARGETS to determine the target ISCI node- name, as will be understood by those skilled in the art.
i). Two separate initiators register a key with the SCSI target for the first time and instruct the target to persist the reservation (-a option): # quorum enable -a -t target_hostname -s serv_keyl -r 0 -i ISID -I initiator_iscsi_node_name -I 0 rg
# quorum enable -a -t target hostname -s serv__key2 -r 0 -i ISID -I intiator_iscsi_node_name -I 0 rg ii). Create a WERO reservation on LUNO (-1 option):
# quorum enable -t target_hostname -r resv_key -s servjcey -i ISID -I initiator_iscsi_node_name -y 5 -1 0 rv iii). Change the reservation from WERO to WEAR on LUNO:
# quorum enable -t targe_hostname -r resvjcey -s serv__key -i ISID -I initiator_iscis_node_name -y 7 -1 0 pt iv). Clear all the reservations/registrations on LUNO:
# quorum enable -r resv_key -a -i ISID -I initia- tor_iscsi_node_name cl v). Write 2k of data to LUNO starting at block 0 from the file foo
# quorum enable -f /u/home/temp/foo -n 4 -o 0 -i ISID -t tar- get_hostname -I initiator_iscsi_node_name wr
In accordance with an illustrative embodiment of the invention, the storage environment 300 of Fig. 3 can be configured such that each cluster member 330a and 330b also include a fencing program 340a and 340b respectively, which provide failure fencing techniques for file-based data, as well as a quorum facility as provided by the quorum programs 342a and 342b, respectively. A flowchart further detailing the method of this embodiment of the invention is depicted in Fig. 9.
The procedure 900 begins at the start step 902 and proceeds to step 904. In ac- cordance with step 904 an initial fence configuration is established for the host cluster. Typically, all cluster members initially have read and write access to the exports of the storage system that are involved in a particular application of the invention, hi accordance with step 906, a quorum device is provided by creating a LUN (vdisk) as an export on the storage system, upon which cluster members can place SCSI reservations as de- scribed in further detail herein.
During operation, as data is served by the storage system, a change in cluster membership is detected by a cluster member as in step 908. This can occur due to a failure of a cluster member, a failure of a communication link between cluster members, the addition of a new node as a cluster member or any other of a variety of circumstances which cause cluster membership to change. Upon detection of this change in cluster membership, the cluster members are programmed using the quorum program of the present invention to attempt to establish a new quorum, as in step 910, by placing a SCSI reservation on the LUN which has been created. This reservation is sent over the network using an iSCSI PDU as described herein.
Thereafter, a cluster member receives a response to its attempt to assert quorum on the LUN, as shown in step 912. The response will either be that the cluster member is in the quorum or is not in a quorum. A least one cluster member that holds quorum will then send a fencing message to the storage system over the network as show in step 914. The fencing message requests the NFS server of the storage system to change export lists of the storage system to disallow write access of the failed cluster member to given exports of the storage system. A server API message is provided for this procedure as set forth in the above incorporated United States Patent Application Numbers 11/187,781 and 11/187,649.
Once the cluster member with quorum has fenced off the failed cluster members or those as identified by the cluster infrastructure, the procedure 900 completes in step 916. Thus, a new cluster has been established with the surviving cluster members and the surviving cluster members will continue operation until notified otherwise by the storage system or the cluster infrastructure. This can occur in a networked environment using the simplified system and method of the present invention for interfacing a host cluster with a storage system in a networked storage environment. The invention provides for quorum capability and fencing techniques over a network without requiring a directly attached storage system or a directly attached quorum disk, or a fiber channel connection. Thus, the invention provides a simplified user interface for providing a quorum facility and for fencing cluster members, which is easily portable across all Unix®-based host platforms. In addition, the invention can be implemented and used over TCP with insured reliability. The invention also provides a means to provide a quorum device and to fence cluster members while enabling the use of NFS in a shared collaborative cluster- ing environment. It should be noted that while the present invention has been written in terms of files and directories, the present invention also may be utilized to fence/unfence any form of networked data containers associated with a storage system. It should be further noted that the system of the present invention provides a simple and complete user interface that can be plugged into a host cluster framework which can accommodate dif- ferent types of shared data containers. Furthermore, the system and method of the present invention supports NFS as a shared data source in a high-availability environment that includes one or more storage system clusters and one or more host clusters having end-to-end availability in mission-critical deployments having substantially constant availability. The foregoing has been a detailed description of the invention. Various modifications and additions can be made without departing from the spirit and scope of the invention. Furthermore, it is expressly contemplated that the various processes, layers, modules and utilities shown and described according to this invention can be implemented as software, consisting of a computer readable medium including programmed instructions executing on a computer, as hardware or firmware using state machines and the like, or as a combination of hardware, software and firmware. Accordingly, this description is meant to be taken only by way of example and not to otherwise limit the scope of the invention.
What is claimed is:

Claims

CLAIMS 1. A method of providing a quorum facility in a networked, host-clustered storage environment, comprising the steps of: providing a plurality of nodes configured in a cluster for sharing data, each node being a cluster member; providing a storage system that supports a plurality of data containers, said stor- age systems supporting a protocol to provide access to each respective data container as- sociated with the storage system; creating a logical unit (LUN) on the storage system as a quorum device; mapping the logical unit to an iSCSI group of initiators which group is made up of the cluster members; coupling the cluster to the storage system; providing a quorum program in each cluster member such that when a change in cluster membership is detected, a surviving cluster member is instructed to send a mes- sage to an iSCSI target to place a SCSI reservation on the LUN; and if a cluster member of the igroup is successful in placing the SCSI reservation on the LUN, then quorum is established for that cluster member.
2. The method as defined in claim 1 wherein said protocol used by said networked storage system is the Network File System protocol.
3. The method as defined in claim 1 wherein the cluster is coupled to the storage system over a network using Transmission Control Protocol / Internet Protocol.
4. The method as defined in claim 1 wherein said cluster member transmits said message that includes an iSCSI Protocol Data Unit.
5. The method as defined in claim 1 further comprising the step of said cluster members sending messages including instructions other than placing SCSI reservations on said quorum device.
6. The method as defined in claim 1 wherein said SCSI reservation is a Persistent Reservation.
7. The method as defined in claim 1 wherein said SCSI reservation is a Re- serve/Release reservation.
8. The method as defined in claim 1 including the further step of employing an iSCSI driver in said cluster member to communicate with said LUN instead of or in addi- tion to said quorum program.
9. A method for performing fencing and quorum techniques in a clustered storage environment, comprising the steps of: providing a plurality of nodes configured in a cluster for sharing data, each node being a cluster member; providing a storage system that supports a plurality of data containers, said stor- age system supporting a protocol that configures export lists that assign each cluster member certain access permission rights, including read write access permission or read only access permission as to each respective data container associated with this storage system; creating a logical unit (LUN) configured as a quorum device; coupling the cluster to the storage system; providing a fencing program in each cluster member such that when a change in cluster membership is detected, a surviving member send an application program inter- face message to said storage system commanding said storage system to modify one or more of said export lists such that the access permission rights of one or more identified cluster members are modified; and providing a quorum program in each cluster member such that when a change in cluster membership is detected, a surviving cluster member transmits a message to an iSCSI target to place the a SCSI reservation on the LUN.
1 10. A system of performing quorum capability in a storage system environment,
2 comprising:
3 one or more storage systems coupled to one or more clusters of interconnected
4 cluster members to provide storage services to one or more client;
5 a logical unit exported by said storage system and said logical unit being config- β ured as a quorum device; and
7 a quorum program running on one or more cluster members including instructions
8 such that when cluster membership changes, each cluster member asserts a claim on the
9 quorum device by sending an iSCSI Protocol Data Unit message to place an iSCSI reser- o vation on the logical unit serving as a quorum device.
1 11. The system as defined in claim 10 wherein said one or more storage systems are
2 coupled to said one or more clusters by way of one or more networks that use the Trans-
3 mission Control Protocol/Internet Protocol.
1 12. The system as defined in claim 10 wherein said storage system is configured to
2 utilize the Network File System protocol.
i
13. The system as defined in claim 10 further comprising:
2
3 a fencing program running on one or more cluster members including instructions
4 for issuing a host application program interface message when a change in cluster mem-
5 bersbip is detected, said application program interface message commanding said storage
6 system to modify one or more of said export lists such that the access permission rights of
7 one or more identified cluster members are modified.
1 14. The system as defined in claim 10 further comprising an iSCSI driver deployed in
2 at least on of said cluster members configured to communicate with said LUN.
15. A computer readable medium for providing quorum capability in a clustered envi- ronment with networked storage system, including program instructions for performing the steps of: creating a logical unit exported by the storage system which serves as a quorum device; generating a message from a cluster member in a clustered environment to place a reservation on said logical unit which serves as a quorum device; and generating a response to indicate whether said cluster member was successful in obtaining quorum.
16. The computer readable medium for providing quorum capability in a clustered environment with networked storage, as defined in claim 15 including program instruc- tions for performing the further step of issuing a host application program interface mes- sage when a change in cluster membership is detected, said application program interface message commanding said storage system to modify one or more export lists such that access permission rights of one or more identified cluster members are modified.
17. A computer readable medium for providing quorum capability in a clustered envi- ronment with a networked storage system, comprising program instructions for perform- ing the steps of: detecting that cluster membership has changed; generating a message including a SCSI reservation to be placed on a logical unit serving as a quorum device in said storage system; and upon obtaining quorum, generating a message that one or more other cluster members are to be fenced off from a given export.
18. The computer readable medium as defined in claim 17 further comprising instruc- tions for generating an application program interface message including a command for modifying export lists of the storage system such that an identified cluster member no longer has read- write access to given exports of the storage system.
19. The computer readable medium as defined in claim 17 further comprising a cluster member obtaining quorum by successfully placing a SCSI reservation on a logi- cal unit serving as a quorum device before such a reservation is placed thereupon by an- other cluster member.
20. The computer readable medium as defined in claim 17 further comprising instructions in a multiple node cluster having more than two cluster members to establish a quorum in a partitioned cluster by appointing a representative cluster member and hav- ing that cluster member place a SCSI reservation on a logical unit serving as a quorum device prior to a reservation being placed by another cluster member.
PCT/US2006/028148 2005-07-22 2006-07-21 Architecture and method for configuring a simplified cluster over a network with fencing and quorum WO2007013961A2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP06800150A EP1907932A2 (en) 2005-07-22 2006-07-21 Architecture and method for configuring a simplified cluster over a network with fencing and quorum

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US11/187,729 US20070022314A1 (en) 2005-07-22 2005-07-22 Architecture and method for configuring a simplified cluster over a network with fencing and quorum
US11/187,729 2005-07-22

Publications (2)

Publication Number Publication Date
WO2007013961A2 true WO2007013961A2 (en) 2007-02-01
WO2007013961A3 WO2007013961A3 (en) 2008-05-29

Family

ID=37680410

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2006/028148 WO2007013961A2 (en) 2005-07-22 2006-07-21 Architecture and method for configuring a simplified cluster over a network with fencing and quorum

Country Status (3)

Country Link
US (1) US20070022314A1 (en)
EP (1) EP1907932A2 (en)
WO (1) WO2007013961A2 (en)

Families Citing this family (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7096213B2 (en) * 2002-04-08 2006-08-22 Oracle International Corporation Persistent key-value repository with a pluggable architecture to abstract physical storage
US7711539B1 (en) * 2002-08-12 2010-05-04 Netapp, Inc. System and method for emulating SCSI reservations using network file access protocols
US7631016B2 (en) * 2005-05-04 2009-12-08 Oracle International Corporation Providing the latest version of a data item from an N-replica set
US7437426B2 (en) * 2005-09-27 2008-10-14 Oracle International Corporation Detecting and correcting node misconfiguration of information about the location of shared storage resources
US8484365B1 (en) * 2005-10-20 2013-07-09 Netapp, Inc. System and method for providing a unified iSCSI target with a plurality of loosely coupled iSCSI front ends
US8788685B1 (en) * 2006-04-27 2014-07-22 Netapp, Inc. System and method for testing multi-protocol storage systems
US7904690B2 (en) * 2007-12-14 2011-03-08 Netapp, Inc. Policy based storage appliance virtualization
US7890504B2 (en) * 2007-12-19 2011-02-15 Netapp, Inc. Using the LUN type for storage allocation
US7543046B1 (en) 2008-05-30 2009-06-02 International Business Machines Corporation Method for managing cluster node-specific quorum roles
US7840730B2 (en) * 2008-06-27 2010-11-23 Microsoft Corporation Cluster shared volumes
US9588806B2 (en) 2008-12-12 2017-03-07 Sap Se Cluster-based business process management through eager displacement and on-demand recovery
WO2010084522A1 (en) * 2009-01-20 2010-07-29 Hitachi, Ltd. Storage system and method for controlling the same
US20100275219A1 (en) * 2009-04-23 2010-10-28 International Business Machines Corporation Scsi persistent reserve management
US8145938B2 (en) * 2009-06-01 2012-03-27 Novell, Inc. Fencing management in clusters
US8417899B2 (en) * 2010-01-21 2013-04-09 Oracle America, Inc. System and method for controlling access to shared storage device
US8381017B2 (en) 2010-05-20 2013-02-19 International Business Machines Corporation Automated node fencing integrated within a quorum service of a cluster infrastructure
WO2011146883A2 (en) * 2010-05-21 2011-11-24 Unisys Corporation Configuring the cluster
US20120102561A1 (en) * 2010-10-26 2012-04-26 International Business Machines Corporation Token-based reservations for scsi architectures
GB2496840A (en) 2011-11-15 2013-05-29 Ibm Controlling access to a shared storage system
US9229648B2 (en) * 2012-07-31 2016-01-05 Hewlett Packard Enterprise Development Lp Storage array reservation forwarding
US9146790B1 (en) * 2012-11-02 2015-09-29 Symantec Corporation Performing fencing operations in multi-node distributed storage systems
US9354992B2 (en) * 2014-04-25 2016-05-31 Netapp, Inc. Interconnect path failover
US10152601B2 (en) * 2014-06-05 2018-12-11 International Business Machines Corporation Reliably recovering stored data in a dispersed storage network
US9459809B1 (en) * 2014-06-30 2016-10-04 Emc Corporation Optimizing data location in data storage arrays
CN104363269B (en) * 2014-10-27 2018-03-06 华为技术有限公司 It is a kind of to pass through FC link transmissions, the method and device of reception NAS data
US10082985B2 (en) * 2015-03-27 2018-09-25 Pure Storage, Inc. Data striping across storage nodes that are assigned to multiple logical arrays
US9930140B2 (en) * 2015-09-15 2018-03-27 International Business Machines Corporation Tie-breaking for high availability clusters
US10176069B2 (en) * 2015-10-30 2019-01-08 Cisco Technology, Inc. Quorum based aggregator detection and repair
US11340967B2 (en) * 2020-09-10 2022-05-24 EMC IP Holding Company LLC High availability events in a layered architecture
US11397545B1 (en) 2021-01-20 2022-07-26 Pure Storage, Inc. Emulating persistent reservations in a cloud-based storage system

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5765034A (en) * 1995-10-20 1998-06-09 International Business Machines Corporation Fencing system for standard interfaces for storage devices
WO2001038992A2 (en) * 1999-11-29 2001-05-31 Microsoft Corporation Quorum resource arbiter within a storage network
EP1117042A2 (en) * 2000-01-10 2001-07-18 Sun Microsystems, Inc. Emulation of persistent group reservations
EP1124172A2 (en) * 2000-02-07 2001-08-16 Emc Corporation Controlling access to a storage device
US20020095470A1 (en) * 2001-01-12 2002-07-18 Cochran Robert A. Distributed and geographically dispersed quorum resource disks
US6487622B1 (en) * 1999-10-28 2002-11-26 Ncr Corporation Quorum arbitrator for a high availability system
US20020188590A1 (en) * 2001-06-06 2002-12-12 International Business Machines Corporation Program support for disk fencing in a shared disk parallel file system across storage area network
US20040139237A1 (en) * 2002-06-28 2004-07-15 Venkat Rangan Apparatus and method for data migration in a storage processing device

Family Cites Families (49)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU651321B2 (en) * 1989-09-08 1994-07-21 Network Appliance, Inc. Multiple facility operating system architecture
US5163131A (en) * 1989-09-08 1992-11-10 Auspex Systems, Inc. Parallel i/o network file server architecture
US5963962A (en) * 1995-05-31 1999-10-05 Network Appliance, Inc. Write anywhere file-system layout
EP0702815B1 (en) * 1993-06-03 2000-08-23 Network Appliance, Inc. Write anywhere file-system layout
DE69431186T2 (en) * 1993-06-03 2003-05-08 Network Appliance Inc Method and file system for assigning file blocks to storage space in a RAID disk system
US5761739A (en) * 1993-06-08 1998-06-02 International Business Machines Corporation Methods and systems for creating a storage dump within a coupling facility of a multisystem enviroment
WO1995029437A1 (en) * 1994-04-22 1995-11-02 Sony Corporation Device and method for transmitting data, and device and method for recording data
US5996075A (en) * 1995-11-02 1999-11-30 Sun Microsystems, Inc. Method and apparatus for reliable disk fencing in a multicomputer system
US7168088B1 (en) * 1995-11-02 2007-01-23 Sun Microsystems, Inc. Method and apparatus for reliable disk fencing in a multicomputer system
US5892955A (en) * 1996-09-20 1999-04-06 Emc Corporation Control of a multi-user disk storage system
US6128734A (en) * 1997-01-17 2000-10-03 Advanced Micro Devices, Inc. Installing operating systems changes on a computer system
US6108699A (en) * 1997-06-27 2000-08-22 Sun Microsystems, Inc. System and method for modifying membership in a clustered distributed computer system and updating system configuration
US5975738A (en) * 1997-09-30 1999-11-02 Lsi Logic Corporation Method for detecting failure in redundant controllers using a private LUN
US5999712A (en) * 1997-10-21 1999-12-07 Sun Microsystems, Inc. Determining cluster membership in a distributed computer system
US6748438B2 (en) * 1997-11-17 2004-06-08 International Business Machines Corporation Method and apparatus for accessing shared resources with asymmetric safety in a multiprocessing system
US5941972A (en) * 1997-12-31 1999-08-24 Crossroads Systems, Inc. Storage router and method for providing virtual local storage
US6748429B1 (en) * 2000-01-10 2004-06-08 Sun Microsystems, Inc. Method to dynamically change cluster or distributed system configuration
US6654902B1 (en) * 2000-04-11 2003-11-25 Hewlett-Packard Development Company, L.P. Persistent reservation IO barriers
US6708265B1 (en) * 2000-06-27 2004-03-16 Emc Corporation Method and apparatus for moving accesses to logical entities from one storage element to another storage element in a computer storage system
JP2002222061A (en) * 2001-01-25 2002-08-09 Hitachi Ltd Method for setting storage area, storage device, and program storage medium
US7016946B2 (en) * 2001-07-05 2006-03-21 Sun Microsystems, Inc. Method and system for establishing a quorum for a geographically distributed cluster of computers
US6757695B1 (en) * 2001-08-09 2004-06-29 Network Appliance, Inc. System and method for mounting and unmounting storage volumes in a network storage environment
US20030061491A1 (en) * 2001-09-21 2003-03-27 Sun Microsystems, Inc. System and method for the allocation of network storage
US6877109B2 (en) * 2001-11-19 2005-04-05 Lsi Logic Corporation Method for the acceleration and simplification of file system logging techniques using storage device snapshots
US7296068B1 (en) * 2001-12-21 2007-11-13 Network Appliance, Inc. System and method for transfering volume ownership in net-worked storage
US7650412B2 (en) * 2001-12-21 2010-01-19 Netapp, Inc. Systems and method of implementing disk ownership in networked storage
US6947957B1 (en) * 2002-06-20 2005-09-20 Unisys Corporation Proactive clustered database management
US20040006587A1 (en) * 2002-07-02 2004-01-08 Dell Products L.P. Information handling system and method for clustering with internal cross coupled storage
US7107385B2 (en) * 2002-08-09 2006-09-12 Network Appliance, Inc. Storage virtualization by layering virtual disk objects on a file system
US7873700B2 (en) * 2002-08-09 2011-01-18 Netapp, Inc. Multi-protocol storage appliance that provides integrated support for file and block access protocols
US20040153558A1 (en) * 2002-10-31 2004-08-05 Mesut Gunduc System and method for providing java based high availability clustering framework
US7451359B1 (en) * 2002-11-27 2008-11-11 Oracle International Corp. Heartbeat mechanism for cluster systems
US7523201B2 (en) * 2003-07-14 2009-04-21 Network Appliance, Inc. System and method for optimized lun masking
US7593996B2 (en) * 2003-07-18 2009-09-22 Netapp, Inc. System and method for establishing a peer connection using reliable RDMA primitives
US7716323B2 (en) * 2003-07-18 2010-05-11 Netapp, Inc. System and method for reliable peer communication in a clustered storage system
US7120821B1 (en) * 2003-07-24 2006-10-10 Unisys Corporation Method to revive and reconstitute majority node set clusters
US7333993B2 (en) * 2003-11-25 2008-02-19 Network Appliance, Inc. Adaptive file readahead technique for multiple read streams
WO2005086756A2 (en) * 2004-03-09 2005-09-22 Scaleout Software, Inc. Scalable, software based quorum architecture
JP4327630B2 (en) * 2004-03-22 2009-09-09 株式会社日立製作所 Storage area network system, security system, security management program, storage device using Internet protocol
JP2005284437A (en) * 2004-03-29 2005-10-13 Hitachi Ltd Storage system
JP2005310025A (en) * 2004-04-26 2005-11-04 Hitachi Ltd Storage device, computer system, and initiator license method
US20050283641A1 (en) * 2004-05-21 2005-12-22 International Business Machines Corporation Apparatus, system, and method for verified fencing of a rogue node within a cluster
US7260678B1 (en) * 2004-10-13 2007-08-21 Network Appliance, Inc. System and method for determining disk ownership model
US7472307B2 (en) * 2004-11-02 2008-12-30 Hewlett-Packard Development Company, L.P. Recovery operations in storage networks
US7721292B2 (en) * 2004-12-16 2010-05-18 International Business Machines Corporation System for adjusting resource allocation to a logical partition based on rate of page swaps and utilization by changing a boot configuration file
US20060212870A1 (en) * 2005-02-25 2006-09-21 International Business Machines Corporation Association of memory access through protection attributes that are associated to an access control level on a PCI adapter that supports virtualization
US20060242453A1 (en) * 2005-04-25 2006-10-26 Dell Products L.P. System and method for managing hung cluster nodes
US7516285B1 (en) * 2005-07-22 2009-04-07 Network Appliance, Inc. Server side API for fencing cluster hosts via export access rights
US7653682B2 (en) * 2005-07-22 2010-01-26 Netapp, Inc. Client failure fencing mechanism for fencing network file system data in a host-cluster environment

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5765034A (en) * 1995-10-20 1998-06-09 International Business Machines Corporation Fencing system for standard interfaces for storage devices
US6487622B1 (en) * 1999-10-28 2002-11-26 Ncr Corporation Quorum arbitrator for a high availability system
WO2001038992A2 (en) * 1999-11-29 2001-05-31 Microsoft Corporation Quorum resource arbiter within a storage network
EP1117042A2 (en) * 2000-01-10 2001-07-18 Sun Microsystems, Inc. Emulation of persistent group reservations
EP1124172A2 (en) * 2000-02-07 2001-08-16 Emc Corporation Controlling access to a storage device
US20020095470A1 (en) * 2001-01-12 2002-07-18 Cochran Robert A. Distributed and geographically dispersed quorum resource disks
US20020188590A1 (en) * 2001-06-06 2002-12-12 International Business Machines Corporation Program support for disk fencing in a shared disk parallel file system across storage area network
US20040139237A1 (en) * 2002-06-28 2004-07-15 Venkat Rangan Apparatus and method for data migration in a storage processing device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"SunCluster3.0 12/01 Concepts Guide" INTERNET CITATION, [Online] December 2001 (2001-12), XP002275987 Retrieved from the Internet: URL:http://docs-pdf.sun.com/816-2027/816-2027.pdf> [retrieved on 2004-04-02] *
SUN: "Sun Clusters" INTERNET CITATION, [Online] October 1997 (1997-10), XP002171157 Retrieved from the Internet: URL:www.sun.com/software/cluster/2.2/wp-cl usters-arch.pdf> [retrieved on 2001-07-04] *

Also Published As

Publication number Publication date
EP1907932A2 (en) 2008-04-09
WO2007013961A3 (en) 2008-05-29
US20070022314A1 (en) 2007-01-25

Similar Documents

Publication Publication Date Title
WO2007013961A2 (en) Architecture and method for configuring a simplified cluster over a network with fencing and quorum
US7653682B2 (en) Client failure fencing mechanism for fencing network file system data in a host-cluster environment
US7516285B1 (en) Server side API for fencing cluster hosts via export access rights
US7162658B2 (en) System and method for providing automatic data restoration after a storage device failure
US7467191B1 (en) System and method for failover using virtual ports in clustered systems
US6606690B2 (en) System and method for accessing a storage area network as network attached storage
US8090908B1 (en) Single nodename cluster system for fibre channel
US7689803B2 (en) System and method for communication using emulated LUN blocks in storage virtualization environments
EP1747657B1 (en) System and method for configuring a storage network utilizing a multi-protocol storage appliance
US7272674B1 (en) System and method for storage device active path coordination among hosts
RU2302034C9 (en) Multi-protocol data storage device realizing integrated support of file access and block access protocols
US7437423B1 (en) System and method for monitoring cluster partner boot status over a cluster interconnect
US6732104B1 (en) Uniform routing of storage access requests through redundant array controllers
US7260737B1 (en) System and method for transport-level failover of FCP devices in a cluster
US6757753B1 (en) Uniform routing of storage access requests through redundant array controllers
US7886182B1 (en) Enhanced coordinated cluster recovery
US7779201B1 (en) System and method for determining disk ownership model
US20070088917A1 (en) System and method for creating and maintaining a logical serial attached SCSI communication channel among a plurality of storage systems
JP2005071333A (en) System and method for reliable peer communication in clustered storage
US7593996B2 (en) System and method for establishing a peer connection using reliable RDMA primitives
US7739543B1 (en) System and method for transport-level failover for loosely coupled iSCSI target devices
US20070061454A1 (en) System and method for optimized lun masking
US7487381B1 (en) Technique for verifying a configuration of a storage environment
US8015266B1 (en) System and method for providing persistent node names

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2006800150

Country of ref document: EP

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 06800150

Country of ref document: EP

Kind code of ref document: A2