EP1532520A2 - Schnell-pfad, um datenoperationen auszuführen - Google Patents

Schnell-pfad, um datenoperationen auszuführen

Info

Publication number
EP1532520A2
EP1532520A2 EP02806874A EP02806874A EP1532520A2 EP 1532520 A2 EP1532520 A2 EP 1532520A2 EP 02806874 A EP02806874 A EP 02806874A EP 02806874 A EP02806874 A EP 02806874A EP 1532520 A2 EP1532520 A2 EP 1532520A2
Authority
EP
European Patent Office
Prior art keywords
data
computer program
program product
path
data operation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP02806874A
Other languages
English (en)
French (fr)
Inventor
Richard Testardi
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Incipient Inc
Original Assignee
Incipient Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US10/218,189 external-priority patent/US7013379B1/en
Priority claimed from US10/218,186 external-priority patent/US6973549B1/en
Priority claimed from US10/218,192 external-priority patent/US6986015B2/en
Priority claimed from US10/218,098 external-priority patent/US7173929B1/en
Priority claimed from US10/218,195 external-priority patent/US6959373B2/en
Application filed by Incipient Inc filed Critical Incipient Inc
Publication of EP1532520A2 publication Critical patent/EP1532520A2/de
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/062Securing storage systems
    • G06F3/0622Securing storage systems in relation to access
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/2053Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant
    • G06F11/2056Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant by mirroring
    • G06F11/2082Data synchronisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/2053Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant
    • G06F11/2089Redundant storage control functionality
    • G06F11/2092Techniques of failing over between control units
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/10Address translation
    • G06F12/1027Address translation using associative or pseudo-associative address translation means, e.g. translation look-aside buffer [TLB]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0629Configuration or reconfiguration of storage systems
    • G06F3/0635Configuration or reconfiguration of storage systems by changing the path, e.g. traffic rerouting, path reconfiguration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0662Virtualisation aspects
    • G06F3/0665Virtualisation aspects at area level, e.g. provisioning of virtual or logical volumes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0683Plurality of storage devices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1471Saving, restoring, recovering or retrying involving logging of persistent data for recovery
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F2003/0697Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers device management, e.g. handlers, drivers, I/O schedulers

Definitions

  • This application generally relates to computer data storage, and more particularly to performing data operations in connection with computer data storage.
  • Computer systems may include different resources used by one or more host processors.
  • Resources and host processors in a computer system may be interconnected by one or more communication connections. These resources may include, for example, data storage devices such as disk drives. These data storage systems may be coupled to one or more host processors and provide storage services to each host processor.
  • An example data storage system may include one or more data storage devices that are connected together and may be used to provide common data storage for one or more host processors in a computer system.
  • a host processor may perform a variety of data processing tasks and operations using the data storage system. For example, a host processor may perform basic system I/O operations in connection with data requests, such as data read and write operations and also administrative tasks, such as data backup and mirroring operations.
  • Host processor systems may store and retrieve data using a storage device containing a plurality of host interface units, disk drives, and disk interface units. The host systems access the storage device through a plurality of channels provided therewith. Host systems provide data and access control information through the channels to the storage device and storage device provides data to the host systems also through the channels.
  • the host systems do not address the disk drives of the storage device directly, but rather, access what appears to the host systems as a plurality of logical disk units or logical volumes. The logical disk units may or may not correspond to the actual disk drives. Allowing multiple host systems to access the single storage device unit allows the host systems to share data stored therein.
  • Data operations issued from a host may utilize switching fabric comprising a combination of hardware and or software in routing a data operation and associated communications between a host and a target data storage device.
  • the switching fabric may include hardware, such as switching hardware, and software.
  • Software used in routing operations between a host and a data storage device may utilize a layered approach. Calls may be made between multiple software layers in the switching fabric in connection with routing a request to a particular device.
  • One drawback with the layering approach is the overhead in performing the calls that may result in increasing the amount of time to dispatch the data operation to the data storage device.
  • the invention is a method for processing a data operation. It is determined if the data operation has at least one predetermined criteria characterizing the data operation as being a commonly performed non-complex data operation using a primitive operation. The data operation is routed to a fast path for processing if the data operation has the at least one predetermined criteria, and routing the data operation to a general control path for processing otherwise.
  • Machine executable code determines if the data operation has at least one predetermined criteria characterizing the data operation as being a commonly performed non-complex data operation using a primitive operation. Machine executable code routes the data operation to a fast path for processing if the data operation has the at least one predetermined criteria, and routing the data operation to a general control path for processing otherwise.
  • a method executed in a computer system for performing a data operation The data operation is received by a switching fabric. At least one processing step for performing the data operation is determines in accordance with a current state of at least one mapping table. At least one mapping primitive operation for processing the data operation is determined. The mapping primitive is used to perform virtual to physical address translation by the switching fabric using at least one mapping table. The mapping primitive operation is executed and a physical address associated with the data operation is obtained.
  • Machine executable code receives, by a switching fabric, the data operation.
  • Machine executable code determines at least one processing step for performing the data operation in accordance with a current state of at least one mapping table.
  • Machine executable code determines at least one mapping primitive operation for processing the data operation.
  • the mapping primitive is used to perform virtual to physical address translation by the switching fabric using at least one mapping table.
  • Machine executable code executes the mapping primitive operation and obtains a physical address associated with the data operation.
  • a volume descriptor associated with said virtual address is determined.
  • the volume descriptor includes a variable size extent table.
  • the variable size extent table includes a plurality of portions. Each of the portions is associated with a varying range of virtual addresses.
  • a first extent included in the variable size extent table corresponding to the virtual address is determined.
  • a corresponding physical address is determined for the virtual address using mapping table information associated with the first extent.
  • Machine executable code determines a volume descriptor associated with the virtual address.
  • the volume descriptor includes a variable size extent table.
  • the variable size extent table includes a plurality of portions. Each of the portions is associated with a varying range of virtual addresses.
  • Machine executable code determines a first extent included in the variable size extent table corresponding to the virtual address.
  • Machine executable code determines a corresponding physical address for the virtual address using mapping table information associated with the first extent.
  • mapping tables used in performing the address translation are determined.
  • the mapping tables include an extent table corresponding to a logical block address range and a storage redirect table includes physical storage location information associated with the logical block address range.
  • the extent table is divided into a plurality of portions.
  • a fast path is used in performing the virtual address translation if an associated data operation meets predetermined criteria independent of at least one of a general control path and another fast path. Otherwise a general control path is used.
  • a portion of the extent table corresponding to a current data operation is loaded into a memory local to the fast path.
  • the portion of the extent table is included in a memory managed using a cache management technique.
  • Machine executable code determines mapping tables used in performing said address translation.
  • the mapping tables include an extent table corresponding to a logical block address range and a storage redirect table includes physical storage location information associated with the logical block address range.
  • the extent table is divided into a plurality of portions: Machine executable code uses a fast path in performing the virtual address translation if an associated data operation meets predetermined criteria independent of at least one of a general control path and another fast path, and otherwise uses a general control path.
  • Machine executable code loads into a memory local to the fast path a portion of said extent table corresponding to a current data, operation.
  • the portion of the extent table is included in a memory managed using a cache management technique.
  • a message is sent from a requester to at least one other user of the shared data accessing the shared data for read access.
  • the requester receives approval messages from each of the at least one other user.
  • the requester obtains a lock on a first copy of the shared data included in a global storage location upon receiving the approval messages wherein the requester releases the lock when the lock is requested by another.
  • the requester in response to obtaining the lock, modifies the first copy of shared data.
  • Machine executable code sends a message from a requester to at least one other user of the shared data accessing the shared data for read access.
  • Machine executable code receives approval messages for the requester from each of the at least one other user.
  • Machine executable code obtains a lock for the requester on a first copy of the shared data included in a global storage location upon receiving the approval messages wherein the requester releases the lock when the lock is requested by another.
  • Machine executable code in response to obtaining the lock, causes the requester to modify the first copy of shared data.
  • Figure 1 is an example of an embodiment of a computer system according to the present invention
  • Figure 2 is an example of an embodiment of a data storage system
  • Figure 3 is an example of a logical view of the devices as seen from the host computer systems of Figure 1;
  • Figure 4A is an example of how a host may communicate with a physical device
  • Figure 4B is an example of another embodiment of how a plurality of hosts may communicate with physical devices
  • Figure 4C is an example of yet another embodiment of how a plurality of hosts may
  • Figure 5 is a flowchart of steps of an embodiment for processing a data operation within the computer system of Figure 1;
  • Figure 6 is a flowchart of steps of an embodiment for processing results of a data operation;
  • FIG. 7 is a flowchart of more detailed steps for processing a data operation
  • Figure 8 is an example of a model of application programming interfaces that may be used in connection with fast paths
  • Figure 9 is an example of an embodiment of tables used in connection with mapping a virtual address to a physical address in the computer system of Figure 1;
  • Figure 10 is an example of an embodiment of mapping virtual to physical storage using the volume segment descriptors
  • Figure 11 is an example of an embodiment of using the mapping tables in connection with a multipath technique
  • Figure 12 is an example of updated tables in connection with a multipath operation
  • Figure 13 is an example of information that may be cached within a fast path (FP);
  • Figure 14 is an example of information that may be included in mapping table entries
  • Figure 15 is an example of information that may be included in a host I/O request
  • Figure 16 is a flowchart of steps of one embodiment for processing a received I/O request as may be performed by the FP ;
  • Figure 17 is a flowchart of steps of one embodiment for processing a received I/O request as may be performed by the CP;
  • Figure 18 is an example of an embodiment illustrating the pending I/O lists within the switching fabric as maintained by the CP and FP;
  • Figure 19 is an example of an embodiment of mapping tables at initialization within the FP
  • Figures 20-21 are examples of an embodiment of a snapshot operation within the computer system of Figure 1;
  • Figure 22 is an example of an embodiment of an incremental operation of a virtual volume within the computer system of Figure 1 ;
  • Figures 23 and 24 are examples of an embodiment of online migration
  • Figures 25 A and 25B are examples of an embodiment of metadata
  • Figure 26 is an example of an embodiment of how a variable size extent maps to fixed portions of metadata
  • Figure 27 is an example of a state transition diagram that may be associated with a distributed virtualization engine(DVE);
  • DVE distributed virtualization engine
  • Figure 28 is an example of an embodiment of two DVEs exchanging messages in connection with acquiring lock
  • Figure 29 is an example of a flowchart of steps in connection with performing a snapshot operation
  • Figures 30 and 31 are examples of an embodiment in connection with performing operations with mirrored devices
  • Figure 32 is an example of an embodiment in connection with performing an asynchronous replication operation
  • Figure 33 is an example of an embodiment of a compound example of a snapshot during a migration; and Figure 34 is an example of an embodiment of a data stracture for the rmap table.
  • the computer system 10 includes a data storage system 12 connected to host systems 14a-14n and data management system 16 through communication medium 18.
  • the data management system 16 and the N hosts 14a-14n may access the data storage system 12, for example, in performing input/output (I/O) operations or data requests.
  • the communication medium 18 may be any one of a variety of networks or other type of communication connections as known to those skilled in the art.
  • the communication medium 18 may be a network connection, bus, and/or other type of data link, such as a hardwire or other connections known in the art.
  • the communication medium 18 may be the Internet, an intranet, network or other connection(s) by which the host systems 14a-14n, and the data manager system may access and communicate with the data storage system 12, and may also communicate with others included in the computer system 10.
  • the components comprising the computer system 10 may comprise, for example, a storage area network (SAN) or other configuration.
  • SAN storage area network
  • Each of the host systems 14a-14n, the data management system 16, and the data storage system 12 included in the computer system 10 may be connected to the communication medium 18 by any one of a variety of connections as may be provided and supported in accordance with the type of communication medium 18.
  • the processors included in the host computer systems 14a-14n and the data management system 16 may be any one of a variety of commercially available single or multi-processor system, such as an Intel-based processor, IBM mainframe or other type of commercially available processor able to support incoming traffic in accordance with each particular embodiment and application.
  • each of the host systems 14a-14n and the data management system 16, as well as those components that may be included in the data storage system 12 are described herein in more detail, and may vary with each particular embodiment.
  • Each of the host computers 14a-14n, as well as the data management system 16, may all be located at the same physical site, or, alternatively, may also be located in different physical locations.
  • Examples of the communication medium that may be used to provide the different types of connections between the host computer systems, the data manager system, and the data storage system of the computer system 10 may use a variety of different communication protocols such as SCSI(Small Computer System Interface), ESCON, Fibre Channel, or GIGE (Gigabit Ethernet), and the like.
  • Some or all of the connections by which the hosts, data management system 16 and data storage system 12 may be connected to the communication medium 18 may pass through other communication devices, such as a Fibre Channel switch, or other switching equipment that may exist such as a phone line, a repeater, a multiplexer or even a satellite.
  • other communication devices such as a Fibre Channel switch, or other switching equipment that may exist such as a phone line, a repeater, a multiplexer or even a satellite.
  • Each of the host computer systems as well as the data management system may perform different types of data operations in accordance with different types of administrative tasks.
  • any one of the host computers 14a-14n may issue a data request to the data storage system 12 to perform a data operation.
  • an application executing on one of the host computers 14a-14n may perform a backup, mirroring or other administrative operation and may do so while performing data requests to the data storage system 12.
  • the data management system 12 may be responsible for performing administrative operations in connection with the other components and switching fabric included in the computer system 10.
  • the data management system 12 may be responsible for performing administrative operations in connection with system configuration changes as well as performing periodic administrative operations, such as automated backups, performance tuning, reporting, and the like.
  • Functionality included in the data management system may also include abstracting components accessed within the computer system.
  • FIG. 2 shown is an example of an embodiment of the data storage system 12 that may be included in the computer system 10 of Figure 1.
  • the switching fabric may be characterized as a hardware and/or software that perform switching of voice, data, video and the like from one place to another.
  • the switching fabric 20 performs switching of data between components in the computer system 10, such as between a host and a physical device.
  • the components included in the switching fabric 20 may vary with each particular embodiment and device in accordance with the different protocols used in a particular embodiment. Additionally, the type of connections and components used may vary with certain system parameters and requirements, such as those related to bandwidth and throughput required in accordance with a rate of I/O requests as may be issued by the host computer systems, for example, to the data storage system 12.
  • Host systems provide data and access control information through channels to the data storage system, and the data storage system may also provide data to the host systems also through the channels.
  • the host systems do not address the disk drives of the storage systems directly, but rather access to data may be provided to one or more host systems from what the host systems view as a plurality of logical devices or logical volumes (LVs).
  • the LVs may or may not correspond to the actual disk drives.
  • one or more LVs may reside on a single physical disk drive. Data in a single storage system may be accessed by multiple hosts allowing the hosts to share the data residing therein.
  • FIG. 3 shown is an example of a logical view of devices in one embodiment as may be viewed from the hosts included in the computer system 10 of Figure 1.
  • Hosts 14a-14n are included in the illustration 30 as described previously in connection with the system 10 of Figure 1.
  • the illustration 30 includes a portion of the components of the computer system 10 previously described in connection with Figure 1.
  • the illustration 30 includes the hosts 14a-14n and storage related components included in the data storage system 12.
  • Also shown are logical components or devices LV 32a-32n wliich are not actually physical components included in the computer system but rather represents a logical view of a portion of physical devices PDI through PDn. The same LVs may be used and accessed by one or more of the host's computer systems 14a-14n.
  • Each of the LVs maps to a portion of a physical device or a plurality of physical devices also included in the data storage system 12.
  • the data storage system 12 may also include switching fabric 20 which may include one or more switches and other associated hardware and software in connection with facilitating data transmissions between each of the host computer systems and the physical devices.
  • switching fabric 20 may include one or more switches and other associated hardware and software in connection with facilitating data transmissions between each of the host computer systems and the physical devices.
  • part of the functionality of the switching fabric is to map a particular logical address of an LV to its actual physical location on one or more of the physical devices 22a-22n.
  • binding a particular physical device or portions thereof to an LV may be performed in connection with data management system functionality.
  • VEs virtualization engines
  • DVE distributed virtualization engine
  • the DVE collectively exposes LVs to a set of hosts and may be used in accessing a set of physical devices.
  • the VEs may utilize a coherency channel, for example, using a storage area network (SAN) and/or a local area network (LAN), to present a single system image to the hosts as well as to the administrator (data management system).
  • SAN storage area network
  • LAN local area network
  • the VEs may have a partially shared back end of physical volumes or devices. Multiple VEs may be physically located within the same hardware box or unit or be physically located in separate hardware units.
  • VEs may have redundant power supplies, cords, and cabling to the hosts.
  • Software associated with each VE in an embodiment may execute independently and perhaps redundantly providing for a single system image to each of the hosts in the computer system.
  • the DVE may be characterized as being responsible for functionality associated with data virtualization, such as in connection with virtualizing storage data accesses across the computer system 10.
  • the DVE may also be characterized as supporting a number of higher level functions and operations, such as, for example, replication, snapshots, on-line migration, and the like.
  • any one or more of the DVEs may be implemented in portions anywhere between the host application and the actual physical storage device or devices.
  • a portion of the functionality of a DVE may be included in the host side filter driver, in an appliance between the host and the storage system, in an existing switch, or within a data storage device itself.
  • a DVE may be implemented anywhere between the host application and associated physical storage devices as, for example, described elsewhere herein.
  • a preferred embodiment may include functionality described herein associated with a DVE within the SAN switching fabric itself, such as within a switch.
  • the switch implementation platform may take advantage of the DVE's distributed coherency and scalability, for example, between multiple switches within the SAN fabric as well as between multiple ports within a given switch.
  • the DVE preserves a single distributed coherent view of storage to the hosts. It should be noted that the DVE's overall bandwidth capabilities are scaled in accordance with the number of port switches through the use of per port fast path processing power.
  • FIG. 4 A shown is an example 40 of how a host, such as 14a, may communicate with a physical device such as 22a or 22b.
  • the components included in illustration 40 represent an example of how a particular host may issue a data operation in connection with a particular physical device.
  • An actual embodiment may include more computer systems, for example, as described previously in connection with the computer system 10 of Figure 1.
  • the number of components included in the illustration 40 have been reduced in order to facilitate the explanation of how the switching fabric may operate in connection with data transfers between a host and a physical device.
  • the host 14a may perform a data operation in connection with one or more physical devices, such as physical device 22a and 22b.
  • DVE 34a includes fast path FP1- 1 and FP1-2 as well as one or more control paths (CPs), such as CP1 through CP3.
  • CPs control paths
  • a switch may connect the FP 1 - 1 hardware and/or software implementation to physical device 12a as well as facilitate communications with the host 14a.
  • FP fast path
  • CPs control paths
  • a switch may connect the FP 1 - 1 hardware and/or software implementation to physical device 12a as well as facilitate communications with the host 14a.
  • the arrows show communications as flowing from the host to the physical devices, the reverse communication path of forwarding data from the physical device through one of the FPs or CPs to the host also exists in the system.
  • the communication path from the host may be only through the FP.
  • the CP may communicate to the host through the FPs such that only an FP serves as an "exposed" communication endpoint for host communications.
  • an embodiment of a DVE may include one or more CPs.
  • a DVE may include a plurality of CPs in which exactly one may be active at a time with the other available, for example, for failover purposes.
  • the number of CPs in an embodiment of a DVE may be less that the number of FPs.
  • a DVE may include one or more CPs and one or more FPs.
  • the FP may optionally be implemented in hardware, software, or some combination thereof.
  • the CP and FP may be implemented each on different CPUs.
  • An embodiment may include a portion of hardware in an implementation of the FP, for example, in connection with functionality associated with the FP and its communication port(s).
  • a path designated using an FP may be used when performing I/O operations, such as read and write operations that may be applied to LVs.
  • I/O operations such as read and write operations that may be applied to LVs.
  • a large portion of the data operations may be handled by the FP.
  • the FPs handle a bulk of the I/O bandwidth from the hosts with no CP intervention meaning that the overall bandwidth capabilities scale with the number of FPs in the DVE.
  • the particular I/O operations that may be handled by the FP is described in more detail in paragraphs that follow.
  • the FP is a streamlined implementation of hardware and/or software that may be used in connection with optimizing and performing a portion of I/O operations.
  • I/O operations from a host are initially directed to an FP. If an FP is able to dispatch the I/O operation further to a particular physical device using a mapping table which is populated by the CP in this example, the FP does such dispatching without further intervention by the CP. Otherwise, the I/O operation may be forwarded to the CP for processing operations.
  • completions of an I/O operation directed from a physical device to a host are directed to the FP in this embodiment. If the completion is successful, the FP may return any I/O operation data and associated status to the host. This may be done without any CP intervention. Otherwise, for example, in the event of an enor in performing the I/O operation, completion may be forwarded to the CP for processing.
  • Metadata may be that information such as included in mapping tables. Metadata may be characterized as data about or describing data.
  • the CP may handle all enor processing, all coherency and synchronization operations in connection with other CPs and all intervolume coherency, for example, as may be included in complex systems such as those using minoring, striping, snapshots, on-line migrations, and the like. All enors may be returned to the host or forwarded through the CP.
  • An FP may also notify a CP about I/Os, for example, in connection with gathering statistics or enor recovery purposes.
  • the DVE 34a includes an FP or a fast path connection between a host and each of the physical devices that may be accessed by the host. As shown in Figure 4A also, each of the FPs is connected to an associated CP and each of the CPs also have connections to each other.
  • the assignment or association of hosts to FPs may vary in accordance with platform configuration.
  • which FPs are used by which hosts may be in accordance with where FPs are located within the switching fabric and how the hosts connect to the fabric.
  • the FP is included in the fabric switch and there is preferably one FP per switch port and hosts are physically connected to one or more switch ports.
  • the embodiment 42 includes a 16 port switch with 12 ports, 45a-451, connected in pairs to 6 hosts, H1-H6, with the remaining 4 ports, 45m-45p, connected in pairs to two data storage devices Devi and Dev2.
  • the FPs may be logically and possibly physically, located on each of the host ports and each host is communicating to two FPs.
  • FIG. 4C shown is another embodiment of how a plurality of hosts may communicate with physical devices.
  • Figure 4C shows a configuration 46 which is a modification of the configuration 42 from Figure 4B but with one of the hosts (H4) removed and two switches(47a and 47b) in place of the host, each of the two switches having 16 ports.
  • Each of the two switches 47a and 47b plugs into locations 45g and 45h includes in the original switch 43.
  • 15 hosts (H10-H24) may be connected up to the fabric with each of the 15 hosts (H10-H24) being connected to a first port in the first switch 47a and a second port in the second switch 47b, such as host H10 is connected to 47c and 47d.
  • Each of the hosts H10-H24 now shares FP7 and FP8.
  • FPs may also be included in a "shared appliance" within the switching fabric resulting a configuration similar to that of Figure 4C in which hosts share access to the same FPs.
  • FIG. 5 shown is a flowchart 50 of steps of one embodiment for processing a data operation within a computer system 10 of Figure 1. It should be noted that this processing described in connection with flowchart 50 generalizes the processing just described in connection with forwarding an I/O operation between a FP and/or a CP from a host to a particular physical storage device.
  • a data operation request is received at step 52 and is forwarded from a host to the data storage system.
  • step 54 a determination is made as to whether this is an FP appropriate operation. It should be noted that the details of step 54 are described in more detail in following paragraphs.
  • step 54 determines whether this is an FP appropriate operation. If a determination at step 54 is made that this is an FP appropriate operation, control proceeds to step 60 where the data request is dispatched and issued to the appropriate physical device using the FP. Otherwise, control proceeds to step 56 where the I/O or data operation is forwarded to the CP for processing. Accordingly, at step 58, the CP issues the data request to the appropriate physical device. It should be noted that part of the processing included in the steps of flowchart 50 is a mapping from the logical address to the physical address as well as other processing operations.
  • a flowchart 70 of steps of a method performed in connection with processing the results of a data operation generally describes those steps that may be performed in an embodiment when forwarding results from a physical device back to a host through a DVE such as 34a.
  • the results of the data operation are determined and received at the DVE. In particular, it is determined at step 74 as to whether the data operation has been successful. If the data operation has been successful, control proceeds to step 78 where the results are forwarded back to the host using the FP connection. Otherwise, control proceeds to step 76 to forward results to the CP for enor processing and/or recovery.
  • flowchart 80 describes in more detail the steps of determining whether or not to use the FP or the CP in connection with processing and forwarding an I/O request between a host and a physical data storage device.
  • the I/O operation is received.
  • a determination is made as to whether or not this is a Virtual device identifier (DID). If a determination is made that the cunent I/O operation involves a physical devices, control proceeds to step 86 where a transparent I/O operation is routed directly to the physical device, for example, using the FP hardware to forward the physical address of an I/O request.
  • DID Virtual device identifier
  • An I/O operation to a physical device may be handled transparently, that is, without requiring FP processing.
  • An I/O operation to a virtual device is handled by the FP and CP.
  • Both virtual and physical devices may exist on the same SAN and may be addressable by Fibre Channel device identifiers (FC DIDs).
  • FC DIDs Fibre Channel device identifiers
  • Physical devices conespond to physical disks, as may be, for example, plugged into a SAN.
  • a DID indicates an address associated with, for example, a disk or host bus adapter that is plugged into the switching fabric.
  • An I/O operation may be specified using the DID or other SAN address in accordance with the particular SAN (storage area network) protocol such as an IP address for iSCSI.
  • the VE fabricates a virtual DID such that the virtual DID may be accessed, for example, using a name server as a physical DID may be accessed.
  • step 84 If the determination at step 84 results in a determination that there is no virtual DID, then the I/O operation is to a real physical device connected to the switching fabric and control proceeds to step 86 to route the I/O operation to the conect outbound port of the switch.
  • step 84 determines whether a determination is made at step 84 that the I/O operation involves a virtual DID.
  • control proceeds to step 88 where processing steps may be taken to remap the virtual DID to a physical device.
  • step 88 a determination is made as to whether this I/O operation involves an access other than a read or a write. If this I/O operation involves access other than a read or write, control proceeds to step 90 where the CP is used in connection with processing the data operation. Otherwise, if this is a read or a write operation, control proceeds to step 92 where a look up of the TE or target exposure is performed. This is performed using the DID (or other SAN Address) of the virtual device addressed by the intercepted I/O.
  • step 93 a determination is made as to whether the LUN is masked. If so, control proceeds to step 90 where the cunent I/O faults to the CP for further processing.
  • An embodiment may include, as part of the determination of whether the LUN is masked, values used in connection with determining the security of a device, such as whether a host has permission to perform I/O operations.
  • An embodiment may also include as part of step 93 processing a determination of whether a particular host has a LUN reserved, such as in connection with processing SCSI Reservations, and SCSI Unit Attention conditions, such as when each host is notified of particular occunences like, a LUN size change, and the like.
  • SCSI Reservations such as in connection with processing SCSI Reservations, and SCSI Unit Attention conditions, such as when each host is notified of particular occunences like, a LUN size change, and the like.
  • step 94 a determination is made as to whether the particular I/O operation involves a LUN of a device which is cunently connected to the host. If not, control proceeds to step 90 where the CP is used in connection with processing the I/O operation. Otherwise, control proceeds to step 96 where the LV is determined at step 96 for the particular LUN.
  • Control proceeds to step 98 where the appropriate segment descriptor is determined for the particular I/O operation.
  • step 100 it is determined whether the I/O operation spans multiple segments. If so, control proceeds to use the CP for processing at step 90. Otherwise, control proceeds to step 102 where a further determination is made as to whether the I/O logical block address or LBA extent is cached. If the I/O LBA extent is not cached, control proceeds to step 104 where an inquiry is made by the FP using the CP to obtain the LBA extent at step 104. The FP may proceed to obtain the LBA extent from the CP, for example, by performing a routine call and returning the LBA extent as a routine result or parameter.
  • Control proceeds to step 106 where the extent's redirect index is determined. Control proceeds to step 108 where a determination is made as to whether the I/O spans extents. If so, control proceeds to step 90 where the CP is used in processing. Otherwise, control proceeds to step 110 where the extent's redirect entry of additional processing information is obtained.
  • the extent redirect index used at step 106 may be used as an index into an anay, for example, or other equivalent data structure of redirect entries to access additional information, as at step 110, as may be used to process a particular I/O operation.
  • the extent redirect index may be, for example, 4 bits used to access, for example, directly or indirectly, a hundred bytes of other information.
  • the anay of extent redirect entries is used and described in more detail elsewhere herein.
  • Control proceeds to step 112 where a determination is made as to whether the fast path may be used in processing a read or write operation to this particular device.
  • One of the additional pieces of information that may be included in an embodiment of a redirect entry is a set of flags indicating which particular operations are allowed to be performed using a fast path to a particular device. In one embodiment, these flags may indicate which operations are disallowed, such as "fault on read” (FoR) and "fault on write” (“FoW”).
  • FoR fault on read
  • FoW fault on write
  • a given virtual volume segment may be divided into a set of variable length extents. Each of these extents may have an associated "redirect entry". These extents may conespond to a state of virtualization.
  • the redirect entry associated with an extent may indicate state information about a portion of a volume, for example, such as whether that portion of a volume has been migrated, snapshot, and the like, depending on the progress of an operation.
  • multiple extents may reference the same redirect entry in accordance with the particular state of the different portions. For example, blocks 0..12 inclusively may reference redirect entry 0. Blocks 13..17 inclusively may reference redirect entry 1, and blocks 18 and 19 may also reference redirect entry 0.
  • the redirect entries indicate which operations may be perfonned in using the FP in accordance with state of a particular portion of a virtual segment. Additionally, the redirect entry may indicate where the actual data is located (storage descriptor) for a particular portion of an LV, such as whether the data has already been pushed to a particular physical device).
  • step 112 determines whether a determination is made at step 112 that it is one of the particular read or write operations. If a determination is made at step 112 that it is one of the particular read or write operations, control proceeds to step 90 where the CP is used in processing the I/O request. Otherwise, control proceeds to step 114 where the storage descriptor is obtained. At step 116, a determination is made as to whether the FP capacity is exceeded. It should be noted that the particular FP capacity or capability may vary in accordance with each embodiment. For example, in one embodiment, an FP may have a limit on the size of an I/O operation it is capable of processing. An embodiment may have other limitations or restrictions.
  • an FP may not perform I/O operations that must be sent to two different devices such as may be when an I/O operation spans a RAIDO stripe and part of the I/O operation is associated with disk A and another part associated with disk B.
  • Each particular embodiment may determine what limits or tasks may be streamlined and performed by an FP allowing for customization of FP operations to those most prevalent within each particular implementation. The remaining operations may be handed over to the CP for processing.
  • control proceeds to the CP for processing. Otherwise, control proceeds to step 118 where a determination is made as to whether the particular I/O operation is for a minoring device or involves a write to a journal. If so, control proceeds to step 120 where a further determination is made as to whether there is a serialization conflict.
  • a serialization conflict may be determined in connection with minored devices. For example, one rule in an embodiment for writing to a minored device is that only one FP within a particular VE may write to a particular LBA (logical block address) range at a time to ensure integrity of minors.
  • a serialization conflict may occur when, within a single FP, one or more hosts have issued two write operations to overlapping LBA ranges. When this serialization conflict is detected, such as may be in connection with a failover, the conflicting I/O operation may be routed to the CP for later retry. If a serialization conflict is determined at step 120, control proceeds to step 90 where the CP is used for processing the I/O request.
  • a variety of different tests may be included in an embodiment in determining whether to use the fast path or FP in routing a particular I/O request to a physical device.
  • the processing of the steps of flowchart 80 may be characterized as filtering out or detecting those operations which are not common or are more complex than those which the FP may handle in an expedient fashion. Those operations that involve other processing and are not able to be performed in a stream line fashion are forwarded to the CP. For example, in a determination at step 122 that the write journal is full, processing steps that are taken from the CP may for example involve emptying a portion of the journaling entries prior to performing the I/O operation.
  • FIG. 8 shown is an example 200 of a model of application programming interfaces or APIs that may be included in an embodiment of the switching fabric when implementing the fast path (FPs) as described herein.
  • the FP or fast path may be implemented in software using a set of platform dependent APIs.
  • These platform dependent APIs may be used by platform independent CP software through the use of the FP API 206.
  • included are various CPs 202a-202n that interface with the FP API 206.
  • the FP API 206 may be a platform independent interface with different platform dependent hardware configurations 204a-204n.
  • the FP API 206 may provide an interface linking the different hardware platforms, such as 204a-204n, to platform independent CP software, such as 202a-202n, that may in turn interface with one or more applications 210, such as a particular database software, running on a host computer system.
  • a CP such as 202a, may utilize the platform dependent APIs through the FP API 206 to communicate with any one or more of a variety of different hardware platforms 204a to 204n. Any one of the CPs 202b-202n may also utilize the same platform dependent API included in the FP API 206 to communicate with particular hardware platforms 204a-204n.
  • the CP software and/or hardware and FP API 206 may be included in the switching fabric within the DVE.
  • an embodiment may also include all or portions of this and other hardware and/or software anywhere between the host application software and the physical storage. For example, a portion or all of the foregoing may be included in a host-side filter driver.
  • the FP API 206 may be supplied by a platform vendor.
  • An embodiment may also include some additional code in an embodiment to "shim" the different APIs together, such as to get the FP API 206 to work with the CP software.
  • the techniques described herein of using the FP may be used in an embodiment that includes file system storage and block storage techniques.
  • virtual block storage is addressed using LVs
  • virtual file storage may be addressed using logical files.
  • the techniques described herein may be used in connection with file level protocols, such as NFS, CIFS and the like, as well as block level protocols, such as SCSI, FC, iSCSI, and the like, with appropriate modifications as may be made by one of ordinary skill in the art.
  • file level protocol may have one volume segment descriptor for each file and accordingly use the Rmap and storage descriptor table described elsewhere herein.
  • the example 240 includes an LBA Rmap table 242 and a storage redirect table 244.
  • the tables 242 and 244 may be used in mapping a virtual address range of a volume descriptor to a storage descriptor identifying a physical device location.
  • a virtual address reference associated with a particular volume segment descriptor as described in more detail elsewhere herein may include, for example, an identifier of a device, a starting offset within a particular segment, and the length representing an ending offset or span from the starting location.
  • a starting offset in terms of a logical block address or LBA value may be used to index into the LBA Rmap 242.
  • the length of the I/O operation may specify the span or length at which an ending offset within an LBA range may be determined.
  • a particular LBA range from 0 to LBA_MAX is represented by the LBA Rmap 242.
  • a starting offset may be a value from 0 to LBA_MAX.
  • the length of the data associated with the I/O operation may be used in determining an ending offset from the starting value.
  • a particular LBA range from zero to LBA VIAX may be partitioned into a plurality of extents.
  • An extent represents a particular subset of an LBA range. Example of extents conesponding to particular LBA ranges are indicated as volume extent A and volume extent B on the LBA Rmap 242.
  • each volume segment descriptor describes a volume segment which is a contiguous range of LB As included in a virtual volume.
  • the volume segment descriptor may include those tables in the example 240, in particular the LBA Rmap 242 and the storage redirect table 244.
  • the volume segment descriptor is the only location within the system for mapping virtual to physical addresses that includes the LBA range of specific information storage.
  • Each entry in the LBA Rmap 242 associates its volume extent or a particular LBA range, such as volume extent A, with a storage redirect table entry representing the state of that particular portion of physical storage conesponding to the LBA range for that particular extent.
  • a first portion or range of addresses is defined.
  • an index value of 1 is an index into the storage redirect table 244 containing an entry conesponding to the state of that particular portion of the LBA range associated with volume extent A.
  • the storage redirect table having an index of 1 246 includes state information that describes the state of that portion of the storage associated with volume extent A.
  • the portion of the LBA range identified by volume extent B also has a redirect index value of 1 meaning that volume extent A and volume extent B have a state represented by entry 246 of the storage redirect table 244.
  • two extents may have the same reference to the same redirect table entry or Rmap value.
  • a particular extent conesponding to an LBA range may be associated with a different entry in the redirect table to reflect its cunent state.
  • extents included in the LBA Rmap 242 may be variable in size. Each extent may conespond to any particular size between zero and LBA_MAX and identify a particular entry in the storage redirect table. Each entry in the storage redirect table 244 describes the state of the physical storage portion conesponding to the extent. Details of how the LBA_Rmap and extents may be used are described in more detail elsewhere herein.
  • Each entry in the storage redirect table 244, such as entry 246, may include a storage descriptor as well as faulting mode flags, such as the FOW (fault on write) flag and the FOR (fault on read) flag used in connection with FP and CP processing.
  • Other information may also be kept in the storage redirect table entries that may vary in accordance with each embodiment.
  • the FOW and FOR flags may be used, for example, as in connection with processing steps of the flowchart 80 of Figure 7 when deciding whether to use the CP or the FP for processing an I/O operation.
  • the information used in performing processing steps of Figure 7 may be obtained from the storage redirect table 244.
  • FIG. 7 describes how to actually access the storage conesponding to a particular LBA range of the volume.
  • a storage descriptor may be used to locate data associated with a particular LBA range in more complex storage systems which may include minoring, striping, and the like.
  • Metadata may include, for example, the state information included in the storage redirect table 244 as well as the state information included in the LBA Rmap 242. It should be noted that entries such as those included in the storage redirect table 244 as well as the LBA Rmap 242 are not modified by the FP but rather in one particular embodiment may only be modified by the CP when the FP faults, for example, in performing an I/O operation.
  • the Rmap table 242 may include a fixed number of extents that may be specified, for example, as a bounded resource requirement where each extent may be of a variable size and each have a value or range associated with it. A new extent may be added and an associated value or range may also be added to the Rmap at any time. Additionally, the value of an extent or part of an extent may also be changed at any particular time.
  • all Rmap or resource map management information and operations involved in the management of the metadata may be performed by the CP.
  • the CP is solely responsible for reading and writing the age list and other metadata.
  • the FP may read the LBA Rmap 242, as accessed though the CP. It should be noted that in this embodiment, the CP reads and writes both the age list (described elsewhere herein) and lba rmap. The FP does not directly access metadata information. Rather, in this embodiment, the FP can query LBA Rmap information and other metadata from the CP.
  • the CP may also communicate LBR Rmap information to the FP through an FP API.
  • the illustration 260 includes a virtual volume 262 that has address range or LBA range 0 through N.
  • the LBA range 0 through M is associated with a first volume segment descriptor 264.
  • the upper portion of the LBA range M+l through N is associated with volume segment descriptor 2 266.
  • This mapping for any LBA within the range 0..M causes volume segment descriptor 1 264 and associated tables to detennine that physical device PI 268 includes conesponding data portions. Similarly, using the tables from volume segment descriptor 2 266 for an incoming virtual address falling in the LBA range M+l through N, a portion of the physical device P2 270 may be determined as the physical storage location of the data.
  • volume segment descriptor 1 264 when an incoming I/O operation specifies a range of blocks falling between 0 through M, volume segment descriptor 1 264 may be used. Similarly, when a particular I/O operation includes an LBA range within the range M through N, volume segment descriptor 2 266 may be used.
  • the foregoing also represents how a single virtual volume may conespond to portions of multiple physical devices. In other words, the use of the tables in connection with the volume segment descriptors may be used in mapping logical or virtual devices to physical devices. In this instance, a single virtual device is mapped to portions of a plurality of physical devices. Similarly, a single virtual volume may conespond only to a portion of a single physical device using the techniques described herein.
  • multipathing may refer to alternate paths to the same physical device.
  • a first path to a first physical device may be used.
  • a second alternate path may be used to send data to the same physical device.
  • Use of the storage redirect table and the LBA Rmap may be used in specifying an alternate path.
  • the CP may determine that there are two paths to the same physical device.
  • An incoming virtual address VI is determined to be in the volume descriptor that includes LBA Rmap 282. In particular, it refers to the second entry in the LBA Rmap table 282.
  • the second entry of the LBA Rmap table includes a 1 as indicated by element 281.
  • an I/O operation uses the path specified by storage redirect table entry 1, an I/O failure may occur and the CP may get involved to perform a path test to device 290 along the path specified by the storage redirect table entry 1.
  • the CP may determine that storage redirect table entries 1 and 2 specify two different paths to the same device 290.
  • the CP may determine that the particular path specified by storage redirect table entry 1 has indeed failed. The CP may then reconfigure the destination of the volume segment descriptor to use the second path specified by the storage redirect table entry 2. An I/O enor may be returned to the host and the host may retry the I/O. On retry, the FP sends the I/O to the newly configured and presumably good path specified by the storage redirect table entry 2. The CP may indicate the use of this alternate path by modifying entry 281 of the LBA Rmap table 282 to indicate a 2 rather than a 1.
  • an embodiment may preferably use another technique in connection with specifying multiple or alternate paths.
  • the foregoing technique may be characterized as one which specifies path changes on a local level, or per entry.
  • all entries referencing a particular path that has been modified need to be updated causing a failover to the CP to update each entry of the LBA Rmap referencing a particular path.
  • An embodiment may utilize an alternate technique in specifying such a global change by redefining a particular path associated with a physical volume using techniques external to the LBA Rmap, such as global or system configuration data.
  • FIG 12 shown is an example of the updated LBA Rmap table as modified by the CP, for example, in connection with the multipathing example just described upon detection of a failure by the CP. It should be noted that alternatively the storage descriptor within an entry of the redirect table may also be modified to specify an alternate path to take to the particular device rather than modifying the LBA Rmap itself. Figure 12 shows an example of performing and specifying an alternate path at a global level.
  • the FP may cache a portion of the LBA Rmap which is included in the CP.
  • the LBA Rmap in the CP may be a cache of the LBA Rmap included on a form of media or other storage. This three level caching of the variable length extents allows the FP LBA Rmap to be very efficient in terms of resource utilization and speed.
  • the FP 300 may include one or more of the mapping tables 310 as well as a pending I/O list 320.
  • the mapping tables 310 may include information such as the LBA Rmap and the storage redirect table described elsewhere herein.
  • the pending I/O list may include an entry, such as 322a for each of the pending or outstanding I/Os. In this particular embodiment, an entry is added to the pending I/O list when an I/O request is received from "upstream", for example, from a host.
  • the entry may also be removed from the list when a message is sent from the switching fabric to the request issuer, such as the host, that the I/O operation has completed. For the duration that the I/O operation is outstanding, the I/O is said to have a status of active.
  • the FP While the I/O status is active, the FP keeps track of any supporting I/Os sent "down stream" or to the particular physical devices. These supporting I/Os may be maintained in a separate "downstream" pending I/O list. Supporting I/Os may include, for example, any type of handshaking messages and protocols in accordance with each particular embodiment. For example, in connection with performing a write operation, once the FP receives the data, the FP may issue a write command to a device, receive a "ready to transfer" command from the device itself, actually perform a write of the data, and then receive a return status prior to any other information being returned to the initiating host. FP keeps track of all of these supporting I/Os sent for example to the devices.
  • An entry included in the pending I/O list 320, such as 322a, may include an exchange ID, state, and other information.
  • the exchange ID in this particular example may represent conesponding protocol dependent information allowing the FP to process subsequent command sequences using the exchange ID to properly identify any mappings. For example, if a particular lookup service may have been used, the actual physical device determined from the logical device may be used in connection with the exchange ID such that a name resolution is not performed each time in connection with performing I/O operations.
  • Mapping information may be detennined when the initial sequence of a command is intercepted based on, for example, a target LUN, LBA and the like. In connection with subsequent sequences, this mapping information may be obtained using the exchange ID, which is common across all command sequences rather than performing perhaps multiple processing steps in connection with associated mapping information.
  • the state infonnation included in the record 322a may describe the state of the I/O operation, for example, as pending, queued, completed, failed or other type of status as appropriate in accordance with each particular embodiment.
  • Each entry may also include other information as needed in connection with performing other supporting I/O operations and other processing steps in connection with performing I/O operations.
  • mappings such as the information contained in the LBA Rmap as well as the storage redirect table, may be maintained coherently.
  • a subset of these mappings may be included in the FP for use by the FP and for communications between the CP and the FP.
  • Mappings are read by the CP and populated to the FP.
  • the FP does not modify the metadata, for example, in the tables in this particular embodiment. Rather, the CP may modify any information in the tables, for example, when the FP faults to the CP in connection with processing an I/O operation.
  • a virtual device may be described by more than one mapping entry. It is the CP's responsibility to ensure that all of the statuses of the various mapping entries are synchronized with one another. In other words, it is up to the CP to enforce uniformly different state rales such that, for example, one half of a minoring device is not indicated as up and running and another portion of the same device indicated by another entry as being down. It is up to the CP to enforce coherent and synchronize statuses in accordance with the different entries of the different devices. For example, when the CP changes or finds that one particular device is inaccessible or down, the CP should also modify any other relevant mapping entries to also indicate this particular device is also down. The CP is involved in state changes.
  • the FP may maintain a cache of the redirect table and a portion of the nnap table in use by the FP.
  • the cache is local to the FP, for example, in memory only accessible by the FP.
  • the portion of the rmap table that is cached within the FP is synchronized with the complete copy maintained by the CP. Additionally, copies of mapping tables maintained by each CP are also synchronized.
  • the DVEs may choose whether to participate in coherency operations in connection with the mapping entry. For example, a DVE not accessing a particular virtual device does not need to participate in ensuring that data included in particular tables such as mapping is coherent in connection with infonnation stored in other tables.
  • Age lists may be used in connection with minors requiring fast re-sync ability. The use of age lists and minoring operations are described elsewhere herein.
  • mapping table entries may include information from the previously described Rmap and storage redirect tables described elsewhere herein.
  • a particular mapping table entry may conespond to a volume descriptor or VSEG.
  • the VSIZE 356 indicates the size of the portion of the virtual device described by the mapping included in the table or descriptor 350.
  • the LBA RMAP OF EXTENTS 360 defines the range or resource map of the device extents of this particular volume segment descriptor.
  • the STORAGE REDIRECT TABLE DATA 370 includes infonnation needed to physically identify the location of a particular storage area conesponding to a particular virtual device location and address. Additionally, other infonnation included in the storage redirect table includes an indicator as to whether certain operations are valid and may be performed by the FP rather than the CP as well as the age list.
  • the DVE supports the FP operation in connection with performing online migration, LUN pooling, snap shots, incremental storage, RAID 0, RAID 1 and RAID 10, as well as asynchronous replication and atomic group operations. It should be noted that RAID 0 requires I/O striping, RAID 1 requires write splitting. RAID 10 requires the use of the I/O striping and the write splitting. Performing asynchronous replication requires the use of the write splitting and the write journaling.
  • An I/O request 400 may include a VDEVICE 402, an LBA 404, a
  • the VDEVICE 402 may include a virtual device destination for the I/O operation.
  • the TYPE 406 may identify a type of I/O operation. Data as described by fields 406 may be included, for example, in a control data block or CDB indicating whether the
  • the LBA 404 may include the starting LBA if the I/O operation of the type 406 is a read or write operation. Otherwise, the LBA field 404 may be not applicable or otherwise null.
  • the SIZE field 408 may specify the size of the data involved in the I/O operation. The data may be stored in a data buffer that is a number of bytes specified by SIZE 408 of a read or a write operation. Otherwise, the SIZE field 408 may include information that is not used.
  • a particular I/O request may be said to have "hit" a conesponding mapping table entry if the particular mapping table entry may be used for processing the I/O request, hi particular, the I/O type of the received I/O operation may be a read or write operation, and a device of the I/O request conesponds to that which is described by the mapping table entry. Additionally, the I/O request specifies a portion of data with a starting LBA whose entire size is within a single Rmap entry. In other words, the data associated with the I/O request may not span multiple Rmap entries in order for there to be a hit on a particular entry of the Rmap table. Generally, the information of processing steps just described herein connection with having a "hit" on a mapping table entry or Rmap entry are those processing steps described previously in connection with Figure 7.
  • an embodiment of an FP may divide an I/O operation into multiple pieces in the event an I/O operation spans multiple extents such that each piece "hits" within a single LBA Rmap entry.
  • an embodiment of the FP may also not include such functionality and optionally choose to cause such I/O operations to fault to the CP for processing.
  • downstream I/Os may be issued by the FP without CP intervention.
  • the stripe destination for the I/O request is determined. If the I/O spans multiple stripes, the CP may handle the I/O operation. In other words, this operation in one embodiment can be performed by the CP rather than the FP.
  • the write splitting option for each minor side that is writeable, a physical I/O operation is dispatched to the physical device offset by the LBA.
  • the FP may complete the conesponding virtual I/Os. However, if any of the physical I/Os completes unsuccessfully, there is a miss to the CP, for example, in connection with enor processing. It should be noted that the FP may be responsible for a small degree of write serialization for the write splitting option. The write journaling option similarly go from this functionality.
  • each volume segment descriptor describes a virtual volume segment which is a contiguous range of LB As of a particular virtual volume starting at a particular address for a length.
  • a volume may be described by multiple volume segment descriptors in which each of the volume segment descriptors describes non-overlapping LBA ranges of a particular virtual volume.
  • the virtual volume segment descriptor or VSEG as described elsewhere herein includes an LBA Rmap.
  • the volume segment descriptor in this embodiment is the only place where the LBA range specific information is stored.
  • Each entry of the LBA Rmap associates its volume extent or a particular LBA range with a storage redirect table entry.
  • the storage redirect table entry specifies various faulting modes such as whether to fault on read or write in connection with the FP processing operations for a particular volume extent as well as the conesponding storage descriptor indicating where data is actually stored for a particular volume. Note that each of the storage descriptors describes storage of the same virtual length as the virtual volume segment being mapped.
  • Storage descriptors may indicate striping, minoring and the like as the part of the storage descriptors internal implementation not visible outside of the storage descriptor field included in the storage redirect table entry.
  • a storage descriptor may also include a set of physical storage elements that are used to store data for a virtual volume segment.
  • a storage descriptor can typically describe both a RAID 0 and a RAID 1 mapping over a set of physical storage elements.
  • Each physical storage element may be, for example, a physical device.
  • Each storage element may be referenced by one storage descriptor.
  • FPs may play a role in connection with I/O write serialization in connection with minors as described elsewhere herein.
  • the CP is responsible for ensuring that only one FP has write permission at the time to any particular minored volumes. However, additional serialization within the FP may be required. If the FP cannot provide the serialization in connection with minoring, for example, then the FP rejects minored I/O operations associated with mapping table entries that require waiting. Consequently, these minored I/O operations are faulted back to the CP for processing so the CP can serialize them. It should be noted that this is the case when the minor runs at CP speeds and the CP may become a bottleneck.
  • This serialization is the synchronization operation of one embodiment that may be included in the FP processing.
  • All other synchronization and coherency may be put into effect by the CP by revoking I/O authority associated with fast path mapping table entries and causing the FP to fault I/Os to the CP for queuing or other dispatching.
  • This goes along with the goals of the FP in an environment being simple and light in handling of both of those I/O operations as described herein. Heavier processing such as involved in synchronization operations is faulted to the CP for processing. In the case of the accelerated minor problem, though, the FP plays a role to ensure conect operation. The problem that the FP is trying to avoid may result in a form of silent data corruption with inconsistent minors. This may happen for example in an instance where two outstanding I/Os are overlapping block ranges through the same FP.
  • the FP needs a way to detennine at FP dispatch if a particular I/O operation such as a write, overlaps any cunently outstanding write operations. If an I/O operation does overlap any currently outstanding writes, this I/O operation must be queued until sometime later. This may be done by faulting this I/O operation to the CP for processing.
  • the conflicting I/O operation may be over-queued longer than absolutely necessary to the CP. It should be noted that in an embodiment this over- queueing may be performed with negligible affects on overall performance due to the fact that this may occur infrequently.
  • the FP When the FP receives an I/O operation, it adds the virtual upstream pending I/O operation to the virtual or upstream pending I/O list. If the I/O misses in the fast path mapping table then it is faulted to the CP for processing. Similarly, if there is an outstanding write I/O to an overlapping LBA range and the virtual upstream pending I/O list, the incoming I/O operation is faulted to the CP. If there is no fault to the CP for processing, an atomic update of the physical or downstream pending I/O list is performed and then the I/O is redispatched to the downstream I/O processing to the physical device.
  • the atomicity requirement may be met in other ways as long as the CP can tell that the FP has, or is in the process of, dispatching physical I/Os for that conesponding virtual I/O. This is typically accomplished with a "timestamp" on the upstream pending I/O which indicates that it is "in progress", and its effects on the downstream pending I/O list might not be fully known yet. Again, the CP waits for these to drain if it wants to perform serialization itself which it must do if a conflicting I/O is ever faulted to the CP.
  • the CP When the CP is serializing I/Os such as in connection with minors, the CP ensures that both the conesponding pending physical or downstream I/Os overlapping the LBA range have drained and completed and are no longer pending and additionally, the FP is prevented from itself initiating new pending physical I/Os overlapping the same LBA range.
  • both of these processing steps may be performed with the primitives defined elsewhere herein, such as, for example, querying the pending I/O table and revoking the fast path mapping table entry.
  • serialization occurs either in the FP or in the CP, but not both.
  • the CP ensures this by revoking any mapping table entries that give the FP authority to itself redispatch downstream I/Os while the CP is performing the serialization.
  • the FP performs serialization on the virtual or upsfream side at initial dispatch time. If that serialization fails, or if the CP has to perform any type of manual I/O dispatching, the FP will be put on hold and the CP will take over the role of serialization. Note that as stated earlier, an FP need not implement serialization if it does not need these operations to be fast and scaleable. hi other words, if the task of writing to a minor is not allocated to the FP, then the FP need not be concerned in an embodiment with serialization.
  • a fault in the FP may occur because no mapping table entry exists.
  • a fault may occur within the FP and default to the CP for processing because a particular mapping table permission was violated such as performing a write in a read only extent.
  • a fault may occur because of serialization rale violations as just described herein. The FP allows these to be dealt with in a variety of different ways.
  • Figures 16 and 17 summarize processing steps as may be performed by the FP and the CP, respectively, in connection with performing I/O write serialization in an embodiment that includes minoring.
  • the FP receives an I/O request.
  • This I/O request may be deemed a virtual or upstream I/O request dispatch, for example, from a host received by the FP within the switching fabric.
  • the FP determines if there is an FP map table miss or whether the received I/O request overlaps an LBA range in the pending virtual I/O list.
  • An FP map table miss may occur because there is no mapping table entry within the FP for the conesponding I/O request for dual address, or also because the mapping table permissions have been violated, such as the mapping table indicates that the FP may not be used in connection with the write operation and the I/O request is for a write operation. If at step 424 one of the conditions results in a positive or yes determination, controls proceeds to step 428 where the operation is faulted to the CP for processing as it is determined that the cunent I/O request may not be processed by the FP. Otherwise, at step 424, control proceeds to step 426 where the FP atomically updates the physical pending I/O list and also dispatches conesponding I/O downstream.
  • atomically performed at step 426 is an access to the shared resource which is the physical or downstream pending I/O list. Additionally, the I/O operation is redispatched downstream or physically to the devices. These two operations are performed atomically at step 426.
  • the CP may serialize I/Os, for example, either because the FP is incapable of doing the serialization or because the FP faulted in I/O to the CP such as when a serialization violation is detected.
  • the CP receives an I/O request such as a write I/O request.
  • a determination is made as to whether there are I/O requests in the physical or downstream pending I/O list overlapping the LBA range of the received I/O request.
  • FPs may be characterized as operating with "authority" independent of other FPs.
  • An FP may be authorized by a CP to perform certain operations with certain data, such as metadata, which the FP obtains from the CP.
  • the FP stores such data in its local cache.
  • the FP continues processing once it has been so authorized by a CP independent of other FPs.
  • the FP also continues to use infonnation in its local cache until, for example, the CP invalidates information included in the FP's local cache.
  • the CP may "revoke" the FP's authority, for example, by invalidating information in the FP's local cache, modifying an entry in the LBA Rmap causing a fault to the CP, and the like.
  • Control proceeds to step 468 where the CP proceeds to issue pending physical I/O requests by adding the appropriate items to the physical or downstream pending I/O request and further dispatching the I/O request downstream.
  • various operations may be performed in connection with performing the processing steps described in flowcharts 460 and 420 such as, for example, clearing the pending I/O table using APIs provided herein and revoking an FP mapping table entry, for example, and causing an operation to fault to the CP by an invalid or a miss on an FP map table.
  • FIG. 18 shown is an example of an embodiment of I/O operations and the switching fabric.
  • the example 500 illustrates the use of "upstream” and “downstream” I/O operations and pending I/O lists with respect to the previous descriptions herein.
  • An I/O operation incoming to the switching fabric such as a from a host, may be refened to as an "upstream” I/O operation handled by the FP or CP.
  • a "downstream” I/O operation is an I/O operation that is initiated by the FP or CP to the data storage system in connection with processing an upstream I O request.
  • a received Write I/O request may result in a plurality of downstream I/O requests in accordance with particular protocols and message exchanges in each particular embodiment.
  • the FP in one embodiment described herein may include functionality in mapping logical or virtual devices to physical devices. This may be accomplished using the FP mapping table entries, including the LBA Rmap and Storage Redirect tables described herein. Also included in the FP is a list of pending I/Os which may be used in connection with enor recovery operations. Operations that cannot be performed by the FP may be faulted to the CP for processing.
  • the FP may use the following API when interacting with the CP in performing various processing steps as described elsewhere herein. Other embodiments may use other APIs for CP/FP communications than as described herein.
  • the CpMappingMissQ routine may be called from the FP to indicate to the CP that a particular I/O could not be mapped by the FP.
  • the CP may return CONTINUE, IGNORE, or QUEUE.
  • CONTINUE includes a new virtual device mapping from the CP which may have been added, for example, to the FP mapping table.
  • IGNORE indicates that no mapping is valid for this particular I/O operation and the FP should take appropriate action.
  • QUEUE indicates that the I/O operation should be queued to the CP for manual processing via CpQueuelO described elsewhere herein.
  • CpQueueIO() is called by the FP to the CP to queue an I/O request, for example, as may be received from a host, for manual processing by the CP.
  • the CP may manually dispatch supporting "downstream" I/Os (between the switching fabric and the storage for example).
  • the CP will subsequently set the I/O completion status, for example, as may be returned to the issuing host, and call FPQueuelOComplete to complete the "upstream" I/O back to its initiator.
  • CpDispatchIOComplete() indicates to the CP by the FP that a "downstream" I/O initiated with FPDispatc O, described elsewhere herein, has completed.
  • the FP has already set the downstream I/O completion status for return to the CP.
  • APIs that may be called from the CP to the FP in connection with perfonning various operations described herein.
  • FPDiscover() to return a list of physical devices which the CP may access for storage operations.
  • FPExpose() to "expose" a virtual device making the device available for storage operations.
  • An embodiment may use a locking mechanism to ensure that a mapping entry is not removed while still in use.
  • FPQueryPendingIOs() returns a list of pending I/Os from the FP.
  • FPQueryStatisticsO to retuni statistics from the FP.
  • the FP may keep and track statistical information in connection with performing I/O operations. This API may be used to obtain particular information.
  • FPDispatchlOO may be used to queue a downstream I/O from the CP for dispatch by the FP.
  • This API may be used by used by the CP in manually dispatching supporting I/Os, to maintain metadata state, and to establish backend enforcement, such as administrative or other commands to storage device.
  • the FP sets the downstream I/O completion status and a call to
  • FPPutData() to set data for an I/O operation for an I/O operation queued to the FP.
  • FPQueueIOComplete() indicates to the FP that an upstream I/O queued to the CP with
  • CPQueueIO() has its completion status set and the FP may complete the upstream I/O back to the initiator.
  • primitives may be used in mapping an "upstream" I/O operation to one or more "downstream” I/O operations.
  • An embodiment may include one or more primitives forming a hierarchy in which a higher level primitive may be implemented by using one or more lower level primitives.
  • the CP and the FP may both perform all of, or a portion of, the primitives.
  • Other embodiments may include other primitives than those that are described in following paragraphs.
  • the goal of primitives is to define one or more basic low-level operations to avoid multiple calls, for example, by the FP or CP in performing an upstream I/O operation.
  • These primitives should also be as flexible as possible so that the CP and/or the FP may build other complex higher level operations using these primitives.
  • An embodiment may have the FP, for example, perform the simpler operations that may be performed with a primitive and the CP may perform more complex operations requiring use of multiple primitives.
  • An embodiment may include an LB A/LUN remapping primitive which is the primitive used by the FP and the CP to dispatch a received I/O to an LBA on a physical device. Additionally, this primitive also includes receiving a return data request and I/O completion status and, if successful, return success by the FP to the request initiator. Otherwise, control is passed to the CP for unmapped or unsuccessful I/Os.
  • the LBA/LUN remapping primitive may be used in performing the virtual to physical address mapping using the Rmap and storage redirect tables described elsewhere herein. Whether an embodiment includes additional primitives depends on the functionality included in an embodiment.
  • the FP may accept an I/O from a host and perform a lookup using the mapping tables in the FP based on: whether it is a read or write operation, the starting and ending LBAs, and the destination or target virtual device ID. If there is no conesponding table entry, the I/O is forwarded to the CP for processing. If the I/O is Write and write operations may be performed for the particular LBA range, or the I/O is a read and read operations may be performed for the particular LBA range, then the downstream I/O is issued to the destination device, possibly with a new destination LBA.
  • the foregoing steps are a portion of the processing steps previously described in connection with Figure 7.
  • Information about the I/O is recorded in the pending I/O lists described elsewhere herein.
  • Information may include, for example, an exchange ID, that may be used by the CP if needed, for example, in connection with enor processing for status return codes for the I/O operation.
  • mapping tables may indicate that the read may occur from any one of N target devices.
  • the FP may implement a read load balancing policy. If there is no response from a downsfream device, the DVE may not know until the host (request initiator) sends an abort or a retry request. At that point, this request may be sent to the CP for enor processing and redispatch a downstream I/O request.
  • a message is received from the downstream I/O device(s), the downstream and upstream pending I/O lists are cleaned up by removing entries as appropriate, and any success or other status is returned to the requestor/initiator, such as a host.
  • a second primitive, the I/O striping primitive may be included in an embodiment that implements I/O striping.
  • An embodiment may also perform multiple LBA/LUN remapping operations rather than use this I/O striping primitive since the I/O striping primitive is built on the LBA/LUN remapping primitive.
  • Information about the physical location of each of the disk stripes for example, may be stored in the storage descriptor accessed by the redirect table with one access of the redirect table.
  • the I/O striping primitive may be included in embodiments using RAIDO striping, for example.
  • This second primitive of I/O striping is an extension of the first primitive, LBA/LUN remapping in which a set of downstream devices may be specified and LBA computations performed by shifting and masking, for example, based on the size of the stripes.
  • An embodiment may allocate processing of I/O operations spanning multiple stripe boundaries to the CP.
  • a RAID5 format may be implemented using this second primitive for reads.
  • an embodiment may be initially written in RAIDl and then, using the CP, migrated to RAID 5 as it falls out of use.
  • the LBA Rmap and conesponding redirect table entries may be used to migrate the data back to RAIDl if the data was subsequently modified.
  • the write gate functionality may utilize an entry in the storage redirect table, as described elsewhere herein, such that a write operation causes a fault to the CP to migrate data back to a RAID-1 organization to allow the write operations.
  • An embodiment may also include a third higher level primitive called the write splitting primitive which is the ability to perform the LBA/LUN remapping of a virtual I/O and simultaneously initiate a second mapped write I/O to another physical device with the same data.
  • This primitive may also include the ability to receive and conelate I/O completion status information from all devices written to and, if all are successful, return success to the request originator. Otherwise, control may be passed to the CP for processing.
  • the FP performs local serialization of overlapping I/Os here for proper function.
  • a mapping table entry may indicate if a write operation to a particular virtual address needs to be split to one or more additional devices. When this happens, the original I/O is mapped and reissued using the first primitive. Additionally, one or more additional downstream I/Os are also issued with the appropriate mapping information for each device obtained from the mapping table. Multiple linked entries are made in the downstream pending I/O table, one for each downstream I/O.
  • the CP may use timer indicators, such as time stamps, for pending I/Os and the FP may record the fact that an I/O is pending. Time stamps may be stored with conesponding pending I/O entries when received by the FP.
  • the time stamps may be used to indicate a relative age of the I/O operation and may be used by the CP in coordinating its own functions in connection with outstanding I/O operations. For example, prior to updating an Rmap entry, the CP determines whether there are any pending I/O operations referencing the Rmap entry. The CP waits until all pending I/O operations referencing the Rmap entry have drained prior to updating the Rmap entry. The CP may use the time stamp associated with a pending I/O operation in performing this coordination by comparing the timestamp of the pending I/O operation to the cunent timestamp. This may be used as an alternative to other techniques, for example, such as keeping a reference count in the FP for each of the Rmap entries which may require more storage.
  • a fourth and highest level primitive, the write journaling primitive may also be included in an embodiment the extends write splitting (the third primitive) while maintaining a journal of writes that have occu ⁇ ed to each physical device.
  • the journal also described elsewhere herein, may be on media or some form of storage (for persistent resynchronization functionality).
  • the journal may be fixed in size and writes to a full journal may be forwarded to the CP for processing. Typically, the CP will then "swap out" the full journal with an empty one so that the FP can keep running.
  • the destination of a write splitting operation may be either a non-journalling device or a write journal device.
  • a write journal may be characterized as a portion of media where a record is made of each write operation including, for example, a copy of the data, destination device and location information. Once the journal fills up, the write operation is transfened to the CP for processing. It should be noted that each FP may have its own journal to avoid locking issues between FPs.
  • a portion of the information needed to implement each of these primitives may be stored in the redirect table and the storage descriptor, such as related to the physical locations and policies of each system.
  • the FP may perform the I/O operation, for example, by dispatching a read or write operation using the first primitive above. If an embodiment includes striping, the FP may perform this operation using the second primitive.
  • the write splitting primitive may be used.
  • an FP may support operations such as, for example, LUN pooling, multi-pathing, snapshots, on-line migration, incremental storage, RAIDO using I/O striping, RAIDl using the write splitting primitive to implement synchronous replication with a fast resynchronization, RAID 10 using the I/O striping and write splitting, asynchronous ordered replication (AOR) using the write splitting and write journaling primitives, and others.
  • operations such as, for example, LUN pooling, multi-pathing, snapshots, on-line migration, incremental storage, RAIDO using I/O striping, RAIDl using the write splitting primitive to implement synchronous replication with a fast resynchronization, RAID 10 using the I/O striping and write splitting, asynchronous ordered replication (AOR) using the write splitting and write journaling primitives, and others.
  • the CP may support operation of any functions not supported or performed by the FP, such as any optional primitive functionality of primitives 2-4 above not included in an embodiment.
  • An embodiment may implement primitives in any combination of hardware and/or software.
  • One embodiment may implement the foregoing primitives in silicon or hardware to maximize speed. This may be particularly important, for example, in connection with FP processing since an embodiment may allocate to FP processing those I/O operations wliich are commonly performed.
  • the processing typically associated with the FP may be characterized as "light weight" processing operations as well.
  • An embodiment that allocates to the FP light weight processing operations associated with primitives and is interested in increased performance may choose to implement primitives completely in hardware.
  • any vendor's storage descriptor may be used.
  • the storage descriptor information such as an indicator for a particular vendor as to whether RAIDO or RAIDl and the like are supported, may be included in the storage redirect table 284.
  • caching techniques may be used such that the FP caches only a portion of the LBA map table 282 as needed. Any one of a variety of different caching techniques and policies may be included in an embodiment of the FP.
  • the FP may implement an LRU or "least recently used" policy for determining which portion of the LBA map table to displace on loading a newer portion of the LBA map table.
  • the associated Rmap and redirect tables may be loaded into cache local to the FP.
  • the storage redirect table associated with the VSEG may be loaded along with an "empty" Rmap table that includes a single extent.
  • the Rmap Portions of the Rmap are loaded in as needed in connection with performing mapping for an I/O operation.
  • the storage redirect table in its entirety is loaded on the first fault within the FP.
  • the LBA map table 282 is formed of one or more extents. The number of extents that are currently loaded for a particular FP may be refened to as the working set or size window. As known to those skilled in the art, a working set algorithm that may be used in connection with page replacement may be used in determining when to increase or decrease this size or window associated with the working set algorithm as used with the FP cache. It should be noted that a single extent is the smallest unit within an Rmap table.
  • FIG. 19 shown is an example of an embodiment of the mapping tables at initialization or start-up within the FP.
  • the storage redirect table and an Rmap table having a single extent are loaded into the FP as shown in Figure 19.
  • the number of extents within the Rmap table may increase as well as the number of entries in the storage redirect table in accordance with the different states of the different devices included in the computer system.
  • a first state of a first entry may represent those portions of a device that have already been migrated from one device to another
  • a second state of a second entry may represent those portions of a device that have not yet been migrated
  • a third state may represent those portions of a device that are cunently in the process of being migrated.
  • a first state of a first entry may be associated with those portions on a device that have not yet been pushed to a snapshot device and a second state of a second entry in the storage redirect table may be associated with those portions of a device that have already been pushed to the snapshot device.
  • a DVE may implement a copy on write operation in connection with performing a snapshot.
  • a snapshot involves mapping two virtual volumes initially to the same physical storage. When the original virtual volume is subsequently written to, the old data that was "snapshot" is copied from the original physical storage to a backing or snapshot storage device.
  • V snap is a snapshot of the virtual volume V at a particular time T.
  • all of Vs Rmap entries in the table Rmapl reference the redirect table 1 entry zero.
  • the redirect tab lei entry zero indicates that only incoming I/O operations that are "read only” are directed towards PI.
  • all of V snaps Rmap2 entries reference redirect table2 entry zero also causing all "read only" operations to be mapped to physical device PI.
  • V snap is then equal by definition to the virtual volume V. Physical volume P2 is initially unused.
  • the CP then changes V snap's Rmap2 entry, as indicated by element 524, from a zero to a 1 now indexing into the first entry of redirect table2.
  • Redirect table2 entry 1 indicates that I/O operations are directed towards physical device P2 and this is for read only access thereby preserving V snap's view of the original virtual volume V data from time T.
  • the CP also changes Vs Rmapl entry for the conesponding disk extent of the write 1/0 operation to identify entry 1 of redirect table 1 as indicated by element 526.
  • Redirect table 1 entry 1 indicates that I/O operations are directed towards physical device PI and that read and write operations may be performed to device PI. This particular write I/O operation, for example, is allowed to proceed onto device PI as indicated by anow 528. Additionally, any subsequent writes to that same extent in which the write I/O operation has previously been made are also allowed to proceed.
  • redirect table entry zero indicates the state of those portions of the disk that have not yet been pushed.
  • Redirect tablel entry 1 is used and associated with those extents that have already been pushed to the snapshot device. If a write is made to a particular LBA, an entry in the Rmap table for the conesponding variable length extent is modified from a zero to a 1. It should be noted that as different write I/O operations are perfonned, a variable length extent may be formed in the Rmap table of a size equivalent to a particular write I/O operation. As additional data is pushed with subsequent copies on write operations, there may be neighboring extents within the Rmap table that may be coalesced or merged to form a single extent. Thus, as more write operations are performed in connection with the snapshot, there exists fragmentation within a particular embodiment of an Rmap table.
  • a completion thread may be executed and started when a snapshot actually begins.
  • This completion thread may run as a background process within the FP and be scheduled so as not to interfere with other operations within a computer system.
  • the completion thread may start at the top of the Rmap table at the beginning of an associated LBA range and push those portions associated with each extent that have not already been pushed to V snap. This allows for a closing up or coalescing of holes that may be created by write I/O operations.
  • the completion thread works its way through the Rmap table, it performs writes of any portions of the conesponding VSEG address space that have not been already pushed to the snapshot device.
  • the completion thread may be at a particular point P within the virtual address range [0... LBAMAX] as represented by an Rmap.
  • the state of the LBA range up to point P may be represented in an Rmap by a single entry or single extent. This single extent conesponds to that portion or entry in the redirect table indicating that the data had aheady been copied to V snap.
  • the source, or Vs mapping tables reference the source storage which in this case is PI either through a read only or a read write redirect indicated by the LBA range in the Rmapl depending on whether the snapshot data has aheady been pushed to the destination or not.
  • the target or snapshot device's mapping tables which in this example are Rmap2 and storage redirect table 2, indicate either the source storage PI if the snapshot data has not yet been pushed, or the destination storage P2 if the data has aheady been pushed to the snapshot device.
  • entries in the storage redirect table may be combined if duplicates, and may also be removed once an operation is complete such that there are no longer any portions of a device in the state represented by a particular entry.
  • Mappings may be modified synchronously prior to the host completing. In the instance where multiple VE's write to the same source volume, only one of them at a time performs write operations. In the VE "fault handler", each VE must acquire a lock (an oplock for the LBA range of interest, as described in more detail elsewhere herein) and in doing so, will prevent "concunent faults" to the same LBA range on other VEs. The first VE that acquired the lock handles the fault, pushes the snapshot data, and updates the LBA Rmap. All subsequent VEs, upon inspecting the LBA rmap, see that the data has aheady been pushed with the LBA Rmap also updated.
  • a lock an oplock for the LBA range of interest, as described in more detail elsewhere herein
  • mapping tables are included in metadata that may be subject to modification by one or more processes in the computer system of Figure 1.
  • a locking technique may be used in connection with synchronizing accesses to shared metadata.
  • FIG 22 shown is an example 540 of how an incremental of a virtual volume may be implemented in connection with using the Rmap and redirect tables as included in the VSEG described elsewhere herein.
  • the incremental of a virtual volume is similar to a snapshot operation by involving initially mapping two virtual volumes to the same physical storage. However, unlike the snapshot operation described previously, subsequent modifications to an original virtual volume may be stored in a private backing store rather than on an original physical volume.
  • the original physical volume becomes read only.
  • the fact that an original physical volume is now read only allows multiple incremental virtual volumes to be based on the same original physical volume all of which continued to be read write without adversely impacting each other.
  • Incrementals may be used to allow 'multiple instantaneous copies of a single virtual volume to seamlessly diverge in time.
  • the example 540 that will be described shows only a single incremental virtual volume, any number of incremental virtual volumes may be included in an embodiment.
  • entry P2 may be read/write rather than read-only allowing data to be directly written to the device P2. This allows an embodiment to utilize the incremental approach on the destination of the snapshot.
  • the incremental virtual volume in this example is denoted as V Inc and the original physical volume is denoted as V Base.
  • V Inc The incremental virtual volume in this example is denoted as V Inc and the original physical volume is denoted as V Base.
  • a fault to the CP occurs because it is indicated by entry 0 of Rmapl that only read operations are allowed to device PI as may be performed by the FP.
  • the CP modifies the entry in Rmapl table as indicated by entry 545, from a zero to a 1, to allow read write operations to occur to device P2. By performing this operation of modifying the entry in the Rmapl table from a zero to a 1, the write operation is "redirected" via the redirect table 1, to physical device P2. The write operation is then allowed to proceed as indicated by anow 548.
  • V Inc and V Base are initially set to the same physical storage.
  • the new data is rerouted to a second physical device.
  • old data from V Base is not pushed. Rather, any new or incremental data is simply rerouted to an incremental or second device which in this case is indicated by Vine.
  • An online migration operation of physical storage for a virtual volume involves the use of a copy agent that may be included in the CP and three entries in the storage redirect table indicated in the redirect table 1 in this example.
  • Entry 0 of the storage redirect table 1 indicates that for device PI, read and write operations are enabled.
  • Entry 0 represents a state of data that has not yet been migrated from device PI to P2.
  • Redirect table entry 1 represents a state of data which is in the process of cunently being migrated.
  • Redirect table 1 entry 2 represents a state of data that has aheady been migrated.
  • the number of extents indicated by Rmap 1 may include at most three extents.
  • the first extent are all of those portions of the Rmap 1 table indicated by entry 2 of the redirect table conesponding to data that has aheady been migrated.
  • Data in the second extent represented by redirect tablel entry 1 may be refened to also as the copy barrier which indicates that portion of the data which is cunently in the process of being migrated. Any data subsequent to that in a particular LBA range is indicated as being associated with redirect table entry zero representing that data which has not yet been copied.
  • the size of the second extent may represent the granularity of the data that is cunently being copied or migrated.
  • the CP is cunently migrating data from physical volume PI to P2.
  • the CP is responsible for establishing a copy barrier range by setting the conesponding disk extent to having a redirect table entry of 1 indicating a read only operation for device PI. This is indicated by the entry 562.
  • the entry 562 has a redirect entry 1.
  • the CP then copies the data in the copy barrier range from device PI to P2 as indicated by the anow 564.
  • the CP may then advance the copy barrier range by 1) setting the rmap entry 562 to 1, 2) copying the data from PI to P2, and 3) setting the rmap entry 566 to 2.
  • Setting a conesponding disk extent indicated by the entry 562 in the table to refer to redirect table entry 2 causes read and write operations to proceed to the second device P2. Any data that has already successfully been migrated to device P2 is accessed through table entry 2. Any data that has not yet begun being migrated to the physical device P2 is accessed through table entry zero with read write operations to PI . Data that is in the process of being migrated within the copy barrier range is accessed through entry 1 with read only operations to device PI.
  • the granularity of data that is actually copied may vary in accordance with each particular embodiment.
  • the amount of data pushed in a single instance may be a 64K byte size.
  • it's size may be "bounded" in accordance with a granularity associated with data copy operations.
  • a write operation may be, for example, writing a 10K byte block of data
  • the smallest amount of data that may be copied in connection with a snapshot or a migration may be a 64K byte block of data.
  • the 1 OK byte write I/O operation may be bounded within a 64K byte block of data that is actually copied.
  • the granularity size is 64K bytes in this example and may vary in accordance with each particular embodiment.
  • data such as metadata
  • FPs data that may be used by FPs as well as by CPs within a single DVE may need to be coherent. Additionally, the same global metadata may be accessed for update by multiple DVEs also requiring synchronized access. Different types of synchronization and/or locking mechanisms may be used in performing intra-DVE and inter- DVE synchronization to manage the data coherency between copies of metadata.
  • a single CP may manage one or more associated FPs to maintain CP and FP data coherency and synchronization, for example, in connection with metadata associated with a virtual volume descriptor, such as the RMAP and storage redirect tables.
  • the CP may communicate with the one or more FPs and, for example, request that one or more FPs remove entries from their local FP caches.
  • the FP and the CP may communicate using one or more APIs as also described elsewhere herein in connection with performing metadata accesses.
  • only CPs may modify global metadata that may require the CP to gain exclusive access over a portion of the metadata using a locking technique described in more detail elsewhere herein. Accesses to metadata may also involve reading, for example, which does not necessarily require exclusive access by a particular CP.
  • a single DVE may have a one-to-one relationship with a CP at execution time. It should be noted that this relationship may change over time, for example, when a CP fails.
  • a CP may be used interchangeably with a DVE for purposes of this one-to-one relationship. For example, the foregoing paragraphs state that DVEs may communicate using a messaging protocol which means that CPs of each of the DVEs may communicate.
  • an embodiment may select to minimize the number of CPs such that there may be reduced inter-CP communication, for example, in connection with performing operations requiring cluster-like communications between CPs as described elsewhere herein.
  • An embodiment may mclude multiple CPs within a single DVE to share the load within a single DVE, but from a viewpoint external to the DVE, there may be a single CP.
  • FIG. 25 A shown is an example of an embodiment 600 of how metadata may be distributed in an arrangement in the computer system of Figure 1.
  • DVE 610 may include multiple DVEs each having a plurality of CPs and associated FPs.
  • Oplocks which are described elsewhere herein, are the mechanism by which access to global metadata is synchronized and controlled to maintain data coherency of the metadata being accessed, for example, by multiple CPs in connection with write metadata operations.
  • Each of the CPs such as 604a and 606a, include may cache a local copy of metadata which may be a portion of the global metadata.
  • Each of the CPs may be associated with one or more FPs, for example, such as CP 604a may be associated with two FPs, 604b and 604b.
  • Each of the FPs may also maintain in a local FP cache a portion or a subset of the metadata.
  • the FP cache s the storage redirect table and a portion of the Rmap table that the FP is cunently using. Caching techniques that may be used in an embodiment of an FP are also described elsewhere herein.
  • the CP maintains cache coherency between the FP cache contents and the contents of the CP's own cache.
  • the anangement 600 in Figure 25A illustrates a hierarchical data anangement in connection with metadata that may be included in an embodiment.
  • the CP and its associated FPs maintain master/slave vertical coherency from the CP to the FP. In other words, any mappings found in the FP mapping tables are guaranteed to be valid by the CP which itself has populated the FP tables.
  • the FP mapping table is a cache or a subset of a portion of the information available within the CP.
  • CPs, of which there may be many, for example, in a distributed system may maintain peer-to-peer horizontal coherency between themselves. In other words, they agree cooperatively using, for example, cluster semantics on what mappings are valid.
  • membership management and distributive techniques may be used in connection with the cluster-type environment.
  • Each CP may be thought of as having a globally coherent copy of a subset of an authoritative mapping table and each FP as having a locally coherent subset of the table maintained by the CP with which it is associated.
  • CPs may communicate with each other when necessary and scale horizontally in a symmetric distributed system.
  • Each FP communicates with its associated CP.
  • the FPs form an asymmetric distributed system off of each of the CPs.
  • the CP modifies the metadata information.
  • the CP handles all I/O enors, all coherency and synchronization with other CPs, through the use of metadata and all inter- volume coherency. All enors returned to a host originate from the software CP.
  • the FPs are not involved in synchronization or coherency issues in connection with the metadata.
  • the CP in direct contrast, is intimately involved in the synchronization and coherency of the metadata.
  • Infra-DVE locks are used to ensure only one thread within a DVE is modifying or accessing global metadata at a time.
  • Inter-DVE locks are used to ensure that only one DVE is modifying or accessing a portion of global metadata at a time. Therefore, true mutual exclusion, from all threads on all DVEs, is obtained when a thread acquires both the intra-DVE and inter- DVE locks protecting a piece of global metadata.
  • the intra-DVE locking technique may use mutual exclusion thread locks that may be included in a particular platform and may vary with embodiment in accordance with the functionality provided.
  • Intra-DVE locks may be based upon metadata apportioned using the variable length extents described, for example, in connection with the Rmap table and the storage redirect table which are divided into variable length extents in RAM as used with mapping.
  • the intra-DVE locks may be associated with each portion of metadata accessible for global access to maintain control of metadata within a DVE. As described elsewhere herein, there may be many processes within a single DVE competing for a single lock, such as sweep threads, migration threads and the like, all executing simultaneously.
  • the intra-DVE locking mechanism is local to each DVE and may be stored in volatile storage, for example, such as RAM, rather than a form of persistent non- volatile storage, such as on media or disk, for use in connection with system failure and recovery operations.
  • critical sections may be used to implement exclusive access for intra-DVE locking.
  • the critical sections may be used to lock a range of an rmap between contending threads.
  • Other embodiments may use other techniques in connection with implementing an intra-DVE locking mechanism that may vary in accordance with each embodiment.
  • FIG. 25B shown is a more detailed representation of one embodiment of the global metadata and oplocks included in the global metadata and oplocks store 602.
  • inter-DVE oplocks may be used as an inter-DVE locking mechanism in contexts for synchronization without being associated with metadata, such as with minor write serialization operations.
  • the CP When a CP wants to modify a portion of metadata, the CP first acquires the conesponding intra-DVE lock and then acquires the inter-DVE oplock and conesponding global metadata. Each piece of data that is globally accessed by multiple DVEs may have an associated oplock.
  • Included in the global storehouse 602 may be, for example, LBA Rmap table metadata and oplocks 626a, storage redirect metadata and oplocks 626b, journal metadata and oplocks
  • journal metadata and the global cluster membership are non- volatile.
  • a portion of data may be either volatile or non- volatile.
  • Associated with each portion of data may be an oplock that is either volatile or non- volatile.
  • Non- volatile data is recorded in some form of permanent storage that retains its state, for example, when there is a failure.
  • ownership information is recorded in an oplock journal also stored in metadata.
  • the LBA Rmap or rmap table metadata and oplocks 626a includes rmap metadata and associated oplocks.
  • rmap metadata is non- volatile metadata because upon failure, a node performing clean-up operations needs to know, for example, which portions of an LV have aheady been migrated.
  • ownership information may also be recorded indicating which DVE is the cunent "owner" that has acquired the oplock. This may be used in connection with performing data recovery operations described elsewhere herein.
  • the journal metadata and oplocks 626c includes journal metadata and oplocks.
  • a single journal may be associated with each DVE describing or journaling the operations performed by, or in the process of being performed by, each DVE as known to one of ordinary skill in the art.
  • the journals may be stored in global storehouse 602 in non- volatile storage since these journals may be played back and used in performing data recovery. For example, a first DVE may "clean up" after a second DVE goes off-line. The first DVE may walk through the operations the second DVE was in the process of performing. Once the first DVE is done, the journal associated with the second DVE may be released.
  • the second DVE may evict the first DVE from the cluster, and inherit its own journal back, as well as that of the
  • the global storehouse may also include oplocks used for inter-DVE synchronization wliich may or may not be used in protecting associated metadata.
  • the global storehouse may also contain other global metadata protected using other types of inter-DVE locking mechanisms. It may also contain global metadata that is not protected by a lock, for example, when machine instructions accessing the global metadata implicitly lock the data.
  • the global metadata and the oplocks 602 may be stored in any one of a variety of different locations. For those oplocks that are non- volatile, a persistent storage location may be used to store the oplocks, ownership and associated information used in connection with performing data recovery operations.
  • the global metadata and the oplocks may be stored in any location and may be provided by a service, for example, in connection with APIs, to modify and access the data. It should be noted that within a particular embodiment of a computer system, there may be multiple information stores including multiple copies, as well as different portions of, the global metadata and oplocks 602.
  • each CP In the process of, for example, removing or modifying an entry from a global mapping table, each CP must insure that each of its slave FPs have aheady removed the entry from their own FP tables after obtaining the conesponding locks, such as intra-DVE locks and inter-DVE oplocks. Note that adding entries to an FP mapping table can be done as needed since the worst case is that there is no matching entry and the I/O would be handled by the CP.
  • the CP may coherently modify an FP table entry from an upstream source to a downstream destination by first deleting the old FP table entry, such as an entry, for example, may be that used in connection with an RMAP or the storage redirect table. By deleting the old FP entry, new I/Os are prevented from being started with the old mapping. Any subsequent initiations or accesses to this particular entry from the upstream source will be forwarded to the CP as a fault will occur in the FP since there is no current entry. Next, the CP may query the FPs pending I/O list to determine if there are any I/Os that are outstanding on the downstream pending I/O list for this particular FP entry.
  • the I/O operations may be aborted and the entry in the pending I/O table may also be deleted or removed, or the CP may wait for those operations to fully complete. This prevents pending I/Os from resuming or henceforth completing using the old mapping.
  • the CP may delete its own copy then of a particular entry in a table.
  • the CP may then further synchronize with the other CPs, such as using messaging, to make the new CP entry valid and modify, for example, the global metadata using the inter-DVE oplocks. Subsequently, the CP modifies its own copy of the data and traditionally updates any copy of this particular table entry in each of the FPs.
  • mappings There is a potential problem when a mapping is changed while I/Os are outstanding, for example, when an I/O is dispatched to a downstream device as a result of a mapping.
  • the I/O has not yet completed but the mapping has changed and the mapping table entry is deleted.
  • This problem may occur because there is no positive acknowledgement to an abort command and the DVE may not be sure that the I/O is not still being processed.
  • This problem may be refened to as the ghost I/O problem in which I/Os, such as write operations, may be initiated by a DVE but not complete prior to a DVE going offline, or being unavailable.
  • An embodiment may attempt to prevent such I/Os from completing.
  • an embodiment may attempt to abort ghost I/Os using any one or more of a variety of different techniques having different associated costs and conditions. For example, an embodiment may abort all I/O operations for a particular target device, or initiated by a particular device for a specified time period. This may be performed by coordinating with other DVEs to stop I/O operations in accordance with certain conditions. Subsequently, messaging, as described elsewhere herein, may be used to coordinate a restart of sending I/O operations among DVEs. If any I/O operations have been aborted that should not have been, the initiator may subsequently detect the abort and reissue the I/O operation. Other techniques may be employed in an embodiment.
  • a host may issue a write I/O request causing a fault to the CP.
  • the CP may then obtain exclusive access to a particular portion of the global metadata by obtaining the intra-DVE and inter-DVE locks needed.
  • the CP communicates to those CPs using only the particular portion which the first CP wishes to lock.
  • Portions of metadata may have an associated inter-DVE oplock. Additionally, there may be a list of those nodes that maintain a copy of the metadata locally in memory of all of the DVEs that are caching that particular metadata. In order for a CP to modify a particular piece of global metadata, it obtains the conesponding oplock for that metadata by obtaining permission through messaging techniques described elsewhere herein.
  • RMAPs Included in the global metadata 602 are RMAPs and storage redirect tables each having associated volatile oplocks.
  • the LBA RMAP or RMAP tables of a volume segment descriptor include variable length extents when represented in memory.
  • the metadata RMAP is divided into fixed size portions or chunks rather than variable length extents.
  • each oplock or locking mechanism is associated with a fixed conesponding RMAP portion.
  • the variable length extents included in an RMAP for example, as may be maintained within a CP or an FP may be mapped to one or more fixed size chunks within the global metadata.
  • the CP obtains the volatile oplocks for the fixed size portions associated with the conesponding metadata.
  • FIG. 26 shown is an example 640 of how a variable size extent may map to one or more chunks or portions.
  • the illustration 640 shows an RMAP 646 that includes three extents of variable lengths. Extent noted by element 642 may need to be accessed by a CP, for example, in connection with modifying the contents of the RMAP referring to a particular entry in the storage redirect table.
  • the CP obtains access to the oplocks conesponding to the portion 644.
  • the portion 644 represents three fixed size segments or portions each having their own associated oplock.
  • the CP obtains each of the three oplocks associated with the portion 644 in order to modify the global metadata conesponding to portion 642 which it may store locally within the CP itself.
  • the boundaries of a particular oplock may be refened to as lock boundaries.
  • the CP may obtain the oplock to the next successive boundary including the LBA range desired.
  • a list of DVEs in a particularly relevant state may be maintained. This may be stored in volatile memory local to each DVE. For example, in connection with performing a write operation, it may be desirable to know who is sharing or using a particular portion of metadata.
  • a DVE initially boots or starts up, it progresses from the boot to imtially the uninterested state where it is not part of the cluster and does not care to know or be communicated with regarding metadata modifications. The DVE may then want to join the cluster and progress to the joined state.
  • a DVE When in the joined state, a DVE is part of the cluster but has not yet begun using or accessing any of the metadata the oplock may be protecting. From the joined state, a DVE may want to move to the sharing state to indicate that they are caching or accessing metadata that the oplock may be protecting.
  • Sharing may be associated with performing a read operation and accessing that part of the metadata. From the sharing state, a DVE may want to acquire the particular oplock or other type of lock associated with that particular metadata for example in performing a write of the metadata associated with, for example, an RMAP table entry. This DVE may then progress to the acquired state.
  • a "join" list and a "share” list may be maintained locally in each DVE in volatile memory.
  • Each DVE may use its own list, for example, in determining to what other DVEs to send an acquire message request.
  • the DVE may broadcast state change messages to other DVEs in the "join list”.
  • the DVEs may communicate using the VI or Virtual Interconnect messaging protocol which is an ordered reliable datagram messaging mechanism. This is only one type of messaging protocol and mechanism that may be used to facilitate communications between each of the DVEs in its cluster-like environment.
  • Messages that may be included and exchanged between different DVEs may include a "join” message notification when a DVE wants to join the cluster protocol.
  • the DVE may enter the sharing state and accordingly send a conesponding share message to other DVEs.
  • a complimentary unshare operation may be exchanged between DVEs when a particular DVE ceases caching metadata associated with a particular oplock.
  • Acquire may be a message sent from one DVE to other DVEs indicating that the DVE sending the acquire message wishes to acquire the oplock for a particular metadata.
  • Release may be a message exchanged between CPs to indicate that a particular CP that is sending the message has released the metadata from update. It should be noted that an embodiment may not include an explicit release message. Rather, an oplock may be considered taken by a first requester until it is next requested and acquired by a second requester. Alternatively, the first requester may release the oplock when the first requester is done with the metadata by issuing an explicit release message.
  • An example of the former technique for acquiring/releasing an oplock is described in more detail elsewhere herein.
  • acknowledgment messages such as a positive acknowledgment and a negative acknowledgment message included in an embodiment.
  • One of the acknowledgment messages may be sent from a CP for example in response to another CP's request to acquire a particular oplock to modify metadata.
  • An oplock is used cooperatively among the one or more DVEs for inter-DVE coherency and synchronization of metadata.
  • An oplock is hosted for example on the DVE that acquired it most recently. That DVE can often reacquire the oplock with a simple write to a private journal to take the oplock or reacquire the oplock. If the oplock is volatile, there is no need to write to a journal.
  • a DVE may communicate with the oplock's DVE host and thereby become the oplock' s new DVE host. What will now be described is one embodiment of the inter-DVE and intra-DVE oplock stractures.
  • the global cluster membership list may be denoted as a "jlist" of all the nodes (DVEs) in the cluster having an associated lock refened to as the "jlock". Also included in the global storehouse may be an eviction list or "elist" to which DVEs are added when they are to be evicted, such as when a first DVE does not receive an acknowledgement message from a second DVE in response to a message from the first DVE. The first DVE may conclude that the second DVE is offline and begin cluster eviction and recovery.
  • an inter-DVE lock associated with the global cluster membership list is an inter-DVE lock associated with the global cluster membership list
  • 626d may be represented as: jlock — oplock Tor jlist jlist — "join broadcast list” (lists all nodes) This is the global cluster list or membership list of DVEs. elist -- “eviction list” elock —oplock for eviction list
  • an embodiment may use a different locking mechanism besides oplocks in connection with the locks for the jlist and the elist referenced above.
  • an oplock may be a particular lock optimized for distributed access where some locality of reference exists.
  • An embodiment may use oplocks for inter-DVE locks.
  • Oplocks may be volatile or non- volatile. If an oplock is volatile, there is no backup media copy. Alternatively, if an oplock is non- volatile, there is a backup copy stored, an identifier as to which DVE is the owner, and a journal of oplock operations. If a node goes off-line such as in the event of a disaster, another node inherits the off-line node's journals and performs any clean-up needed for any non- volatile oplocks, such as may be associated with minored writes. In the event that a DVE goes off-line, its volatile locks are automatically released by virtue of the protocol described elsewhere herein in that a DVE acquires a lock by obtaining permission from all other DVEs in the sharing state for the associated data.
  • Oplocks may be used as an alternative to other locking mechanisms, such as critical sections, semaphores and the like.
  • the use of oplocks keeps a list of all readers. When one DVE decides that it needs to write to the commonly accessed data, it obtains permission from all other readers first. In other words, with oplocks, only writers need to acquire and release the locks.
  • This policy is in contrast to an embodiment using an alternative locking mechanism, such as a critical section, in which both readers and writers acquire and release a lock when accessing the shared resource to ensure exclusive access to a shared resource for both reading and writing.
  • the oplocks for each piece of metadata such as a fixed portion of the Rmap table which include an indication of who is the acquirer or owner of the oplock. It should be noted that the acquirer or the owner of the oplock may also be refened to as a host of the oplock.
  • Each of the non- volatile inter-DVE oplocks may be represented by the following: owner (cunent and recent, if known) slist — "share broadcast list” (all joined DVEs) alist -- “acquire broadcast list” (all sharing DVEs) dirty — indicates dirty (unrestrictive) metadata needs to be flushed
  • slist and alist may be maintained privately (per-node), in-memory andper- oplock.
  • Cunent owner is the present owner of the oplock. Recent owner may refer to a previous owner, as in the instance where a node goes down and the current owner is performing cleanup for the recent owner, hi the foregoing, Jlist is the list of all possible nodes in the cluster, "join" requests are broadcast to the DVEs in this list.
  • Slist is the subset of nodes which have actually "joined” the cluster to which "share” requests are broadcast.
  • Alist is the further subset of nodes which are actually “sharing” access to metadata, "acquire” requests are broadcast to these DVEs.
  • volatile oplocks may be represented by a slightly modified version of the above stracture described for nonvolatile oplocks.
  • the volatile oplock structure may be the above stracture without the ownership information.
  • An update to data associated with an oplock may be characterized as unrestrictive (dirty) or restrictive.
  • a restrictive update a requesting node acquires the associated lock, notifies all other nodes of the update. All other nodes stall the I/O operations until the update is performed.
  • I/O operations are not stalled.
  • the update may be performed by each node at some point, for example, as performed by a background task update when there are idle processor cycles.
  • an unrestrictive acquisition and update may be associated with locks for metadata which grant new or additional authority.
  • a restrictive acquisition and conesponding restrictive update may be associated with locks for metadata which restrict or take away authority.
  • an Rmap update may be a restrictive update performed by the CP such as when an Rmap entry is updated to further restrict the types of operations that may be performed by the FP (e.g., change from "FP can perform R and W operations" to "FP can only perform read operations”).
  • an unrestrictive Rmap entry update may be, for example, a modification by the CP of an entry to increase the types the operations that the FP may perform (e.g., change from "FP can perform only read operations" to "FP can perform read and write operations").
  • restrictive updates all copies of associated data as referenced by all CPs are invalidated and replaced with the new updated version prior to performing additional I/O operations.
  • node B For example, consider an unrestrictive update by node B in which node B must obtain node A's permission to acquire the lock. Node B sends a message to node A requesting to acquire a lock. Node A sends an acknowledgement to node B. Node B updates the metadata and this is a restrictive update. Node B sends node A a message regarding the unrestrictive update of the metadata. Node A records the unrestrictive update in node A's journal and sends an acknowledgement back to node B. Node A then purges all outdated copies of the metadata as time allows.
  • Uninterested state (primary dormant state for oplocks of unshared metadata) on "join” received, respond with "nakx” (not interested); on "leave” received, ignore; main ⁇ if (need to access metadata) ⁇ clear slist; clear alist; goto Want to join; ⁇
  • Joined state (primary dormant state for oplocks of shared metadata) on "join” received, add sender to slist; respond with "ack”; on “leave” received, remove sender from slist; on “share” received, add sender to alist; respond with "ack'; main ⁇ if (need to cache metadata) ⁇ goto Want to share; ⁇ if (no longer need any access to metadata) ⁇ async broadcast "leave” to slist; goto Uninterested; ⁇
  • Sharing state on "join” received add sender to slist; respond with "ack'; on “leave” received, remove sender from slist; on “share” received, add sender to alist; respond with "ack'; on “unshare” received, remove sender from alist; on “acquire” received, notice below; main ⁇ if (acquire received) ⁇ PURGE METADATA CACHE; async broadcast "unshare” to alist; if (dirty) ⁇ FLUSH METADATA JOURNAL; dirty — false;
  • the variable "purge” is set to indicate that the oplock was successfully acquired but that the previous node holding the oplock flushed some dirty metadata that was protected by the oplock prior to releasing the oplock. Accordingly, the current node purges the cached metadata and rereads the metadata from the media or non- volatile storage. Purge is set in the "Want to acquire” description elsewhere herein when the previous lock owner released the
  • Non- volatile is a characteristic of an oplock specified when the oplock was previously created such that a record of the oplock is stored, such as on media, in the event of node owner failure.
  • an oplock acquisition may be done in a restrictive or unrestrictive manner for each acquire.
  • An unrestrictive acquisition may be characterized as stating that metadata is being updated by a first node but if the other nodes do not need to learn about this update immediately. This allows communication to other nodes that the lock was acquired and metadata changed in less restrictive fashion.
  • An unrestrictive acquisition may be used, for example, in connection with metadata updates that grant new authority to other DVEs in the cluster, as opposed to revoking existing authority.
  • an explicit release of a lock in this embodiment triggers a retry for other nodes attempting to share or acquire a lock that another node has aheady acquired.
  • other nodes may retry after a predetermined time period.
  • a metadata structure may be one or more anays associated with a device. Each anay associated with a device may conespond to a logical device identifier. A particular portion of metadata may be accessed by a triple represented as: global_id, local_id, index where global_id conesponds to a volume segment number, local_id conesponds to a particular attribute, and index conesponds to a particular portion, such as a particular 32 megabyte extent.
  • Local_id may conespond to a particular attribute, such as rmap information for a particular volume segment.
  • the metadata stracture may be a two-dimensional anay in which an element is accessed by [first_index, secondjuidex].
  • the global_id and local_id may be used in obtaining a hash value conesponding to the first_index value. Any one of a variety of different hashing techniques may be used. For example, if the metadata stracture is a 2- dimensional anay, the global_id and local _id may be used to obtain a first_index value such as represented by:
  • each oplock may be similarly referenced by the tuple and each oplock may be a record or structure including the ownership information, and the like, as described elsewhere herein.
  • An oplock may be associated with an element in the array, an entire array, or an entire instance of metadata.
  • An embodiment may represent metadata and oplocks using data stractures other than those described herein as known to those of ordinary skill in the art.
  • the data structure used to implement DVE oplocks allows access to a particular oplock by a guid.luid[index] tuple as described elsewhere herein in connection with a metadata data stracture.
  • RMAP metadata may include an anay of redirect values, whose elements are addressed by VSEGguid.RMAPluid[BBAindex/BLOCKSIZE in wliich VSEGguid is the volume segment identifier, RMAPluid refers to the RMAP identifier, BBAindex refers to the beginning block address index, and BLOCKSIZE refers the size of a block of data.
  • the following RMAP oplock policy may be employed cooperatively between nodes:
  • a DVE is in a sharing state for the conesponding oplock.
  • a DVE acquires the conesponding oplock. This causes copies of the metadata to be "purged" from all the RMAP caches of other reading nodes (peers) sharing the oplock. If the oplock is acquired with a restrictive update, the peers also synchronize waiting for all upsfream I/Os that might be using the old RMAP redirect value to complete (based on RMAP VSEG, upstream I/O BBA range, and upstream I/O timestamp). Once all peers have acknowledged the purge (and synchronization) as complete, the node now owning the oplock can update the RMAP metadata knowing no other node is using it.
  • reader nodes may return to reading the metadata as in the conesponding "shared reader" state of the oplock. Note that if only one node is using an oplock, all subsequent transitions from shared to acquired state require no inter-node coherency traffic.
  • an upstream I/O is timestamped prior to reading the (potentially cached) redirect values from the RMAP metadata.
  • the timestamp may be used to "synchronize" I/Os that might be using old RMAP redirect values when making restrictive updates to the RMAP metadata.
  • the timestamp is used in determining wliich I/Os need to drain by comparing the I/O's timestamp to a cunent timestamp value for those I/Os referencing the RMAP value being updated.
  • the timestamp may be used as an alternative to a usage count on each and every generation of RMAP redirect values.
  • timestamps may have an advantage of reducing the amount of memory used within an FP.
  • An embodiment of a system may utilize many oplocks.
  • the RMAP metadata described above is an oplock protecting each anay element of each VSEG's metadata RMAP state.
  • Oplocks may also be used for other "lba range" specific functions, like minor write serialization oplocks, as well as oplocks protecting various fields of LV, VSEG, and SD metadata state.
  • oplocks may be used to protect metadata state associated with DVE objects, oplocks may be used in connection with other data objects, such as in the case of the minor write serialization oplocks as an "lba range" mutual exclusion access mechanism for concurrent minor writes.
  • the acquired oplock and metadata are not affected as part of the recovery process of the failed node. If the node that has acquired the oplock fails, the recovery processing steps taken depend on whether the oplock is volatile or non-volatile.
  • the oplock In the case of a volatile oplock, the oplock is implicitly released and some other node can immediately acquire it. This presumes that the failed node (that previously owned the oplock) needs no other cleanup. In the case of a non- volatile oplock, the failed node may have left the collective system in a state that needs cleaning up. When the failed node is subsequently evicted from the cluster, the recovering node performs cleanup prior to explicitly releasing the nonvolatile oplock. Additional processing steps as may be performed by a recovering node are described elsewhere herein in more detail.
  • volatile oplocks are released not by an explicit "release" message being broadcast, but rather in that another node is now free to request and acquire the oplock when a cunent owner no longer refuses another's request to acquire the lock.
  • a message may be broadcast when a node that has acquired the oplock is done in order to signal other nodes that they may now attempt to acquire the oplock and also obtain a new copy of the data associated with the oplock.
  • an embodiment may not broadcast a cluster- wide message when a node that has acquired the lock is done. However, an embodiment may choose to broadcast such a message as a way to notify other nodes that they may try to acquire the oplock.
  • an embodiment may use a self-awakening of retrying and predetermined time internvals for retries by other nodes.
  • a single DVE at a time may make changes to particular shared objects. The other DVEs may pause I/Os to the affected objects waiting for the single DVE to complete its metadata updates, at which time the "following" DVEs will reload the affected objects, in a restrictive or unrestrictive fashion.
  • Oplock broadcasts messages may be used in performing a DVE cluster node eviction.
  • a DVE broadcasts an oplock request (join, share, or acquire) to a set of peers, and one or more of the peers do not respond, those peers may be "evicted” from the DVE cluster.
  • An embodiment may use other cluster techniques, such as quorum rules for performing an operation. It should be noted that when a node is evicted, the evicting (or "recovering") node becomes the caretaker of the evicted node's cluster resources. If a cascaded eviction occurs, the evicting node may become caretaker of the evicted node's resources and also any nodes evicted, directly or indirectly, by the evicted node.
  • Oplocks as described herein may be volatile or non- volatile.
  • volatile oplocks when the node owning the oplock dies, the oplock is implicitly released since an oplock is only owned by virtue of the owning node defending the oplock against other peer node's "share” or "acquire" requests with a negative response.
  • Non-volatile oplocks behave exactly like volatile ones, except a) their ownership records are recorded in a journal (for performance) backed by metadata, and b) the most recent owner of an oplock is always considered a member of the "join set", and hence is always included in subsequent requests to "share” the oplock.
  • an evicting node defends the non- volatile oplock while the evicting node is cleaning up for the evicted node.
  • Eviction attempts of a given node may be globally serialized, and if two nodes attempt to evict the same other node, only one of them actually performs eviction steps and performs any clean-up needed before the other evicted node can re-attempt its oplock broadcast. If an oplock broadcast results in an eviction, the broadcast processing may be retried from the beginning.
  • the oplock state hierarchy described elsewhere herein may minimize inter-node coherency traffic in the performance path.
  • These tiers conespond to, for example, "joining" an oplock, “sharing” an oplock (for caching), and “acquiring” an oplock (for update).
  • To each tier to elevate to the next tier requires a broadcast message.
  • the set of recipients at each level is always a subset of the set of recipients at the previous level. In the ideal case, the "shared" to "acquired” transition will require no inter-node coherency traffic at all.
  • a node can "join” (express interest in potentially later sharing) an oplock with a broadcast to the entire set of potential peers. The response from each peer to the broadcast indicates if the peer "cares" about the join. A join may occur only at boot/configure time. Typically, a node "joins" oplocks for all VSEGS that it has configured. It should be noted that the "working set" of VSEGS for a node may be pre-configured at boot time with additional VSEGS configured on-demand, such as in connection with a first I/O operation to a particular VSEG's LV.
  • a node can "share” (express interest in potentially later acquiring) an oplock with a broadcast to the set of nodes that want to know, such as those nodes in a "join” state.
  • a node can "acquire" an oplock with a broadcast message to the set of nodes that are cunently sharing the oplock. In the ideal case, this is just the node itself, so no inter-node coherency traffic is required. For example, consider a pair of hosts in a cluster accessing LVs through a pair of DVEs. Each host has a single path to one of the DVEs. One host accesses the LV while the other is waiting. This means that the active host's DVE shares the oplock (is in the sharing state). The other host's DVE is in the join state since it is servicing no I/O operations.
  • the active host's DVE can then "acquire" the associated lock without talking to the passive host's DVE, since the broadcast to elevate to the "acquired” tier is only made to the set of nodes cunently "sharing" the oplock.
  • Oplocks are volatile unless otherwise specified.
  • an oplock may be used as a locking mechanism to synchronize access to associated data, which may be volatile or non- volatile. Additionally, an oplock may not be used to synchronize access of any particular piece of data, such as the migration thread (task set) oplock above.
  • the oplock's function may be characterized as a flag used in process or thread restart, for example, in the event that a DVE fails. All of the foregoing metadata associated with the oplocks in the table is non- olatile metadata in this embodiment except for the Minor Write serialization, Group Atomic Operations and Migration Thread oplocks. The former two are volatile metadata in this embodiment. Other embodiments may have other metadata characterized as volatile or nonvolatile in accordance with the requirements of each embodiment.
  • an embodiment may include different oplocks than as described above in accordance with each particular embodiment.
  • the specific reference above regarding how the oplock may be addressed may vary with the oplock data structure included in each embodiment.
  • the particular oplock stracture referenced in the foregoing table is described elsewhere herein in more detail.
  • an agelist maybe used in recording differences in minor sides and used in comiection with resynchronizing a minor side brought back on-line.
  • the value stored in an age list may be refened to as a generation number.
  • a DVE identifier conesponding to the DVE which updated the age list may be stored.
  • the particular generation number may be valid only when associated with that particular DVE.
  • the cunent DVE generation number is incremented whenever a minor side state change occurs.
  • the Rmap values for the remaining live sides of the minor i.e., for the minor's VSEG
  • the live minor side's age list is updated to the cunent (new) generation number, and then the Rmap value for the faulted extent is updated to allow subsequent writes without faulting
  • only a single DVE may update a specific LBA range of a mi or at a time.
  • This embodiment allows only one DVE to write to a given range of a minor, and further only one FP within that single DVE.
  • the DVE updates its per-DVE cunent generation number.
  • the global copy of the RMAP metadata may be set to fault-on- write for all extents if the embodiment also supports fast resynchronization, as described elsewhere herein.
  • Minor side state changes and the use of the associated oplock will now be described.
  • a minor side state change from "alive” to "dead” may be initiated by the notification of a failed write I/O to a minored side device.
  • a failed read need not technically change the state of the minor, but an embodiment may prevent other unsuccessful reads when a failure of a first read has been determined.
  • the CP indicates to the FP which minor sides may be read from and which ones may be written to. This information may be included, for example, in the storage redirect table.
  • the state change from "alive" to "dead” for a minor side is completed before upstream status can be returned for the failed write I/O.
  • the DVE that detects a failed write I/O may acquire the minor side state change oplock. If, upon acquiring the oplock, it finds that the minor side has aheady been declared “dead” by some other DVE, then this is a "false alarm", so it reloads the metadata for the minor side, releases the oplock, and continues.
  • the metadata which is reloaded may be characterized as storage descriptor metadata that describes which minor sides are "dead” or inactive, and wliich are "alive” or active. It should be noted that another DVE has aheady declared a particular minor side as "dead” and has aheady completed the appropriate processing steps.
  • the DVE that detects a failed write I/O finds that the minor side is still "alive"
  • the DVE performs steps in connection with declaring the minor side "dead” or off-line.
  • all of the "other" DVEs have been notified to pause I/Os to the virtual volume segment or VSEG, unload the minor side metadata state information from all cache copies, and then wait on resharing the oplock, reloading the metadata, and unpausing I/O operations to the minor side.
  • the DVE which acquired the oplock also pauses I/O operations to the minor side and unloads all copies of associate metadata.
  • the DVE that has acquired the lock then increments the generation number, and sets the
  • RMAP for the live sides of the minor to fault-on- write, so that new writes are intercepts and fault to the CP and record the fact that the dead minor side is now out-of-date. This may include making a copy of the current "age list" for the dead minor side, if one was not being cunently maintained.
  • the DVE that has acquired the oplock marks the minor side as "dead” by, for example, updating metadata included in the storage redirect table. The oplock may then be released, and the operations may continue using the new state information from the storage descriptor ⁇ as pointed to by the redirect table entries described herein. Upstream status for the failed write I/O may then be returned.
  • Minor Write Serialization and Reconciliation
  • this embodiment utilizes a non- volatile "minored write” oplock covering each extent (fixed size) of each minored VSEG's lba range. Sharing this oplock gives a DVE "write authority" for that extent of the minored VSEG's lba range.
  • Intra-DVE locks are used to distribute write authority further among the FP's that may be associated with each CP.
  • a DVE When a DVE wants to write to an extent of the minored VSEG, the DVE acquires the oplock, thereby revoking sharing authority from any peer DVEs. It releases the lock immediately (still sharing it), and thereby implicitly keeps write authority until another has acquired the associated lock.
  • a DVE "shares”, “acquires” and “releases” before it can assume it has write authority.
  • the DVE goes off-line while writing the extent (or more precisely, while holding the lock giving it write authority over the extent), the records of the non- volatile oplock ownership are in the DVE's journal.
  • the evicting DVE performs minor reconciliation at failover time by copying from one side of the other minor sides, and then releases the oplock.
  • non-volatile oplocks including the minored write oplocks, are swept or unshared so that a DVE only has records of owning oplocks for extents that were recently written, for example, in the last minute.
  • Ownership records for non-volatile oplocks may be updated by the owning node at acquire time. The ownership information may change, for example, when the lock is unshared as by the sweep process, or acquired by another node.
  • RMAP state change in connection with the Rmap metadata and associated oplocks is described elsewhere herein.
  • a redirect entry is created before any RMAP entries or Rmap values reference a particular redirect entry.
  • RMAP values in addition, can only be changed by a fault handler in the CP. A redirect entry cannot be free until there are no RMAP value references to the particular redirect entry.
  • a pause/reload technique may be used in connection with the redirect entries.
  • the system is in steady state, and all DVEs are sharing the Redirs oplock.
  • a high-level function such as a snapshot or migration thread, need to create a new redirect entry, the function acquires the oplock.
  • all of the "other" DVEs pause I/Os to the VSEG and then wait on resharing the oplock, reloading the redirs, and unpausing the VSEG.
  • the local DVE (which acquired the oplock) similarly pauses its I/O operations.
  • the acquiring DVE updates the redirs metadata and released the oplock. Operations may be resumed using the new updated information.
  • Metadata may also be maintained in connection with each LV in which a list is associated with each LV of hosts cunently allowed to access each LV . This involves using the previously listed SCSI reserve oplock. Reservation conflicts may be handled when an I/O is faulted to the CP and also in the FP.
  • a LUN Masking FP API may be used to indicate to an FP using a mask which hosts hold a reservation to perform fastpath I/O. Other hosts' I/Os fault to the CP.
  • any host may issue a SCSI "reserve" command to a disk to say that it wants to access the disk also to prevent any other host from accessing the disk. Once this has been done, if another host tries to access the disk, the other host receives a special return enor status which may be refened to as a "reservation conflict" indicating that the other host's request to access the disk is denied.
  • a reserve or release command is received, if successful, the command changes the reservation state of the LV.
  • a reserve or release command is received, such as in connection with a SCSI device, an intra-DVE lock may also be acquired to ensure mutual exclusion within a DVE.
  • the a device module such as a SCSI device module, acquires the SCSI Reserve oplock for the LV.
  • a Unit Attention condition In a clustered environment, where multiple DVEs are may be accessing the same LV, when the LV experiences a Unit Attention condition. An embodiment may receive this condition, for example, if removable medium has been changed on a device. Others accessing this LV may be notified accordingly since, for example, the previously sent write I/O may be meant for another piece of media that was removed.
  • the metadata is a list of associated nodes that are notified upon the occunence of such a condition.
  • An embodiment may respond with Check Condition/Unit Attention to only the first I/O from each initiator to the LV, regardless of which DVE the I/O was subsequently .processed by.
  • the list of hosts indicated by the associated metadata indicates which nodes are notified and subsequently, the host is removed from the list. The list may be initially the set of logged in hosts at the time of the condition.
  • an associated oplock may be used in connection with handling this and other group atomic operations.
  • Group atomic operations may be characterized as a set of operations that are perceived as occurring atomically. For example, taking a snapshot of a set of devices associated with a database may be perceived as an atomic operation by pausing I/O operations to the devices, taking a snap shot of each device and then restarting or resuming I/O operations.
  • taking a snapshot of a set of devices associated with a database may be perceived as an atomic operation by pausing I/O operations to the devices, taking a snap shot of each device and then restarting or resuming I/O operations.
  • the following may be performed: a) pause the conesponding LVs at the FPs and CPs of all DVEs (this does not imply waiting for aheady issued I/Os to drain, necessarily, except as required by snapshot processing described elsewhere herein; b) take the snapshot; c) resume the I/O operations to the LVs.
  • a DVE fails while the LVs are paused, it is the responsibility of the recovering DVE (which evicted the failed DVE) to continue the operation and resume the LVs.
  • the LVs are not available during this time (however long it takes to detect the previous DVE had failed).
  • DVE may first acquire some other non-volatile oplock (typically for the taskset) before entering into the a), b), c) sequence above.
  • This oplock may be used, for example, when adding a VSEG to an LV (growing the volume).
  • This oplock may also be used, for example, to split a VSEG in two (not changing the volume, but changing the number of VSEGs) or to merge a VSEG. All of the foregoing may be coordinated between DVEs using this oplock. In other words, a DVE may acquire this lock when performing one of these or possibly other operations in connection with an LV when the set of
  • VSEGS associated with an LV are being updated. Like the group atomic operations described elsewhere herein, there is an associated taskset that is non- volatile. If a DVE fails in the middle and has acquired this lock, the processing is performed by the recovering DVE. As part of the LV reconfiguration processing, I/O operations to the LV are paused as well.
  • the former two may be implemented as volatile oplocks and the latter oplock as a non- volatile oplock.
  • the former two oplocks are acquired as part of the taskset oplock and are accordingly reacquired and re-released in connection with a failed DVE.
  • an embodiment may have a migration thread and others as described elsewhere herein executing on each DVE node, for example, when performing clustered migrations.
  • An oplock may be associated with a task set including a migration thread that is non- volatile such that if a node goes off-line, another node detecting that a failed node is offline, takes of the failed node's migration process.
  • An embodiment may associate a single nonvolatile oplock with a taskset, and the node with the thread running acquires the oplock.
  • the evicting node restarts the migration thread and possible others associated with the taskset when performing cleanup for the dead node (since it will find that oplock in the dead node's journal).
  • the migration thread and others included in the taskset are able to execute on any node.
  • Other embodiments may require that a migration thread execute on a particular DVE and accordingly may require modifications that may vary from what is described herein.
  • a taskset may be refened to as a set of tasks to be performed.
  • the set of tasks may include, for example, relatively “quick” or short tasks, such as a snapshot, or relatively “slow” tasks, such as migration which may take hours to complete.
  • a DVE may acquire a taskset oplock (non-volatile) when a taskset is started, and release the associated oplock (and unshare and leave) when the taskset has completed. If the DVE goes off-line prior to this, a recovering DVE continues the taskset, such as perform an ongoing migration, or complete a partially completed group atomic operation. It should be noted that taskset oplocks are not contended for, that is, no two DVEs start the same taskset except during failover recovery when one of the DVE's is off-line.
  • FIG 28 shown as an example of an embodiment 750 that includes two DVEs that handle an I/O request from a host.
  • the example shown in the illustration 750 is a simplified view of how different DVEs may access physical devices.
  • the actual mapping mechanism is not shown as part of the DVE accessing particular physical device.
  • the details within a particular DVE such as whether there are one or more CPs in the FPs are not shown in detail.
  • the oplock mechanism for modifying global data will be explained on the level of inter-DVE communication. It is assumed that each particular DVE monitors all infra DVE communication and access these for synchronizing access to any type of data.
  • each of the DVEs, DVE 1 and 2 both have copies of the same RMAPs in connection with a V and a V Snap (snapshot) device as described elsewhere herein in connection with performing a snapshot operation, copy on write.
  • both the RMAP and the storage redirect tables may be modified in connection with perfonning a snapshot operation.
  • the RMAPs that are in the metadata use fixed length extents, and those RMAPS which are in the memory, for example, within the DVEs conesponding to the V and the VSnap devices, use variable length extents as also described elsewhere herein.
  • the host issues both a read and a write operation simultaneously.
  • the host writes to a part of virtual volume through DVE 1 which ends up faulting to a CP within DVE 1 since a snapshot is being performed for two virtual devices using physical devices PI and P2.
  • the host issues an I/O read request to DVE2 to the same portion of a physical device.
  • the FP of DVE2 may be used to do the read operation.
  • DVE1 issues a message to acquire the conesponding oplock associated with the particular RMAP portions for the extent associated with the I/O write operation.
  • DVE1 may broadcast a point-to-point message to all DVEs indicated with sharing this particular portion of the global metadata using its local share list. Essentially, DVE1 is asking permission to acquire the lock for particular metadata portions or RMAP portions it needs to perform its modifications on the metadata.
  • the acquire message that is sent to DVE2 is also a request for DVE2 to invalidate its conesponding portion of the RMAP and its cache as well as take care of synchronizing any other references to that particular RMAP portion in CP and FP portions included within the DVE2.
  • DVE2 In response to receiving the acquire message, DVE2 purges RMAP portions requested included those within the CP and the FP. When all of the portions or copies have been purged within the DVE2, DVE2 then sends to DVE1 a message indicating acknowledgment that DVE1 may acquire the lock and update the metadata. In connection with performing a snapshot operation, DVE1 performs a write operation to update portions on PI and P2 conesponding, respectively, to portions for V and V Snap.
  • a DVE releasing a lock may broadcast a message to all other nodes having a local copy or those other nodes that have registered themselves as wanting to receive such notification. Upon receiving this release notification, a node may reshare and reread the updated data from the global storehouse described elsewhere herein. After DVEl acquires the lock for a portion of the virtual device on PI, DVEl also acquires the conesponding lock on V snap which, in this instance, is device P2. Data is then pushed from physical device PI to P2 if the global Rmap entry indicates a state of zero such that the data has not yet been copies.
  • the Rmap in the global metadata for device P2 is then modified to reflect the state change that the data has now been copied to the snapshot device P2.
  • DVE then also updates its portions internally which reference this particular Rmap location, such as within the CP and the FP.
  • DVEl may now release the lock associated with the Rmap portion on device P2. Part of releasing the lock may mean that another may acquire the lock rather than P2 issuing an explicit release lock message.
  • DVEl waits for the reads to device PI to drain and then changes portions of the global Rmap table of device PI to have the appropriate redirect table entry indicating that the data has been pushed from PI to P2. Note that this update has been to the global metadata.
  • DVEl again may update any local copies of CP or FP data to this portion of the Rmap and then DVEl may release the lock, for example, by allowing another DVE to acquire the lock to the conesponding Rmap portion.
  • DVEl attempts to acquire the oplock for PI, for example, by issuing an acquire message and receiving the appropriate acknowledgement back from the other DVEs.
  • DVEl also attempts to perform and acquire the conesponding lock on physical device P2.
  • the determination is made as to whether the data has already been pushed from device PI to P2. If not, control proceeds to step 774 where the data is pushed from device PI to P2.
  • Control proceeds to step 768 where the global metadata for the Rmap of P2 is updated to indicate that the data has been pushed, for example, by updating the particular Rmap entry index to be one.
  • Control proceeds to step 770 where it is determined if any reads to device PI are in process of being perfonned. If so, control proceeds to step 776 where DVEl waits for the reads to device PI to drain.
  • Control proceeds to step 772 where the global metadata Rmap portions for device PI are updated to indicate that the data has been pushed to device P2.
  • RMAP values may be updated in a restrictive and an unrestrictive fashion as described elsewhere herein.
  • the DVE When making a restrictive update to an rmap value as described above, the DVE must wait for all I/Os that were issued using the old rmap value to drain. This must occur prior to making any subsequent changes to the system state that are dependent on the new rmap value.
  • step b) when faulting writing to the source of a snapshot, a) the snapshot data is pushed and then b) the destination rmap is updated to reflect the new location of the data (and that writes are now allowed), and c) the source rmap is updated to reflect that the write is now allowed (to the original location of the data).
  • the steps are perfonned in the foregoing order. Additionally, after step b), and before proceeding to step c), there is a wait for any I/Os issued referencing the destination rmap to the old location to drain. Otherwise at step c), writes to data may be allowed that the destination side of the snap is still reading resulting in corruption.
  • the foregoing also applies on a larger scale in embodiments using a single DVE and a single CP execution thread, as well as multiple CP execution threads, such as migration threads and fault handler threads.
  • the lock for the portion of the global metadata on P2 may be released after step 768 and similarly, the lock for the global metadata portion on device PI may be released after step 772.
  • a lock is released not by the action of sending a particular message from a first DVE cunently holding the lock to other DVEs. Rather, one of the other DVEs may now successfully acquire the lock in connection with the particular metadata portions from the first DVE. Any message sent in connection with a release operation is not an explicit release of the lock. Rather, it may serve as a signal to "wake up" other nodes that they may now attempt to acquire the lock and should accordingly obtain a fresh copy of the global data.
  • the volatile oplock may be automatically released in that now another node is free to acquire the lock.
  • An embodiment may have other nodes routinely retry to acquire the lock after a certain amount of time has past from a prior acquisition attempt. Thus, the sending of the release message may be omitted from an embodiment, for example, if the other nodes attempt to retry to acquire a lock and otherwise obtain an updated global copy of the data.
  • the DVE may broadcast a message (e.g., asynchronously at a lower priority) so that other DVEs know in a timely manner that they may attempt to acquire the volatile oplock. Relying solely on timeouts for the retries may be not as efficient as the broadcast technique. However, in the instance where a DVE that has acquired a volatile oplock goes off-line, timeouts may be relied on for subsequent attempts to acquire the oplock since the volatile oplock is released when the DVE goes off-line using the technique of acquiring the lock by obtaining permission from all others in the shared state as described elsewhere herein.
  • a message e.g., asynchronously at a lower priority
  • a DVE such as DVEl
  • DVEl may be turned off or inaccessible, for example, in connection with a power failure. Assume that a host, for example, has not received an acknowledgement that a previously requested write operation has successfully completed. Subsequently, the host may retry the write operation if there is a time out and reissue the write request. If, for example, DVEl has a power failure, all intra-DVE oplocks and volatile inter-DVE locks of DVEl are released as they are volatile or non-persistent. However, non-volatile inter-DVE locks that have been acquired by DVEl are still locked. Using these acquired inter-DVE locks, another DVE may perform "clean-up" operations in connection with DVEl.
  • Another DVE may be elected, as a member of the cluster, to clean up after another DVE, such as DVEl, that has failed.
  • the DVE performing the cleanup may be a predetermined cluster member, or it may be the first DVE that determines that DVE has failed and evicts the failed DVE from the cluster. This may vary in accordance with policies included in each embodiment.
  • the cleanup may be performed by using the list of inter-DVE non- volatile oplocks which DVEl had acquired. If DVE2 is performing the cleanup of DVEl upon DVEl failing, DVE2 first inherits all of DVEl's non- volatile inter-DVE oplocks. DVE2 implicitly acquires each of the oplocks by inheriting those of the failed node. In other words, DVE2 acquires the locks without DVE2 first asking and obtaining permission from all the other nodes. DVE2 is recorded as the owner in the ownership information for the non- volatile inter-DVE oplock. DVE2 now defends the implicitly acquired locks.
  • DVE2 examines the list of inter-DVE non- volatile oplocks and, for each non- volatile inter-DVE oplock owned by the failed DVE, completes the write, update of the global metadata, or other operation associated with the oplock. DVE2 then releases the locks implicitly acquired. Only non-volatile locks, non volatile locks, play a role in recovery operations as described above.
  • a DVE that is returning to service after a failure performs operations similar to those in connection with a DVE coming online initially; the DVE joins the cluster.
  • the DVE starts up or boots up, such as initially or subsequent to going off-line, the DVE performs certain steps that may be represented as follows for DVE A coming on-line:
  • DVE A waits a predetermined time period to acquire own journal
  • DVE A's journal indicates a "dirty" shutdown with tasks that were in progress, replay the conesponding journal entries for those tasks. 4. If any of the journals that DVE A inherited also show a dirty shutdown, replay the conesponding journal entries for those tasks.
  • a first DVE sends a message to a second DVE that never responds
  • the first DVE evicts the peer from the cluster.
  • the first DVE acquires all of the second DVE's journals, locks, etc., and performs clean-up operations, including processing of steps 4 and 5 above.
  • the first DVE does this clean-up while continuing itself to run on-line as a member of the cluster.
  • the evicting DVE inherits responsibility for all journals of the DVE that has been evicted. This may be characterized as a cascading eviction. For example, A evicts B and A goes off-line before cleaning up B.
  • journal C evicts A and then C performs clean-up operations for A and B.
  • journals record important operations in progress, such as write operations, that either may be re-issued or "undone" in the event that the DVE performing them goes off-line.
  • any minor reconciliation is also be performed. If there was a minoring operation being performed, only a portion of the minoring operation or update may have been performed. For example, there may be two minoring devices, Ml and M2. In connection with performing a write operation, the DVE 1 may update minor device Ml but DVEl failed prior to completing the write to device M2. When DVEl fails, if DVEl was potentially writing to a minor, there may be a need to reconcile the minoring devices such that the data on the minoring devices is coherent.
  • a DVE When performing a minoring operation, a DVE acquires the necessary locks, such as the inter-DVE non- volatile oplocks, in order to write for all minoring devices. Only one FP is allowed to write at a time to a particular minor or minoring device. The locks are acquired and held until another node issues a request to obtain the lock.
  • an embodiment may include a sweeping process that runs on each of the DVE's. The sweeping process may be executed, for example, each minute to release the inter-DVE non- volatile oplocks. The DVE may reacquire the locks as needed. As described elsewhere herein, a non-volatile inter-DVE oplock may be released by clearing the ownership information.
  • a DVE returning to service may perform recovery operations for minored devices, for example, such that the DVE coming on line may be brought up to date with the operations that have occurred while offline or out of service.
  • an embodiment may not want to reconcile the entire volume or device for all minoring devices.
  • a fast reconciliation may be desirable by only copying those portions that have changed.
  • reconciliation occurs when a DVE fails and uses non-volatile oplocks.
  • Resynchronization occurs when a minor side fails and comes back on line.
  • Age lists may be used in performing the resynchronization operation when a minor side comes back on-line.
  • An embodiment may include and utilize and age list in connection with performing a fast resynchronization for failed mirroring devices brought back on-line. An example of a failed write in connection with a minored device will now be described.
  • a host initiates a write request to a DVE wliich causes multiple downstream write I/O requests to a plurality of minor devices, Ma and Mb.
  • Mb goes offline due to a device failure.
  • the copy of Mb's data needs to be resynchronized with the other minor devices, such as Ma.
  • age lists that provides for a fast resynchronization of data on the minor devices to copy those portions to Mb from Ma that are out of date since Mb was off-line.
  • Each of the minoring devices has an associated age list that includes fixed size extents in metadata.
  • the agelist may be stored as inter-DVE metadata with associated locks in persistent storage.
  • the agelist remains the same. Initially, all elements of the agelist are assigned the cunent age.
  • a DVE has the concept of a cunent age counter which may be initially 0. This counter is used in connection with indicating an age of the minor data.
  • a minor device such as Mb
  • the DVE modifies the Rmap entries of the associated down minor device to cause a fault to the CP when there is a write operation.
  • the DVE obtains the necessary infra-DVE and inter-DVE locks to modify the Rmap table to indicate a different redirect table entry causing a CP fault on a write operation to the minor device.
  • the CP updates the agelist entry or entries conesponding to the address for the particular write operation to be the updated cunent age, which is 1 in this instance.
  • Mb comes back on line, all of Mb's extent portions having a conesponding agelist entry not equal to 0 are updated by migrating data from Ma to Mb. This may be done using the migration thread, for example, to push data from Ma to Mb for each entry in the agelist not equal to 0.
  • the DVE performing the clean-up must update all extents on the minor side whose age list generation numbers do not match the generation number of the live side minor.
  • the age list generation numbers may be maintained on a per- volume basis rather than a per-side/device basis. It should be noted that the agelist metadata may be associated with fixed size extents portions of a minored storage device.
  • reads to a minor may be load balanced round-robin between minor sides with the best load balancing priority as may be specified in the mapping table.
  • This may be implemented in an embodiment using the storage redirect table by maintaining an index of the last minor side to receive an operation. The index may be incremented to indicate the next minor to use for the next I/O operation. When the index reaches the number of minor sides, the index may be reset to indicate the first side minor.
  • Other embodiments may use other techniques to implement a load balancing. This technique allows the CP to have the ability to specify and modify which minor sides are remote, and accordingly, a high "cost" to use, but also whether any form of round-robin or other balancing technique is appropriate.
  • each minor side is given a unique value or cost, then the lowest cost minor side may be selected.
  • Minoring operations may be implemented using the write splitting functionality, such as the write splitting primitive, described elsewhere herein.
  • the CP may implement locking for shared minors, which may be accessed by multiple FPs, such that only one FP is enabled for write at any given time for an extent of a virtual volume included in a minor.
  • the locks for shared minors may be implemented as inter-DVE oplocks described elsewhere herein in mode detail.
  • reads to an extent are not synchronized with writes.
  • the reads may return old data, new data, or a combination of old and new on a block-by-block basis.
  • the combination of mix of data returned for a read may change over time so that two concurrent reads to the same portion may return different data if there are also outstanding writes completing as data is being read.
  • another DVE may remove the failed DVE from the cluster, such as described elsewhere herein.
  • the recovering DVE may assume ownership of all the failed DVE's inter-DVE oplocks in addition to its journals.
  • the failed DVE may also have outstanding writes which may result in a minor being out of synchronization with other sides of the same minor.
  • order of completion in this embodiment is unspecified and read return data is unspecified. If a write was outstanding to a minor device on the failed DVE, the requester may eventually time out and/or abort the write, and reissue the write. The write operation may then be blocked since the recovering DVE cleaning up after the failed DVE blocks writes to the minor until the minors are resynchronized. However, writes may be allowed.
  • FIG. 30 shown is an example of an embodiment of a device V that has two conesponding minor devices PI and P2. Initially, both PI and P2 are on-line and Read and write operations are allowed to the devices and both minor sides are up-to-date. Assume P2 fails. This initial state is shown in Figure 30.
  • the CP marks P2 as DEAD status/offline and updates the generation number to "n+1", as indicated by 802.
  • the CP then updates Pi's age list to indicate, using the new generation number, that PI has newer data for the extent just faulted on when writing, as indicated by 804.
  • the CP marks all the Rmap entries, except the one just faulted on as indicated by 806, to indicate that a resynchronization is to be performed if the minor side is subsequently brought back online.
  • the CP then allows the write operation to complete.
  • the penalty for supporting fast resynchronization is that the first write to the extent following a minor side state change of either ALIVE to DEAD, or DEAD to RESYNC, causes a fault to the CP with other writes using the FP. Later, when a write is made to an extent after P2 has been declared DEAD or offline, as above, there is a fault to the CP and Pi's age lists are updated to reflect the new dirty regions.
  • Fast resynchronization may be implemented by the CP by comparing age lists when P2 is brought back online. Fast resynchronization involves resynchronizing the minors to have the same set of data.
  • the Resynch state of P2 allows P2 to participate in write splitting without being involved in processing read operations until P2 is brought back on-line.
  • an age map may be used in synchronous minoring as described in connection with, for example, Figures 30 and 31.
  • the relative age of extents of various minor - sides may be recorded. If one minor side is offline and operations are performed to other minor sides, when the down minor side is brought back on-line, it is resynchronized with other minor sides. In one embodiment, this may be performed using the technique described herein which that only the extents that are out of date are copied. This may also be refened to as a fast resynchronization.
  • a new generation number may be assigned to the age maps. The cunent generation number is incremented whenever any minor side changes state. Subsequently, the first write to the remaining minor sides are intercepted and the age map is updated to indicate that the remaining minors have been modified relative to the offline minor side.
  • Fast reconciliation involves reconciling shared metadata using the inter-DVE oplocks held by the DVE to extents for which writes may have been outstanding when the DVE failed.
  • the DVE performing the cleanup of the failed DVE inherits the failed DVE's inter-DVE oplocks and therefore knows which extents are suspects for reconciliation.
  • the inter-DVE oplocks may actually be implemented so as to journal their state sequentially to media, like traditional DRL, while offering significantly more flexibility at failover time.
  • the inter-DVE locking techniques that may be used in an embodiment are described elsewhere herein.
  • each DVE may have its own non- volatile oplock journal. Additionally, a data journal may be maintained for each FP. It should be noted that the per-DVE non- volatile oplock journal and the per-FP data journals are maintained independently of one another. The non- volatile oplock j oumals may be used in connection with performing recovery operations for a failed DVE. The data journal of an FP may be used for asynchronous ordered replication.
  • asynchronous I/O operations are recorded in the journal and then to the actual device.
  • Inbound I/Os for each FP may be paused at discrete points in time, such as every minute or other time interval in accordance with system parameters, such as incoming I/O rate, bandwidth, and the like.
  • Existing journals for each FP may then be swapped out and inbound I/Os then resumed. I/O operations subsequent to the resume may be redirected to a new journal. Meanwhile, there is a wait for the existing journal I/O operations to commit to the existing FP journals.
  • FIG. 32 shown is an example of an embodiment 850 in connection with performing an asynchronous replication operation for FP joumalling as described above.
  • there are some aspects (such as write serialization at the FP) similar to that previously described in connection with minoring in that writes to virtual device V are split to two physical devices PI and P2.
  • Writes to PI are delivered natively, that is, writes are performed on PI .
  • the write entries are journalled to journal entries.
  • Each entry as shown in 852 has a header indicating where the write entry is supposed to go, such as the logical block address.
  • a database for example, may be implemented on more than one volume and may involve multiple servers.
  • the foregoing may be used as an alternative, for example, to ordering all I/O operations through a central point that may become a bottleneck in performance.
  • the foregoing techniques may be used to provide synchronization at discrete points in time that may be selected in accordance with parameters that may vary with each embodiment to minimize any negative performance impact.
  • the foregoing asynchronous minoring may be implemented using the write splitting and write journaling primitives described elsewhere herein.
  • the FP may synchronously split writes to a private journal using a private index as described in connection with Figure 33.
  • control is passed or faulted to the CP which exchanges a new, empty journal for the old journal.
  • the CP may then copy the journal contents to a remote location using an asynchronous copy agent. It should be noted that in one embodiment, data from the journal is not being moved through the CP.
  • Multiple journals may be synchronized periodically. Multiple journals may occur, for example, with multiple volumes, or multiple FPs or DVEs to the same volume. The multiple journals may be synchronized by revoking mapping entries for all journals and waiting for downstream I/O operations to the journals from the FP to complete. The journals may then be swapped out and copied to a remote location followed by a synchronization barrier. The copy agent on the remote side knows that the remote image set is only valid when a barrier is reached. In one embodiment, journals may be implemented per-DVE or per FP such that no DVEs and FPs communicate with each other to do joumalling. Otherwise, there may be performance penalties. Synchronization may be performed at discrete points in time that are predefined as described elsewhere herein.
  • the recovering DVE takes over the data journals of the failed DVE, as well as the non- volatile oplock journals.
  • its journals may have incomplete data for I/Os for which a status has not yet been returned to the host.
  • the state of the actual disk blocks on the data storage device may be characterized as "unknown".
  • the host may issue a retry of the I/O operation.
  • differences are detected and reconciled between the N sides of the minor. Similar reconciliation may be performed for journals.
  • the DVE performing cleanup in connection with a failed DVE, through non-volatile oplocks, knows wliich block ranges the failed DVE may have been modifying and may read the data from the device and write to the journal making the journal complete for those block ranges.
  • Reconciliation for a failed DVE being brought up to date may use the non- volatile oplocks as a form of dirty region logging to detect those portions.
  • DVE A may traverse the list of non- volatile oplocks to identify those which DVE B owned when it failed. Accordingly, DVE A may update DVE B's journal for any write operations, for example, that DVE B may have been in the process of completing. All volatile inter-DVE oplocks are released when DVE B goes off-line.
  • FIG. 33 shown is an embodiment 900 of a compound example of performing a snapshot during a migration.
  • the example 900 illustrates an initial state of the nnapl and rmap2 tables. Data is being migrated from PI to P2 and V snap is also a snapshot of V.
  • the rmaps are modified in accordance with the state changes as described elsewhere herein in connection with performing the snapshot and migration operations, for example, when there is a write operation to a portion of data in V.
  • the foregoing illustrates that the FP can handle more complex and compound examples such as depicted in Figure 33. It should be noted that entries 0 and 2 in redirect table 2 may be combined if the CP can handle this compression since these entries are the same in this example.
  • the Rmap describes the variable length extents included in the VSEG descriptors LBA range.
  • the Rmap shown is also a cache in the FP which is a portion of the potentially larger RMAP included in the CP, wliich itself may be implemented as a cache of media-based Rmap information.
  • the Volume to VSEG descriptor mapping has been eliminated as if there is only a single Volume Segment per Volume to keep the foregoing examples simple. Additionally, the foregoing examples freat each physical volume (P) as if fully consumed without having some offset into the device as the base address.
  • an acquiring DVE obtains all necessary locks, purges its local cache and all copies of the data cached elsewhere, such as by other DVEs, updates the global copy of the associated data, if any, and issues a broadcast to signal to other nodes that the lock is now available and that a new global copy of the associated metadata is also available. If there is no such broadcast sent to interested nodes (sharing list nodes), such as when an acquiring node has failed, other nodes may have individual timers. These timers may be used by each node as a default signaling mechanism to attempt to acquire locks.
  • Rmap data stracture is shown elsewhere herein, for example, in Figure 9 as element 242.
  • the key value 1002 may be a value, such as an LBA.
  • the rmap data structure 1001 in this example may be implemented as a multi-level page table structure in which successive portions of the key 1002 are used as indices into a series of cascaded anays. The anays at a first level point to other anays at a next level until a leaf is reached. As known to those of ordinary skill in the art, this may be refened to as a trie data stracture.
  • a look up in the rmap data stracture 1001 may be performed to determine a particular defined range, if any, into wliich the key value falls.
  • one or more ranges of values may be defined, such as 1006a, in wliich a starting value, length and associated value are specified.
  • each range may conespond to an LBA range of each extent, for example, as in the rmap 242 described previously in connection with Figure 9.
  • the value, such as "A" in 1006a may conespond to the index into the storage redirect table, as also described elsewhere herein.
  • Bits of a key value 1002 are used in traversing a path of connected anays at each level.
  • This particular key value 1002 as described herein is a small key value for purposes of illustrations. Embodiments may use other key values including a varying number of bits, such as 16 or 256.
  • 2 bits of the key value 1002 are used to map and determine which next anay, if any, to follow in determining whether, for a given key value, there is defined range and obtaining associated information regarding that range, such as the value A of 1006a which may conespond to the index into the storage redirect table for a given LBA address.
  • the rmap 1001 includes anows with solid lines defined when traversing the arrays with a starting value of one of the ranges. Additionally, the rmap 1001 includes single dashed line anows defined providing paths to the range leaf nodes, such as 1006a, for values of each range other than the starting value.
  • a key value is 0x11 having the binary representation "0001 0001 "
  • the first two bits of the key value "00" are used to select the conesponding element of 1022a which points to 1022b.
  • the next two bits of the key value "01" of 1022b point to 1020c.
  • the next two bits of the key value "00” point to anay 1020d.
  • the final two bits of the key value "01” point to the leaf range 1006a via the solid anow 1006b.
  • the final two bits of the key value " 10” lead to 1006a via final connector 1006c.
  • a key value of 0x10 there is no anow from the first anay element of 1020d conesponding to the two bit key value "00". Accordingly, a determination is made that there is no defined range that includes the key value 0x10.
  • a given two-bit of the key value at a cunent level may only be associated with one leaf node range, intervening anays between the cunent level anay and the leaf node may be omitted and a direct connection may be made to the leaf.
  • anow 1008d provides a direct connection to the conesponding leaf range node 1008 a.
  • a lookup is then perfonned to determine if the key value is indeed included in the range of 1008a since more than one key value may be possible depending on the cunent level and not all key values may actually be included in the range of leaf node 1008a.
  • This trie may also be further compressed and collapsed in that anays 1020a and 1020b and all pointers included therein may be omitted and replaced with double dashed arrow 1012. All valid key values with the first four bits of "0011" fall within the range of leaf node 1010a. All valid key values with the first two bits "01" may also be mapped directly to 1010a. A determination is then made as to whether the key is actually in the range by obtaining information from the leaf node and determining if the key is within the range "start value+length-1". As just described, the rmap 1001 may be refened to as a compressed trie in which anays at intervening levels may be removed as a space optimization also providing a time saving optimization when performing a look-up. The foregoing description uses a technique in which "legs" of the trie may be pruned if the leg has only a single hit by collapsing the leg up to the parent pointer.
  • the C language fault handler performs the updating of the global metadata and pushing snapshot data, for example in connection with performing a write using a snapshot device described elsewhere herein.
  • FP fast path
  • CP control path
  • the FP in an embodiment may implement one or more "primitive” operations.
  • the primitive operations, used as building blocks, may be used together to perform more complex operations.
  • the CP for example, utilizing an FP interface, may issue instructions to the FP to perfo ⁇ n a set of the primitives in a carefully orchestrated way, so as to perform higher level data operations, such as snapshots, migrations, replications, and other operations.
  • the CP can do the foregoing such that multiple FPs and CPs can provide access to the same data concurrently and redundantly.
  • the FP does not have specific knowledge as to what particular more complex data operation may be performed. Rather, the CP has knowledge of the how the individual primitive operations piece together to complete the more complex data operation.
  • the CP invokes the one or more FPs to perform the various primitive operations as may be defined in accordance with the FP API as described elsewhere herein.
  • a fraditional volume manager there may be independent modules used to perform different complex operations, such as snapshot, migration, minoring, striping, and the like.
  • Each of the foregoing modules may perform independent virtual to physical LBA translations.
  • Each of these independent modules may be called in a predetermined sequence in connection with performing any I/O operation.
  • Each module may accordingly perform the relevant processing in connection with the cunent 1/0 operation.
  • the CP determines what particular I/O primitives and computations from virtual to physical LBA translates are necessary to complete a particular I/O operation.
  • These I/O primitives may be implemented in hardware and/or software. Consider, for example, the following. An incoming I/O operation may be initially routed to the
  • the CP determines that the following translations from virtual LBAs to physical LBAs are needed to complete the
  • the CP determines the foregoing translations and associated states of the LBA Rmap table entries prior to invoking any FPs for processing. Since the CP has knowledge about what other processes or threads may be accessing a particular LBA range, device, etc., the CP may coordinate activities to be performed by the FPs in connection with completing this I/O operation as well as other ongoing activities. In this instance, the CP may determine that the foregoing virtual address LBA ranges may be accessed and used in connection with performing this cunent I/O operation. The CP may then invoke and authorize multiple FPs to perform, in parallel, the translations and associated I/O operation for the above virtual addresses, except v401-v500. As indicated by the "fault" above, the CP may need to perform an action, such as load a table entry, prior to authorizing an FP to perform an operation in connection with virtual addresses v401- v500.
  • an action such as load a table entry
  • CP and FP may be characterized as different from the architecture associated with a volume manager which sends every I/O operation through a central code path.
  • the CP and FP embodiment separates the I/O operations into those that may be performed by the FP and those that may not.
  • most I/O operations may be processed in a streamlined fashion as described herein by the FP.
  • the foregoing provides a scaleable technique for use with I/O operations.
  • the relationship between the CP and one or more associated FPs may be characterized as a master-slave relationship.
  • the CP is the master that coordinates and controls the one or more FPs to perform tasks.
  • the CP's responsibilities include coordination of FP processing to perform an I/O operation.
  • the CP may be deemed a taskmaster and coordinator in connection with other operations that need to be performed in a system, such as migration.
  • the CP enlists the assistance of the one or more FPs also in performing the migration, for example.
  • the CP coordinates and balances the performance of other tasks, such as migration, and incoming I/O operations.
  • the CP When the CP instructs an FP to perform an operation, such as perform a mapping primitive operations, the CP grants authority to the FP to perform the operation.
  • the FP as described herein also has its own local cache that may include data used by the FP in performing the operation.
  • the FP continues to operate using the cunent data in its local FP cache independent of other FP caches and the CP cache until the CP revokes the authority of the FP, for example, by invalidating the contents of the FPs local cache.
  • the FP may then continue to complete its cunent I/O operation but not begin any new I/O operations.
  • the FP may subsequently acknowledge the invalidation message by sending an acknowledgement to the CP.
  • the CP then takes appropriate subsequent action.
  • the CP may wait for pending I/Os to drain from the FP and CP's pending I/O lists if there is a restrictive update being performed.
  • the FP does not synchronize its cache with any other FP cache providing each of the FPs with the independence needed to make the CP and FP techniques described herein scalable.
EP02806874A 2001-12-10 2002-12-09 Schnell-pfad, um datenoperationen auszuführen Withdrawn EP1532520A2 (de)

Applications Claiming Priority (15)

Application Number Priority Date Filing Date Title
US218195 1988-07-13
US218189 1998-12-22
US34005001P 2001-12-10 2001-12-10
US340050P 2002-03-15
US36894002P 2002-03-29 2002-03-29
US368940P 2002-03-29
US218192 2002-08-13
US10/218,189 US7013379B1 (en) 2001-12-10 2002-08-13 I/O primitives
US10/218,186 US6973549B1 (en) 2001-12-10 2002-08-13 Locking technique for control and synchronization
US218186 2002-08-13
US10/218,192 US6986015B2 (en) 2001-12-10 2002-08-13 Fast path caching
US10/218,098 US7173929B1 (en) 2001-12-10 2002-08-13 Fast path for performing data operations
US218098 2002-08-13
US10/218,195 US6959373B2 (en) 2001-12-10 2002-08-13 Dynamic and variable length extents
PCT/US2002/039232 WO2003071419A2 (en) 2001-12-10 2002-12-09 Fast path for performing data operations

Publications (1)

Publication Number Publication Date
EP1532520A2 true EP1532520A2 (de) 2005-05-25

Family

ID=27761690

Family Applications (1)

Application Number Title Priority Date Filing Date
EP02806874A Withdrawn EP1532520A2 (de) 2001-12-10 2002-12-09 Schnell-pfad, um datenoperationen auszuführen

Country Status (3)

Country Link
EP (1) EP1532520A2 (de)
AU (1) AU2002366270A1 (de)
WO (1) WO2003071419A2 (de)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005115506A (ja) * 2003-10-06 2005-04-28 Hitachi Ltd ストレージシステム
US7318134B1 (en) 2004-03-16 2008-01-08 Emc Corporation Continuous data backup using distributed journaling
US7599951B1 (en) 2004-03-25 2009-10-06 Emc Corporation Continuous data backup
US7398348B2 (en) 2004-08-24 2008-07-08 Sandisk 3D Llc Method and apparatus for using a one-time or few-time programmable memory with a host device designed for erasable/rewritable memory
US7493458B1 (en) 2004-09-15 2009-02-17 Emc Corporation Two-phase snap copy
US8037154B2 (en) * 2005-05-19 2011-10-11 International Business Machines Corporation Asynchronous dual-queue interface for use in network acceleration architecture
CN105359114B (zh) * 2014-03-31 2018-09-18 甲骨文国际公司 用于在寻址方案之间进行迁移的方法和系统
WO2018023198A1 (en) 2016-08-04 2018-02-08 Astenjohnson, Inc. Reinforced element for industrial textiles
CN111061465B (zh) * 2019-11-22 2023-06-13 上海乐白机器人有限公司 机器人编程的反向映射方法、系统、电子设备、存储介质
CN113064546B (zh) * 2020-01-02 2024-02-09 阿里巴巴集团控股有限公司 一种文件系统的管理方法、装置、文件系统及存储介质
CN117093278B (zh) * 2023-10-16 2024-03-15 荣耀终端有限公司 内核关机方法、电子设备及存储介质

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6728832B2 (en) * 1990-02-26 2004-04-27 Hitachi, Ltd. Distribution of I/O requests across multiple disk units
EP0570516A4 (en) * 1991-02-06 1998-03-11 Storage Technology Corp Disk drive array memory system using nonuniform disk drives
US5566331A (en) * 1994-01-24 1996-10-15 University Corporation For Atmospheric Research Mass storage system for file-systems
US5712976A (en) * 1994-09-08 1998-01-27 International Business Machines Corporation Video data streamer for simultaneously conveying same one or different ones of data blocks stored in storage node to each of plurality of communication nodes
US5634028A (en) * 1994-12-15 1997-05-27 International Business Machines Corporation Compact track address translation mapping system and method
US6466978B1 (en) * 1999-07-28 2002-10-15 Matsushita Electric Industrial Co., Ltd. Multimedia file systems using file managers located on clients for managing network attached storage devices
WO2001033361A1 (en) * 1999-11-01 2001-05-10 Mangosoft Corporation Internet-based shared file service with native pc client access and semantics
US6684209B1 (en) * 2000-01-14 2004-01-27 Hitachi, Ltd. Security method and system for storage subsystem
JP2001350707A (ja) * 2000-06-06 2001-12-21 Hitachi Ltd 情報処理システム、記憶装置の割り当て方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See references of WO03071419A2 *

Also Published As

Publication number Publication date
AU2002366270A8 (en) 2003-09-09
AU2002366270A1 (en) 2003-09-09
WO2003071419A3 (en) 2005-03-31
WO2003071419A2 (en) 2003-08-28

Similar Documents

Publication Publication Date Title
US6959373B2 (en) Dynamic and variable length extents
US7173929B1 (en) Fast path for performing data operations
US6973549B1 (en) Locking technique for control and synchronization
US6986015B2 (en) Fast path caching
US7013379B1 (en) I/O primitives
US11449401B2 (en) Moving a consistency group having a replication relationship
US9811431B1 (en) Networked based replication of distributed volumes
US9910739B1 (en) Inverse star replication
US9594822B1 (en) Method and apparatus for bandwidth management in a metro cluster environment
US7191304B1 (en) Efficient and reliable virtual volume mapping
US7337351B2 (en) Disk mirror architecture for database appliance with locally balanced regeneration
US10061666B1 (en) Method and apparatus for adding a director to storage with network-based replication without data resynchronization
US8156195B2 (en) Systems and methods for obtaining ultra-high data availability and geographic disaster tolerance
US7542987B2 (en) Automatic site failover
JP2008524724A (ja) クラスタ・ストレージ環境内で並列データ移行を実施する方法
US11409708B2 (en) Gransets for managing consistency groups of dispersed storage items
US11579983B2 (en) Snapshot performance optimizations
US11709780B2 (en) Methods for managing storage systems with dual-port solid-state disks accessible by multiple hosts and devices thereof
EP1532520A2 (de) Schnell-pfad, um datenoperationen auszuführen
US11875060B2 (en) Replication techniques using a replication log
US11586376B2 (en) N-way active-active storage configuration techniques
US20220138105A1 (en) Resolving cache slot locking conflicts for remote replication
US11513900B2 (en) Remote replication of snapshots taken while replication was inactive
US20240126470A1 (en) Synchronous replication

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

AK Designated contracting states

Kind code of ref document: A2

Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR IE IT LI LU MC NL PT SE SI SK TR

RIC1 Information provided on ipc code assigned before grant

Ipc: 7G 06F 3/06 A

17P Request for examination filed

Effective date: 20050930

RBV Designated contracting states (corrected)

Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR IE IT LI LU MC NL PT SE SI SK TR

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 20070703