WO2004061642A2 - Universal multi-path driver for storage systems - Google Patents

Universal multi-path driver for storage systems Download PDF

Info

Publication number
WO2004061642A2
WO2004061642A2 PCT/US2003/039869 US0339869W WO2004061642A2 WO 2004061642 A2 WO2004061642 A2 WO 2004061642A2 US 0339869 W US0339869 W US 0339869W WO 2004061642 A2 WO2004061642 A2 WO 2004061642A2
Authority
WO
WIPO (PCT)
Prior art keywords
machine
irp
processor
causes
program code
Prior art date
Application number
PCT/US2003/039869
Other languages
French (fr)
Other versions
WO2004061642A3 (en
Inventor
Giridhar Athreya
Chris B. Legg
Juan Carlos Ortiz
Original Assignee
Unisys Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Unisys Corporation filed Critical Unisys Corporation
Priority to JP2004565492A priority Critical patent/JP2006513469A/en
Priority to EP03814802A priority patent/EP1579334B1/en
Priority to AU2003297111A priority patent/AU2003297111A1/en
Priority to DE60313040T priority patent/DE60313040T2/en
Publication of WO2004061642A2 publication Critical patent/WO2004061642A2/en
Publication of WO2004061642A3 publication Critical patent/WO2004061642A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/10Program control for peripheral devices
    • G06F13/102Program control for peripheral devices where the programme performs an interfacing function, e.g. device driver
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F2003/0697Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers device management, e.g. handlers, drivers, I/O schedulers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0673Single storage device

Definitions

  • Embodiments of the invention relates to the field of storage systems, and more specifically, to driver for storage system.
  • An embodiment of the invention is a technique to manage multiple paths for input/output (I/O) devices.
  • An I/O request packet (IRP) from a higher level driver is received.
  • a plurality of paths to a plurality of device objects is managed in response to the IRP using a plurality of lower level drivers.
  • the device objects correspond to physical devices having M device types.
  • the lower level drivers control the physical devices.
  • Figure 1 A is a diagram illustrating a system in which one embodiment of the invention can be practiced.
  • Figure IB is a diagram illustrating a server/client system according to one embodiment of the invention.
  • Figure 2 is a diagram illustrating a storage management driver according to one embodiment of the invention.
  • Figure 3 is a diagram illustrating multipaths to physical devices according to one embodiment of the invention.
  • Figure 4 is a diagram illustrating a universal multipath driver according to one embodiment of the invention.
  • Figure 5 is a flowchart illustrating a process to dispatch according to one embodiment of the invention.
  • Figure 6 is a flowchart illustrating a process to respond to a start device minor IRP according to one embodiment of the invention.
  • Figure 7 is a flowchart illustrating a process to interface to lower level drivers according to one embodiment of the invention.
  • Figure 8 is a flowchart illustrating a process to monitor paths according to another embodiment of the invention.
  • Figure 9 is a flowchart illustrating a process to balance load according to one embodiment of the invention.
  • An embodiment of the invention is a technique to manage multipaths for input/output (I/O) devices.
  • An I/O request packet (IRP) from a higher level driver is received.
  • a plurality of paths to a plurality of device objects is managed in response to the IRP using a plurality of lower level drivers.
  • the device objects correspond to physical devices having M device types.
  • the lower level drivers control the physical devices.
  • FIG. 1 A is a diagram illustrating a system 10 in which one embodiment of the invention can be practiced.
  • the system 10 includes a server/client 20, a network 30, a switch 40, tape drives 50i and 50 , a tape library 60, and a storage subsystem 70. Note that the system 10 is shown for illustrative purposes only. The system 10 may contain more or less components as show. The system 10 may be used in a direct attached storage or storage area networks (SAN). The topology of the system 10 may be arbitrated loop or switched fabric.
  • SAN storage area networks
  • the server/client 20 is a computer system typically used in an enterprise environment. It may be a server that performs dedicated functions such as a Web server, an electronic mail (e-mail) server, or a client in a networked environment with connection to other clients or servers.
  • the server/client 20 usually requires large storage capacity for its computing needs. It may be used in a wide variety of applications such as finance, scientific researches, multimedia, academic and government work, databases, entertainment, etc.
  • the network 30 is any network that connects the server/client 20 to other servers/ clients or systems.
  • the network 30 may be a local area network (LAN), a wide area network (WAN), an intranet, an Internet, or any other types of network.
  • the network 30 may contain a number of network devices (not shown) such as gateways, adapters, routers, etc. to inter l ace to a numoer oi telecommunication nerwor ⁇ s su n as ⁇ syncm ⁇ n ⁇ us Transfer Mode (ATM) or Synchronous Optical Network (SONET).
  • ATM ⁇ syncm ⁇ n ⁇ us Transfer Mode
  • SONET Synchronous Optical Network
  • the switch 40 is an interconnecting device that interconnects the server/client 20 to various storage devices or other devices or subsystems.
  • the switch 40 may be a hub, a switching hub, a multiple point-to-point switch, or a director, etc. It typically has a large number of ports ranging from a few ports to hundreds of ports. The complexity may range from a simple arbitrated loop to highly available point-to-point.
  • the throughput of the switch 40 may range from 200 MegaByte per second (MBps) to 1 GBytes per second (GBps).
  • the tape drives 50i and 50 2 are storage devices with high capacity for backup and archival tasks.
  • the capacity for a tape used in the tape drives may range from tens to hundreds of Gigabytes (GB).
  • the transfer rates may range from 10 to 50 MBps.
  • the tape library 60 includes multiple tape drives with automated tape loading.
  • the capacity of the tape library 60 may range from 1 to 1,000 Terabytes (TB) with an aggregate data rate of 50-300 MBps.
  • the tape drives 50 ⁇ and 50 2 and tape library 60 use sequential accesses.
  • the storage subsystem 70 includes a disk subsystem 72, a redundant array of inexpensive disks (RAID) subsystem 74, and a storage device 76.
  • the disk subsystem 72 may be a single drive or an array of disks.
  • the RAID subsystem 74 is an array of disks with additional complexity and features to increases manageability, performance, capacity, reliability, and availability.
  • the storage device 76 may be any other storage systems including magnetic, optic, electro-opti
  • the tape drives 50 ⁇ and 50 2 , tape library 60, disk subsystem 72, redundant array of inexpensive disks (RAID) subsystem 74, and storage device 76 form physical devices that are attached to the server/client 20 to provide archival storage. These devices typically include different device types.
  • the server/client 20 has ability to interface to all of these device types (e.g., tape drives, tape library, disk RAID) in multiple paths.
  • FIG. IB is a diagram illustrating a server/client system 20 in which one embodiment of the invention can be practiced.
  • the server/client system 20 includes a processor 110, a processor bus 120, a memory control hub (MCH) 130, a subsystem memory 140, an input/output control hub (ICH) 150, a peripheral bus 160, host bus adapters (HBAs) 165] . to IC M, a mass storage device 170, and input/output devices 180! to 180 ⁇ .
  • MCH memory control hub
  • ICH input/output control hub
  • IC M disk drive
  • mass storage device 170 disk drives
  • input/output devices 180! to 180 ⁇ .
  • the server/client system 20 may include more or less elements than these elements.
  • the processor 110 represents a central processing unit of any type of architecture, such as embedded processors, mobile processors, micro-controllers, digital signal processors, superscalar computers, vector processors, single instruction multiple data (SIMD) computers, complex instruction set computers (CISC), reduced instruction set computers (RISC), very long instruction word (VLIW), or hybrid architecture.
  • SIMD single instruction multiple data
  • CISC complex instruction set computers
  • RISC reduced instruction set computers
  • VLIW very long instruction word
  • the processor bus 120 provides interface signals to allow the processor 110 to communicate with other processors or devices, e.g., the MCH 130.
  • the host bus 120 may support a uni-processor or multiprocessor configuration.
  • the host bus 120 may be parallel, sequential, pipelined, asynchronous, synchronous, or any combination thereof.
  • the MCH 130 provides control and configuration of memory and input/output devices such as the system memory 140, the ICH 150.
  • the MCH 130 may be integrated into a chipset that integrates multiple functionalities such as the isolated execution mode, host-to-peripheral bus interface, memory control.
  • the MCH 130 interfaces to the peripheral bus 160.
  • peripheral buses such as Peripheral Component Interconnect (PCI), accelerated graphics port (AGP), Industry Standard Architecture (ISA) bus, and Universal Serial Bus (USB), etc.
  • PCI Peripheral Component Interconnect
  • AGP accelerated graphics port
  • ISA Industry Standard Architecture
  • USB Universal Serial Bus
  • the system memory 140 stores system code and data.
  • the system memory 140 is typically implemented with dynamic random access memory (DRAM) or static random access memory (SRAM).
  • the system memory 140 may include program code or code segments implementing one embodiment of the invention.
  • the system memory 140 includes a storage management driver 145. Any one of the elements of the storage management driver 145 may be implemented by hardware, software, firmware, microcode, or any combination thereof.
  • the system memory 140 may also include other programs or data which are not shown, such as an operating system.
  • the storage management driver 145 contains program code that, when executed by the processor 110, causes the processor 110 to perform operations as described below.
  • the ICH 150 has a number ot functionalities that are designed to support I/O functions.
  • the ICH 150 may also be integrated into a chipset together or separate from the MCH 130 to perform I/O functions.
  • the ICH 150 may include a number of interface and I/O functions such as PCI bus interface to interface to the peripheral bus 160, processor interface, interrupt controller, direct memory access (DMA) controller, power management logic, timer, system management bus (SMBus), universal serial bus (USB) interface, mass storage interface, low pin count (LPC) interface, etc.
  • the HBAs 165 ⁇ to 165 M are adapters that interface to the switch 40 ( Figure 1 A).
  • the HBAs 165 ⁇ to 165 M are typically add-on cards that interface to the peripheral bus 160 or any other bus accessible to the processor 110.
  • the HBAs may have their own processor with local memory or frame buffer to store temporary data.
  • the protocols supported by the may be Small Computer Small Interface (SCSI), Internet Protocol (IP), and Fiber Channel (FC).
  • the transfer rates may be hundreds of MBps with full duplex.
  • the media may include copper and multi-mode optics.
  • the mass storage device 170 stores archive information such as code, programs, files, data, applications, and operating systems.
  • the mass storage device 170 may include compact disk (CD) ROM 172, a digital video/versatile disc (DVD) 173, floppy drive 174, hard drive 176, flash memory 178, and any other magnetic or optic storage devices.
  • the mass storage device 170 provides a mechanism to read machine-accessible media.
  • the machine-accessible media may contain computer readable program code to perform tasks as described in the following.
  • the I/O devices 180 ⁇ to 180 ⁇ may include any I/O devices to perform I/O functions.
  • I/O devices 180 ⁇ to 180 ⁇ include controller for input devices (e.g., keyboard, mouse, trackball, pointing device), media card (e.g., audio, video, graphics), network card, and any other peripheral controllers.
  • Elements of one embodiment of the invention may be implemented by hardware, firmware, software or any combination thereof.
  • hardware generally refers to an element having a physical structure such as electronic, electromagnetic, optical, electro- optical, mechanical, electro-mechanical parts, etc.
  • software generally refers to a logical structure, a method, a procedure, a program, a routine, a process, an algorithm, a formula, a function, an expression, etc.
  • firmware generally refers to a logical structure, a method, a procedure, a program, a routine, a process, an algorithm, a formula, a function, an expression, etc that is implemented or embodied in a hardware structure (e.g., flash memory, ROM, EROM).
  • firmware may include microcode, writable control store, micro-programmed structure.
  • the elements of an embodiment of the present invention are essentially the code segments to perform the necessary tasks.
  • the software/firmware may include the actual code to carry out the operations described in one embodiment of the invention, or code that emulates or simulates the operations.
  • the program or code segments can be stored in a processor or machine accessible medium or transmitted by a computer data signal embodied in a carrier wave, or a signal modulated by a carrier, over a transmission medium.
  • the "processor readable or accessible medium” or “machine readable or accessible medium” may include any medium that can store, transmit, or transfer information.
  • Examples of the processor readable or machine accessible medium include an electronic circuit, a semiconductor memory device, a read only memory (ROM), a flash memory, an erasable ROM (EROM), a floppy diskette, a compact disk (CD) ROM, an optical disk, a hard disk, a fiber optic medium, a radio frequency (RF) link, etc.
  • the computer data signal may include any signal that can propagate over a transmission medium such as electronic network channels, optical fibers, air, electromagnetic, RF links, etc.
  • the code segments may be downloaded via computer networks such as the Internet, Intranet, etc.
  • the machine accessible medium may be embodied in an article of manufacture.
  • the machine accessible medium may include data that, when accessed by a machine, cause the machine to perform the operations described in the following.
  • the machine accessible medium may also include program code embedded therein.
  • the program code may include machine readable code to perform the operations described in the following.
  • the term "data” here refers to any type of information that is encoded for machine-readable purposes. Therefore, it may include program, code, data, file, etc.
  • All or part of an embodiment of the invention may be implemented by hardware, software, or firmware, or any combination thereof.
  • the hardware, software, or firmware element may have several modules coupled to one another.
  • a hardware module is coupled to another module by mechanical, electrical, optical, electromagnetic or any physical connections.
  • a software module is coupled to another module by a function, procedure, method, subprogram, or subroutine call, a jump, a link, a parameter, va ⁇ able, and argument passing, a function return, etc.
  • a software module is coupled to another module to receive variables, parameters, arguments, pointers, etc. and/or to generate or pass results, updated variables, pointers, etc.
  • a firmware module is coupled to another module by any combination of hardware and software coupling methods above.
  • a hardware, software, or firmware module may be coupled to any one of another hardware, software, or firmware module.
  • a module may also be a software driver or interface to interact with the operating system running on the platform.
  • a module may also be a hardware driver to configure, set up, initialize, send and receive data to and from a hardware device.
  • An apparatus may include any combination of hardware, software, and firmware modules.
  • One embodiment of the invention may be described as a process which is usually depicted as a flowchart, a flow diagram, a structure diagram, or a block diagram. Although a flowchart may describe the operations as a sequential process, many of the operations can be performed in parallel or concurrently. In addition, the order of the operations may be re-arranged. A process is terminated when its operations are completed. A process may correspond to a method, a program, a procedure, a method of manufacturing or fabrication, etc.
  • FIG. 2 is a diagram illustrating the storage management driver 145 according to one embodiment of the invention.
  • the storage management driver 145 includes operating system components 210, higher-level drivers 218, a universal multipath driver 220, and a lower level driver 250.
  • the OS components 210 include an I/O manager 212, a plug and play (PnP) manager 214, a power manager 216.
  • the I/O manager 212 provides a consistent interface to all kernel-mode drivers. It defines a set of standard routines that other drivers can support. All I/O requests are from the I/O manager 212 sent as I/O request packets (IRPs).
  • IRPs I/O request packets
  • PnP manager 214 supports automatic installation and configuration of drivers when their corresponding devices are plugged into the system.
  • the power manager 216 manages power usage of the system including power modes such as sleep, hibernation, or shutdown.
  • the higher-level drivers 218 include a class driver and may also include any driver that interface directly to the OS components 210.
  • the I/O manager 212, the PnP manager 214, and the power manager 16 are trom Microsoft windows " " z ⁇ uu, CE, and.NET.
  • the universal multipath driver (UMD) 220 is a driver that provides multipath management to the storage devices shown in Figure IB such as the tape drives, the tape library, and the disk subsystem.
  • the UMD 220 responds to an IRP sent by the higher level driver 218 and interfaces to the lower level driver 250.
  • the lower level driver 250 includes drivers that are directly responsible for the control and management of the devices attached to the system.
  • the lower level driver 250 includes a tape drive device driver 252, a tape library device driver 254, and a HBA driver 256 which are drivers for device 165i, library 165j, and HBA 165 k , respectively.
  • the HBA 165 k in turn directly controls the corresponding storage device(s) shown in Figure IB.
  • FIG. 3 is a diagram illustrating multipaths to physical devices according to one embodiment of the invention.
  • a path is a physical connection between a HBA and the corresponding device.
  • an HBA is interfaced to a number of devices via multiple paths through the switch 40.
  • the HBA 165j is connected to the tape drives 50]. and 50 2 and the tape library 60 through the paths 311, 312, and 313; and the HBA 165 k is connected to through the paths 321, 322, and 323, respectively.
  • the universal multipath driver (UMD) 220 provides multipath management, failover and fallback, and load balancing. This is accomplished by maintaining a list of devices attached to the system. The devices are identified by their device name, device identifier, and device serial number. This information is typically provided by the peripheral devices upon inquiry by the corresponding lower-level drivers.
  • FIG. 4 is a diagram illustrating the universal multipath driver (UMD) 220 according to one embodiment of the invention.
  • the UMD 220 includes a driver entry 410, a major function group 420, a system thread 480, and a path monitor 490.
  • the driver entry 410 provides an entry point for the UMD 220 in response to an IRP issued by the higher level driver 218.
  • the driver entry 410 includes a driver object pointer 415 that provides address reference or points to the major function group 420.
  • the driver entry 410 also causes creation of the system thread 480. Ihe system thread 480 invokes the path monitor 490.
  • the major function group 420 includes a number of functions, routines, or modules that manage the multiple paths connected to the physical devices. In one embodiment, these functions, modules, or routines are compatible with the Microsoft Developer Network (MSDN) library.
  • the major function group 420 includes a dispatch function 430, a filter SCSI function 440, a filter add device function 450, a filter unload function 460, and a power dispatch function 470.
  • the dispatch function 430, the filter SCSI function 440, and the filter add device function 450 interface to the lower level driver 250.
  • the dispatch function 430 dispatches the operations in response to receiving an IRP from the higher level driver 218.
  • the PnP manager sends a major PnP IRP request during enumeration, resource rebalancing, and any other time that plug-and-play activity occurs on the system.
  • the filter SCSI function 440 sets up IRP's with device- or device-specific I/O control codes, requesting support from the lower-level drivers.
  • the filter add device function 450 creates and initializes a new filter device object for the corresponding physical device object, then it attaches the device object to the device stack of the drivers for the device.
  • the filter unload function 460 frees any objects and releases any driver-allocated resources. It terminates the system thread 480.
  • the path monitor 490 monitors the multiple paths in the system and determine if there is any fail-over.
  • Path failover occurs when a peripheral device is no longer reachable via one of the paths. This may be a result of disconnection or any other malfunction or errors.
  • the failed path is placed into a bad path list.
  • path failback can then be initiated.
  • fallback is completed, the path is removed from the bad path list.
  • an alternate path to an alternate device may be established for the failed device.
  • the alternate device may be active or passive prior to the failover.
  • FIG. 5 is a flowchart illustrating the process 430 to dispatch according to one embodiment of the invention.
  • the process 430 responds to a minor IRP (Block 510).
  • a minor JRP may be a start device minor IRP (Block 520), a remove device minor IRP (Block 530), a device relation minor IRP (Block 540), a query id minor JRP (Block 550), a stop device minor IRP (Block 560), and a device usage notification (Block 570).
  • the process 430 performs operations in response to these minor IRPs accordingly.
  • the process 430 removes an entry from a device list (Block 532). This entry contains the device attributes such as name, serial number, and device ID.
  • the process 430 detaches the attached device (Block 534). This can be performed by sending a command to the lower level driver that is responsible for controlling the attached device. The process 430 is then terminated.
  • the process 430 allocates a device relation structure in a page memory (Block 542) and is then terminated.
  • the process 430 creates a device ID (Block 552), returns the device ID (Block 554) and is then terminated.
  • the process 430 removes an entry from the device list (Block 562) and is then terminated.
  • the process 430 forwards the IRP to the next driver in the stack (Block 572) and is then terminated.
  • Figure 6 is a flowchart illustrating the process 520 to respond to a start device minor IRP according to one embodiment of the invention.
  • the process 520 starts the device using the lower level driver (Block 610). This may be performed by sending a command to the lower level driver that directly controls the device, or by writing control parameters to the appropriate calling function.
  • the process 520 obtains the device name (Block 620). Then, the process 520 sends control command to the lower level driver to obtain the SCSI address of the device (Block 630).
  • the process 520 obtains the device identifier (ID) (Block 640). Then, the process 520 obtains the device serial number (Block 650).
  • the process 520 determines if the device code match an entry in the device list (Block 660).
  • the device code may be any one of the device ID or the device serial number or both. If so, the process 520 creates a new bus physical device object (Block: 670) and is then terminated. Otherwise, the process 520 is terminated.
  • Figure 7 is a flowchart illustrating the process 440 to interface to lower level drivers according to one embodiment of the invention.
  • the process 440 determines if a device property flag indicating that the device property has been obtained is set (Block 710). If not, the process 440 obtains the supported device name of the attached device (Block 715). Then, the process 440 determines if the supported device name is on the device list (Block 720). If so, the process 440 asserts a device support flag and is then terminated. Otherwise, the process 440 negates the device support flag (Block 730) and is then terminated.
  • the process 440 determines if the filter device object is attached (Block 735). If so, the process 440 determines if there is a claim, release, or an inquiry (Block 740). If so, the process 440 determines if the device property flag is set (Block 745). Otherwise, the process 440 returns an error status (Block 755) and is then terminated. If there is not claim, release, or inquiry, the process 440 returns an error status (Block 755) and is then terminated. If the flag is not set in Block 745, the process 440 sends the request to the next driver (Block 750) and is then terminated.
  • the process 440 determines if the higher level driver claim the bus physical device object (Block 760). If not, the process 440 is terminated. Otherwise, the process 440 returns a success status (Block 765). Then, the process 440 processes the I/O requests or balance the load (Block 770) and is then terminated. The details of the Block 770 are shown in Figure 9.
  • Figure 8 is a flowchart illustrating the process 490 to monitor paths according to another embodiment of the invention.
  • the process 490 determines if the failover of a path is detected (Block 810). This can be performed by determining if the path is in a list of bad paths or paths having disconnected status. If not, the process 490 returns to Block 810 to continue polling the failover. Otherwise, when the failover is detected is process 490 determines the connection status of the path or the corresponding device (Block 820). This can be done by checking the status of the device as returned by the lower-level d ⁇ ver or the US d ⁇ ver. When a failover is detected, an alternate path to an alternate device may be established for the failed device. The alternate device may be active or passive prior to the failover. Then, the process 490 determines if the connection status is a connected status (Block 830). A connected status indicates that the device is back on line. If not, the process 490 returns to Block 820 to continue determining the connection status.
  • Figure 9 is a flowchart illustrating the process 770 to balance load according to one embodiment of the invention.
  • the process 770 maintains a queue list of I/O requests to the paths (Block 910). This can be done by storing information on each of the I/O requests in the queue list. The information may include a pointer or an address reference to the device object, a device priority, a path affinity code, a number of I/O requests for a path, or a channel characteristic. The channel characteristic may includes the channel speed or transfer rate, the channel capacity, etc. Then, the process 770 distributes the I/O requests over the paths to balance the load according to a balancing policy (Block 920). This can be done by selecting a next path in the queue list (Block 930) using a balancing criteria. A number of balancing criteria or policies can be used.
  • the process 770 selects the next path on a rotation basis (Block 940). This can be performed by moving the head of a queue back to the end and advance the queued requests up by one position.
  • a path affinity policy the process 770 selects the next path according to the path affinity code (Block 950).
  • the path affinity code indicates a degree of affinity of the next path with respect to the current path.
  • the process 770 selects the next path according to the number of I/O requests assigned to that path (Block 960). Typically, the path having the least amount of I/.O requests is selected.
  • the process 770 selects the next path according to the device priority (Block 970).
  • the device priority may be determined in advance during initialization or configuration, or may be dynamically determined based on the nature of the I/O tasks.
  • a size policy the process 770 selects the next path according to the block size of the I/O requests (Block 980). Typically, the path having the largest block size is selected.
  • a channel policy the process 770 selects the next path according to the channel characteristic (Block 990). .bor example, the path having a channel with last transfer rate may be selected.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Stored Programmes (AREA)
  • Hardware Redundancy (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

An embodiment of the invention is a technique to manage multipaths for input/output (I/O) devices. An I/O request packet (1RP) from a higher level driver is received. A plurality of paths to a plurality of device objects is managed in response to the IRP using a plurality of lower level drivers. The device objects correspond to physical devices having M device types. The lower level drivers control the physical devices.

Description

UNIVERSAL MULTI-PATH DRIVER FOR STORAGE SYSTEMS
CROSS REFERENCE TO RELATED APPLICATIONS
[0001] This application is related to the following patent applications: Serial No.
(Attorney Docket No. 02-031/006342.P005) entitled "FAILOVER AND FAJLB ACK USING A UNIVERSAL MULTI-PATH DRIVER FOR STORAGE DEVICES"; Serial
No. (Attorney Docket No. 02-32/006342.P006) entitled "LOAD BALANCING IN A
UNIVERSAL MULTI-PATH DRIVER FOR STORAGE DEVICES", all filed on the same date and assigned to the same assignee as the present application, the contents of each of which are herein incorporated by reference.
BACKGROUND
FIELD OF THE INVENΉON
[0002] Embodiments of the invention relates to the field of storage systems, and more specifically, to driver for storage system.
DESCRIPTION OF RELATED ART
[0003] Storage technology has become important for many data intensive applications. Currently, there are various storage devices having different capacities and streaming rates to accommodate various applications. Examples of these storage devices include redundant array of independent disks (RAIDs), tape drives, disk drives, and tape libraries. Techniques to interface to these devices include direct-attached storage, and storage area networks (SANs).
[0004] Existing techniques to interface to these storage devices have a number of drawbacks. First, they do not provide management to different types of devices in a same driver. A system typically has to install several different types of drivers, one for each type of storage device. This creates complexity in management and system administration, increases cost in software acquisition and maintenance, and reduces system reliability and re-configurability. Second, they do not provide taiiover among ditterent storage devices, reducing system fault-tolerance and increasing server downtime. Third, they do not provide load balancing among different storage devices, causing performance degradation when there is skew in storage utilization.
SUMMARY Or- THE INVENTION
[0005] An embodiment of the invention is a technique to manage multiple paths for input/output (I/O) devices. An I/O request packet (IRP) from a higher level driver is received. A plurality of paths to a plurality of device objects is managed in response to the IRP using a plurality of lower level drivers. The device objects correspond to physical devices having M device types. The lower level drivers control the physical devices.
BKlJil* ujf lJt-J JJKAWΠ VJO
[0006] The invention may best be understood by referring to the following description and accompanying drawings that are used to illustrate embodiments of the invention. In the drawings:
[0007] Figure 1 A is a diagram illustrating a system in which one embodiment of the invention can be practiced.
[0008] Figure IB is a diagram illustrating a server/client system according to one embodiment of the invention.
[0009] Figure 2 is a diagram illustrating a storage management driver according to one embodiment of the invention.
[0010] Figure 3 is a diagram illustrating multipaths to physical devices according to one embodiment of the invention.
[0011] Figure 4 is a diagram illustrating a universal multipath driver according to one embodiment of the invention.
[0012] Figure 5 is a flowchart illustrating a process to dispatch according to one embodiment of the invention.
[0013] Figure 6 is a flowchart illustrating a process to respond to a start device minor IRP according to one embodiment of the invention.
[0014] Figure 7 is a flowchart illustrating a process to interface to lower level drivers according to one embodiment of the invention.
[0015] Figure 8 is a flowchart illustrating a process to monitor paths according to another embodiment of the invention.
[0016] Figure 9 is a flowchart illustrating a process to balance load according to one embodiment of the invention. DEi-CRI T1UJN
[0017] An embodiment of the invention is a technique to manage multipaths for input/output (I/O) devices. An I/O request packet (IRP) from a higher level driver is received. A plurality of paths to a plurality of device objects is managed in response to the IRP using a plurality of lower level drivers. The device objects correspond to physical devices having M device types. The lower level drivers control the physical devices.
[0018] In the following description, numerous specific details are set forth. However, it is understood that embodiments of the invention may be practiced without these specific details. In other instances, well-known circuits, structures, and techniques have not been shown in order not to obscure the understanding of this description.
[0019] Figure 1 A is a diagram illustrating a system 10 in which one embodiment of the invention can be practiced. The system 10 includes a server/client 20, a network 30, a switch 40, tape drives 50i and 50 , a tape library 60, and a storage subsystem 70. Note that the system 10 is shown for illustrative purposes only. The system 10 may contain more or less components as show. The system 10 may be used in a direct attached storage or storage area networks (SAN). The topology of the system 10 may be arbitrated loop or switched fabric.
[0020] The server/client 20 is a computer system typically used in an enterprise environment. It may be a server that performs dedicated functions such as a Web server, an electronic mail (e-mail) server, or a client in a networked environment with connection to other clients or servers. The server/client 20 usually requires large storage capacity for its computing needs. It may be used in a wide variety of applications such as finance, scientific researches, multimedia, academic and government work, databases, entertainment, etc.
[0021] The network 30 is any network that connects the server/client 20 to other servers/ clients or systems. The network 30 may be a local area network (LAN), a wide area network (WAN), an intranet, an Internet, or any other types of network. The network 30 may contain a number of network devices (not shown) such as gateways, adapters, routers, etc. to interlace to a numoer oi telecommunication nerworκs su n as Λsyncmυnυus Transfer Mode (ATM) or Synchronous Optical Network (SONET).
[0022] The switch 40 is an interconnecting device that interconnects the server/client 20 to various storage devices or other devices or subsystems. The switch 40 may be a hub, a switching hub, a multiple point-to-point switch, or a director, etc. It typically has a large number of ports ranging from a few ports to hundreds of ports. The complexity may range from a simple arbitrated loop to highly available point-to-point. The throughput of the switch 40 may range from 200 MegaByte per second (MBps) to 1 GBytes per second (GBps).
[0023] The tape drives 50i and 502 are storage devices with high capacity for backup and archival tasks. The capacity for a tape used in the tape drives may range from tens to hundreds of Gigabytes (GB). The transfer rates may range from 10 to 50 MBps. The tape library 60 includes multiple tape drives with automated tape loading. The capacity of the tape library 60 may range from 1 to 1,000 Terabytes (TB) with an aggregate data rate of 50-300 MBps. The tape drives 50ι and 502 and tape library 60 use sequential accesses. The storage subsystem 70 includes a disk subsystem 72, a redundant array of inexpensive disks (RAID) subsystem 74, and a storage device 76. The disk subsystem 72 may be a single drive or an array of disks. The RAID subsystem 74 is an array of disks with additional complexity and features to increases manageability, performance, capacity, reliability, and availability. The storage device 76 may be any other storage systems including magnetic, optic, electro-optics, etc.
[0024] The tape drives 50χ and 502, tape library 60, disk subsystem 72, redundant array of inexpensive disks (RAID) subsystem 74, and storage device 76 form physical devices that are attached to the server/client 20 to provide archival storage. These devices typically include different device types. The server/client 20 has ability to interface to all of these device types (e.g., tape drives, tape library, disk RAID) in multiple paths.
[0025] Figure IB is a diagram illustrating a server/client system 20 in which one embodiment of the invention can be practiced. The server/client system 20 includes a processor 110, a processor bus 120, a memory control hub (MCH) 130, a subsystem memory 140, an input/output control hub (ICH) 150, a peripheral bus 160, host bus adapters (HBAs) 165]. to IC M, a mass storage device 170, and input/output devices 180! to 180κ. Note that the server/client system 20 may include more or less elements than these elements.
[0026] The processor 110 represents a central processing unit of any type of architecture, such as embedded processors, mobile processors, micro-controllers, digital signal processors, superscalar computers, vector processors, single instruction multiple data (SIMD) computers, complex instruction set computers (CISC), reduced instruction set computers (RISC), very long instruction word (VLIW), or hybrid architecture.
[0027] The processor bus 120 provides interface signals to allow the processor 110 to communicate with other processors or devices, e.g., the MCH 130. The host bus 120 may support a uni-processor or multiprocessor configuration. The host bus 120 may be parallel, sequential, pipelined, asynchronous, synchronous, or any combination thereof.
[0028] The MCH 130 provides control and configuration of memory and input/output devices such as the system memory 140, the ICH 150. The MCH 130 may be integrated into a chipset that integrates multiple functionalities such as the isolated execution mode, host-to-peripheral bus interface, memory control. The MCH 130 interfaces to the peripheral bus 160. For clarity, not all the peripheral buses are shown. It is contemplated that the subsystem 40 may also include peripheral buses such as Peripheral Component Interconnect (PCI), accelerated graphics port (AGP), Industry Standard Architecture (ISA) bus, and Universal Serial Bus (USB), etc.
[0029] The system memory 140 stores system code and data. The system memory 140 is typically implemented with dynamic random access memory (DRAM) or static random access memory (SRAM). The system memory 140 may include program code or code segments implementing one embodiment of the invention. The system memory 140 includes a storage management driver 145. Any one of the elements of the storage management driver 145 may be implemented by hardware, software, firmware, microcode, or any combination thereof. The system memory 140 may also include other programs or data which are not shown, such as an operating system. The storage management driver 145 contains program code that, when executed by the processor 110, causes the processor 110 to perform operations as described below. [0030] The ICH 150 has a number ot functionalities that are designed to support I/O functions. The ICH 150 may also be integrated into a chipset together or separate from the MCH 130 to perform I/O functions. The ICH 150 may include a number of interface and I/O functions such as PCI bus interface to interface to the peripheral bus 160, processor interface, interrupt controller, direct memory access (DMA) controller, power management logic, timer, system management bus (SMBus), universal serial bus (USB) interface, mass storage interface, low pin count (LPC) interface, etc.
[0031] The HBAs 165χ to 165M are adapters that interface to the switch 40 (Figure 1 A). The HBAs 165ι to 165M are typically add-on cards that interface to the peripheral bus 160 or any other bus accessible to the processor 110. The HBAs may have their own processor with local memory or frame buffer to store temporary data. The protocols supported by the may be Small Computer Small Interface (SCSI), Internet Protocol (IP), and Fiber Channel (FC). The transfer rates may be hundreds of MBps with full duplex. The media may include copper and multi-mode optics.
[0032] The mass storage device 170 stores archive information such as code, programs, files, data, applications, and operating systems. The mass storage device 170 may include compact disk (CD) ROM 172, a digital video/versatile disc (DVD) 173, floppy drive 174, hard drive 176, flash memory 178, and any other magnetic or optic storage devices. The mass storage device 170 provides a mechanism to read machine-accessible media. The machine-accessible media may contain computer readable program code to perform tasks as described in the following.
[0033] The I/O devices 180ι to 180κ may include any I/O devices to perform I/O functions. Examples of I/O devices 180ι to 180κ include controller for input devices (e.g., keyboard, mouse, trackball, pointing device), media card (e.g., audio, video, graphics), network card, and any other peripheral controllers.
[0034] Elements of one embodiment of the invention may be implemented by hardware, firmware, software or any combination thereof. The term hardware generally refers to an element having a physical structure such as electronic, electromagnetic, optical, electro- optical, mechanical, electro-mechanical parts, etc. The term software generally refers to a logical structure, a method, a procedure, a program, a routine, a process, an algorithm, a formula, a function, an expression, etc. The term firmware generally refers to a logical structure, a method, a procedure, a program, a routine, a process, an algorithm, a formula, a function, an expression, etc that is implemented or embodied in a hardware structure (e.g., flash memory, ROM, EROM). Examples of firmware may include microcode, writable control store, micro-programmed structure. When implemented in software or firmware, the elements of an embodiment of the present invention are essentially the code segments to perform the necessary tasks. The software/firmware may include the actual code to carry out the operations described in one embodiment of the invention, or code that emulates or simulates the operations. The program or code segments can be stored in a processor or machine accessible medium or transmitted by a computer data signal embodied in a carrier wave, or a signal modulated by a carrier, over a transmission medium. The "processor readable or accessible medium" or "machine readable or accessible medium" may include any medium that can store, transmit, or transfer information. Examples of the processor readable or machine accessible medium include an electronic circuit, a semiconductor memory device, a read only memory (ROM), a flash memory, an erasable ROM (EROM), a floppy diskette, a compact disk (CD) ROM, an optical disk, a hard disk, a fiber optic medium, a radio frequency (RF) link, etc. The computer data signal may include any signal that can propagate over a transmission medium such as electronic network channels, optical fibers, air, electromagnetic, RF links, etc. The code segments may be downloaded via computer networks such as the Internet, Intranet, etc. The machine accessible medium may be embodied in an article of manufacture. The machine accessible medium may include data that, when accessed by a machine, cause the machine to perform the operations described in the following. The machine accessible medium may also include program code embedded therein. The program code may include machine readable code to perform the operations described in the following. The term "data" here refers to any type of information that is encoded for machine-readable purposes. Therefore, it may include program, code, data, file, etc.
[0035] All or part of an embodiment of the invention may be implemented by hardware, software, or firmware, or any combination thereof. The hardware, software, or firmware element may have several modules coupled to one another. A hardware module is coupled to another module by mechanical, electrical, optical, electromagnetic or any physical connections. A software module is coupled to another module by a function, procedure, method, subprogram, or subroutine call, a jump, a link, a parameter, vaπable, and argument passing, a function return, etc. A software module is coupled to another module to receive variables, parameters, arguments, pointers, etc. and/or to generate or pass results, updated variables, pointers, etc. A firmware module is coupled to another module by any combination of hardware and software coupling methods above. A hardware, software, or firmware module may be coupled to any one of another hardware, software, or firmware module. A module may also be a software driver or interface to interact with the operating system running on the platform. A module may also be a hardware driver to configure, set up, initialize, send and receive data to and from a hardware device. An apparatus may include any combination of hardware, software, and firmware modules.
[0036] One embodiment of the invention may be described as a process which is usually depicted as a flowchart, a flow diagram, a structure diagram, or a block diagram. Although a flowchart may describe the operations as a sequential process, many of the operations can be performed in parallel or concurrently. In addition, the order of the operations may be re-arranged. A process is terminated when its operations are completed. A process may correspond to a method, a program, a procedure, a method of manufacturing or fabrication, etc.
[0037] Figure 2 is a diagram illustrating the storage management driver 145 according to one embodiment of the invention. The storage management driver 145 includes operating system components 210, higher-level drivers 218, a universal multipath driver 220, and a lower level driver 250.
[0038] The OS components 210 include an I/O manager 212, a plug and play (PnP) manager 214, a power manager 216. The I/O manager 212 provides a consistent interface to all kernel-mode drivers. It defines a set of standard routines that other drivers can support. All I/O requests are from the I/O manager 212 sent as I/O request packets (IRPs). PnP manager 214 supports automatic installation and configuration of drivers when their corresponding devices are plugged into the system. The power manager 216 manages power usage of the system including power modes such as sleep, hibernation, or shutdown. The higher-level drivers 218 include a class driver and may also include any driver that interface directly to the OS components 210. In one embodiment, the I/O manager 212, the PnP manager 214, and the power manager 16 are trom Microsoft windows" " zυuu, CE, and.NET.
[0039] The universal multipath driver (UMD) 220 is a driver that provides multipath management to the storage devices shown in Figure IB such as the tape drives, the tape library, and the disk subsystem. The UMD 220 responds to an IRP sent by the higher level driver 218 and interfaces to the lower level driver 250.
[0040] The lower level driver 250 includes drivers that are directly responsible for the control and management of the devices attached to the system. The lower level driver 250 includes a tape drive device driver 252, a tape library device driver 254, and a HBA driver 256 which are drivers for device 165i, library 165j, and HBA 165k, respectively. The HBA 165k in turn directly controls the corresponding storage device(s) shown in Figure IB.
[0041] Figure 3 is a diagram illustrating multipaths to physical devices according to one embodiment of the invention. A path is a physical connection between a HBA and the corresponding device. Typically, an HBA is interfaced to a number of devices via multiple paths through the switch 40. For example, the HBA 165j is connected to the tape drives 50]. and 502 and the tape library 60 through the paths 311, 312, and 313; and the HBA 165k is connected to through the paths 321, 322, and 323, respectively.
[0042] The universal multipath driver (UMD) 220 provides multipath management, failover and fallback, and load balancing. This is accomplished by maintaining a list of devices attached to the system. The devices are identified by their device name, device identifier, and device serial number. This information is typically provided by the peripheral devices upon inquiry by the corresponding lower-level drivers.
[0043] Figure 4 is a diagram illustrating the universal multipath driver (UMD) 220 according to one embodiment of the invention. The UMD 220 includes a driver entry 410, a major function group 420, a system thread 480, and a path monitor 490.
[0044] The driver entry 410 provides an entry point for the UMD 220 in response to an IRP issued by the higher level driver 218. The driver entry 410 includes a driver object pointer 415 that provides address reference or points to the major function group 420. The driver entry 410 also causes creation of the system thread 480. Ihe system thread 480 invokes the path monitor 490.
[0045] The major function group 420 includes a number of functions, routines, or modules that manage the multiple paths connected to the physical devices. In one embodiment, these functions, modules, or routines are compatible with the Microsoft Developer Network (MSDN) library. The major function group 420 includes a dispatch function 430, a filter SCSI function 440, a filter add device function 450, a filter unload function 460, and a power dispatch function 470.
[0046] The dispatch function 430, the filter SCSI function 440, and the filter add device function 450 interface to the lower level driver 250. The dispatch function 430 dispatches the operations in response to receiving an IRP from the higher level driver 218. In one embodiment, the PnP manager sends a major PnP IRP request during enumeration, resource rebalancing, and any other time that plug-and-play activity occurs on the system. The filter SCSI function 440 sets up IRP's with device- or device-specific I/O control codes, requesting support from the lower-level drivers. The filter add device function 450 creates and initializes a new filter device object for the corresponding physical device object, then it attaches the device object to the device stack of the drivers for the device. The filter unload function 460 frees any objects and releases any driver-allocated resources. It terminates the system thread 480.
[0047] The path monitor 490 monitors the multiple paths in the system and determine if there is any fail-over. Path failover occurs when a peripheral device is no longer reachable via one of the paths. This may be a result of disconnection or any other malfunction or errors. When failover occurs, the failed path is placed into a bad path list. When a bad path becomes functional again, path failback can then be initiated. When fallback is completed, the path is removed from the bad path list. When a failover is detected, an alternate path to an alternate device may be established for the failed device. The alternate device may be active or passive prior to the failover.
[0048] Figure 5 is a flowchart illustrating the process 430 to dispatch according to one embodiment of the invention. [0049] Upon START, the process 430 responds to a minor IRP (Block 510). A minor JRP may be a start device minor IRP (Block 520), a remove device minor IRP (Block 530), a device relation minor IRP (Block 540), a query id minor JRP (Block 550), a stop device minor IRP (Block 560), and a device usage notification (Block 570). The process 430 performs operations in response to these minor IRPs accordingly.
[0050] The details of operations for the start device minor IRP in Block 520 are shown in Figure 6. In response to the remove device minor IRP, the process 430 removes an entry from a device list (Block 532). This entry contains the device attributes such as name, serial number, and device ID. Next, the process 430 detaches the attached device (Block 534). This can be performed by sending a command to the lower level driver that is responsible for controlling the attached device. The process 430 is then terminated.
[0051] In response to the device relations minor IRP, the process 430 allocates a device relation structure in a page memory (Block 542) and is then terminated. In response to the query id minor IRP, the process 430 creates a device ID (Block 552), returns the device ID (Block 554) and is then terminated. In response to the stop device minor IRP (Block 560), the process 430 removes an entry from the device list (Block 562) and is then terminated. In response to the device usage notification minor IRP (Block 570), the process 430 forwards the IRP to the next driver in the stack (Block 572) and is then terminated.
[0052] Figure 6 is a flowchart illustrating the process 520 to respond to a start device minor IRP according to one embodiment of the invention.
[0053] Upon START, the process 520 starts the device using the lower level driver (Block 610). This may be performed by sending a command to the lower level driver that directly controls the device, or by writing control parameters to the appropriate calling function. Next, the process 520 obtains the device name (Block 620). Then, the process 520 sends control command to the lower level driver to obtain the SCSI address of the device (Block 630). Next, the process 520 obtains the device identifier (ID) (Block 640). Then, the process 520 obtains the device serial number (Block 650).
[0054] Next, the process 520 determines if the device code match an entry in the device list (Block 660). The device code may be any one of the device ID or the device serial number or both. If so, the process 520 creates a new bus physical device object (Block: 670) and is then terminated. Otherwise, the process 520 is terminated.
[0055] Figure 7 is a flowchart illustrating the process 440 to interface to lower level drivers according to one embodiment of the invention.
[0056] Upon START, the process 440 determines if a device property flag indicating that the device property has been obtained is set (Block 710). If not, the process 440 obtains the supported device name of the attached device (Block 715). Then, the process 440 determines if the supported device name is on the device list (Block 720). If so, the process 440 asserts a device support flag and is then terminated. Otherwise, the process 440 negates the device support flag (Block 730) and is then terminated.
[0057] If the device property flags is not set, the process 440 determines if the filter device object is attached (Block 735). If so, the process 440 determines if there is a claim, release, or an inquiry (Block 740). If so, the process 440 determines if the device property flag is set (Block 745). Otherwise, the process 440 returns an error status (Block 755) and is then terminated. If there is not claim, release, or inquiry, the process 440 returns an error status (Block 755) and is then terminated. If the flag is not set in Block 745, the process 440 sends the request to the next driver (Block 750) and is then terminated.
[0058] If the filter device object is not attached, the process 440 determines if the higher level driver claim the bus physical device object (Block 760). If not, the process 440 is terminated. Otherwise, the process 440 returns a success status (Block 765). Then, the process 440 processes the I/O requests or balance the load (Block 770) and is then terminated. The details of the Block 770 are shown in Figure 9.
[0059] Figure 8 is a flowchart illustrating the process 490 to monitor paths according to another embodiment of the invention.
[0060] Upon START, the process 490 determines if the failover of a path is detected (Block 810). This can be performed by determining if the path is in a list of bad paths or paths having disconnected status. If not, the process 490 returns to Block 810 to continue polling the failover. Otherwise, when the failover is detected is process 490 determines the connection status of the path or the corresponding device (Block 820). This can be done by checking the status of the device as returned by the lower-level dπver or the US dπver. When a failover is detected, an alternate path to an alternate device may be established for the failed device. The alternate device may be active or passive prior to the failover. Then, the process 490 determines if the connection status is a connected status (Block 830). A connected status indicates that the device is back on line. If not, the process 490 returns to Block 820 to continue determining the connection status.
[0061] Figure 9 is a flowchart illustrating the process 770 to balance load according to one embodiment of the invention.
[0062] Upon START, the process 770 maintains a queue list of I/O requests to the paths (Block 910). This can be done by storing information on each of the I/O requests in the queue list. The information may include a pointer or an address reference to the device object, a device priority, a path affinity code, a number of I/O requests for a path, or a channel characteristic. The channel characteristic may includes the channel speed or transfer rate, the channel capacity, etc. Then, the process 770 distributes the I/O requests over the paths to balance the load according to a balancing policy (Block 920). This can be done by selecting a next path in the queue list (Block 930) using a balancing criteria. A number of balancing criteria or policies can be used.
[0063] In a round robin policy, the process 770 selects the next path on a rotation basis (Block 940). This can be performed by moving the head of a queue back to the end and advance the queued requests up by one position. In a path affinity policy, the process 770 selects the next path according to the path affinity code (Block 950). The path affinity code indicates a degree of affinity of the next path with respect to the current path. In a request policy, the process 770 selects the next path according to the number of I/O requests assigned to that path (Block 960). Typically, the path having the least amount of I/.O requests is selected. In a priority policy, the process 770 selects the next path according to the device priority (Block 970). The device priority may be determined in advance during initialization or configuration, or may be dynamically determined based on the nature of the I/O tasks. In a size policy, the process 770 selects the next path according to the block size of the I/O requests (Block 980). Typically, the path having the largest block size is selected. a channel policy, the process 770 selects the next path according to the channel characteristic (Block 990). .bor example, the path having a channel with last transfer rate may be selected.
[0064] While the invention has been described in terms of several embodiments, those of ordinary skill in the art will recognize that the invention is not limited to the embodiments described, but can be practiced with modification and alteration within the spirit and scope of the appended claims. The description is thus to be regarded as illustrative instead of limiting.

Claims

CLAIMSWhat is claimed is:
1. A method comprising: receiving an input/output (I/O) request packet (IRP) from a higher level driver; and managing a plurality of paths to a plurality of device objects in response to the IRP using a plurality of lower level drivers, the device objects corresponding to physical devices having M device types, the lower level drivers controlling the physical devices.
2. The method of claim 1 wherein receiving the IRP comprises: receiving the IRP from the higher level driver, the higher level driver being a class driver.
3. The method of claim 1 wherein managing the plurality of paths comprises: responding to a major function IRP; and monitoring a device path in the N paths.
4. The method of claim 3 wherein managing the N paths further comprises: adding a new device object; and freeing the device objects.
5. The method of claim 3 wherein responding to the maj or function IRP comprises: responding to the major function IRP from one of an I/O manager, a Plug and Play (PnP) manager, and a power manager.
6. The method of claim 5 wherein responding to the major function IRP comprises: responding to one of a start device minor IRP, a remove device minor IRP, a device relation minor IRP, an identifier (ID) query minor IRP, a stop device minor IRP, and a usage notification minor IRP.
7. The method of claim 6 wherein responding to the start device minor IRP comprises: obtaining a peripheral address of an attached device using one of the lower level drivers; obtaining a device code of the attached device using the peripheral address; and determining if the device code matches an entry in a device list.
8. The method of claim 7 further comprising: creating a new bus physical device object if the device code matches the entry..
9. The method of claim 6 wherein responding to the remove device minor IRP comprises: removing an entry from a device list, the entry corresponding to an attached device; and detaching the attached device.
10. The method of claim 6 wherein responding to the device relation minor IRP comprises: allocating a device relation structure from a paged memory containing a count and at least a device object pointer.
11. The method of claim 6 wherein responding to the ID query minor IRP comprises: creating a device ID; and returning the device ID to the higher-level driver.
12. The method of claim 6 wherein responding to the stop device minor IRP comprises: removing an entry from a device list, the entry corresponding to an attached device.
13. The method of claim 6 wherein responding to the usage notification minor JRP comprises: forwarding the usage notification minor 1KF to a next dπver.
14. The method of claim 5 wherein responding to the major function IRP comprises: obtaining a supported device name of an attached device; asserting a device support flag if the supported device name is in the device list; and negating the device support flag if the supported device name is not in the list
15. The method of claim 5 wherein responding to the maj or function IRP comprises: determining if a filter device object is attached; returning an error status if there is one of a claim, a release, and an inquiry for a supported device; and sending the maj or function JRP to a next driver.
16. The method of claim 5 wherein responding to the major function IRP comprises: processing I/O requests in a queue list.
17. The method of claim 3 wherein monitoring the path comprises: polling path status of a path, the path status including a disconnect status and a connect status for a physical device corresponding to the path; and determining if the path having the disconnect status becomes having the connect status; and adjusting the path.
18. The method of claim 17 wherein polling the path status comprises : : determining if at least a first device name corresponding to a disconnected device is present in a first list.
19. The method of claim 17 wherein adjusting the path comprises: removing the first device name from the first list.
20. The method of claim 1 wherein managing the N paths comprises: managing the N paths to a plurality of device objects in response to the IRP, the device objects corresponding to physical devices having M device types, the M devices types including at least two of a disk device, a redundant array of inexpensive disks (RAID) subsystem, a tape device, and a tape library.
21. An article of manufacture comprising: a machine-accessible medium including data that, when accessed by a machine, causes the machine to perform operations comprising: receiving an input/output (I/O) request packet (IRP) from a higher level driver; and managing a plurality of paths to a plurality of device objects in response to the IRP using a plurality of lower level drivers, the device objects corresponding to physical devices having M device types, the lower level drivers controlling the physical devices.
22. The article of manufacture of claim 21 wherein the data causing the machine to perform receiving the IRP comprises data that, when accessed by the machine, causes the machine to perform operations comprising: receiving the IRP from the higher level driver, the higher level driver a class driver.
23. The article of manufacture of claim 21 wherein the data causing the machine to perform managing the plurality of paths comprises data that, when accessed by the machine, causes the machine to perform operations comprising: responding to a major function IRP; and monitoring a device path in the N paths.
24. The article of manufacture of claim 23 wherein the data causing the machine to perform managing the N paths further comprises data that, when accessed by the machine, causes the machine to perform operations comprising: adding a new device object; and freeing the device objects.
25. The article of manufacture of claim 23 wherein the data causing the machine to perform responding to the major function IRP comprises data that, when accessed by the machine, causes the machine to perform operations comprising: responding to the major function IRP from one of an I/O manager, a Plug and Play (PnP) manager, and a power manager.
26. The article of manufacture of claim 25 wherein the data causing the machine to perform responding to the major function IRP comprises data that, when accessed by the machine, causes the machine to perform operations comprising: responding to one of a start device minor IRP, a remove device minor IRP, a device relation minor IRP, an identifier (ID) query minor JRP, a stop device minor IRP, and a usage notification minor JRP.
27. The article of manufacture of claim 26 wherein the data causing the machine to perform responding to the start device minor IRP comprises data that, when accessed by the machine, causes the machine to perform operations comprising: obtaining a peripheral address of an attached device using one of the lower level drivers; obtaining a device code of the attached device using the peripheral address; and determining if the device code matches an entry in a device list.
28. The article of manufacture of claim 27 wherein the data further comprises data that, when accessed by the machine, causes the machine to perform operations comprising: creating a new bus physical device object if the device code matches the entry..
29. The article of manufacture of claim 26 wherein the data causing the machine to perform responding to the remove device minor IRP comprises data that, when accessed by the machine, causes the machine to perform operations comprising: removing an entry from a device list, the entry corresponding to an attached device; and detaching the attached device.
JU. i ne aracie or manuiacture or claim lb wherein the data causing the machine to perform responding to the device relation minor IRP comprises data that, when accessed by the machine, causes the machine to perform operations comprising: allocating a device relation structure from a paged memory containing a count and at least a device object pointer.
31. The article of manufacture of claim 26 wherein the data causing the machine to perform responding to the ID query minor IRP comprises data that, when accessed by the machine, causes the machine to perform operations comprising: creating a device ID; and returning the device ID to the higher level driver.
32. The article of manufacture of claim 26 wherein the data causing the machine to perform responding to the stop device minor IRP comprises data that, when accessed by the machine, causes the machine to perform operations comprising: removing an entry from a device list, the entry corresponding to an attached device.
33. The article of manufacture of claim 26 wherein the data causing the machine to perform responding to the usage notification minor JRP comprises data that, when accessed by the machine, causes the machine to perform operations comprising: forwarding the usage notification minor IRP to a next driver.
34. The article of manufacture of claim 25 wherein the data causing the machine to perform responding to the major function JRP comprises data that, when accessed by the machine, causes the machine to perform operations comprising: obtaining a supported device name of an attached device; asserting a device support flag if the supported device name is in the device list; and negating the device support flag if the supported device name is not in the list
35. The article of manufacture of claim 25 wherein the data causing the machine to perform responding to the major function IRP comprises data that, when accessed by the machine, causes the machine to perform operations comprising: determining if a filter device object is attached; returning an error status if there is one of a claim, a release, and an inquiry for a supported device; and sending the maj or function IRP to a next driver.
36. The article of manufacture of claim 25 wherein the data causing the machine to perform responding to the major function JRP comprises data that, when accessed by the machine, causes the machine to perform operations comprising: processing I/O requests in a queue list.
37. The article of manufacture of claim 23 wherein the data causing the machine to perform monitoring the path comprises data that, when accessed by the machine, causes the machine to perform operations comprising: polling path status of a path, the path status including a disconnect status and a connect status for a physical device corresponding to the path; and determining if the path having the disconnect status becomes having the connect status; and adjusting the path.
38. The article of manufacture of claim 37 wherein the data causing the machine to perform polling the path status comprises data that, when accessed by the machine, causes the machine to perform operations comprising: determining if at least a first device name coπesponding to a disconnected device is present in a first list.
39. The article of manufacture of claim 37 wherein the data causing the machine to perform adjusting the path comprises data that, when accessed by the machine, causes the machine to perform operations comprising: removing the first device name from the first list.
40. The article of manufacture of claim 21 wherein the data causing the machine to perform managing the N paths comprises data that, when accessed by the machine, causes the machine to perform operations comprising: managing the N paths to a plurality of device objects in response to the IRP, the device objects corresponding to physical devices having M device types, the M devices types including at least two of a disk device, a redundant aπay of inexpensive disks (RAID) subsystem, a tape device, and a tape library.
41. A system comprising: a processor; a plurality of physical devices coupled to the processor via a plurality of adapters, the physical devices having M device types; and a memory coupled to a processor, the memory containing program code that, when executed by the processor, causes the processor to: receive an input/output (I/O) request packet (IRP) from a higher level driver, and manage a plurality of paths to a plurality of device objects in response to the JRP using a plurality of lower level drivers, the device objects corresponding to the physical devices, the lower level drivers controlling the physical devices.
42. The system of claim 41 wherein the program code causing the processor to receive the IRP comprises program code that, when executed by the processor, causes the processor to: receive the IRP from the higher level driver, the higher level driver being one of a class driver.
43. The system of claim 41 wherein the program code causing the processor to manage the plurality of paths comprises program code that, when executed by the processor, causes the processor to: respond to a major function IRP; and monitor a device path in the N paths.
44. The system of claim 43 wherein the program code causing the machine to manage the N paths further comprises program code that, when executed by the processor, causes the processor to: add a new device object; and free the device objects.
45. The system of claim 43 wherein the program code causing the machine to respond to the major function IRP comprises program code that, when executed by the processor, causes the processor to: respond to the major function IRP from one of an I/O manager, a Plug and Play (PnP) manager, and a power manager.
46. The system of claim 45 wherein the program code causing the machine to respond to the major function IRP comprise program code that, when executed by the processor, causes the processor to s: respond to one of a start device minor IRP, a remove device minor IRP, a device relation minor IRP, an identifier (ID) query minor IRP, a stop device minor IRP, and a usage notification minor JRP.
47. The system of claim 46 wherein the program code causing the machine to respond to the start device minor IRP comprises program code that, when executed by the processor, causes the processor to: obtain a peripheral address of an attached device using one of the lower level drivers; obtain a device code of the attached device using the peripheral address; and determine if the device code matches an entry in a device list.
48. The system of claim 47 wherein the program code further comprises program code that, when executed by the processor, causes the processor to: create a new bus physical device object if the device code matches the entry..
49. The system of claim 46 wherein the program code causing the machine to respond to the remove device minor IRP comprises program code that, when executed by the processor, causes the processor to: remove an entry from a device list, the entry corresponding to an attached device; and ueiacn tne atiaυueu ucvιι;c.
50. The system of claim 46 wherein the program code causing the machine to respond to the device relation minor J P comprises program code that, when executed by the processor, causes the processor to: allocate a device relation structure from a paged memory containing a count and at least a device object pointer.
51. The system of claim 46 wherein the program code causing the machine to respond to the ID query minor JRP comprises program code that, when executed by the processor, causes the processor to: create a device ID; and return the device ID to the higher level driver.
52. The system of claim 46 wherein the program code causing the machine to respond to the stop device minor JRP comprises program code that, when executed by the processor, causes the processor to: remove an entry from a device list, the entry corresponding to an attached device.
53. The system of claim 46 wherein the program code causing the machine to respond to the usage notification minor IRP comprises program code that, when executed by the processor, causes the processor to: forward the usage notification minor JRP to a next driver.
54. The system of claim 45 wherein the program code causing the machine to respond to the major function IRP comprises program code that, when executed by the processor, causes the processor to: obtain a supported device name of an attached device; assert a device support flag if the supported device name is in the device list; and negate the device support flag if the supported device name is not in the list
55. The system ot claim 45 wherein the program code causing the machine to respond to the major function IRP comprises program code that, when executed by the processor, causes the processor to: determine if a filter device object is attached; return an error status if there is one of a claim, a release, and an inquiry for a supported device; and send the major function IRP to a next driver.
56. The system of claim 45 wherein the program code causing the machine to respond to the major function IRP comprises program code that, when executed by the processor, causes the processor to: process I/O requests in a queue list.
57. The system of claim 43 wherein the program code causing the machine to monitor the path comprises program code that, when executed by the processor, causes the processor to: poll path status of a path, the path status including a disconnect status and a connect status for a physical device coπesponding to the path; and determine if the path having the disconnect status becomes having the connect status; and adjust the path.
58. The system of claim 57 wherein the program code causing the machine to poll the path status comprises program code that, when executed by the processor, causes the processor to: determine if at least a first device name corresponding to a disconnected device is present in a first list.
59. The system of claim 57 wherein the program code causing the machine to adjust the path comprises program code that, when executed by the processor, causes the processor to: remove the first device name from the first list.
60. The system of claim 41 wherein the program code causing the machine to manage the N paths comprises program code that, when executed by the processor, causes the processor to: manage the N paths to a plurality of device objects in response to the IRP, the device objects corresponding to physical devices having M device types, the M devices types including at least two of a disk device, a redundant array of inexpensive disks (RAID) subsystem, a tape device, and a tape library.
PCT/US2003/039869 2002-12-16 2003-12-15 Universal multi-path driver for storage systems WO2004061642A2 (en)

Priority Applications (4)

Application Number Priority Date Filing Date Title
JP2004565492A JP2006513469A (en) 2002-12-16 2003-12-15 General-purpose multipath driver for storage systems
EP03814802A EP1579334B1 (en) 2002-12-16 2003-12-15 Universal multi-path driver for storage systems
AU2003297111A AU2003297111A1 (en) 2002-12-16 2003-12-15 Universal multi-path driver for storage systems
DE60313040T DE60313040T2 (en) 2002-12-16 2003-12-15 UNIVERSAL MULTI-DRIVER FOR STORAGE SYSTEMS

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US10/321,029 2002-12-16
US10/321,029 US7222348B1 (en) 2002-12-16 2002-12-16 Universal multi-path driver for storage systems

Publications (2)

Publication Number Publication Date
WO2004061642A2 true WO2004061642A2 (en) 2004-07-22
WO2004061642A3 WO2004061642A3 (en) 2005-02-17

Family

ID=32710749

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2003/039869 WO2004061642A2 (en) 2002-12-16 2003-12-15 Universal multi-path driver for storage systems

Country Status (6)

Country Link
US (1) US7222348B1 (en)
EP (1) EP1579334B1 (en)
JP (1) JP2006513469A (en)
AU (1) AU2003297111A1 (en)
DE (1) DE60313040T2 (en)
WO (1) WO2004061642A2 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7484040B2 (en) 2005-05-10 2009-01-27 International Business Machines Corporation Highly available removable media storage network environment
JP2009508212A (en) * 2005-09-09 2009-02-26 マイクロソフト コーポレーション Plug and play device redirection for remote systems
US10331501B2 (en) 2010-12-16 2019-06-25 Microsoft Technology Licensing, Llc USB device redirection for remote systems

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7610589B2 (en) * 2003-08-22 2009-10-27 Hewlett-Packard Development Company, L.P. Using helper drivers to build a stack of device objects
US7406617B1 (en) * 2004-11-22 2008-07-29 Unisys Corporation Universal multi-path driver for storage systems including an external boot device with failover and failback capabilities
US8874806B2 (en) * 2005-10-13 2014-10-28 Hewlett-Packard Development Company, L.P. Methods and apparatus for managing multipathing software
US8402172B2 (en) * 2006-12-22 2013-03-19 Hewlett-Packard Development Company, L.P. Processing an input/output request on a multiprocessor system
US7944572B2 (en) * 2007-01-26 2011-05-17 Xerox Corporation Protocol allowing a document management system to communicate inter-attribute constraints to its clients
JP4994128B2 (en) * 2007-06-28 2012-08-08 株式会社日立製作所 Storage system and management method in storage system
US8789070B2 (en) * 2007-12-06 2014-07-22 Wyse Technology L.L.C. Local device virtualization
US8136126B2 (en) * 2008-01-31 2012-03-13 International Business Machines Corporation Overriding potential competing optimization algorithms within layers of device drivers
US20090240844A1 (en) * 2008-03-21 2009-09-24 Inventec Corporation Method for adding hardware
US8321878B2 (en) 2008-10-09 2012-11-27 Microsoft Corporation Virtualized storage assignment method
CN102141897B (en) * 2010-02-02 2013-01-02 慧荣科技股份有限公司 Method for improving access efficiency, relevant personal computer and storage medium
JP5422611B2 (en) * 2011-06-24 2014-02-19 株式会社日立製作所 Computer system, host bus adapter control method and program thereof
US9483331B1 (en) * 2012-12-27 2016-11-01 EMC IP Holding Company, LLC Notifying a multipathing driver of fabric events and performing multipathing management operations in response to such fabric events
WO2015130314A1 (en) * 2014-02-28 2015-09-03 Hewlett-Packard Development Company, L.P. Mapping mode shift
WO2016159930A1 (en) 2015-03-27 2016-10-06 Hewlett Packard Enterprise Development Lp File migration to persistent memory
CN107209720B (en) 2015-04-02 2020-10-13 慧与发展有限责任合伙企业 System and method for page caching and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6085156A (en) * 1998-03-20 2000-07-04 National Instruments Corporation Instrumentation system and method having instrument interchangeability
US6311228B1 (en) * 1997-08-06 2001-10-30 Microsoft Corporation Method and architecture for simplified communications with HID devices

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5339449A (en) * 1989-06-30 1994-08-16 Digital Equipment Corporation System and method for reducing storage channels in disk systems
US6845508B2 (en) * 1997-12-19 2005-01-18 Microsoft Corporation Stream class driver for computer operating system
US6233625B1 (en) * 1998-11-18 2001-05-15 Compaq Computer Corporation System and method for applying initialization power to SCSI devices
US6643748B1 (en) * 2000-04-20 2003-11-04 Microsoft Corporation Programmatic masking of storage units
JP3466998B2 (en) * 2000-07-06 2003-11-17 株式会社東芝 Communication device and control method thereof
US6904477B2 (en) * 2001-04-13 2005-06-07 Sun Microsystems, Inc. Virtual host controller interface with multipath input/output
JP2002328823A (en) * 2001-04-27 2002-11-15 Toshiba Corp Non-covalent parallel database serve system, data writing method and matching processing method in the same system
US7134040B2 (en) * 2002-04-17 2006-11-07 International Business Machines Corporation Method, system, and program for selecting a path to a device to use when sending data requests to the device

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6311228B1 (en) * 1997-08-06 2001-10-30 Microsoft Corporation Method and architecture for simplified communications with HID devices
US6085156A (en) * 1998-03-20 2000-07-04 National Instruments Corporation Instrumentation system and method having instrument interchangeability

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7484040B2 (en) 2005-05-10 2009-01-27 International Business Machines Corporation Highly available removable media storage network environment
US7779204B2 (en) 2005-05-10 2010-08-17 International Business Machines Corporation System and computer readable medium for highly available removable storage network environment
JP2009508212A (en) * 2005-09-09 2009-02-26 マイクロソフト コーポレーション Plug and play device redirection for remote systems
US10331501B2 (en) 2010-12-16 2019-06-25 Microsoft Technology Licensing, Llc USB device redirection for remote systems

Also Published As

Publication number Publication date
WO2004061642A3 (en) 2005-02-17
JP2006513469A (en) 2006-04-20
US7222348B1 (en) 2007-05-22
AU2003297111A8 (en) 2004-07-29
EP1579334A2 (en) 2005-09-28
AU2003297111A1 (en) 2004-07-29
DE60313040D1 (en) 2007-05-16
DE60313040T2 (en) 2007-12-13
EP1579334B1 (en) 2007-04-04

Similar Documents

Publication Publication Date Title
US7222348B1 (en) Universal multi-path driver for storage systems
US8898385B2 (en) Methods and structure for load balancing of background tasks between storage controllers in a clustered storage environment
US8516294B2 (en) Virtual computer system and control method thereof
US6816917B2 (en) Storage system with LUN virtualization
US7191287B2 (en) Storage system having a plurality of interfaces
US7194662B2 (en) Method, apparatus and program storage device for providing data path optimization
US8607230B2 (en) Virtual computer system and migration method of virtual computer
US8543762B2 (en) Computer system for controlling allocation of physical links and method thereof
US7987466B2 (en) Storage system
US7406617B1 (en) Universal multi-path driver for storage systems including an external boot device with failover and failback capabilities
US20070055797A1 (en) Computer system, management computer, method of managing access path
US6665812B1 (en) Storage array network backup configuration
US20040068591A1 (en) Systems and methods of multiple access paths to single ported storage devices
US20060195663A1 (en) Virtualized I/O adapter for a multi-processor data processing system
US20050149641A1 (en) Methods and data storage subsystems of controlling serial ATA storage devices
GB2445457A (en) Altering the state of storage devices based on accesses to logical volumes
US20060277310A1 (en) Apparatus, system, and method for accessing a preferred path through a storage controller
US8122120B1 (en) Failover and failback using a universal multi-path driver for storage devices
US20070294600A1 (en) Method of detecting heartbeats and device thereof
US8477624B2 (en) Apparatus, system, and method for managing network bandwidth
US7370157B2 (en) Systems and methods of sharing removable media storage devices in multi-partitioned systems

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AU JP

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PT RO SE SI SK TR

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
WWE Wipo information: entry into national phase

Ref document number: 2004565492

Country of ref document: JP

WWE Wipo information: entry into national phase

Ref document number: 2003814802

Country of ref document: EP

WWP Wipo information: published in national office

Ref document number: 2003814802

Country of ref document: EP

WWG Wipo information: grant in national office

Ref document number: 2003814802

Country of ref document: EP