WO2005089339A2 - Disk controller methods and apparatus with improved striping redundancy operations and interfaces - Google Patents

Disk controller methods and apparatus with improved striping redundancy operations and interfaces Download PDF

Info

Publication number
WO2005089339A2
WO2005089339A2 PCT/US2005/008647 US2005008647W WO2005089339A2 WO 2005089339 A2 WO2005089339 A2 WO 2005089339A2 US 2005008647 W US2005008647 W US 2005008647W WO 2005089339 A2 WO2005089339 A2 WO 2005089339A2
Authority
WO
WIPO (PCT)
Prior art keywords
data
drive
drives
array
read
Prior art date
Application number
PCT/US2005/008647
Other languages
English (en)
French (fr)
Other versions
WO2005089339A3 (en
Inventor
Michael C. Stolowitz
Original Assignee
Netcell Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Netcell Corporation filed Critical Netcell Corporation
Publication of WO2005089339A2 publication Critical patent/WO2005089339A2/en
Publication of WO2005089339A3 publication Critical patent/WO2005089339A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1076Parity data used in redundant arrays of independent storages, e.g. in RAID systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2211/00Indexing scheme relating to details of data-processing equipment not covered by groups G06F3/00 - G06F13/00
    • G06F2211/10Indexing scheme relating to G06F11/10
    • G06F2211/1002Indexing scheme relating to G06F11/1076
    • G06F2211/1054Parity-fast hardware, i.e. dedicated fast hardware for RAID systems with parity

Definitions

  • This data transfer rate is in the range of 50 to 100 MBPS (megabytes per second) for currently available products.
  • the Disk Array [0004] A system might use multiple disk drives, i.e. an array of drives, if the required capacity, performance or reliability exceeds that available from a single drive. Capacity enhancement is the most common motivation. Two drives of a given size can store twice as much data as either single drive; compare Fig. 1 A to Fig. 1 B. Reliability enhancement is less obvious. Two drives of given type will have twice the failure rate of a single drive. On the other hand, a system may be arranged so that the second drive always has an exact copy of the data on the first drive.
  • Performance is the third reason that a system might require the use of a drive array.
  • a high speed streaming application may require a higher sustained bandwidth than a single drive can deliver.
  • a system with N drives can potentially provide N times the sustained bandwidth of a single drive.
  • the access time for a given drive determined primarily by seek and rotational delays, limits the number of IO operations that can be performed per second ("IOPS").
  • An N- drive array can potentially support N times the IOPS performance of a single drive, in the best case.
  • An example is illustrated in Fig. 1D.
  • Data Architectures [0006] The simple addition of a second drive to a system will immediately double the capacity, but there may not be a performance improvement.
  • Striping is a known technique which distributes data over the available drives so that retrieving the data will require the participation of all of the available drives, thereby allowing the system to attain performance approaching the aggregate performance of the drives.
  • the smallest addressable unit of storage on a typical disk drive is the sector.
  • a sector of data is typically an exponential multiple of two bytes in length. In this application, a sector size of 512 bytes will be used for purposes of illustration but not limitation.
  • To stripe data an order or sequence is assigned to the drives and a stripe width is selected. A pair of drives may be identified as 0 and 1.
  • the stripe width might be 4K bytes which is eight sectors of 512 bytes each. With these selections, the first 4K block of user data (“User 0" in the drawing) is stored in the first 4K block of drive 0.
  • the second 4K block of user data (“User 1") is in the first 4K bock of drive 1.
  • the third 4K block of user data is stored in the second 4K block of drive 0 and the third 4K block is stored in the second 4K block of drive 1.
  • This arrangement is illustrated in Fig. 2. This process is repeated, alternating the storage of 4K blocks between the two drives, until the ends of the drives are reached. If the system has a large number of small accesses of one or two sectors, the two drives may be accessed concurrently to attain twice the random access performance of a single drive. If the system is accessing relatively large data blocks, say of 100K, the two drives may once again be operated concurrently to attain nearly twice the sustained performance of a single drive. Consequences of the stripe width selection will be discussed below.
  • a drive may be added to keep a continuously updated backup copy of a primary drive.
  • the backup drive is an exact copy of the primary drive.
  • This technique is often known as "mirroring" or RAID1.
  • Data may be read from either drive until one of the drives fails at which point the remaining drive is selected for reads.
  • the increased reliability results in a 100% increase in the cost of storage, i.e. one mirror drive is required for each primary drive.
  • each 4K block of the redundant drive receives the XOR of the corresponding 4K blocks of the other two drives.
  • the contents of any 4K block of the failed drive can be reconstructed by computing the XOR of the corresponding 4K blocks of the remaining data drive and the redundant drive.
  • the XOR of all of the data blocks in the stripe is stored on the redundant drive.
  • any block in the stripe can be reconstructed by XORing the remaining blocks of the stripe (including the redundant drive block). The added cost of the redundancy is reduced to 1 / N where N is the number of data drives.
  • Figure 4 illustrates an array with three data plus a redundant drive.
  • the redundant drive contains the XOR of the corresponding blocks in the three data drives.
  • These drive functional assignments are typically rotated between stripes because the parity drive tends to become the bottleneck for applications with a high percentage of writes and this rotation tends to balance the load.
  • the performance of the redundant array is the same as the striped array performance without redundancy.
  • a system with two drives can provide either redundancy by using one drive to mirror the other, or it can double the capacity and provide up to a 2X performance improvement.
  • the issuing of extra disk write commands required to maintain a copy or the extra operations required to distribute or collect data striped between two drives can easily be handled by the driver software using a disk controller that does not provide any specialized array functions.
  • Each of the remaining drives is read with the data transferring by DMA to memory. Even though the three drives may have identical average access characteristics, the read operations will actually complete at different times for various reasons, including the fact that the initial states of the head position and rotational position are independent.
  • FIG 35A shows this asynchronous data transfer from Data 0, Data 1 and PAR (parity or redundant drive) drives via respective DMA channels to corresponding buffer memory. The Data 2 drive has failed. Once the three blocks are stored in a buffer, the XOR operation can be performed to reconstruct the missing data.
  • Synchronous Redundant Data Transfers An alternative technique is known as Synchronous Redundant Data Transfers or SRDT. With Synchronous Redundant Data transfers: 1. The read commands are issued to all three (or N) drives. Read data is not immediately transferred when less than three (or N) drives have data available in their internal buffers. 2. However, when the read data from all three drives is available in their respective internal buffers, the XOR process can begin. An XOR engine fetches a first element from each drive; computes the XOR of the three elements; and outputs the first element of the result to the buffer within the controller / adapter.
  • This redundancy operation is "on the fly” as it occurs as data is moved from the drive to the buffer, as distinguished from first storing data in the buffer, and then having to read it out to do the redundancy operations as described above.
  • the element size is a single sixteen bit word, the width of the interface.
  • the element fetching is accomplished by asserting the DIOR strobe to the three drives simultaneously.
  • the use of the common DIOR strobe makes the data transfer "synchronous". In the scheme described above under redundancy hardware, the XOR process could not start until the data from the last drive had been transferred to the memory.
  • the process begins as soon as the data from the last drive is available in that drive's internal buffer. Assuming that the read strobes are generated at the maximum rate supported by the drives, the advantages of the Synchronous Data Transfer and the On-The-Fly redundancy computation are as follows: 1. From the time when the last-to-finish drive has the read data ready in its buffer, the XOR is computed and the result is transferred with the same latency as the transferring of data from a single drive prior to the failure. The additional latency of fetching the three blocks from the buffer, computing the XOR, and storing the result to the buffer are all eliminated. 2. The total amount of data transferred to the buffer is the 4K block that was originally transferred.
  • the total buffer bandwidth required in this example is that bandwidth required to support a single drive. 3.
  • the data from the three drives is reduced to a single stream. Only a single DMA context (address and count) is required for the operation rather than one DMA context per drive as was required in the original example. This efficient operation, however, is dependent on using a storage element size equal to the width of the drive interface ("narrow striping"), and it is limited to synchronous transfers invoked by applying a common DIOR strobe to all of the drives in the array.
  • a RAID disk drive controller implements disk storage operations, including striping and redundancy operations with multiple disk drives connected via respective source synchronous ports, e.g. SATA ports.
  • Configurable data path switch logic provides dynamic configuration of two or more attached drives into one or more arrays.
  • Data transfers are synchronized locally by leveraging the SATA port transport layer FIFO. Synchronous transfers allow on-the-fly redundancy (XOR) operations for improved performance and reduced hardware complexity.
  • XOR accumulator hardware FIG. 42-FIG. 43) reduces buffer requirements for multiple DMA channels otherwise required for synchronization, and various narrow and wide striping modes are supported. Improvements in partial stripe update performance are provided.
  • FIGS. 1 A-1 D illustrate various disk drive configurations.
  • FIG. 2 illustrates striping data over two data drives.
  • FIG. 3 illustrates striping over two data drives plus a redundant drive.
  • FIG. 4 illustrates striping over three data drives plus a redundant drive.
  • FIG. 5 illustrates narrow striping, one word wide.
  • FIG. 6 is a simplified schematic diagram of a disk array system showing read data paths for synchronizing UDMA disk drive data.
  • FIG. 7 is a simplified schematic diagram of a disk array system showing write data paths for writing to UDMA drives.
  • FIG. 8 is a simplified schematic diagram of a disk array write data path
  • FIG. 9 is a simplified schematic diagram of a disk array read data path with
  • FIG. 10 is a timing diagram illustrating a disk array READ operation.
  • FIG. 11 is a functional view of a disk array controller showing the logical data path and DMA channel for each physical port.
  • FIG. 12 shows the actual implementation with a single physical DMA channel and the data paths for all of the physical ports being provided by the Array Switch.
  • DMA contexts for each of the physical ports are stored in a RAM.
  • FIG. 13 illustrates the Array Switch data path setup for a Disk Read.
  • FIG. 14 illustrates the Array Switch data path setup for a Disk Write.
  • FIG. 15A illustrates data striping over two, three, and four drives, is a table illustrating data mapping over one or more disk drives in a JBOD - RAID0 mapping.
  • FIG. 15B illustrates the data mapping for RAID1 or Mirroring.
  • FIG. 16A illustrates the RAIDXL data mapping over two, three, or four drives for non-redundant arrays.
  • FIG. 16B illustrates RAIDXL data mapping over two data drives plus a parity drive or three data drives plus a parity drive.
  • FIG. 17 illustrates a possible RAID5 data mapping for two data drives plus a parity drive or three data drives plus a parity drive.
  • FIGS. 18-22 illustrate various array switch configurations and operations.
  • FIG. 23 is a simplified block diagram of a disk array controller providing a host interface for interaction with a host bus, and a drive interface for interaction with a plurality of attached disk drives.
  • FIG. 24A is a conceptual diagram illustrating direct connections between logical data ports and physical data ports; and it shows corresponding Mapping Register contents.
  • FIG. 24B is a conceptual diagram illustrating one example of assignments of four logical ports to the available five physical data ports; and it shows corresponding
  • FIG. 24C is a conceptual diagram illustrating a two-drive array where each of the drives is assigned to one of the five available physical data ports; and it shows corresponding Mapping Register contents.
  • FIG. 24D is a conceptual diagram illustrating a single-drive system where logical ports 0-3 transfer data on successive cycles to physical port #3; and it shows corresponding Mapping Register contents.
  • FIG. 25A illustrates XOR logic in the disk write direction in the drive configuration of figure 24A; and it shows corresponding Mapping Register contents.
  • FIG. 25B illustrates the XOR logic in the Disk Read direction for the same data path as Figures 24A and 25A except that the drive attached to physical port 2 has now failed; and again the Mapping Register contents are shown.
  • FIG. 26 is one example of a Mapping Register structure
  • Register controls the configuration of the data paths between the logical and physical data ports in one embodiment of the array controller.
  • FIG. 27A is a conceptual diagram of multiplexer circuitry in the logical port #1 read data path.
  • FIG. 27B illustrates disk read XOR logic in one embodiment of the array controller.
  • FIG. 28 illustrates decoder logic for the Logical Port #1 (PPJ-1) field of the
  • FIG. 29A logical port to physical port data path logic in one embodiment of the array controller (illustrated for physical port #2 only).
  • FIG. 29B illustrates disk write XOR logic in one embodiment of the array controller.
  • FIG. 30 illustrates disk address, strobe and chip select logic to enable global access commands to a currently selected array.
  • FIG. 31 illustrates interrupt signal logic for associating with logical drives.
  • FIG. 32 illustrates a hardware implementation of logical addressing.
  • FIG. 33 is a simplified conceptual diagram of an embodiment of a disk array system employing a plurality of SATA ports and drives in accordance with the present invention.
  • FIG. 34A shows detail of the host interfaces of FIG. 33.
  • FIG. 34B shows a SATA port interface in the disk read direction.
  • FIG. 34C shows a SATA port interface in the disk write direction.
  • FIGS. 35A-35B illustrate a prior art read operation where a drive has failed.
  • FIG. 36 illustrates a new read operation where a drive has failed in accordance with the present invention.
  • FIGS. 37A-37C illustrate a prior art read-mod ify-write (R/M W) operation in a serial interface disk drive array.
  • FIG. 38 illustrates a new read-modify-write operation in a serial interface disk drive array in accordance with the present invention, step 1.
  • FIG. 39 illustrates a new read-modify-write operation in a serial interface disk drive array in accordance with the present invention, step 2.
  • FIG. 40 illustrates one embodiment of a disk array switch configured in the disk write direction in accordance with the present invention.
  • FIG. 41 illustrates one embodiment of a disk array switch configured in the disk read direction in accordance with the present invention.
  • FIG. 42 illustrates one embodiment of disk write accumulator logic in accordance with the present invention.
  • FIG. 43 illustrates one embodiment of disk read accumulator logic in accordance with the present invention. Detailed Description of Preferred Embodiments
  • the FIFO shall provide an "almost full" signal that is asserted with enough space remaining in the FIFO to accept the maximum number of words that a drive may send once "pause” has been asserted. Data is removed from the FIFOs synchronously using most of the steps of the method described in 6,018,778.
  • FIG. 6 illustrates an array 10 of disk drives.
  • the UDMA protocol is used by way of illustration and not limitation.
  • Drive 12 has a data path 14 to provide read data to an interface 16 that implements the standard UDMA protocol.
  • a second drive 20 had a data path 22 coupled to a corresponding UDMA interface 24, and so on.
  • the number of drives may vary; four are shown for illustration.
  • Each physical drive is attached to a UDMA interface.
  • Each drive is coupled via its UDMA interface to a data input port of a memory such as a FIFO, although other types of memories can be used.
  • disk drive 12 is coupled via UDMA interface 16 to a first FIFO 26, while disk drive 20 is coupled via its UDMA interface 24 to a second FIFO 28 and so on.
  • the UDMA interface accepts data from the drive and pushes it into the FIFO on the drive's read strobe. See signal 60 from drive 12 to FIFO 26 write WR input; signal 62 from drive 20 to FIFO 28 write /VR input, and so on.
  • Each FIFO has a data output path, for example 46, 48 -sixteeen bits wide in the presently preferred embodiment. All of the drive data paths are merged, as indicated at box 50, in parallel fashion. In other words, a "broadside" data path is provided from the FIFOs to a buffer 52 that has a width equal to N times m bits, where N is the number of attached drives and m is the width of the data path from each drive (although they need not necessarily all have the same width) In the illustrated configuration, four drives are in use, each having a 16-bit data path, for a total of 64 bits into buffer 52 at one time.
  • Segments of the data words read from the buffer are pushed into each of the FIFOs using a common strobe 72, coupled to the write control input WR of each FIFO as illustrated. See data paths 74, 76, 78, 80. In this way, the write data is "striped" over the drives of the array. Should any of the FIFOs become “full” the process is stalled. This is implemented by the logic represented by block 82 generating the "any are full" signal.
  • interfaces 16, 24 etc. implementing the UDMA protocol will pop data from the FIFOs and transfer it to the drives. While these transfers might start simultaneously, they will not be synchronous as each of the interfaces will respond independently to "pause” and "stop” requests from its drive.
  • Data stored in a disk array may be protected from loss due to the failure of any single drive by providing redundant information.
  • stored data includes user data as well as redundant data sufficient to enable reconstruction of all of the user data in the event of a failure of any single drive of the array.
  • U.S. Pat. No. 6,237,052 B1 teaches that redundant data computations may be performed "On-The-Fly" during a synchronous data transfer.
  • the combination of the three concepts: Synchronous Data Transfers, "On-The-Fly” redundancy, and the UDMA adapter using a FIFO per drive provides a high performance redundant disk array data path using a minimum of hardware.
  • FIG. 8 data flow in the write direction is shown.
  • the drawing illustrates a series of drives 300, each connected to a corresponding one of a series of UDMA interfaces 320.
  • Each drive has a corresponding FIFO 340 in the data path as before.
  • data words are read from the buffer 350. Segments of these data words, e.g. see data paths 342, 344, are written to each of the drives. At this point, a logical XOR operation can be performed between the corresponding bits of the segments "on the fly".
  • XOR logic 360 is arranged to compute the boolean XOR of the corresponding bits of each segment, producing a sequence of redundant segments that are stored preliminarily in a FIFO 370, before transfer via UDMA interface 380 to a redundant or parity drive 390. Thus the XOR data is stored synchronously with the data segments.
  • FIG. 9 a similar diagram illustrates data flow in the read direction.
  • the array of drives 300, corresponding interfaces 320 and FIFO memories 340 are shown as before.
  • the XOR is computed across the data segments read from each of the data drives and the redundant drive.
  • the data segments are input via paths 392 to XOR logic 394 to produce XOR output at 396. If one of the data drives has failed (drive 322 in FIG. 9), the result of the XOR computation at 394 will be the original sequence of segments that were stored on the now failed drive 322.
  • This sequence of segments is substituted for the now absent sequence from the failed drive and stored along with the other data in the buffer 350. This substitution can be effected by appropriate adjustments to the data path. This data reconstruction does not delay the data transfer to the buffer, as more fully explained in my previous patents.
  • FIG. 10 is a timing diagram illustrating FIFO related signals in the disk read direction in accordance with the invention.
  • each drive is likely to have a different read access time.
  • DMARQ a DMA request
  • DMACK a DMA request
  • Drive 0 happens to finish first and transfers data until it fills the FIFO. It is followed by Drives 2, 1, and 3 in that order. In this case, Drive 3 happened to be last.
  • the access for the purpose of computing an XOR is a third access adding 50% to the bandwidth requirements of the buffer.
  • the read/modify/write operations required by a local processor to perform this task were too slow, so specialized DMA hardware engines have been designed for this process.
  • the time required to compute the XOR is reduced, but a third pass over the data in the buffer is still required.
  • new data is written to the disk immediately.
  • the writes to the parity drive must be postponed until the XOR computation has been completed. These write backs accumulate and the parity drive becomes a bottleneck for write operations. Many designs try to solve this problem by distributing the parity over all of the drives of the array in RAID 5.
  • a disk array controller is implemented on a computer motherboard. It can also be implemented as a Host Bus Adapter (HBA), for example, to interface with a PCI host bus.
  • HBA Host Bus Adapter
  • the disclosed Array Switch includes features that facilitate support for RAID5.
  • RAID5 is an optimization for small random accesses whereas RAIDXL is an optimization of large sequential accesses.
  • RAID5 performance is usually measure in IOPS (IO Operations Per Second) as opposed to MBPS (Megabytes per second).
  • IOPS IO Operations Per Second
  • MBPS Megabytes per second
  • These features use the XOR hardware that already exists for the "On the fly XOR” with a new single sector buffer, accumulator, and appropriate sequencing to achieve the RAID5 functionality.
  • the principle RAID5 functions supported are FULL STRIPE READ WITH FAILED DRIVE, READ FROM FAILED DRIVE, FULL STRIPE WRITE, and PARTIAL STRIPE UPDATE.
  • the Array Switch implements data paths.
  • the configuration of the Array Switch defines arrays consisting of one or more drives.
  • An array of individual drives is known as a JBOD, Just a Bunch of Drives. Multiple drives are involved for RAID0, RAID1 , RAIDXL and RAID5. In a given instant, the Array Switch might be performing any of these functions. It is capable of supporting all of them concurrently.
  • Array A subset of the drives attached to a controller. Arrays may consist of one to four drives.
  • Logical Drive The data drives of an array are numbered starting with zero. The logical drives are L0, L1 , L2, and L3.
  • Parity Drive An array may have a single redundant or parity drive.
  • the parity drive shall be PAR.
  • SATA Port A SATA port provides an interface to a disk drive meeting the requirements of the Serial ATA specifications.
  • the SATA ports are SATA 0, SATA 1 , SATA 2, and SATA 3.
  • Physical Drive A disk drive shall take its identification from the SATA port to which it is attached.
  • the Physical drives shall be Drive 0, Drive 1 , Drive 2, and Drive 3. Sector.
  • the sector is the smallest addressable block of disk data. For our purposes, a sector shall be 512 bytes.
  • LBA The sectors of a disk drive are identified by a Logical Block Address or LBA. LBAs are assigned from zero up to the maximum number required to address the capacity of the drive.
  • Striping In arrays with more than one data drive, the data is distributed over the data drives of the array. A stripe width is selected. The capacity of each of the logical drives is viewed as a series of blocks of the stripe width. The capacity of the array is mapped to the first of these blocks on each of the logical drives of the array in logical drive sequence, then to the second blocks, etc.
  • the stripe width shall be a "power of two" number of sectors.
  • the stripe width shall be one DWORD.
  • RAID1 In this mode, there is a single logical drive and a PAR drive that holds a mirror image of the contents of the logical drive. In this mode, the pair of drives looks like a single drive to the host system.
  • RAIDXL This mapping uses DWORD interleave, with or without a parity drive. In this mode, the drives of the array including the parity drive appear to the host system as a single drive.
  • Data Mapping Diagram - Notes. In drawing Figures 15A-17: • Drive numbers shown are the "logical" drives of an array. • All numeric values are sequence numbers of blocks of stripe width. • The [+] indicates the XOR of the blocks listed. • The notation (n), (n+1), (n+2)... indicates that the logical drives data includes portions of these blocks. • The notation [n] indicates that the logical drive has the nth relative segment of the DWORD interleave.
  • Figure 15B shows the data mapping for RAID1. Regardless of the stripe size, the second drive is an exact copy of the first. (I worry about using the term "Mirror Image”.
  • Figure 16A shows the Mapping for RAIDXL without the redundant drive.
  • Figure 6B shows the data mapping for RAIDXL with redundancy.
  • the user data is striped DWORD wide which is 32 bits. (Note that a mode may be provided for 16 bit striping to provide a migration path for data stored on the current controller).
  • the first two sectors of user data are stored on the first two sectors of the pair of drives.
  • Drive 0 has the even words of user data sectors 0 and 1 as indicated by the notation 0,1 [0].
  • the Array Switch has four sets of registers, one for each SATA port. These registers are used to define individual drives or arrays. Each register set has Mapping,
  • the host software driver configures the Array Switch by loading the mapping registers. Thereafter, the system illustrated in FIGS. 11 and 12, by way of illustration and not limitation, executes disk access operations in accordance with the current configurations and provides improved
  • the Mapping field has one byte for each physical port. This field is used to indicate if the corresponding physical port/drive is used by the array defined by the particular register set. If the drive is used, it will indicate if it is the parity drive or a data drive. In the case of a data drive, it will indicate the logical drive number and whether or not the drive has failed.
  • the burst length is essentially the number of logical data drives. It is the burst length in sectors that are to be transferred contiguously.
  • the Fast Read flag indicates that, in a redundant array in which no drive has failed, whether or not the parity data is to be read and checked on a read. If not, the read will be faster due to the reduced rotational latency, i.e. "fast reads.”
  • the Command register is loaded with one of the Array Switch primitives which are defined below.
  • JBOD Just a bunch of drives
  • FIG. 18A This is the default contents of the Array Switch following a power on reset.
  • Each SATA port / physical drive is an independent single drive array using the corresponding DMA channel in the Host interface for data transfers.
  • the software driver will send commands to the SATA port.
  • the Array Switch When requested by the SATA port, the Array Switch will transfer single sector packets between the SATA and a buffer on the PCI Bus using the DMA channel.
  • RAID1 See FIG. 18B [00112] This configuration always involves two drives, a data drive and a parity drive.
  • the map shows that logical drive 0 is attached to physical port 0 and that the parity drive is attached to physical port 1.
  • the parity drive maintains an exact copy of the data drive.
  • the two drive array appears to the Host as a single drive.
  • the Array Switch logic cause the write command to be broadcast to the two drives and the same data is written to both of them.
  • For a "Fast Read” the command is sent only to the data drive and only the data drive is accessed.
  • For a "Slow Read” the command is broadcast to the two drives. Both drives are accessed and the contents of the PAR drive are checked against the contents of the data drive.
  • RAIDXL (Common Information): See FIG. 19 [00114]
  • the data is striped DWORD wide across the data drives of the array.
  • the first DWORD is stored on the first logical drive; the next DWORD is stored on the next logical drive, etc. until each logical drive has received one DWORD.
  • the DWORD computed by the bitwise XOR of the DWORDS on the data drives is stored on the parity drive (if present). This distribution is repeated throughout the disk.
  • a given sector of user data will be distributed across each of the data drives of the array. In order to access a given sector, all of the drives of the array must be accessed.
  • the minimum addressable data block of the drives that make up the array is also one sector, so any physical disk access will involved at least one sector from each data drive making the minimum transfer length equal to the N sectors where N is the number of data drives. Since all of the segments of a given stripe are stored at the same relative locations on each of the data drives, the read or write commands required to access these segments are identical. This allows common commands to be broadcast to all of the drives for any access making the array appear to the Host as a single drive. On a given access, it is possible that one or more of the drives will have an error. Any drive error must be resolved through accesses to specific drive exhibiting the error.
  • parity drive If the parity drive is present, there is the option on a read access of reading the parity drive as well as the data drives.
  • the XOR of the data is recomputed and compared with the data from the parity drive. An error is indicated if they do not match.
  • the computation does not increase the access time, but accessing an additional drive does increase the average rotational latency for the access.
  • the option of checking the parity data may be declined by asserting "fast reads."
  • the parity drive is always accessed on writes in either case.
  • the benefit of the parity drive is that it allows the array to continue to operate even though one of the drives has failed.
  • the failure of the parity drive is a trivial case.
  • the array simply becomes a RAIDXL without parity. If one of the data drives fails, the array switch is reprogrammed to indicate the position of the failed drive. The indicated drive will no longer receive any of the broadcast commands or participate in any of the data transfers.
  • the parity drive will be accessed on reads and writes regardless of the state of "fast read.” On a read, the XOR of all of the remaining data drives and the parity drive will be computed. The result of this computation is equivalent to the data that was or would have been stored on the failed drive. This data is inserted in place of the data that would have been read from the failed drive. On a write, the parity drive receives the result of the computation based on all of the write data in the usual fashion even though the data for the failed drive is discarded.
  • the software driver receives access requests from the operating system. These requests are extended at each end by a sector or two as required to get to a stripe boundary. A command for the drives is built by dividing the resulting LBA address and sector count by the number of data drives. Note that the extended commands will divide evenly. If there are three data drives, a given drive's LBA and count will be only 1/3 of the users LBA and count because each drive stores only 1/3 of the data. The Array Switch will merge the data streams from the array into a single stream for transfer to or from the user's buffer.
  • the software driver must deal with the fact that it has extended the users request.
  • the DMA is going to transfer the total requested count.
  • the driver shall create a scatter list that directs any sectors appended to the front or back end of the requested data to a discard buffer.
  • the requested data is transferred directly.
  • appended sectors at either end will require a read / modify / write operation for that end.
  • the driver would first read the target stripe that includes added sectors into a buffer. It would then build a gather that picks up the extension sectors from these stripe buffers as required and the user's data directly.
  • RAIDXL - 3 Drives - No Parity See FIG. 19B [00122]
  • the Map shows that logical drive 0 is on physical port 0, logical drive 1 is on physical port 1 , and logical drive 2 is on physical port 2.
  • the minimum transfer length is one sector per data drive for a total of 3. Commands written physical port 0 are broadcast to all of them. Data transfer requests from either port are ignored until all three ports are ready to transfer data. Data is transferred using a single DMA channel corresponding to the Array Switch register set.
  • the gray bar indicates that physical ports 1 and 2 are used by the current array.
  • the referenced Synchronous Redundant Data Transfer patent teaches simple striping where the number of data drives is a power of two and the data from the drives is simply interleaved to reconstitute the use data.
  • new hardware is introduced which extends the concept to arrays in which the number of data drives may be other than a power of two.
  • the data path to the physical drive port is DWORD wide and the data path to the Host System interface is QWORD wide.
  • the minimum access is one sector per drive in which three sectors of user data are distributed over single sectors on each of the three data drives.
  • the hardware will read data synchronously from each of the physical drive data port twice yielding a total of six DWORDS.
  • the minimum transfer length is one sector per data drive for a total of 4. Commands written physical port 0 are broadcast to all of them. Data transfer requests from either port are ignored until all three ports are ready to transfer data. Data is transferred using a single DMA channel corresponding to the Array Switch register set. The gray bar indicates that physical ports 1 , 2, and 3 are used by the current array.
  • RAIDXL - 3 Drives - Parity See FIG. 19D [00124] The Map shows that logical drive 0 is on physical port 0, logical drive 1 is on physical port 1 , and the parity drive is on physical port 2. The minimum transfer length is one sector per data drive for a total of 2. Commands written to physical port 0 are broadcast to both of them.
  • RAIDO functionality is implemented entirely within the software driver using SATA ports / disk drives in the JBOD mode. The present system does not provide any hardware support for RAIDO.
  • the software driver Upon receipt of an access request from the operating system, the software driver will break request into a sequence of accesses to the drives of the array. This includes locating segments of the user buffer, error handling, and reporting completion.
  • RAID5 Common Information: [00129] RAID5 differs from RAIDO in two ways. First, there is a parity drive.
  • the bitwise XOR of the information stored in a given stripe of the data drives is computed and stored on the corresponding stripe of a parity drive.
  • the logical to physical assignments of the data and parity drives is rotated between stripes in such a way as to distribute the parity information over all of the drives of the array.
  • Normal read accesses (that do not involved the parity drive) of a RAID5 array are the same as read accesses of a RAIDO array except for the fact that the software driver must allow for the rotation of logical to physical drives between stripes.
  • RAID5 disk write operations or reads with a one drive failed do involve accesses of the parity drive and are more complex.
  • the present system does include hardware functions to assist the software driver in these operations.
  • RAIDXL a contiguous stream of user data is interleaved between the data drives. For this reason, it is convenient for any access to use a single DMA channel transferring data between a single user buffer and the array.
  • the segment of a stripe stored on a give data drive may be many Kilobytes of data. Accessing a stripe requires data transfers between data buffers and drives where the data buffers are located many Kilobytes apart. For this reason, a DMA channel is used for each data drive. The segments could be transferred sequentially sharing a single DMA channel, but this limits the performance and implies a huge amount of buffering in order to perform the XOR computations.
  • a DMA channel must be programmed for each of the logical drives including Channel 1 whose drive has failed. What must happen here is that the segment of the stripe stored on Drive 0 will be transferred to the buffer indicted by DMA Channel 0. The segment stored on Drive 0 must be XORed with the segment stored on the parity drive reconstructing the data that was or would have been stored on Drive 1. The result must be sent to the buffer indicated by DMA Channel 1.
  • the Array Switch has a single sector buffer into which it can store the output of the XOR logic.
  • the command is written to physical port 0 and will be broadcast to ports 1 and 3.
  • the Array Switch will wait until all of the accessed drives are ready to transfer at least one sector. It then will transfer one sector from the SATA port of the first non-failed logical data drive to the host using that drives DMA channel.
  • the XOR logic snoops the transfer capturing a copy of the sector in its XOR buffer.
  • a stripe is accessed in which logical drive 0 is assigned to SATA 0, logical drive 1 is assigned to SATA 1, and the parity drive is assigned to SATA 3.
  • Logical drive 2 was assigned to SATA 2, but this drive has failed. Since the data format was striped over three drives, the minimum transfer length is still 3 even though one of the data drives has failed.
  • a DMA channel must be programmed for each of the logical drives including Channel 2 whose drive has failed.
  • the Array Switch will wait until all of the accessed drives are ready to transfer at least one sector. It then will transfer one sector from the SATA port of the first non-failed logical data drive to the host using that drives DMA channel.
  • the XOR logic snoops the transfer capturing a copy of the sector in its XOR buffer. In sequence, it will transfer one sector each from the balance of the non-failed data drives, accumulating the XOR of each new sector with the current contents of the buffer. Then it will take one sector from the parity drive, XOR it with the contents of the buffer and sending the result to the host using the DMA channel of the failed drive.
  • logical drive 0 is assigned to SATA 0 and logical drive 1 is assigned to SATA 1.
  • the parity drive is on SATA 2. Since the data format is striped over two drives, the minimum transfer length is 2. DMA Channels must be programmed for each of the data drives.
  • the command is written to SATA 0 and will be broadcast to SATA 0, SATA 1 , and SATA 2.
  • the Array Switch will wait until all of the active SATA ports are ready to receive data. It will then transfer one sector to logical drive 0 on SATA 0 using the buffer indicated by DMA channel 0. [00142] This transfer is snooped by the Array Switch and a copy of the sector is captured in the XOR buffer. One sector each is then transferred for each of the remaining data drives, in sequence.
  • logical drive 0 is assigned to SATA 0 and logical drive 1 is assigned to SATA 1 , and logical drive 2 is assigned to SATA 2.
  • the parity drive is on SATA 3. Since the data format is striped over two drives, the minimum transfer length is 3. DMA Channels must be programmed for each of the data drives.
  • the command is written to SATA 0 and will be broadcast to SATA 0, SATA 1 , SATA 2, and SATA 3.
  • the Array Switch will wait until all of the active SATA ports are ready to receive data. It will then transfer one sector to logical drive 0 on SATA 0 using the buffer indicated by DMA channel 0. This transfer is snooped by the Array Switch and a copy of the sector is captured in the XOR buffer.
  • RAID5 a user might write a single sector. It is possible to update the parity drive without doing a read / modify / write of the entire stripe. Only the data on the target data drive and parity drive are going to change. This leaves the other drives available for concurrent read accesses or the other pair of drives available for a concurrent partial stripe update. Following the update, the parity drive must contain the XOR of the entire stripe including the new data, but it already has the XOR of the entire stripe with the current data.
  • the traditional RAID5 approach to this problem is to read the data segment that is going to be replaced and to read the current parity data. The two are XORed giving the parity of the balance of the stripe with the target data segment removed from the result.
  • the new data segment is then written to the array and it is XORed with the result of the previous computation yielding the XOR of the updated stripe including the updated data segment.
  • This is written to the parity drive.
  • the Array Switch hardware includes features designed to facilitate this process. [00145] As described above, a partial stripe update involves only one data drive and a parity drive. In the example shown, the data drive is treated as a logical drive 0 attached to SATA 0 and the parity drive is assigned to SATA 3. With only one data drive, the minimum transfer length is one.
  • the read command is written to SATA 0 and broadcast to SATA 0 and SATA 3. During the read phase, the Array switch waits until both SATA ports are ready to transfer data.
  • the data is then transferred synchronously with the XOR being computed "on the fly” and the result is stored in the buffer indicated by the DMA channel for SATA 3.
  • This result gives the software driver the parity data for the current track with the data segment that is going to be update already removed. This accomplishes the two read operations and the first XOR computation of the traditional RAID5 read modify write in a single action.
  • the software driver programs the DMA channel 0 with the address of the buffer holding the new data and DMA channel 3 with the address of the buffer holding the contents of the parity computation just completed.
  • the commands are written to logical port 0 and broadcast to SATA 0 and SATA 3.
  • the Array Switch will wait until both drives are ready to receive data. It will then transfer one sector to the SATA port of the data drive using its DMA channel. The transfer is snooped and a copy of the sector is captured in the buffer. It will then transfer one sector using the DMA channel of the parity drive's SATA port.
  • the Array Switch will compute the XOR of this data with the contents of the buffer, sending the result of the computation to the parity drive's SATA port. At this point, one sector has been read from each DMA channel's buffers and one sector has been written to each SATA port. This process is repeated until all of the data has been transferred.
  • mapping registers organized in logical port sequence It appears now that a preferred solution may be to organize mapping registers in a physical port sequence. There would be a field in each register for each physical port. The entry in the field identifies the corresponding logical drive (or indicates that the physical port is not used in that array).
  • the typical RAID controller for a small computer system includes an interface to a host system and an interface to a drive array.
  • FIG. 23 is a simplified block diagram of a disk array controller 10 providing a host interface 16 for interaction with a host bus 12, and a drive interface 22 for interaction with a plurality of attached disk drives 14.
  • the controller preferably includes a control processor 20 and a buffer memory 18 for temporary storage of data moving between the host bus and the drives.
  • a physical port is required for the attachment of a mass storage device such as a disk drive to a system. While some interfaces are capable of supporting concurrent data transfers to multiple devices, the physical ports tend to become a bottleneck. For this reason, a high performance RAID controller may have a physical port per mass storage device as shown in Figure 24A.
  • Figure 24A also shows the corresponding contents of a Mapping Register 24, further described below with reference to FIG. 26.
  • One of the performance benefits of RAID comes from the striping of data across the drives of the array. For example, reading data from four drives at once yields a four times improvement over the transfer rate of a single drive. For the example shown in FIG. 24A, the sixteen bit data arriving from four drives is merged in logical drive order into sixty-four bit data that is sent to the buffer (18 in FIG. 23). User data was striped, ie. it was distributed a segment at a time (e.g. 16-bit word) across an array of drives in a predetermined sequence. We identify that sequence as starting with Logical Drive #0 and proceeding through Logical Drive #n-1 , where n is the number of drives in the array.
  • FIG. 26 conceptually represents a data path switch described later in detail.
  • the data path switch 26 provides dynamically configurable data paths between logical data ports and physical data ports.
  • Figure 24A with its direct connection between logical data ports and physical data ports, is only a conceptual diagram. In real applications, the number of available physical data ports will be greater than the number of logical data ports. There may be ports that are reserved as "hot spares" or the physical ports may be grouped into different sub-arrays that are accessed independently.
  • Figure 24B is an example of one of the possible assignments of four logical data ports (Logical Port #0 to Logical Port #3) to the available five physical data ports, Physical Port #0 to Physical Port #4).
  • the large arrow 30 simply indicates the assignment of Logical Port #1 to Physical Port #2.
  • Figure 24B also shows the corresponding contents of a Mapping Register 24.
  • the second field from the right in the register corresponds to Logical Port #1 , and it contains the value "2" indicating the Physical Port #2, as indicated by arrow 30.
  • the data path switch 26 implements logical to physical port assignments as fully described later.
  • Figure 24C shows an example of a two-drive array where each of the drives is assigned to one of the five available physical ports, namely Physical Port #1 and Physical Port #2.
  • each of the 16-bit drives In order to assemble a 64-bit word for the buffer, each of the 16-bit drives must be read twice. On the first read, the data for Logical Ports #0 and #1 are obtained from Physical Ports #2 and #1 , respectively. On the second read, Logical Ports #2 and #3 obtain data from Physical Ports #2 and #1 respectively. These operations are orchestrated by the processor 20. Again, the Mapping Register shows the assignments to Physical Ports #1 and #2.
  • Figure 24D shows an example of an array with a single drive connected to physical port #3. For this configuration, the data for logical ports #0 through #3 is obtained by reading the same physical port four times.
  • Figure 25A shows the four-drive array of Figure 24A with the addition of logic 36 to compute a redundant data pattern that is stored on the drive attached to physical port #4.
  • the logical XOR between the corresponding bits of the data from the logical data ports has the advantage over an arithmetic operation in that the XOR operation does not have to propagate a carry. Due to the use of the XOR, the fifth drive is often referred to as either the "redundant" drive or the "parity" drive.
  • FIG. 25B shows the same four-drive array as defined in FIG. 25B, with the data paths 40, 42 etc. shown for the disk read direction. In this case, the drive attached to physical port #2 has failed. Accordingly, the corresponding data path 44, which does not function, is shown in dashed lines.
  • the XOR function is computed across the data from the remaining data drives (Physical Ports #0, #1 and #3) and from the redundant drive, Physical Port #4. This computation reconstructs the data that was stored on the failed drive and the result is directed to logical port #2 via data path 46 in place of the now unavailable data from the failed drive.
  • the Mapping Register consists of five fields, one for each of five logical data ports, L0-L4 in this example. Each logical data port's corresponding field in the register is loaded with the number of the physical data port to which it is connected. The data in the field for logical data port 0, is represented symbolically as PPJ.0 indicating that it is the Physical Port associated with Logical Port 0. The values in the next four fields are identified as PP_L1 , PPJL2, PP_L3, and PP_L4 respectively.
  • the fifth logical data port is a pseudo port. The PP_L4 value is used to assign a physical data port for the Parity drive.
  • the Mapping Register fields can be of almost any size.
  • An eight-bit field for example, would support an array of up to 256 physical ports. In the illustrative embodiment, with only five physical ports, a three bit field is sufficient. The five fields pack nicely into a sixteen bit register with a bit to spare noted by an "r" in the Figures for "reserved”. Any type of non-volatile memory can be used to store the mapping register information.
  • FIG. 24A To demonstrate the use of the Mapping Register, we will briefly revisit each of the configurations described so far.
  • a Mapping Register 24 is shown.
  • the value of PP_L0 is 0 indicating the logical data port #0 is connected to physical port #0.
  • the next three values are 1 , 2, and 3 indicating that the next three logical data ports are connected to the next three physical data ports.
  • the value of PP_L4 is 7. This is not a legal physical port number in this example.
  • the value "7" is used to indicate that there is no parity drive in this configuration.
  • the specific value chosen is not critical, as long as it is not an actual physical port number.
  • FIG. 24B shows the values stored in the Mapping Register.
  • Figure 24C shows the Mapping Register configured for a two-drive array. Note that logical data ports #2 and #3 are associated with the same physical ports as logical ports #0 and #1. The first two logical ports transfer data on the first physical port cycle while the second two logical ports transfer data on the second physical port cycle.
  • Figure 24D shows the Mapping Register configured for the singe drive case. Logical ports #0 through #3 transfer data on successive cycles to physical port #3. All of the variations of Figure 2 are different data path configurations shown independent of the redundant data logic.
  • Figure 25A shows the XOR logic in the Disk Write direction for the same data drive configuration as Figure 2A.
  • the XOR is computed over the data from all four of the logical data ports. The result is stored on the drive attached to the physical port specified in logical port #4 field of the Mapping Register.
  • PP_L4 has a value of "4" instead of "7" indicating that there is a parity drive and that it is attached to port #4.
  • Figure 25B shows the XOR logic in the Disk Read direction for the same data path as Figures 24A and 25A, except that the drive attached to physical port #2 has now failed.
  • the contents of the Logical Data Port 2 field, PP_L2 has been replaced with a "5".
  • the legal physical port numbers are 0 through 4.
  • the "5" is a reserved value used to indicate that a drive has failed. Any logical data port accessing the pseudo physical port number 5 will take its data from the output of the XOR.
  • each of the four logical data ports must be able to receive data from any of the five physical data ports or, in the case of a failed drive, from the Disk Read XOR.
  • each of the physical data ports has a corresponding six-to-one multiplexor 50, sixteen bits wide.
  • the multiplexor 50 for logical port 1 is shown in the Figure 27A, but the others (for Logical Ports #0, #2 and #3) are identical.
  • the selector or "S" input of the multiplexor is connected to Logical Port #1 field of the Mapping Register -"PP .1".
  • the PP_L1 values of 0 through 4 select data from physical ports #0 through #4 respectively while a the value "5" selects the output of the Disk Read XOR.
  • Figure 27B shows the Disk Read XOR logic 52.
  • the Disk Read XOR 52 is a five-input XOR circuit, sixteen bits wide in the preferred embodiment (corresponding to the attached disk drive data paths). (This is equivalent to sixteen XORs, each with five inputs.) Each of the five inputs is logically qualified or “gated" by a corresponding AND gate, also sixteen bits wide, for example AND gate 54. (This is equivalent to sixteen NAND gates, each with two inputs.) The five NAND gates are qualified by the corresponding five physical port select signals, PP0_SEL through PP4_SEL. The generation of these signals will be described below.
  • the data path to each of the physical ports may come from any of the four logical data ports, or from the Disk Write XOR. Examples were shown with reference to FIGS. 24A-24D. While a field of the Mapping Register identifies the data source for each of the logical data ports, we do not have a field that provides the corresponding data for each of the physical ports. This information can be derived from the fields that we do have.
  • Each of the three bit binary encoded fields of the Mapping register is decoded with a "one of eight" decoder. Figure 28 shows such a decoder 66 for the Logical Port #1 field.
  • the value PP_L1 is decoded into L1_P0, L1_P1 , L1_P2 ... L1_P7 where the names indicate a path from a source to a destination. L1_P2, for example, indicates a path from Logical Port #1 to Physical Port #2.
  • FIG. 29A sample circuitry is shown for multiplexing of the data paths 70 from the logical data ports to the physical data ports (#0-#4).
  • the multiplexor 72 for physical port #2 is shown in the figure, but the multiplexors for the other four ports (not shown) are identical.
  • Each of the multiplexors 72 consists of an AND / OR array with five AND gates 74, all sixteen bits wide, and a corresponding OR gate 76.
  • Each of the AND gates is equivalent of sixteen AND gates, each with two inputs.
  • the OR gates is equivalent to sixteen OR gates, each with five inputs.
  • the AND gates from the logical data ports are qualified by the corresponding outputs of the five decoders, i.e.
  • each of the decoders 66 has an enable input "EN" that qualifies all of its outputs.
  • EN enable input
  • One of the advantages of the present invention is the ability to easily configure, or reconfigure, the organization of attached drives into defined arrays simply by storing appropriate configuration bytes into the mapping register.)
  • commands such as read and write
  • Figure 30 shows one implementation to address this issue.
  • the address, strobe, and chip select signals CSO, CS1 , DA0, DA1 , DA2, DIOW and DIOR are shown for the first two of the five physical ports (P0 and P1). Note that these address and strobe signals are common to all five ports. They are buffered individually so that a failure of a given drive cannot block the propagation of these signals to the other drives. See buffers 80,82.
  • the output drivers for the two chip select signals CS0#, CS1# of a given port are qualified by the Pn_SEL signal for that port; see gates 84, 86. Any port not selected by the current contents of the Mapping Register will not have either of its chip selects asserted and therefore will ignore the read and write strobes .
  • a "global read” does not make any sense as it implies that potentially conflicting data values are returned on a common bus.
  • a "global read” causes a read strobe, Figure 30 Pn_DIOR#, to be “broadcast” to all of the physical data ports.
  • Those attached storage devices qualified by a chip select Pn_CS0#, Pn_CS1# will return data to the physical port where it is latched at the trailing edge of the Pn_DIOR# strobe. No attempt is made to return a data value to the local processor as a result of this read cycle.
  • the local processor will then read each of the ports one at a time using a different address which does not cause a repeat of the Pn_DIOR# strobe cycle and without changing any of the latched data. These cycles do allow the local processor to fetch the potentially unique values stored in each of the data latches.
  • the Pn_DIOR# cycle which may require up to 600 nS is only executed once.
  • the values latched in each of the ports may be fetched in 15 ns each for a significant time savings over repeating the Pn_DIOR# cycle five times.
  • the "global read” and “global write” apparatus allows the local processor to send commands to and receive control status from the currently selected array in the minimum possible amount of time.
  • the control interface updates automatically without other code changes.
  • each of the drives has an interrupt output used to signal the need for service from the controller.
  • Figure 31 shows the use of a multiplexor 90 controlled by PP_L0 value from the Mapping Register to select the interrupt of the physical port associated with logical data port zero.
  • Each of the other logical data ports has an identical multiplexor (not shown) that uses the corresponding PP_Ln value to locate its interrupt.
  • the buffer 92 takes the selected interrupts from each of the logical data port multiplexors (90 etc.).
  • the interrupts appear in logical data port order starting with logical data port zero in the bit zero position.
  • the same technique can be used to sort both internal and external signals from the physical data ports including drive cable ID signals and internal FIFO status signals. This feature enables the local firmware to use a common sequence of code for multiple arrays with different numbers of physical ports.
  • the selected interrupts from the logical data ports can be logically ANDed 94 and ORed 96 as shown in Figure 31 to provide signals "Interrupt ALL" and Interrupt ANY".
  • the local processor When the local processor has issued a command, and before any data has been transferred, it might want to know about an interrupt from ANY of the drives as one or more drives may have rejected the command or had some other error. Once the drives have begun to transfer data, the local processor will want to know when ALL of the drives have asserted their interrupt signals as this indicates the completion of the command. Note that this type of implementation makes the software independent of the number of drives. (For a two-drive array, the interrupt signal from each device appears twice while in a single drive array, the same drive appears four times. The AND and ALL signals still function correctly.)
  • each of the physical data ports appears at unique location within the local processor address space.
  • the decoded output if remapped according to the contents of the Mapping Register.
  • the Mapping Register is loaded with an "identity" pattern, i.e. logical device 0 points to physical port 0, logical device 1 points to physical port 1 , etc. This makes the physical ports appear in order starting with first physical port location in the processor's address space. In normal operation the Mapping Register will be loaded with a logical to physical drive map.
  • the local processor may access the interrupting drive through the unique address space that accessed physical port 2 when the identity map is loaded. This makes the servicing of logical drives independent of the physical data port to which they are attached.
  • FIG 32 One hardware implementation of the logical addressing feature is shown in Figure 32.
  • the processor accesses the address region for the device port space, the one of eight decoder 100 decodes processor address lines five through seven defining thirty-two byte spaces for each of the devices. The decoding of each space asserts the corresponding port N decode signal, Pn_DEC. The decoding of the virtual port number seven is the signal for a global access.
  • the P7_DEC signal or ORed with each of the other decode signals 102 so that the resulting port select signals Pn_SEL (n 0-4) are asserted both for a specific access of that port and for a global access.
  • Each of the port select signals is then steered by the PP_Ln values from the Mapping Register.
  • the one-of-eight decoder 104 takes the P2_SEL signals and routes it according to the PP_L2 value from the Mapping Register producing a set of signals of the form L2_P0_CS indicating a chip select from physical port zero from logical port two.
  • the one-of-eight decoders for the other four logical ports are identical (not shown).
  • Each physical port has a five-input OR gate, for example 106.
  • the OR gate 106 for physical port #2 is shown. It ORs together the five different sources for a chip select to physical port #2. Note that for a single-drive sub-array, the chip select will be asserted by all four logical devices and for a dual drive sub-array, the chip select is asserted by two of the logical devices.
  • mapping register it can be called a logical mapping register. As explained, it provides a field for each logical drive in a defined array, and in that field, a value indicates a corresponding physical port number.
  • a register provides a field for each physical port or attached drive, and in each field, a value indicates a corresponding logical port number.
  • This alternative mapping register is illustrated in the following example.
  • the first block of data (as well as the 5th, 9th, etc) is stored on the drive connected to physical port #1.
  • the second block (as well as 6th, 10th, etc ) is stored on the drive connected to physical port #2.
  • the third block of data (as well as 7th, 11th, etc) is stored on the drive connected to physical port #4.
  • the first block of data goes on logical drive 0, the second on logical drive 1 , the third on logical drive two and the fourth on logical drive 3.
  • the two alternative types of mapping registers for this case are as follows:
  • FIG. 33 is a simplified conceptual diagram of an embodiment of a disk array controller employing a plurality of SATA ports and drives in accordance with the present invention.
  • a series of buffers 420 are shown conceptually. In practice, these buffers could be anywhere in the available host system address space.
  • This diagram illustrates a separate buffer 420, host interface 450, DMA channel 422, data path switch logic 460, and SATA port 424 for each of the disk drives 428.
  • the actual relationships among the data channels, buffers, and drives depends upon the current configuration of the data path switch logic 460.
  • mapping can be reconfigured dynamically (for example in the event of a drive failure), and it can be reconfigured under software control.
  • the data path switch supports data paths for multiple drives and for multiple arrays concurrently, i.e. without software involvement once the paths are set up.
  • FIG. 34A illustrates the host interface 450 of FIG. 33 in greater detail.
  • the host interface includes a system bridge 500 providing data transfer between system main memory and a PCI bus 502.
  • the PCI bus is merely one example.
  • the PCI bus is coupled via a PCI bus interface 504 to various logical DMA channels 510, under control of a bus arbiter 506.
  • a bus arbiter 506 In operation, only one DMA channel at a time actually transfers data to or from the PCI bus. That said, multiple DMA channels can be concurrently "active" as further discussed later.
  • FIG. 34B shows a SATA port 520 in greater detail, indicating operation in the disk read direction.
  • the interface implements a physical layer 522 for physical connection to an attached drive 524; a link layer 526, and a transport layer 528.
  • the transport layer includes a FIFO memory 530 which provides a data buffer between the physical drive and the controller.
  • the SATA interface specification provides a handshake mechanism for either end of the link 532 to throttle (pause) a data transfer from the other end.
  • the SATA link is half duplex.
  • the protocol uses the reverse channel to handshake the transfer of each FIS. Due to the much higher speed of the link 532, up to 80 additional bytes may be received after requesting a pause in the transfer.
  • the transport layer FIFO 530 when receiving data from the drive, can generate an "almost full” indication (not shown) that will throttle the link using the back channel to prevent FIFO overflow and data loss.
  • the other side of this FIFO, data path 540 can be accessed with locally generated timing.
  • the control flag "EMPTY” from the FIFO, and the control signal "POP" to the FIFO are used to control accessing the data.
  • FIG. 34C shows the same SATA port 520 during a disk write operation.
  • control signals "FULL” and "PUSH” are used to write data to the port FIFO, for subsequent transfer to the attached drive. Again, the data transfer is decoupled from the actual link to the drive, thus the transfer from the controller can be synchronous, leveraging the advantages of "on-the-fly” redundancy operations, implemented in the switch logic 460 in a preferred embodiment.
  • the data path switch logic in one embodiment was described in detail above with reference to FIGS. 24-32. For example, see data path switch 26. As noted, it provides dynamically configurable data paths between logical data ports or DMA channels and physical data ports. In one embodiment, the configuration is determined by mapping data stored in mapping registers. The mapping can be logical or physical as noted above.
  • FIG. 40 is a simplified diagram of a data path switch, illustrating configuration in the disk write direction, and showing four DMA channels (DMA0-DMA3) for data transfer with a host or buffer memory.
  • the switch preferably includes hardware for implementing synchronous transfers as well as on-the-fly redundancy (XOR) operations.
  • This block is inserted between the DMA channels and the SATA ports (SATA P0-SATA P3).
  • SATA P0-SATA P3 SATA ports
  • the specific numbers of DMA channels and SATA ports is merely for illustration and not limiting. While some of the other drawings show specific connections between DMA channels, SATA ports, and the XOR logic (for example FIGS. 13,14, 24, 25), these paths are in fact all configurable using logic as discussed earlier. All data paths preferably are 32-bit for SATA or 16-bit for PATA / UDMA applications.
  • disk write data to any SATA port may come from any of the DMA sources A-D or from the XOR block 4010 output "X".
  • Each SATA port has a FULL status flag output as indicated.
  • the PUSH signals of all active status ports are asserted simultaneously, but only when the FULL flags of the active ports are all false.
  • the XOR block 4010 can compute the XOR of any combination of the
  • DMA inputs by qualifying the appropriate combinations of the AND gates with enable logic signals, e.g. XB_ENA as needed for a particular disk drive array configuration or striping scheme.
  • the DMA channel data paths might be 32, 64, or 128 bits depending on the width of the data path to memory.
  • the data path switch block will pack or unpack the 32-bit SATA elements as required to build up the data path width.
  • the logic receives data words of the DMA width, and it outputs from 1 to 4 32-bit words at a time, synchronously to the array of drives and to the XOR logic.
  • this block receives 1 to 4 32-bit words at a time, synchronously,, taking advantage of the SATA port FIFOs as discussed above.
  • FIG. 40 illustrates one embodiment of XOR accumulator circuitry for use in a data path switch.
  • the accumulator is an alternative to providing additional buffering for each DMA that was required because the DMA channels do not transfer data synchronously.
  • the shared Host bus interface (see FIG. 5A) ensures that the Host transfers from different buffers will not be overlapping.
  • the FIFO 4210 in this illustration could be one sector in length.
  • This apparatus might be used, for example, to write a full stripe of a three data drives plus redundant drive array using the following process: [00203] 1. The process would wait until all active drives were in icating NOT FULL which, in this case, would mean that they could accept one more sector of data. (See FULL flag in FIG. 40) 2. Next, one sector of data is transferred from Buffer 0 using DMA Channel 0 along the path "A" shown in FIG.
  • the data (“A") would also pass through the XOR 4222, enabled by A_ENA, into the FIFO 4210. 3.
  • one sector of data is transferred from Buffer 1 using DMA Channel 1 along the path "B" shown in FIG. 40 to SATA port 1.
  • the data "B” would also pass through the XOR 4222, enabled by B_ENA.
  • the current contents of the FIFO 4210 is enabled by X_ENA. Consequently "B” and "X” would be XORed with the result going into the FIFO 4210. 4.
  • one sector of data would be transferred from Buffer 2 using DMA Channel 2 along the path "C" to SATA port 2.
  • the data would also pass through the XOR, enabled by CJ ⁇ NA.
  • the current contents of the FIFO would be enabled by X_ENA.
  • "C” and "X” would then be XORed with the result, the XOR of the sectors from Buffer 0, Buffer 1 , and Buffer 2, going synchronously to SATA 3.
  • the above scheme uses only one FIFO, a single sector in length. The process of moving one sector from each buffer would be repeated as required until all data had been transferred.
  • a host bus rate of 450 MBPS for example, would support the full 150 MBPS for four SATA drives.
  • Disk read data from any SATA port, SATA P0 - SATA P3, can be steered to any of the DMA destinations A-D, as indicated conceptually by the multiplexers, e.g. mux 4110.
  • data from any of the SATA ports can be input to the XOR 4120, gated by the corresponding enable signal, e.g. X1 J ⁇ NA.
  • Each SATA port has a EMPTY status flag output.
  • the POP signals of all active status ports are asserted simultaneously, but only when the EMPTY flags of the active ports are all false.
  • the XOR block 4130 can compute the XOR of any combination of the SATA inputs by qualifying the appropriate combinations of the AND gates.
  • the DMA channel data paths might be 32, 64, or 128 bits depending on the width of the data path to memory.
  • This block receives elements from the SATA ports and may receive elements from the XOR in place of the data from a failed drive.
  • FIG. 36 illustrates this read failed drive situation. It will pack elements from 1 to 4 sources to build words for DMA transfers. For wide striped applications, there may be more than one DMA channel. As shown in the "HOST INTERFACE DETAIL" drawing of FIG. 5A, the DMA channels may be running concurrently, but the Host bus will only transfer data for one at a time.
  • FIG. 43 illustrates one embodiment of an XOR accumulator circuitry for use in a data path switch in the read direction.
  • the accumulator is an alternative to providing additional buffering for each DMA that was required because the DMA channels do not transfer data synchronously.
  • the shared Host bus interface (see FIG . 5A) ensures that the Host transfers from different buffers will not be overlapping.
  • the FIFO 4310 in this illustration could be one sector in length.
  • This apparatus can be used in many ways. For example, it might be used to read a full stripe of a three data drive plus redundant drive array with the drive on SATA 2 failed, for example, using the following process:
  • one sector of data is transferred from SATA 3, enabled by P3J ⁇ NA., and XORed with the current contents of the FIFO 4310 enabled by X_ENA once again. The result is sent through DMA Channel 2 to Buffer 2 thereby completing the full stripe read.
  • all of the non-failed drives are transferred first followed by the parity drive data.
  • the corresponding blocks of data from the three SATA ports would be transferred to the XOR in the same order, but only the final result would be transferred through a DMA channel to a buffer.
  • the stripe width may be any multiple of a single sector.
  • a read request may start on any sector, and its length may range from a single sector up to the capacity of the array.
  • a general method of synchronously reading data from a redundant disk drive array in accordance with the present invention proceeds as follows: • From the drive whose block contains the initial sector, data is read from the initial sector up to the end of the block, or the end of the read request, whichever comes first. If the initial sector happens to be on a failed drive, the corresponding range of sectors is read from the remaining drives, computing the XOR and transferring the result "on-the-fly" as described herein. • For additional whole data blocks requested, blocks are read from successive blocks across the stripe and from successive stripes.
  • the read accesses must occur concurrently on each of the drives.
  • a block is required from a failed drive, the concurrency is not possible because data from all of the other drives are required to reconstruct the failed block.
  • a data read access is made from the drive on which the block including the final sector resides. Data is transferred from the start of the block or from the initial sector if it were in the same block, up to the final sector. If this block of data resided on a drive which had failed, the corresponding sector ranges of the blocks of the stripe on the other drives are read, the XOR is computed, and the result is returned.
  • FIGS. 37A-37C illustrate a prior art methodology for a partial stripe update, also called a Read/Modify/Write or R/M/R operation.
  • Step 1 reading the block to be changed from a data drive, and reading the corresponding block from a parity (PAR) drive.
  • Both the data and the redundant data are stored in buffers 3702, 3704, respectively.
  • Step 2 FIG. 37B, an new XOR is formed by reading (and XOR-ing) buffers containing the old data 3702, the old parity 3704, and the new data 3706.
  • the XOR result is stored in buffer 3708.
  • Each buffer has a corresponding DMA channel as illustrated.
  • Step 3 the new data and the new parity data are stored on the respective d rives.
  • Another known approach to this problem is to pre-read the blocks which are not being updated. At this point, all of the new and the unchanged blocks will be available in buffers which can be read for XOR computation.
  • the pre-read creates the same starting state as a full stripe write. Using this approach, each data drive is either read or written, and the redundant drive is written.
  • FIG. 38 is the step of reading the current redundancy and the current contents of the block which is to be updated.
  • the XOR of these two blocks is then computed. (The XOR is implemented in the array switch logic 460 as discussed.) The result of this computation is equivalent to the XOR of all of the blocks which are not being updated.
  • This XOR comp> utation of the two blocks effectively "backs out” the effects of the current data i n the block which is going to be updated.
  • This intermediate result is stored in a temporary buffer 3810.
  • the intermediate result is then XORed with the new data 3812 to produce an updated redundancy for the entire stripe.
  • the updated redundancy is written to the redundant (PAR) drive 3S20 the new data is written and to the data drive 3822.
  • PAR redundant
  • Redundant Writes - Synchronous Data Transfers - [00220] The full stripe write operation can also be performed using a synchronous redundant data transfer previously described for a disk read operation, as follows:
  • Disk write commands are issued to each of the drives of the array.
  • the DMA engine fetches a first element from each of the buffers, computes the XOR of the elements and then writes the first elements to each of the data drives and the result of the XOR computation to the redundant drive, using a common DIOW for the parallel ATA/ATAPI interface.
  • the total amount of data transferred from the buffer is the 12K of new data which is written to the drives. The additional 12K of buffer reads and 4K of buffer writes are eliminated.
  • a partial stripe update can be handled as a full stripe update by first pre-reading the contents of the blocks which are not changing.
  • the described approach of pre-reading only the redundant data and the block to be updated can use synchronous redundant data transfers to advantage, as follows. 1. Read commands are first issued to the redundant drive and the drive to be updated. 2. When both drives are ready to transfer data, first elements are read from each of the drives, the XOR of these elements is computed, and the result is stored in a buffer. This processor is repeated, element by element until the entire blocks have been read from the drives, the resulting block has been stored in a buffer. This buffer now holds the XOR of all of the blocks not being updated. 3.
  • a write command is issued to the redundant drive and the data drive to be updated. 4.
  • both drives are ready to accept data, a first data element is fetched from the XOR buffer and a first data element is fetched from the update buffer. The XOR of the two elements is computed. Data is then transferred to the two drives synchronously using a common DIOW strobe. The data drive receives the element from the update buffer unaltered, while the redundant drive receives the computed XOR of the two elements. This process is repeated, element by element completing the partial stripe update.
  • FIG. 5 it illustrates an array with data striped over three data drives plus a redundant drive.
  • a logical order is assigned to the three drives and a stripe width is selected.
  • the sixteen-bit width of the data path is a useful stripe width.
  • the first word (0) of Sector 0 is stored on the first drive (Data 0 in FIG. 5)
  • the second word (1) is stored on the second drive (Data 1)
  • the third word (2) on the third drive (Data 2).
  • This process is then repeated, storing the fourth data word (3) in the second location of the first drive, the fifth (4) in the second location of the second drive and the sixth (word number 5) in the second location of the third drive.
  • This process is repeated word after word and stripe after stripe to the end of the disk.
  • the parity information is stored on the parity drive. For example, in Sector 0, stripe 2, the parity drive word is shown as 6 ⁇ 7 ⁇ 8 meaning t e XOR of words 6, 7 and 8 stored on the data drives.
  • an entire stripe consists of three words, one from each data drive. Reading user data will require an access from three drives even without a drive failure. Since entire stripes are consumed, tine aggregate bandwidth of the data drives in the array is always achieved.
  • the synchronously accessed word- wide elements are assembled into blocks of the buffer mem ory width, maintaining the striping order, and stored in the buffer.
  • end conditions will require that a block of three sectors be read, one sector from each drive, and only one or two of these sectors will be returned to the host.
  • the drives always require that entire sectors be transferred. The one or two sectors that were not requested by the host are discarded.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Signal Processing For Digital Recording And Reproducing (AREA)
  • Techniques For Improving Reliability Of Storages (AREA)
PCT/US2005/008647 2004-03-12 2005-03-14 Disk controller methods and apparatus with improved striping redundancy operations and interfaces WO2005089339A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US55359404P 2004-03-12 2004-03-12
US60/553,594 2004-03-12

Publications (2)

Publication Number Publication Date
WO2005089339A2 true WO2005089339A2 (en) 2005-09-29
WO2005089339A3 WO2005089339A3 (en) 2009-04-30

Family

ID=34994276

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2005/008647 WO2005089339A2 (en) 2004-03-12 2005-03-14 Disk controller methods and apparatus with improved striping redundancy operations and interfaces

Country Status (2)

Country Link
TW (1) TWI386795B (zh)
WO (1) WO2005089339A2 (zh)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7496785B2 (en) * 2006-03-21 2009-02-24 International Business Machines Corporation Enclosure-based raid parity assist
US9207876B2 (en) 2007-04-19 2015-12-08 Microsoft Technology Licensing, Llc Remove-on-delete technologies for solid state drive optimization
TWI472920B (zh) * 2011-09-01 2015-02-11 A system and method for improving the read and write speed of a hybrid storage unit
JP2022095257A (ja) * 2020-12-16 2022-06-28 キオクシア株式会社 メモリシステム
JP7516300B2 (ja) * 2021-03-17 2024-07-16 キオクシア株式会社 メモリシステム
CN117251115B (zh) * 2023-11-14 2024-02-09 苏州元脑智能科技有限公司 磁盘阵列的通道管理方法、系统、设备及介质

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6105146A (en) * 1996-12-31 2000-08-15 Compaq Computer Corp. PCI hot spare capability for failed components
US6151641A (en) * 1997-09-30 2000-11-21 Lsi Logic Corporation DMA controller of a RAID storage controller with integrated XOR parity computation capability adapted to compute parity in parallel with the transfer of data segments
US6237052B1 (en) * 1996-05-03 2001-05-22 Netcell Corporation On-the-fly redundancy operation for forming redundant drive data and reconstructing missing data as data transferred between buffer memory and disk drives during write and read operation respectively

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5483641A (en) * 1991-12-17 1996-01-09 Dell Usa, L.P. System for scheduling readahead operations if new request is within a proximity of N last read requests wherein N is dependent on independent activities

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6237052B1 (en) * 1996-05-03 2001-05-22 Netcell Corporation On-the-fly redundancy operation for forming redundant drive data and reconstructing missing data as data transferred between buffer memory and disk drives during write and read operation respectively
US6105146A (en) * 1996-12-31 2000-08-15 Compaq Computer Corp. PCI hot spare capability for failed components
US6151641A (en) * 1997-09-30 2000-11-21 Lsi Logic Corporation DMA controller of a RAID storage controller with integrated XOR parity computation capability adapted to compute parity in parallel with the transfer of data segments

Also Published As

Publication number Publication date
WO2005089339A3 (en) 2009-04-30
TWI386795B (zh) 2013-02-21
TW200535609A (en) 2005-11-01

Similar Documents

Publication Publication Date Title
US8074149B2 (en) Disk controller methods and apparatus with improved striping, redundancy operations and interfaces
US8281067B2 (en) Disk array controller with reconfigurable data path
US6018778A (en) Disk array controller for reading/writing striped data using a single address counter for synchronously transferring data between data ports and buffer memory
US5446855A (en) System and method for disk array data transfer
US5499337A (en) Storage device array architecture with solid-state redundancy unit
US8560772B1 (en) System and method for data migration between high-performance computing architectures and data storage devices
US5956743A (en) Transparent management at host interface of flash-memory overhead-bytes using flash-specific DMA having programmable processor-interrupt of high-level operations
JP5272019B2 (ja) プロセッサを内部メモリに接続するクロスバー・スイッチを含むフラッシュメモリ用ストレージコントローラ
US6151641A (en) DMA controller of a RAID storage controller with integrated XOR parity computation capability adapted to compute parity in parallel with the transfer of data segments
JP5124792B2 (ja) RAID(RedundantArrayofIndependentDisks)システム用のファイルサーバ
US7770076B2 (en) Multi-platter disk drive controller and methods for synchronous redundant data operations
US7644303B2 (en) Back-annotation in storage-device array
GB2271462A (en) Disk array recording system
EP0825534B1 (en) Method and apparatus for parity block generation
US8291161B2 (en) Parity rotation in storage-device array
US7653783B2 (en) Ping-pong state machine for storage-device array
WO2005089339A2 (en) Disk controller methods and apparatus with improved striping redundancy operations and interfaces
US7769948B2 (en) Virtual profiles for storage-device array encoding/decoding
US20040205269A1 (en) Method and apparatus for synchronizing data from asynchronous disk drive data transfers
TWI278752B (en) Disk array controller and fast method of executing stripped-data operations in disk array controller
JPH08501643A (ja) コンピュータ・メモリ・アレイ・コントロール
GB2298306A (en) A disk array and tasking means

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SM SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): BW GH GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LT LU MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

NENP Non-entry into the national phase in:

Ref country code: DE

WWW Wipo information: withdrawn in national office

Country of ref document: DE

122 Ep: pct application non-entry in european phase