REDUNDANT ARRAY OF SOLID-STATE STORAGE DEVICE
MODULES
Technical Field
[ 001 ] The present invention broadly relates to data storage systems and solid-state storage devices used in connection with computer systems or networks, and more specifically to a redundant array of solid-state storage device modules providing such a data storage system.
Background Art
[ 002 ] Typically, a digital data storage system involves one or more storage or memory devices which are connected to a central processing unit
(CPU). The general function of the data storage units is to store data or programs which the CPU utilises in accordance with instruction sets running on the CPU. Conventional data storage systems for data storage include hard disk drives. Conventional hard disk drives have relatively slow data access times for accessing stored data compared to solid state devices such as dynamic random access memory (DRAM) or static random access memory
(SRAM). Hard disk drives using magnetic storage means are the principle means available today for long term or non-volatile storage of data for use in computer systems. While able to readily store data in a non-volatile manner, such hard disk drives have significant disadvantages, for example, low data transfer rates, relatively long disk access times and latencies.
[ 003 ] Solid-state memory devices offer improved data transfer rates and access times. For example, DRAM is commonly used as the main memory in a computer system, but it is generally considered unsuitable for non-volatile memory applications due to continuous power requirements. Other solid-state memory types such as EEPROM and flash memory do not require continuous power to maintain the integrity of the stored data.
[ 004] There is a continuing need to provide data storage systems which have fast data transfer rates and access times whilst simultaneously ensuring that data is not lost or is recoverable. Computer applications and user requirements continue to demand faster data retrieval and systems that guarantee that data will not be lost.
[ 005 ] The reliability of magnetic disk drive devices has been dramatically improved by the introduction of RAID (Redundant Array of Inexpensive Disks) systems. Magnetic disk drives organised in this way create data redundancy and can improve data throughput. Fundamental to RAID is striping, a method of concatenating multiple storage devices into one logical storage unit. Striping involves partitioning each disk drive's storage space into stripes, which may be as small as one sector (512 bytes) or as large as several megabytes. These stripes are then interleaved in a round-robin basis, so that the combined space is composed alternatively of stripes from each disk drive. In effect, the storage space of the disk drives is shuffled like a deck of cards. The type of application environment, whether input/output or data intensive, determines whether large or small stripes should be used. Various types of RAID are used and are generally well known. These types are briefly summarised below.
[ 006 ] RALD-0 is not redundant, hence does not truly fit the RAID acronym. In level 0, data is split across storage devices, resulting in a higher data throughput. Since no redundant information is stored, performance is very good, but the failure of any storage device in the array results in data loss. This level is commonly referred to as striping. RAID - 1 provides redundancy by writing the same data to two or more storage devices. If either storage device fails, no data is lost. This level of RAID is commonly referred to as mirroring. RAID - 2 uses Hamming error correction codes and is intended for use with storage devices that do not have built-in error detection. RAID - 3 stripes data at a byte level across several drives with parity stored on one storage device, the parity information allows recovery from the failure
of any single storage device. RAID - 4 is the same as level 3, but stripes data at a block level across several storage devices, with parity stored on one storage device. The parity information allows recovery from the failure of any single storage device. RAID - 5 is the same as level 4, but no dedicated parity storage device exists. Instead, parity is distributed among the data storage devices.
[ 007 ] Other data storage systems have been described in various patent specifications. The most relevant of these are now discussed.
[ 008 ] EP 0869436 discloses a data storage system having both rotating magnetic disks and a solid-state disk. The system stores data in a redundant array of rotating disks, while check data (e.g. parity data) is stored on the solid-state disk. The solid-state disk is used to prevent bottle necks when reading and writing the check data.
[ 009 ] US 2002/0010875 discloses a data storage system utilising volatile solid-state memory devices (SDRAM) configured in a RAID-like structure. The structure disclosed is primarily intended to facilitate the hot- plugging of new or replacement memory cartridges.
[ 010 ] US 2001/0018728 discloses a data storage system having at least one hard disk and at least one solid-state disk. In a particular embodiment the system includes a host computer having access to a solid-state storage area and a hard disk. Data is transferred between the two types of storage such that the most frequently accessed data is stored in the solid-state storage area. The solid-state memory devices are described as being one of a variety of different types including volatile DRAM. The volatility of this type of memory is addressed by providing each solid-state disk with a battery back-up capability. The two or more solid-state disks are used in a cache-like capacity to speed up access to the data primarily stored on the hard disks.
[ 011 ] US 5499337 and US 6289471 disclose a data storage system where the check or parity data needed for a RAID - 4 type system is stored on solid-state storage devices rather than hard disks. The increased speed of access of the solid-state storage device removes the bottle neck which would otherwise be experienced in writing new parity data to the hard disks.
[ 012 ] US 5680579 discloses a data storage system including a number of solid-state devices arranged according to a RAID-type configuration. The description identifies the solid-state memory as being of the flash type, and as such, it is non-volatile.
[ 013 ] It is not presently known, and the aforementioned prior art does not disclose, a data storage system in which data is primarily stored in solid- state memory in a redundant array and where non-volatile memory is additionally provided as back-up memory to the solid-state memory. Such a data storage system would offer the benefit of the speed and robustness of solid-state devices and ensure that data is not permanently lost by storing data in a redundant array and also allowing data to be stored on non-volatile memory, such as for example a magnetic hard disk drive for back-up purposes. If the hard disk drive is only used to back-up data from a redundant array of solid-state devices then in normal operation the hard disk drive should not degrade the overall system performance. This identifies a need for a new type of system or method for digital data storage which overcomes or at least ameliorates the problems inherit in the prior art.
Disclosure Of Invention
[ 014 ] Broadly, the present invention provides a system or device incorporating a redundant array of solid-state storage devices which uses relatively high-speed volatile solid-state memory means, as opposed to relatively slow non-volatile hard disks, and can be used as a redundant storage array system with non-volatile hard disk back-up capability. This creates data redundancy and improves data throughput.
[ 015 ] In a broad form the present invention provides a data storage system including: a plurality of volatile solid-state storage devices arranged in a redundant array; one or more non-volatile back-up storage devices, the nonvolatile back-up storage devices having data access and data transfer speeds less than the volatile solid-state storage devices; a system controller to control data transfers between the volatile solid-state storage devices and the one or more non-volatile back-up storage devices; and, a data input/output interface adapted to transfer data between the volatile solid-state storage devices and a host computer system or network.
[ 016 ] According to further aspects of an embodiment of the invention, the system additionally includes a user interface for user control and monitoring of the system, and/or one or more uninterrupted power supply modules to supply power to the volatile solid-state storage devices. Preferably, the one or more uninterrupted power supply modules are arranged in a redundant array. Also preferably, the one or more non-volatile back-up storage devices are arranged in a redundant array.
[ 017 ] In a further broad form the present invention provides a device for data storage, the device including: a plurality of data storage modules, each data storage module including at least one volatile memory module and at least one non-volatile memory module; at least one controller module; and, a data input/output module adapted to transfer data with a computer system or a network; whereby, at least some of the volatile memory modules of the plurality of data storage modules are arranged in a redundant array, with the volatile memory modules providing the primary means of data storage, and, the non- volatile memory modules providing back-up data storage of data in the volatile memory modules.
[018 ] In a particular embodiment, the redundant array is a RAID arrangement, or a combination of different RAID type arrangements. According to another aspect, the device includes one or more controller modules providing control of the volatile memory modules and one or more
controller modules providing control of the non-volatile memory modules. Preferably, the volatile memory modules and the non-volatile memory modules are arranged in separate RAID arrangements. According to another aspect, the non-volatile memory modules become the primary data storage means in the event of failure of the volatile memory modules.
[ 019 ] In a particular embodiment, in a single data storage module, data from the at least one volatile memory module is backed-up on the at least one non-volatile memory module. In a further particular embodiment, one or more of the data storage modules are specially reserved data storage modules. In still a further particular embodiment, one or more of the volatile memory modules within a data storage module are specially reserved volatile memory modules.
[ 020 ] In yet another particular embodiment, data is automatically transferred to a non-volatile memory module from a volatile memory module when power to the volatile memory module is interrupted or terminated, or data is transferred to the non-volatile memory modules from the volatile memory modules as a result of a user request received via the user interface. Preferably, data from any volatile memory module is able to be backed-up on any non-volatile memory module.
[ 021 ] In a further broad form the present invention provides a data storage component, for use as a data storage component in a data storage system or device, the data storage component including: a plurality of volatile memory modules; at least one non-volatile memory module; and, at least one connector port adapted to facilitate communication between the data storage component and a separate controller module of the data storage system or device; whereby, the data storage component is adapted to form part of a redundant array in the data storage system or device.
[ 022 ] In a particular form, the data storage component includes one or more ports adapted to facilitate communication between one or more separate controller modules of the data storage system or device.
[ 023 ] In another particular form, the data storage system or device is a rack mountable system or device and can accommodate a plurality of data storage components that are readily insertable/removable.
[ 024 ] In a further broad form the present invention provides a method of storing data in a redundant arrangement using a data storage device, the method including the steps of: transferring data to the data storage device via a data input/output module of the data storage device; storing the data in a plurality of data storage modules of the data storage device, each data storage module including at least one volatile memory module, and at least one non- volatile memory module; whereby, at least some of the volatile memory modules of the plurality of data storage modules are arranged in a redundant array, with the volatile memory modules providing the primary means of data storage, and, the non-volatile memory modules providing back-up data storage of data in the volatile memory modules.
[ 025 ] Preferably, the system (or device) for digital data storage is a rack mountable unit and can accommodate a plurality of solid-state storage device modules. Also preferably, the solid-state storage device modules form a redundant array when in use in the system for digital data storage. According to a further embodiment, the solid-state storage device module also includes a second connector port to a second separate controller module. In still a further possible embodiment, the solid-state storage device module also includes a fan unit.
[ 026 ] In a further broad embodiment of the invention there is provided a method for use in a system for digital data storage, the method providing data redundancy.
[ 027 ] Accordingly, the present invention seeks to provide these and other features or advantages providing a system and/or a device with a redundant array of solid-state storage device modules which forms a system/device for digital data storage. The present invention also provides a method for creating redundancy in data and offering non-volatile back-up of data .
Brief Description Of The Figures
[ 028 ] The present invention should become apparent from the following description, which is given by way of example only, of a preferred but non-limiting embodiment thereof, described in connection with the accompanying figures, wherein:
Figure 1 broadly illustrates the main functional blocks of an embodiment of the present invention. Figure 2 broadly illustrates the mechanical construction of the overall system/device according to a particular embodiment.
Figure 3 provides an illustration of the functional interconnection between solid-state storage device modules.
Figure 4 illustrates a particular type of loop topology. Figure 5 illustrates a further particular type of switched topology.
Figure 6 illustrates a first version of controller module layout. Figure 7 illustrates a second version of controller module layout. Figure 8 shows a functional diagram of a specific embodiment of the present invention using a first back end version. Figure 9 shows a functional diagram of a specific embodiment of the present invention using a second back end version.
Figure 10 illustrates the broad mechanical construction of a solid-state storage device module according to a particular embodiment of the present invention. Figure 11 illustrates a solid-state storage device module block diagram according to a particular embodiment.
Figure 12 broadly illustrates how the solid-state storage device module controller is capable of running multi-port into multiple LUNs.
Figure 13 illustrates a schematic of the solid-state storage device unit.
Figures 14-16 shows schematics illustrating how the solid-state storage device unit can be configurable in various modes.
Figures 17-24 illustrate schematics of further possible modes of the solid-state storage device unit.
Figure 25 illustrates a broad schematic of the power supply module according to a specific embodiment of the present invention.
Detailed Description Of Various Embodiments
Overview
[ 029 ] A redundant array of solid-state storage devices is embodied as a system as illustrated in figure 1. The redundant array of solid-state storage devices (RASSD) system 10 may incorporate any of the known disk- connecting topologys which create data redundancy and/or improve data throughput. In addition to the redundant data storage topology, the RASSD system 10 features a set of components and operational algorithms that enable use of volatile solid-state data storage media 11 and in particular RAM devices. Additional components can include: one or more redundant uninterrupted power supply modules (UPS) 12; one or more non-volatile back-up storage devices 13 that have slower data access and transfer speeds than the volatile solid-state storage devices 11, but that can retain data without power; one or more system controller module 14 that monitors and controls system components; one or more high speed system input/output module 15 used for external data transfer connections 16 with computer systems or networks; and a user interface 18 for system control and monitoring.
[ 03 0 ] Any of the presently known RAID types or their combinations can be used in the RASSD system 10 for the purposes of creating data
redundancy or increasing data throughput. In addition, the RASSD system 10 can include a reserved volatile storage device or devices 17. The reserved volatile storage devices 17 would stay idle for most of the time during operation of the system 10 until such time as one or more active storage devices 11 fail. The reserved devices 17, if in place, can then be used to recreate failed storage devices without the need to physically replace them. Presence of reserved volatile storage devices 17 allows for fully automatic data recovery routines to be implemented. Reserve non-volatile storage devices 19 are used in the same fashion as reserve volatile storage devices 17.
[031] To protect data kept in the volatile storage devices 11 in situations when mains power is lost or temporarily interrupted, the RASSD system 10 utilises one or more UPS modules 12, which provide power from a secondary source (for example a rechargable battery) until all data is backed- up onto a non-volatile storage device 13, or until the mains power is restored. If there are more than one UPS modules 12 they can be connected in a way to provide power redundancy.
[ 032 ] In situations where mains power is lost the RASSD system 10 can start backing up the data from the volatile solid state storage devices 11 onto the non-volatile back-up storage devices 13. There may be one or more back-up non-volatile storage device 13 creating redundancy of the back-up media 13. After the mains power is restored, the data that was previously backed-up onto the non- volatile storage devices 13 can be transferred back to the volatile solid-state storage devices 11. Generally, in all other periods, under normal circumstances the back-up devices 13 would stay idle or be switched off, unless storage space not allocated to backup functions was being used to provide an additional mechanical redundant array.
[ 033 ] The RASSD system 10 incorporates the controller module 14 which controls functional system components. The system controller module 14 monitors the status of the components in the system 10, receive signals and
messages from the components and from the external interfaces, and thereby controls the whole system 10 based on input information.
[ 034 ] The RASSD system 10 also includes a high speed system input/output module 15 which serves as the main interface between the system
10 and external devices, such as a computer system or network. Some of the types of interfaces that can be used for this purpose include SCSI, fibre channel, proprietary interfaces, etc.
[ 035 ] As is shown in figure 1, the RASSD system 10 incorporates an array of volatile solid-state storage devices (also referred to as volatile memory modules) 11 which are typically organised in such a way that data redundancy is created. One or more of the known RAID types or topologys, or combinations thereof, can be used for this purpose. In addition to the redundant array of volatile solid-state devices 11, any number of reserved volatile solid-state devices 17 can be provided for an automatic or manual data recovery mechanism. This process could be started immediately after any main redundant array 11 failure, thus avoiding the need to physically replace the failed devices. After the completion of such a recovery procedure, the previously reserved device or devices that now holds valid data images can assume the identity of the device that has failed.
[ 036 ] As a general description, on power-up, the RASSD system 10 can establish communications with the external computer system or network to which it is connected via the high speed system I/O interface module 15. The RASSD system 10 can check the contents of the non-volatile back-up storage devices 13 that should contain previously backed-up data, and prepare for copying the data onto the volatile array of solid-state devices 11. In case the contents of one of the back-up storage devices 13 is found to be invalid, the main system controller module 14 may attempt to restore the back-up storage device from another back-up storage device or give such a choice to the user or system operator. After completing this initial phase, the RASSD system 10 can restore information onto the volatile solid-state storage array 11 in one
uninterrupted effort, before enabling external I/O requests, or can enable external I/O requests and restore information gradually, as soon as the particular bit of data is requested by the externally connected computer system or network.
[ 037 ] In the event of mains power termination or failure, the RASSD system 10 would continue receiving power from the secondary source in the UPS module 12. The main system control module 14 can detect this situation and initiate a back-up procedure, during which all previously restored data blocks from the volatile array of solid-state devices 11, that were modified during the normal operation phase, can be backed-up onto the non-volatile back-up storage devices 13 in corresponding locations. From this perspective, each data block on the volatile array of solid-state devices 11 has its own fixed position on one or more of the non-volatile back-up storage devices 13 within the system 10. In the case that there is more than one back-up storage device 13 and when the same data block from the volatile solid-state storage devices 11 is stored into two (or more) different locations on two (or more) back-up storage devices 13, then the back-up data is also redundant. During normal operation, the RASSD main system control module 14 can monitor the condition and status of all major components within the system 10 and provide relevant information to the user via a user interface 18.
II. Modules
[ 038 ] The following modules provide a more detailed outline of a particular embodiment of the present invention. The modules are intended to be merely illustrative and not limiting of the scope of the present invention.
HA. Overview of Hardware
[ 039 ] Referring to figure 2, a device for data storage 20 is illustrated and includes a RAID system chassis designed to be rigidly mounted in a standard 19 inch rack, with front to rear airflow. It is constructed around a vertical mid-plane section 21 into which two RAID controllers 22 and 23 and
three power supply modules 24 can plug in from the rear. The "back end" consists of ten vertical solid-state storage device (SSD) modules 25 (i.e. data storage modules or data storage components) which plug-in from the front of the device or unit 20, into the mid-plane 21. The SSD module 25 back ends are dual-ported, each port driven by a different controller 22 or 23. Each SSD module 25 contains a solid-state device (i.e. volatile memory module) and a magnetic disk drive (i.e. non-volatile memory module) and is presented as a "dual" LUN (Logical Unit Number - a unique identifier used on a SCSI bus to distinguish between devices that share the same bus) device, bound into two separate RAID sets (SSD RAIDset 26 and Disk RAIDset 27) by the RAID firmware, as illustrated in figure 3. Figure 3 shows an SSD module 25 as being associated with controller A (22) and controller B (23) which provide a disk RAIDset 27 and a SSD RAIDset 26. Each controller 22 or 23 contains 2 Gbit FC "host ports" and is capable of driving the ten SSD modules 25 as five a channel RAID back ends. The two 2 Gbit FC channels are used to drive the back ends. Depending on the type of controller used, the back end is organised either as dual FC - AL loop (with five SSD modules per loop) or a higher performance switched star topology. These topologys are illustrated in figures 4 and 5.
IIB. Controller
[ 040] Two controller variations are presented as various embodiments of modules of the present invention and are designated as back end versions 1 and 2. The major differences between the designs is in the way the back ends are handled. RAID controller version 1 (60) implements the dual FC AL loop topology, whereby each loop drives 5 SSD units. RAID controller version 2 (70) supports switched topology with a cross point switch. This topology eliminates the FC AL arbitration and drives each one of the ten SSD modules 25 directly and is expected to return better back end performance. General physical layout schematics of these controller versions 60 and 70 are illustrated in figures 6 and 7, respectively.
RAID Controller version 1
[ 041 ] The CPU is a 64 bit RISC architecture processor 61, running at
600 MHz internal clock rate. This processor supports 16 K byte primary (LI) caches and a very large 256 Kbyte secondary (L2) cache. The system interface is provided by the dual PCI bridge interface 62 and a memory controller supporting a standard, single 168 pin DIMM module (Dual In-line Memory Module), running at 133 MHz to provide approximately 1,000 Mbytes/sec of peak memory bandwidth. Reference is made to the schematic in figure 8.
[ 042 ] The controller 'host interface' 81 consists of two 2 Gbit FC channels, providing up to 400 Mbytes/sec of peak bandwidth. This is balanced by two 2 Gbit FC back end channels 82 providing the controller 62 with approximately 400 Mbytes/sec of peak end-to-end throughput from a dual loop
FC backend.
[ 043 ] It should be noted that the version 1 backend interface is a dual
FC-AL loop which splits the back plane into two segments, with five SSD modules per loop. The SSD module bypass (to provide FC loop continuity when these are unplugged) is maintained on the controller 62, enabling the design of the backplane to be passive. The average throughput for a single controller configuration (under R0 stripe, large transfers) is expected to be 300 Mbytes/sec or 150 Mbytes/sec per single host channel. Performance under RAID R3 or R5 (I/O per second) small transfers is not predictable as it involves the generation of parity and is dependent on the performance of the FC chips 63.
[ 044 ] The Parity Accelerator Module hardware 83 is optional and can support DMA (only) driven XOR functionality, expected to be useful only in R3 and under large transfers (special applications). In general, the generation of parity will be soft, i.e. under the control of controller software/firmware.
[ 045 ] The inter-controller "cache-mirroring" link is provided by a dual channel, 2 Gigabit FC interface. The two channels are point to point links, configured as target and initiator at each end of the link, to provide the
communication path required to synchronise (mirror) individual controller caches in Active-Active or Active-Passive dual-controller fail-over (high availability) configurations. It should be noted that the "high-availability" mirrored-cache dual controller configurations consume additional PCI bus and processor memory bandwidth, decreasing the overall system throughput. This decrease in system throughput is highly dependent on the profile of 'writes' from the host and cannot be accurately specified.
[ 046 ] The primary method of error detection during data transfers rests on the functionality of the hardware-assisted 'end to end' bus parity validation (i.e. from the FC interface chip), over the PCI bridge through to memory 84. Data transfers which generate parity errors can be acknowledged as 'bad ' transfers to the host, in a manner identical to that found on magnetic disk drives. Host resident operating system drivers can 'retry' such failed operations (several times, including speed re-negotiation, etc.) and can abort the target device if the operation cannot be completed without errors.
[ 047 ] Controller firmware can support serial 'console' communications, firmware downloads and can provide a simple character based GUI to be used for setting-up and run-time monitoring of the subsystem. An Ethernet interface can be supported with a plug-in ethernet module 64.
[ 048 ] In terms of architecture, controller version 2 (70) is identical to that of controller version 1 (60). The only difference is in that the controller version 2 'backend' interface includes the crosspoint switch 71. This allows the use of two FC chips to be 'de-multiplexed' into ten, directly connected FC backend channels, eliminating the need for the 'traditional' arbitrated (FC-AL) loop and the latency associated with the traffic passing through each SSD module on the loop. In addition, the Crosspoint Switch chip includes channel 'bypass" circuitry enabling the backplane to be completely passive. This is illustrated in figure 9.
[049] The average throughput for a single controller configuration
(under RO stripe, large transfers) is expected to be better than 300 Mbytes/sec, i.e. 150 Mbytes/sec per host channel. All other performance related expectations are much the same as outlined in backend version 1.
IIC. SSD module
[ 050 ] Mechanical construction of the SSD Module 25 is preferably, but not necessarily, suitable for mounting in a 3U (5.25 inch) RAID System Chassis. The SSD Module 25 contains a controller and supports up to eight industry- standard DIMMs 100, plus one SCSI magnetic disk 101. The mechanical construction is outlined in figure 10.
[ 051 ] The architecture can support two PCI busses generated by the
PCI Bridge system controller. It includes 64 bit RISC processor with 256 Kbytes of L2 cache. Reference is made to figure 11.
[052 ] The SSD memory is provided by up to eight vertical DIMM sockets. Each DIMM can provide a maximum of two Gigabyte of memory for a total of 16 Gigabytes. All DIMMs should be 'registered' or 'buffered' type and the DIMM height is limited mechanically to suit the overall module dimensions. On power-up, the system firmware can perform simple diagnostics and automatically size the available memory.
[ 053 ] The SSD Module controller is capable of running multi-ported into multiple LUNs, as illustrated in figure 12. Two host channels are provided and the dual-ported firmware will 'point' both of the FC Host channels to each LUN. This dual-porting is 'coherent' internally, i.e. each host port may read/write independently and there is no corruption of data due to the 'simultaneous' or 'overlapped' writes from the above host ports to the same area of the available formatted storage.
[ 054 ] The SSD LUN represents the allocated capacity provided by the available DIMMs, to the maximum of 16 Gbyte (i.e. 2 Gbyte per DIMM),
minus the required run-time program and disk buffer memory storage requirement. This hardware/firmware design is based on bank- switching technology, whereby each host port is presented with the total view of the available DIMM storage for unrestricted DMA (direct memory access) data movement purposes.
[ 055 ] The primary method of internal error detection during data transfers rests on the functionality of the hardware-assisted 'end to end' validation of parity, i.e. from the FC interface and over the PCI bridge. The SDRAM memory is provided with ECC protection, as available on the PCI Bridge. These are the internal methods of error detection and correction for the complete SSD Module.
[056] Data transfers, which generate internal data bus parity errors or uncorrectable ECC errors will cause unrecoverable data errors on the SSD LUN. Such data transfers can be acknowledged as 'bad ' transfers to the host, in a manner identical to that found on magnetic disk drives. Host resident operating system drivers can 'retry' such failed operations and can abort the target device (i.e. the particular LUN device) if the particular operation cannot be completed without errors.
[ 057 ] The Management LUN is a specialised small LUN, which is used for 'in-band' management and statistics by the associated RAID controllers. The SSD Shadow LUN serves as a magnetic disk backup for the SSD LUN and is exactly the same in size and is configured from a portion of the available magnetic storage. It is envisioned to be a 'read-only' LUN. The Magnetic Disk LUN is the allocated portion of the magnetic storage (minus the SSD Shadow LUNs), as available from the on-board SCSI disk drive. The design can provide two non-standard specialised host channel commands (e.g. the equivalent of the disk 'spin-down' and the 'spin-up' commands) to initiate an 'unconditional' backup of the SSD LUN data to the Shadow LUN and to move the Shadow LUN data to the SSD LUN.
[ 058 ] System firmware can be designed to perform a 'rapid' restore from power-up. In this scenario, the SSD is available as soon as the system is powered up (pending some delay incurred by diagnostics) and all writes are logged to the magnetic SSD Shadow LUN, until such time that the data in the SSD LUN is completely restored, i.e. at the point in time that the SSD LUN becomes a 'mirror' of the SSD Shadow LUN.
[ 059 ] The design preferably uses the CPU and the PCI Bridge/system controller running the SDRAM memory at 66 MHz. This provides 524 Mbytes/sec of peak memory bandwidth. Dual 64 bit PCI bus structure is implemented with the primary (host) bus running at 66 MHz (i.e. approx. 500
Mbyte/sec bandwidth) and the secondary bus is also set to 66 MHz.
[ 060 ] Each FC channel should support 1 or 2 Gigabits (approx. 100 or 200 Mbyte) of peak transfer rates and should result in 75 or 150 Mbytes/sec of average throughput, respectively, under large (e.g. 1 Mbyte) transfers. The PCI bus is expected to deliver approximately 400 Mbytes/sec of average throughput and hence it is capable of accommodating the two FC channels running at 2 Gbit speed.
[ 061 ] The available SDRAM memory is presented as the SSD LUN and the single SCSI magnetic disk is divided into the Shadow LUN (identical in size to the SSD LUN) and the Magnetic Disk LUN, which is the available remainder of the SCSI disk. The Management LUN is an in-band management and control tool through which the SSD Unit is managed via external (host- resident) software. The shadow LUN is not visible to the external world under normal (non-diagnostic) operations and its functionality is limited to internal transfers for data backup and restore functions to and from the SSD LUN. It is controlled with a set of commands, passed over the FC ports or the serial console.
[ 062 ] The SSD Unit could be configured to the three different configurations as shown in figures 14 to 16. As an example, the
configurations are recognised and accessible for read and write operations under the Linux operating system. The present invention should not be considered to be limited to these configurations or this operating system.
[ 063 ] Any one or all of the three LUNs (i.e. the SSD LUN, Shadow
LUN or the Magnetic LUN may be 'logically' attached to either one of the host ports.
[ 064 ] Either the SSD LUN or the Magnetic LUN may be 'logically' attached to either one of the host ports, for 'single-ported' configurations. The Shadow LUN may be deleted from the configuration but can remain available for the backup/restore functions of the SSD LUN.
[ 065 ] The SSD Unit may be configured to a 'dual-ported' operation whereby each port can access either the SSD LUN, the Shadow LUN or the Magnetic LUN. The shadow LUN may be deleted from the configuration but can remain available for the backup/restore functions of the SSD LUN.
[ 066 ] Either host is able to concurrently read/write to the SSD LUN and the Magnetic LUN and the writes are 'coherent' internally, i.e. there is no corruption of data due to the 'overlapped' reads or writes to the same location, from either host.
[ 067] In addition, any of the 3 LUNs mentioned above can be configured by the user as multiple LUNs.
Shadow LUN Configurations
[ 068 ] The two features outlined below describe operational configurations of the SSD Unit, involving the Shadow LUN. These are operational configurations, set up through the serial port and require a reboot of the SSD Module. The associated data transfer traffic between the SSD LUN and the Shadow LUN is internal to the SSD Module.
Auto-Restore on Power Up
[ 069 ] The firmware can provide an 'auto-restore on power-up' mode on the SSD Module. This involves power cycling of the SSD Module. The power-on cycle of the SSD Module can cause the Shadow LUN to be copied to the SSD LUN. Some form of busy/ready/done status can be provided. This demonstrates that a repetitive execution of the above procedure is reliable and that the data on the two LUNs remains equal after each operation of the command.
Auto-Backup on Power Failure
[ 070 ] An 'auto-backup on power-failure' mode can also be enabled on the SSD Module. This involves power cycling of the SSD Module. The power-down cycle of the SSD module can be simulated via manually generated Power Fail Interrupt and this demonstrates that this causes the SSD LUN to be copied to the Shadow LUN. Some form of busy/ready/done status can be provided. This demonstrates that a repetitive execution of the above procedure is reliable and that the data on the two LUNs remains equal after each operation of the command.
Shadow LUN Backup/Restore Commands
[ 071 ] These commands direct the Shadow LUN to 'unconditionally' backup and to restore a 'static' (i.e. not available for read/write operations) SSD LUN. The associated data transfer traffic between the SSD LUN and the Shadow LUN will be internal to the SSD Module. The commands are issued via the particular host channel or the serial port and initiate an 'unconditional' and self-completing block for block copy of the SSD LUN to the Shadow LUN and vice-versa. During the operation, both LUNs become unavailable for any other operation. Some form of busy/ready/done status can be provided on both LUNs. This demonstrates that a repetitive execution of the above procedure is reliable and that the data on the two LUNs remains equal after each operation of the command. The associated data transfer traffic between the SSD LUN and the Shadow LUN will be internal to the SSD Module.
SSD LUN Incremental Data Backup Mode
[ 072 ] In an incremental data backup mode, all writes from the host port
(i.e. changes to the SSD LUN data) are continuously tracked (logged) by firmware and copied to the Shadow LUN as a background task at a rate set by the 'backup rate' variable. This demonstrates the functionality of the 'incremental data backup' mode and the associated 'backup rate' variable. The 'incremental data backup' mode is enabled through a command via the serial port and the SSD Unit must be rebooted. In this mode, all writes from the host port (i.e. changes to the SSD LUN data) are continuously tracked (logged) by the firmware and are scheduled to be written out to the Shadow LUN, as dictated by the 'backup rate' variable. The 'backup rate' variable is adjustable (serial port command) while the SSD LUN is operational.
[ 073 ] A 'write' can be used for the incremental data backup operation and the 'backup' rate' variable can be adjusted to allow the data on the Shadow LUN to synchronise with the SSD LUN, i.e. at that point the data on the Shadow LUN should match the data on the SSD LUN.
The SSD LUN 'Quick-Availability' Configuration. [ 074 ] This is an operational mode, enabled on the SSD Module as a configuration procedure, using the serial port and a reboot would be required. Some form of 'restore rate' priority and busy/done status can be provided. When configured, it enables the SSD LUN to become "increasingly more- available" over a period of time, while it is being restored from the Shadow LUN, following a 'power-up' condition of the SSD Module. This functionality is accomplished by directing all host 'writes' to the SSD LUN and all of the 'un-restored' reads to the Shadow LUN, while the data is being progressively restored from the shadow LUN to the SSD LUN, in the background and at a certain preset rate. In this manner, the host data traffic can switch over entirely to the SSD LUN access (i.e. the high speed operational mode), when the above 'restore' process is complete.
Transaction Throughput (TO transactions/second)
[ 075] The IO per second figure can not be quoted accurately as the design of the SSD Module does not control the latency of the HBA/OS latency nor the performance of the Qlogic FC chips. The most important parameter here is the 'latency' associated with the decode of the SCSI 'read' and 'write' command (i.e. the SCSI CDB block) and the resulting transfer of data by the target. In the case of a 'read' command, this is defined as the elapsed time between the placement of the command by the initiator (i.e. the HBA) on to the FC bus and the return of the first byte of requested data on to the FC bus.
This latency added to the actual transfer of data determines the transactional performance of the system, measured in IO transactions/second.
Basic RAID Enclosure Features
[ 076 ] When the RAID Enclosure is fully loaded, i.e. it contains ten
SSD Units, two RAID Controllers and three Power Supply modules, the following functionality should be achieved, with reference to figure 17.
- Simple power-on diagnostics and the ability of the RAID Controller to auto-size the available backends, i.e. the size and the number of available SSD LUNs and the Magnetic Disk LUNs, internal RAID Controller configuration plus other related system-level parameters. - The ability to download the run-time firmware using the serial port and the host channel with demonstration of the serial console command protocol. The ability to access to the Management LUN.
- The ability to configure a varying number of SSD LUNs into a single Raid 0 (strip) "SSD RAIDset". - The action of creating the "SSD RAIDset" can automatically invoke the creation of the associated internal "Shadow RAIDset" (of equivalent size and type) from the available magnetic disk storage.
- The ability to bind the Magnetic Disk LUN space into a separate RAID 0 "Magnetic Disk RAIDset". - The ability to configure the SSD RAIDset and the Magnetic Disk
RAIDset to support RAID Levels 0, 3 and 5 across a varying number of SSD Units, with or without spares, to the maximum of nine SSD Units plus one spare.
- The SSD RAIDset and the Magnetic Disk RAIDset are 'driver transparent' 'magnetic disk' devices under the operating system. This involves read/write/verify operations to one or two RAIDsets under the operating system.
The Shadow RAIDset Functionality
[ 077] The SSD RAIDset can be configured as a RAID 3, in an (8 + 1 +
Spare) configuration and mirrored (internally within the SSD Module) with the Shadow RAIDset. This configuration is set up through the serial port, followed by system reboot. The 'auto-restore on power-up' cycle can cause the Shadow RAIDset to be copied to the SSD RAIDset (internally within the SSD modules). The SSD RAIDset can be in the 'spun-down' (busy) status until it becomes fully restored, at which point it should become 'ready'. The serial port can initiate an 'unconditional' and self-completing copy of the Shadow Raidset, and vice-versa.
[ 078 ] The 'incremental data backup' mode can be enabled through a command via the serial port and the RAID enclosure is rebooted. In this mode, all writes from the host port (i.e. changes to the SSD RAIDset data) are continuously tracked (logged) by the firmware and scheduled to be written out to the Shadow RAIDset, as dictated by the 'backup rate' variable. The 'backup rate' variable is adjustable via the serial port command while the enclosure is operational.
[ 079 ] This requires a read/write/verify test program being executed by the Linux host. The SSD Raidset is configured to R0. A 'write' can be executed on the host to demonstrate the incremental data backup operation. The 'backup rate' variable can be set to 'low rate' (via the serial port), causing the content of the Shadow Raidset to lag the content on the SSD RAIDset and the incremental backup system can be demonstrated to be reliable. The 'backup rate' can be increased (via the serial port) to allow the data on the Shadow RAIDset to synchronise with the SSD RAIDset. At that point the write on the host can be terminated and the data on the two RAIDsets should be equal.
The SSD RAIDset 'Quick- Availability' Configuration
[ 080 ] The 'quick availability' of the SSD RAIDset is an operational mode, enabled on the RAID enclosure as a configuration, using the serial port, the SSD RAIDset can be configured as a RAID 3 and mirrored with the Shadow RAIDset, in an 8 + 1 + Spare configuration. This configuration is set up through the serial port, followed by system reboot. This mode is mutually exclusive with the 'auto-restore on power-up' mode, described previously.
[ 081] When configured, this mode enables the SSD RAIDset to become
"increasingly more-available" over a period of time, while it is being restored from the Shadow RAIDset, following a 'power-up' condition of the RAID enclosure. Speed of the restoration is controlled by the 'restore rate' variable, which may be adjusted (via the serial port) at run time.
Removable SSD Unit Test
[ 082 ] The system can be configured for 'quick availability' with
'incremental data backup' and 'auto-backup n power- failure'. The RAID Enclosure is configured as shown in figure 18.
Removable Disk Drive
[ 083 ] The RAID Enclosure can contain ten SSD Units and a single controller. The system can be configured for 'quick availability' with 'incremental data backup' and 'auto-backup on power-failure'. The RAID Enclosure can be configured as shown in figure 19.
[084 ] The SSD RAIDset and the Magnetic Disk RAIDset could be configured respectively in the following four RAID combinations: (a) dual R3 configuration, (b) as a dual R5 configuration, (c) R3 and R5 configuration (d) R5 and R3 configuration. The Shadow RAIDset can equal the SSD RAIDset in configuration and size. The host port is able to read/write to the SSD RAIDset and the Magnetic RAIDset.
[ 085 ] A disk drive in one of the SSD Units forming the RAIDsets can be hot-unplugged from the chassis to:
Demonstrate that the SSD RAIDset continues to be functional and that the transfer of data to an from the host is unaffected. However the transfers may be slower due to the data reconstruction process involving the Magnetic and Shadow RAIDsets;
Demonstrate that the Magnetic RAIDset continues to be functional and that the transfer of data to and from the host is unaffected; and
Demonstrate that the controller can engage the disk drive located in the 'spare' SSD Unit (chassis 'spare' position) in a data 'reconstruct' process, the completion of which may take some time.
[ 086 ] The unplugged disk drive could be reinstated into the SSD Unit and it could be demonstrated that: The Controller can treat the re-established disk drive as a 'new spare'.
It should be noted that the 'new spare' will not become functional until the data reconstruction process involving the 'old spare' is completed; and
After the above data reconstruction is complete, it could be demonstrated that the controller may be directed to 're-order' the Shadow RAIDset and the Magnetic RAIDset to the original configuration and that a
'spare' can be re-established in the RAID chassis slot labelled 'spare'.
[ 087 ] While the system is operational with the disk drive removed and the Magnetic/Shadow RAIDsets undergoing a 'data reconstruct' process, it could be demonstrated:
That the 'quick availability' and the 'auto-backup on power-failure' functions involving the Shadow RAIDset perform correctly. This would require power cycling of the RAID Enclosure; and
That the 'incremental data backup' feature performs correctly while the Shadow RAIDset is missing a LUN and while the data is being reconstructed on to the spare drive. Various configurations of the system are possible.
Some of these configurations are illustrated in figures 20 to 24, and are briefly discussed.
Single Controller configurations with Dual Hosts, (figure 20) [ 088 ] Both RAIDsets can be configured to RAID Level 3. The RAID
Enclosure can be configured for 'quick availability' with 'incremental data backup' and 'auto-backup on power-failure' options.
Single Controller Configurations with Dual Hosts, (figure 21) [ 089 ] Both RAIDsets can be re-configured as 'dual ported' RAID Level
3 configurations, as shown in figure 21.
Dual Host, Dual Controller, Single-Ported Configuration (figure 22) [ 090 ] The three available LUNs can be demonstrated to run "single- ported", as shown in figure 22.
[ 091 ] Both RAIDsets can be configured to RAID Level 5. It can be demonstrated that one host will be able to read/write to the SSD RAIDset, while the other host port is able to perform independent read/write operations to the Magnetic RAIDset.
Dual Host, Dual Controllers. Dual-Ported Operation (figure 23)
[ 092 ] The RAID Enclosure can contain ten SSD Unites and two controllers. The system can be configured for 'quick availability' with 'incremental data backup' and 'auto-backup on power- failure'. The RAID Enclosure can be configured as shown in figure 23.
Dual Host, Dual Controller, Quad-Ported Configuration (figure 24) [ 093 ] The RAID Enclosure can contain ten SSD Units and two controllers. The system can be configured for 'quick availability' with 'incremental data backup' and 'auto-backup on power- failure'. The RAID Enclosure can be configured as shown in figure 24.
[ 094 ] Three current- sharing power supply modules can hot-plug from the rear. Each power supply module preferably contains an AC inlet/filter
module, four fan units and support power factor correction (PFC). In addition to a battery system, sufficient in power for the system to backup the ten SSD modules and to shut down controllers in an orderly fashion. This is illustrated in figure 25.
[ 095 ] The design should tolerate the failure of one power supply module and each module should deliver 50% of the required load.
[ 096 ] Thus, there has been provided in accordance with the present invention, a redundant array of solid-state storage device modules and data storage system which satisfies the advantages set forth above.
[ 097 ] The invention may also be said broadly to consist in the parts, elements and features referred to or indicated in the specification of the application, individually or collectively, in any or all combinations of two or more of said parts, elements or features, and where specific integers are mentioned herein which have known equivalents in the art to which the invention relates, such known equivalents are deemed to be incorporated herein as if individually set forth.
[ 098 ] Although the preferred embodiment has been described in detail, it should be understood that various changes, substitutions, and alterations can be made herein by one of ordinary skill in the art without departing from the scope of the present invention as herein described.