GB2446177A - Data storage system - Google Patents

Data storage system Download PDF

Info

Publication number
GB2446177A
GB2446177A GB0702690A GB0702690A GB2446177A GB 2446177 A GB2446177 A GB 2446177A GB 0702690 A GB0702690 A GB 0702690A GB 0702690 A GB0702690 A GB 0702690A GB 2446177 A GB2446177 A GB 2446177A
Authority
GB
United Kingdom
Prior art keywords
data storage
data
storage unit
installation according
installation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
GB0702690A
Other versions
GB0702690D0 (en
Inventor
Katherine Bean
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to GB0702690A priority Critical patent/GB2446177A/en
Publication of GB0702690D0 publication Critical patent/GB0702690D0/en
Publication of GB2446177A publication Critical patent/GB2446177A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0655Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
    • G06F3/0659Command handling arrangements, e.g. command buffers, queues, command scheduling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • G06F3/0611Improving I/O performance in relation to response time
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0614Improving the reliability of storage systems
    • G06F3/0619Improving the reliability of storage systems in relation to data integrity, e.g. data losses, bit errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0629Configuration or reconfiguration of storage systems
    • G06F3/0635Configuration or reconfiguration of storage systems by changing the path, e.g. traffic rerouting, path reconfiguration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0683Plurality of storage devices
    • G06F3/0689Disk arrays, e.g. RAID, JBOD
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/2002Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where interconnections or communication control functionality are redundant
    • G06F11/2007Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where interconnections or communication control functionality are redundant using redundant communication media
    • G06F11/201Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where interconnections or communication control functionality are redundant using redundant communication media between storage system components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/2053Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant
    • G06F11/2056Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant by mirroring
    • G06F2003/0692
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0673Single storage device

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A data storage installation (12, fig. 1) capable of expansion and fault tolerance is disclosed. The installation comprises a plurality of data storage units 14 interconnected by a data transport means 16, such as an Ethernet network. Each data storage unit comprises a controller 22 and one or more data storage devices 20, such as hard disc drives. The controller includes a memory in which is stored metadata that identifies data stored on the storage devices of all data storage units in the installation. Upon receipt of a request for data, the controller consults the memory to determine which data storage unit contains the requested data. The controller can then either fulfill the request for data or pass the request onto the data storage unit on which the data is stored. The installation can implement a storage area network (SAN) or network addressable storage (NAS). The controller may write data to storage devices in a format analogous to RAID5, along with parity metadata.

Description

Data storage system This invention relates to a data storage system. In
particular, it relates to a system for efficient and reliable storage of data that is highly scalable and fault-tolerant.
Computer systems are being called upon to store ever greater amounts of data. This applies not only to newly-developed systems, but also to existing systems. In the latter case, it is highly advantageous to be able to expand capacity for storage of data in a manner that does not change the logical structure of the data. This allows data storage capacity to be increased without substantial alteration to the software that is accessing the data. Although ever-larger storage devices are becoming available, this addresses just a small minority of the issues arising from larger data storage requirements. It is inevitable that as the size of data storage devices increases, so does the consequence of loss of any such device.
To address some of these problems, various schemes have been proposed by which data can be stored on multiple physical devices, while the logical structure of the data is that of a single large volume. Use of multiple physical devices to mimic a single logical volume can, with suitable arrangement, provide data storage with greater performance or greater reliability than a single large physical device. So-called "redundant arrays of inexpensive (or independent) drives" (RAID arrays) have gained popularity. RAID arrays can have several configurations depending upon whether the aim is to optimise the array for speed or for reliability. The data may be duplicated on several devices for reliability, it may be spread over several devices for performance, (or both), and measures such as parity may be used to provide a check on data integrity.
A problem with RAID arrays is that many configurations require all of the physical devices within the array to be identical, or to have particular properties, such as drive geometries, in common. Given the speed at which physical devices are being developed, devices may not be available to replace those in a RAID array that has been in use for some time. Moreover, every RAID array has only limited scalability, capacity and performance so its performance cannot be maintained as a computer system expands.
An aim of this invention is to provide a data storage system that can support very large logical volumes which can be accessed with very high data transfer rates. Moreover, the aim is to provide a system in which a logical volume can grow significantly after its initial implementation.
Another aim of the invention is to improve data integrity. This has been achieved in several ways which the following description will make abundantly clear.
A further aim of the invention is to reduce the cost and environmental impact, such as power consumption, of implementing a data storage system that provides a logical volume of a given size.
With this aim in mind, from a first aspect the invention provides a data storage installation comprising a plurality of data storage units interconnected by a data transport means, in which each data storage unit comprises a controller and one or more data storage devices, each controller includes a memory in which is stored metadata that identifies data stored on the storage devices of all data storage units in the installation, whereby upon receipt of a request for data, the controller consults the memory to determine which data storage unit contains the requested data and either fulfils the request for data or passes the request onto the data storage unit on which the data is stored.
From the point of view of a host external to the data storage installation, the location of any piece of data is not apparent. Rather than appearing to be a large number of storage devices, the installation operates as a single storage unit.
In typical embodiments, each data storage unit includes multiple data storage devices, which may be hard disc drives (although in future embodiments these may be replaced with solid-state devices or other data storage devices yet to be developed). For example, the controller may be configured to write data to the data storage devices in a format that is analogous to RAID 3 or RAID 5, in which data is stored along with parity metadata, the latter being preferred because the parity metadata is distributed throughout the storage devices. This allows the data storage unit to continue to operate in the event of failure of one of the storage devices.
Installations embodying the invention may implement network addressed storage (NAS) or a storage area network (SAN). The difference between these arrangements is in the manner that a request for a file is mapped onto sectors in the storage installation.
With SAN, a host, or several hosts operating collaboratively, making a request for a file is responsible for the mapping of files into sectors within a storage volume. Therefore, embodiments that implement a SAN system supply to a host data that is addressed by an identifier of a block, typically by number and store data received from a host at a location that identifies a block. Thus, an embodiment suitable for use with SAN requires the data storage units to present blocks of data to a host.
In contrast, in a NAS system, it is the storage that does the mapping of files onto sectors. To this end, in a NAS environment the data storage units themselves maintain the file system and the corresponding mapping, so they can present as many logical volumes as have been configured while managing them. Therefore, those embodiments that implement a NAS system supply to a host data that is addressed by an identifier of a file, for example by file name. Therefore, an embodiment of the invention for use in a NAS system presents files and folders. Multiple hosts could address the same data storage unit at once. In such embodiments, data belonging to a file may be held within one or more than one data storage unit. Most typically, each file is owned by exactly one data storage unit. That is, there is exactly one data storage unit that holds authoritative metadata relating to a file (although, the one data storage unit may cause the metadata to be mirrored or otherwise copied to reduce the likelihood of its loss). Metadata may be copied to other data storage units to increase speed of access. However, such copies are not authoritative, and may subsequently be changed by the data unit that stores the authoritative metadata.
Embodiments of the invention will now be described in detail, by way of example, and with reference to the accompanying drawings, in which: Figure 1 shows a data server being a simple embodiment of the invention; Figure 2 is a diagrammatic illustration of a data storage unit being a component of the data server of Figure 1; Figure 3 shows a data server that supports several hosts being a further embodiment of the invention; Figure 4 is a diagraimnatic illustration of a data storage unit that can implement mirroring of data; Figure 5 is a data storage implementation that incorporates data storage units as illustrated in Figure 4; and Figure 6 is an overview of an embodiment of the invention that uses mirroring.
With reference first to Figure 1, there is shown a simple embodiment of the invention. The embodiment provides a storage installation 12 that can be used by a host system 10 to store and retrieve data. In this embodiment, the storage installation 12 implements a block-addressed system. Thus, the file system is maintained by the host, which treats the storage installation as a single block 110 storage device.
As shown in Figure 2, the data storage installation 12 comprises a plurality (in this case, six) of data storage units 14. Each of the data storage units 14 comprises five hard disc drives 20 that each constitute a data storage device and a controller 22. The controller 22 has an internal interface 24 to each of the hard drives 20. The internal interface can be a conventional hard disc interface, such as ATA, SATA, or SCSI. The controller 22 also has an external network interface, which, in this embodiment, is a high-speed Ethernet network interface. Each of the data storage units 14 are connected by their respective external interfaces to a common network 16 such that they can exchange data with each other. The controller 22 also has a host interface 26, which may also be a highspeed Ethernet. The host interface 26 allows the host 10 to connect to and exchange data with one of the data storage units 14, using the iSCSI protocol, as set forth in RFC3720 published by the Internet Engineering Task Force.
The protocol used on the common network is applied within the data storage installation 12; it is not an external protocol, so may be specific to embodiments of the invention rather than being in accordance with an accepted standard. Suitable protocols can be devised to optimise the performance of a particular implementation. Alternatively, an existing protocol, such as iSCSI may be used on the common network 16.
Internally, the data storage unit can arrange data on its discs 20 in substantially any way that can support block addressing. One preferred arrangement is analogous to that used by RAID5, in which data is secured by storing parity information distributed throughout the discs of the unit.
A simple block-addressed implementation is configured to present on the host interface 26 the appearance of a volume. The volume is divided into a number of fixed-length blocks that can be addressed by a single number from 0 to some maximum value b,, that defines the capacity of the volume. if the implementation includes n data storage units 14 and the data storage units have capacities co... C.j then b-m-0.njCm. Each data storage unit 14 advertises its storage capacity on the common network 16 and this information is received by all of the other data storage units 14. An algorithm is then applied to determine an unambiguous order for the storage units. This could be hard-coded or manually programmed, but greater flexibility can be achieved it the order can be determined automatically. For example, the order may be the ascending numerical order of MAC addresses of the external interfaces (which information is, of course, also broadcast on the common network 16).
Assume now that the installation 12 is in an unconfigured state and has just been powered on.
Each data storage unit 14 broadcasts on the common network the number of blocks (and its order, if appropriate) that it can store, and this information is received by each of the units.
The controller 22 of each unit records this information in an internal memory. Once each unit 14 has received this information from every other unit in the installation 12, it can then create a lookup table to identify the location of every block of data within the installation.
Thus, the contents of the lookup table will indicate that blocks 0 to co-I are stored on storage unit 0, blocks Co to c0+c,-1 are stored on storage unit 1, and so forth for all storage units 14 in the installation 12. (In general, storage unit i is responsible for sectors m-O i-ICm tO Lv0 iCm 1.) Now, consider the operation of the controller 22 of any one of the storage units 14 when it receives a request to read or write a block of data on its host interface 26. The controller 22 consults the lookup table to determine which of the data storage units 14 is responsible for storing that block. If the block is stored within the data storage unit 14 that received the request, then it will handle the request itself. Otherwise, it will place a request on the common network 16 that is directed to the particular data storage unit 14 that is responsible for storing the block.
Consider a data storage unit 14 that stores blocks b10 to bh,8h that receives over the host interface a request to read block b where = b = blugh. The controller 22 retrieves block b-b, from its disc array, and returns its contents in response to the request over the host interface. Likewise, if a request to write data to a block b where b, = b = bh,gh, the controller can handle this directly within its local array of discs 20.
Now consider the behaviour of a data storage unit 14 (referred to as the "requesting storage unit") that stores blocks to bh.gh that receives over the host interface a request to read block b where b <b, 0 or b > bh,gh. The controller 22 constructs a request to be placed on the internal network 16. This request identifies the block b that is to be read and also identifies the number of the storage unit 14 that contains that block (referred to as the "target storage unit"). The request is placed on the internal network in accordance with the internal network protocol, and the controller waits for a response. The controller of the target storage unit receives the request and reads the requested sector from its discs or elsewhere (for example, from cache). The controller then constructs a response packet in accordance with the internal network protocol that contains the identification of the requesting storage unit and the data that was read from the discs and places that packet on the internal network 16. The response packet is received by the controller 22 of the requesting storage unit and it is decoded. The data that is extracted from the packet is then relayed to the host over the host interface. From the point of view of the host, the fetching of data over the internal network 16 is entirely transparent, and it sees no difference between the cases in which the data is on the data storage unit to which the request was made or is on a remote host.
The Situation in which the request is to write a block of data is handled in a similar manner.
It is either handled by the data storage unit that receives the request, or that unit creates a packet and dispatches it over the internal network to the data storage unit that can handle it.
An alternative embodiment of the invention has a topology identical to that shown in Figures 1 and 2. However, it presents a filesystem interface to the host on the host interface 26. A typical embodiment will implement an industry-standard distributed filesystem, such as Network File System (NFS), Common Internet File System (CIFS), or the Andrew File System (AFS), to name but a few.
As with the embodiment described above, the combined capacity of the data storage units 14 is arranged into one large logical volume that can handle block-addressed read and write operations in response to requests made on the host interface 26. However, the operation of the controllers 22 is different. This will now be described.
The purpose of a filesystem is to organise data saved in a block device into files of arbitrary length that can be created, written and subsequently read. The files are identified by a symbolic identifier, such as a file name. Central to the operation of a filesystem is metadata which describes where (in tenns of blocks) the location of file data on a block-addressed store, A network cannot store persistent data. Therefore, in the case of a network filesystem, filesystem operations must be translated into operations on an underlying physical filesystem, which, in turn, performs read and write operations on a block-addressed storage device.
The embodiment described above can serve as a block-addressed persistent data store. Each controller can store data in and retrieve data from an arbitrary block anywhere in the installation. The operations required to translate network file system operations into physical file system operations are well-understood. Provided that the controller 22 has sufficient processing power and memory, it is a routine matter to provide it with software that can interact with a host over the host interface 26 using a network file system protocol and translate operations on the network file system into physical file system operations. The manner in which the physical file system is implemented within an installation embodying the invention will now be described.
An alternative embodiment of the invention implements network addresses storage (NAS).
In a NAS system, the data is stored and managed by the storage installation rather than an external host. This means that the accessing hosts, be they servers or clients, do not maintain the file system, but access the data directly using a filename or similar mechanism. The lilesystem is maintained by the data units within the storage system.
In a similar way to the SAN case (and, indeed, for a conventional single disc drive), the purpose of the filesystem is to map logical disc files to physical sectors within the data storage units. This mapping is more complex in this NAS case, as compared with the SAN case, because there is now no central, unified place to maintain filesystem information. Any of the data storage units could be called upon by a host at any time to work with any file that is contained within the filesystem. This presents a two-fold problem. Firstly, how is the data originally written into the filesystem? Secondly, how does the filesystem maintain the link between files and physical sectors without maintaining a complete copy of the file system metadata within each data unit? The manner in which the first of these issues will be addressed depends upon the nature of the data that will be stored in the installation and the manner in which the data will be accessed. For example, there is a performance benefit from having very large files split over many data storage units 14 so that streaming, reading and writing of the data can occur in parallel. This is particularly beneficial in a situation where a file copy operation is performed. Considering the distributed nature of the file system, it is typical that a given file will be distributed over multiple data units 14. This arrangement will be the one that will now be addressed.
A host that has data to store sends a request to create a file to a local data storage unit 14 where it is pmcessed by the controller 22. The data storage unit 14 that receives the request does not necessarily save the file to discs 20 local to that data storage unit 14; it may or may not send it over the internal network 16 to other data storage units within the installation. The ability to distribute the data ensures that there is never a restriction in the size of file or quantity of data that can be stored, provided there is capacity in the system. This information is passed within the installation as metadata by the data storage units 14.
At this point one data storage unit 14 has complete, authoritative file information about the file just written and the remaining storage nodes only have information about the section of the file stored locally on their disks. Note that the data storage unit 14 that received the request from the host is not necessarily the data storage unit that maintains the authoritative data. Each data storage unit can now store the information about the section of the file it has stored within its local discs. This information might include the full filename, including path, the start and end sector numbers of the file, an identifier of the data storage unit has the previous and next section of the file (so creating a chain on data storage units that hold data for a file), and an identifier of the data storage unit that has the beginning of the file. With this information, it is possible for any data storage unit in the installation to look up the data it has stored on its local discs 20, and to then know if it should search forward or backwards in a chain for the data storage unit which has the required sectors. This is because all data storage units which hold a copy of the file have filename information. This data is distributed around the installation.
In a case where a host delivers a file operation to data storage unit that is part of the data chain for a file being sought, then it becomes a simple matter of contacting either the up-stream or down-stream data storage unit for an in-file search or the zero-node for the start of the file. In the case where a host sends a request for a file operation to a data storage unit that does not have access to any of the metadata for the file, the situation is less straightforward.
The data unit requiring the data has no information to identify where the file's data is located.
This data can be obtained by using a hierarchical lookup system in which parts of the filesystem are maintained by data storage units and all files within that section are either known to the data storage unit or it is aware of the data unit which maintains that specific information. An example will now be presented to clarify this.
The example uses part of a filesystem hierarchy that includes the following directories: /usrlhome/kate /usr/homelsam /usrlhomelvic For the sake of illustration, assume that all of the information for Iusr is stored on a single data storage unit, but the information for the path /usr/home/kate is stored on a different data storage unit, with an internal metadata link made between the two so that any data stored in /usrfhome/kate can be found from the higher level within the filesystem tree. This hierarchy can be indefinitely expanded with the hierarchies adjusted dynamically over time to balance the load on the installation or remain static until a manual rebalancing process is initiated.
This rebalancing is only necessary to ensure that performance does not deteriorate. For example, in a situation where millions of files are all within the same directory (for example, a directory that holds data for a mail server), it would be possible to use the linking mechanism to split the files of the directory over many data storage units, with the reference only to the data unit maintaining the authoritative metadata.
There will always be at least one data storage unit 14 that has metadata about a particular file, and one data storage unit will, therefore, always be authoritative about that file. Many more data storage units might have partial or searching information about that file or part of the tree.
It is clear that if an alternative method of data storage was required, such as an SQL-based system, then that could easily be supported by the installation.
It will be seen that each of the data storage units 14 has identical functionality. Therefore, a host 10 can be connected to any of them and can achieve equal access to the storage, as shown in Figure 3.
The above-described systems can provide a functional storage installation. It has flexibility, in that it is possible to expand its storage capacity by adding additional data storage units 14 to the internal network 16. It also is reasonably reliable, because of the inherent reliability of the RAID units. However, it is possible to implement further embodiments that have substantially enhanced characteristics.
A first enhancement is to provide several host interfaces 26 that can operate in parallel, as shown in Figure 4. This can increase the throughput of requests between the host and the storage unit. This is advantageous because, in many cases, several requests will be handled by different data storage units, and these can be handled in parallel by the installation.
Substantial enhancement to both reliability and performance can be achieved by providing mirroring of data within the data storage units 114, as shown in Figure 5. In such embodiments, each data storage unit comprises all of the components of the data storage unit 14 of Figure 2 in duplicate, and capable of operating independently of one another. In addition, a direct communication link 30 is provided between the two controllers 20. Each data storage unit 114 has total of four host interfaces 26. The installation has two independent internal networks 16a, 16b. Each data storage unit 114 has one external interface connected to a respective one of the internal networks 1 6a, 1 6b. Therefore, with the exception of the direct communication links, the installation of this embodiment has a topology identical to two installations as described with reference to Figures 1 and 3.
A more general overview of an embodiment of the invention that uses mirroring is shown in Figure 6. As can be seen, a very wide range of configurations can be selected by a system designer to optimise the installation. For example, one data storage unit 114 can connect to one or more hosts, and a host can connect to one or more data storage units 114. Figure 6 shows just a small range of the configurations that are possible.
In order to describe operations of a mirrored embodiment of the invention, the installation as a whole will be described as having two "sides", and to differentiate them, the sides will be referred to as "dark" and "light", and labelled "L" and "D", respectively. These terms have been chosen because they have no implied technical or structural meaning, thereby emphasising the symmetrical nature of the mirrored system.
There are three principle benefits from the mirroring system: 1. an increase in bandwidth and throughput for any data unit; 2. additional redundancy of the data; and 3. additional access paths to data.
These will now be described further.
There is an inherent increase in the bandwidth to the data stored on any data storage unit as there are now two (or more) data units which can share the workload between themselves.
This means, the light side of the pair might handle half of the load and the dark side the other half of the load, enabling both to operate at half capacity. This is beneficial as there may be other things happening within the system other than input/output (110) operations. Each side of the mirror is also capable of operating independently, yet with coordination, so the latency can be reduced between requests through cooperation.
There is a second copy of the complete data stored in the data storage unit, and this provides data redundancy. With this mechanism, the data system is protected against catastrophic failure; in the unlikely event of an entire data storage unit becoming damaged to the point of failure, there is still a second live copy of the data. This redundancy is important, as when the data sets become extremely large, there is no practical way to perform a backup, other than by copying it to another mass storage (typically disc-based) system. The mirroring mechanism provides a live backup system while simultaneously, benefiting from performance improvements.
In a large system embodying the invention there are now many paths to data. There are the direct paths which are normally used through the internal network, and there are also paths that can be accessed by one side of a data storage unit using its peers, and there are paths that can be accessed by jumping to the other side of the mirror. In this case, a path can be created when a failure within the local network had rendered a particular path is inaccessible in such a way that the most expedient route to the data is to cross to the mirrored partner and to then use that system to route the data request. This is functionally equivalent to a data storage Unit requesting its mirror partner to perform an operation on its behalf.
The complexity of calculating new paths with the myriad of connections is a challenge in itself, but the advantage is that it provides a system that is resilient to a point where it is very difficult to accurately assess the resilience of the system. With the mirror system in place there is no single or double point of failure. There are situations where an entire data storage unit must be disabled, but that would not be a single point of failure. This significantly improves resilience of the embodiment. Data should be always available, to the point where a failure of the system will be accompanied by a more severe failure of the connected systems.
Embodiments of the system can implement "hot spare" drives in an efficient manner, as will now be described.
As is well known by those in the technical field, a data store that operates in a mode analogous to RAID 5 can continue to operate, albeitwith significantly degraded performance, in the face of failure of one of its storage devices. The data and parity information in the other devices can be used to re-construct the missing data. It is common to provide a RAID 5 array with a "hot spare". A hot spare is a storage device (typically a hard disc drive) that is electrically connected, with its electronics powered, but with its physical disks powered down. In this state, communication with the drive is possible and it will respond to commands, but the life of the recording medium and other mechanical components is not being degraded. Upon failure of one storage device, the system uses the hot spare to replace the failed device, reconstructing the data stored on the failed device on the hot spare before integrating the hot spare into the array. Likewise, embodiments of the invention may implement a hot spare conventionally, with one spare disc per side of the storage unit. When a failure occurs, load from the degraded side of the mirror can be transferred to the non-degraded side.
The inherent flexibility of data storage installations embodying the invention allows hot spare capacity to be implemented more efficiently than in the case of conventional RAID arrays.
A hot spare must provide a capacity to store data equivalent to that of one data storage device. This is commonly achieved by providing one spare storage device per array.
However, this is quite wasteful since most of the time all of these drives are idle. In embodiments of this invention, each data storage unit can store data within its own data storage devices or within the data storage devices of other data storage units. Hitherto, this has been described with reference to file storage, but it can also be applied to data that is to be stored in a hot spare.
In one proposed embodiment, only some data storage units are provided with hot spare devices, which can be used to provide hot spare capacity for the local array in a conventional manner. These data storage units can also advertise the availability of their hot spare to other data storage units within the installation. When a controller within a data storage unit detects the failure of a data storage device and there is no local hot spare, it can use the hot spare capacity of another data storage unit to store the data that would otherwise be stored in a local hot spare. Once the defective data storage device has been replaced, its data can be re-constructed from the data in the remote data storage unit in the normal way. This will, of course, result in a lowering of performance of the array as compared with the case in which a local hot spare is provided. However, this effect is mitigated by the presence of a mirror within the data storage unit.
As a further development, some installations may include no data storage devices that are specifically assigned the task of a hot spare. Rather, one, some or all of the data storage units may reserve space within their data storage devices that can be used to store data that would conventionally be stored in a hot spare. The availability of such space may be advertised to other data storage units.

Claims (20)

  1. Claims l.A data storage installation comprising a plurality of data
    storage units interconnected by a data transport means, in which each data storage unit comprises a controller and one or more data storage devices, each controller includes a memory in which is stored metadata that identifies data stored on the storage devices of all data storage units in the installation, whereby upon receipt of a request for data, the controller consults the memory to determine which data storage unit contains the requested data and either fulfils the request for data or passes the request onto the data storage unit on which the data is stored.
  2. 2.A data storage installation according to claim 1 in which each data storage unit comprises multiple data storage devices.
  3. 3.A data storage installation according to claim 2 in which the data storage devices are hard disc drives.
  4. 4.A data storage installation according to claim 2 or claim 3 in which the controller is configured to write data to the data storage devices in a format that is analogous to RAID 5, in which data is stored along with parity metadata which is distributed throughout the storage devices.
  5. 5.A data storage installation according to any preceding claim in which at least one data storage unit includes a spare data storage device that can be used to replace data storage capacity in the event that a data storage device within the installation should fail.
  6. 6.A data storage installation according to claim 5 in which the spare data storage device can be used as a store for data resulting from a failure of a data storage device within the data storage unit in which it is contained or resulting from a failure of a data storage device within another data storage unit.
  7. 7.A data storage installation according to any preceding claim in which at least one data storage unit reserves space within its data storage devices that can be used to replace data storage capacity in the event that a data storage device within the installation should fail.
  8. 8.A data storage installation according to claim 7 in which the reserved data storage space can be used as a Store for data resulting from a failure of a data storage device within the data storage unit in which it is contained or resulting from a failure of a data storage device within another data storage unit.
  9. 9.A data storage installation according to any preceding claim that implements network attached storage (NAS).
  10. l0.A data storage installation according to claim 9 operative to interact with a host using a distributed filesystem protocol.
  11. 11.A data storage installation according to claim 10 in which the distributed fliesystem protocol is one or more of: Network File System (NFS), Andrew file system (AFS) and Common Internet File System (CIFS).
  12. 12.A data storage installation according to claim 9 or claim 10 in which data within a file can be stored on one or on more than one data storage unit.
  13. 13.A data storage installation according to any one of claims 9 to 11 in which one data storage unit is responsible for authoritative metadata for a file.
  14. 14.A data storage installation according to claim 13 in which the authoritative metadata is mirrored on a plurality of storage devices or storage units.
  15. 15.A data storage installation according to claim 13 in which the metadata is cached on data storage units other than the data storage unit that is responsible for the metadata.
  16. 1 6.A data storage installation according to any preceding claim that implements a storage area network (SAN).
  17. 17.A data storage installation according to claim 16 operative to write data to a block or read data from a block in response to a request from a host, the block being identified by a block identifier.
  18. 18.A data storage installation according to claim 16 or claim 17 operative to interact with a host using a transport layer network protocol.
  19. l9.A data storage installation according to claim 18 in which the transport layer protocol is Internet SCSI (iSCSI).
  20. 20.A data storage installation substantially as described herein with reference to the accompanying drawings.
GB0702690A 2007-02-03 2007-02-03 Data storage system Withdrawn GB2446177A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
GB0702690A GB2446177A (en) 2007-02-03 2007-02-03 Data storage system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB0702690A GB2446177A (en) 2007-02-03 2007-02-03 Data storage system

Publications (2)

Publication Number Publication Date
GB0702690D0 GB0702690D0 (en) 2007-03-21
GB2446177A true GB2446177A (en) 2008-08-06

Family

ID=37899172

Family Applications (1)

Application Number Title Priority Date Filing Date
GB0702690A Withdrawn GB2446177A (en) 2007-02-03 2007-02-03 Data storage system

Country Status (1)

Country Link
GB (1) GB2446177A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP0769744A2 (en) * 1995-09-19 1997-04-23 International Business Machines Corporation System and method for sharing multiple storage arrays by multiple host computer systems
US5668943A (en) * 1994-10-31 1997-09-16 International Business Machines Corporation Virtual shared disks with application transparent recovery
US20020124134A1 (en) * 2000-12-28 2002-09-05 Emc Corporation Data storage system cluster architecture
EP1260904A2 (en) * 2001-05-23 2002-11-27 Hitachi, Ltd. Storage subsystem with cooperating disk controllers
US20030188045A1 (en) * 2000-04-13 2003-10-02 Jacobson Michael B. System and method for distributing storage controller tasks
US20030204683A1 (en) * 2002-04-30 2003-10-30 Hitachi, Ltd. Method, system, and storage controller for controlling shared memories
US6654831B1 (en) * 2000-03-07 2003-11-25 International Business Machine Corporation Using multiple controllers together to create data spans

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5668943A (en) * 1994-10-31 1997-09-16 International Business Machines Corporation Virtual shared disks with application transparent recovery
EP0769744A2 (en) * 1995-09-19 1997-04-23 International Business Machines Corporation System and method for sharing multiple storage arrays by multiple host computer systems
US6654831B1 (en) * 2000-03-07 2003-11-25 International Business Machine Corporation Using multiple controllers together to create data spans
US20030188045A1 (en) * 2000-04-13 2003-10-02 Jacobson Michael B. System and method for distributing storage controller tasks
US20020124134A1 (en) * 2000-12-28 2002-09-05 Emc Corporation Data storage system cluster architecture
EP1260904A2 (en) * 2001-05-23 2002-11-27 Hitachi, Ltd. Storage subsystem with cooperating disk controllers
US20030204683A1 (en) * 2002-04-30 2003-10-30 Hitachi, Ltd. Method, system, and storage controller for controlling shared memories

Also Published As

Publication number Publication date
GB0702690D0 (en) 2007-03-21

Similar Documents

Publication Publication Date Title
US11687259B2 (en) Reconfiguring a storage system based on resource availability
US9418015B2 (en) Data storage within hybrid storage aggregate
US8392370B1 (en) Managing data on data storage systems
US7882304B2 (en) System and method for efficient updates of sequential block storage
CN109313538B (en) Inline deduplication
US20130346532A1 (en) Virtual shared storage in a cluster
US11880578B2 (en) Composite aggregate architecture
US11860791B2 (en) Methods for managing input-output operations in zone translation layer architecture and devices thereof
US8560503B1 (en) Content addressable storage system
US11709603B2 (en) Multi-tier write allocation
WO2012075845A1 (en) Distributed file system
EP2100224A1 (en) Computer storage system
US20150269042A1 (en) Survival site load balancing
US8117493B1 (en) Fast recovery in data mirroring techniques
JP2006331076A (en) Data storage system and storage method
US11960448B2 (en) Unified object format for retaining compression and performing additional compression for reduced storage consumption in an object store
US10503700B1 (en) On-demand content filtering of snapshots within a storage system
US7080197B2 (en) System and method of cache management for storage controllers
CN110663034B (en) Method for improved data replication in cloud environment and apparatus therefor
GB2446177A (en) Data storage system

Legal Events

Date Code Title Description
WAP Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1)