EP2024809A2 - Filesystem-aware block storage system, apparatus, and method - Google Patents
Filesystem-aware block storage system, apparatus, and methodInfo
- Publication number
- EP2024809A2 EP2024809A2 EP07797330A EP07797330A EP2024809A2 EP 2024809 A2 EP2024809 A2 EP 2024809A2 EP 07797330 A EP07797330 A EP 07797330A EP 07797330 A EP07797330 A EP 07797330A EP 2024809 A2 EP2024809 A2 EP 2024809A2
- Authority
- EP
- European Patent Office
- Prior art keywords
- data
- storage
- filesystem
- host
- cluster
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0638—Organizing or formatting or addressing of data
- G06F3/0643—Management of files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0602—Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
- G06F3/0604—Improving or facilitating administration, e.g. storage management
- G06F3/0605—Improving or facilitating administration, e.g. storage management by facilitating the interaction with a user or administrator
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0602—Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
- G06F3/0608—Saving storage space on storage systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0668—Interfaces specially adapted for storage systems adopting a particular infrastructure
- G06F3/0671—In-line storage system
- G06F3/0683—Plurality of storage devices
- G06F3/0689—Disk arrays, e.g. RAID, JBOD
Definitions
- the present invention relates to digital data storage systems and methods, and more particularly to those providing fault-tolerant storage It is known in the prior art to provide redundant disk storage in a pattern according to any one of various RAID (Redundant Array of Independent Disks) protocols Typically disk arrays using a RATD pattern are complex structures that require management by experienced information technologists Moreover in many array designs using a RAID pattern, if the disk drives in the array are of non-uniform capacities, the design may be unable to use any capacity on the drive that exceeds the capacity of the smallest drive in the array
- a spare storage device will be maintained in a ready state so that it can be used in the event another storage device fails
- a spare storage device is often referred to as a "hot spare "
- the hot spare is not used to store data during normal operation of the storage system.
- the failed storage device is logically replaced by the hot spare, and data is moved or otherwise recreated onto the hot spare
- the hot spare is brought offline so that it is ready to be used in the event of another failure
- Maintenance of a hot spare disk is generally complex, and so is generally handled by a skilled administrator A hot spare disk also represents an added expense
- the storage system allocates a storage block for the data and updates its data structures to indicate that the storage block is in use From that point on, the storage system considers the storage block to be in use, even if the host filesystem subsequently ceases to use its block
- the host filesystem generally uses a bitmap to track its used disk blocks Shortly after volume creation, the bitmap will generally indicate that most blocks are free, typically by having all bits clear As the filesystem is used, the host filesystem will allocate blocks solely through use of its free block bitmap
- a method of storing data in by a block- level storage system that stores data under control of a host filesystem
- the method involves locating host filesystem data structures stored for the host filesystem in the block-level storage system, analyzing the host filesystem data structures to identify a data type associated with the data to be stored, and storing the data using a storage scheme selected based on the data type, whereby data having different data types can be stored using different storage schemes selected based on the data types
- a block- level storage system that stores data under control of a host filesystem.
- the system comprises a block- level storage in which host filesystem data structures are stored for the host filesystem and a storage controller operably coupled to the block- level storage for locating the host filesystem data structures stored in the block-level storage, analyzing the host filesystem data structures to identify a data type associated with the data to be stored, and storing the data using a storage scheme selected based on the data type, whereby data having different data types can be stored using different storage schemes selected based on the data types
- the data may be stored using a storage layout and/or an encoding scheme selected based on the data type For example, frequently accessed data may be stored so as to provide enhanced accessibility (e g , in an uncompressed form and in sequential storage), while infrequently access data may be stored so as to provide enhanced storage efficiency (e g , using data compression and/or non-sequential storage) Additionally or alternatively, the data may compressed and/or encrypted depending on the data type
- the host filesystem data structures may be located by maintaining a partition table, parsing the partition table to locate an operating system partition, parsing the operating system partition to identify the operating system and locate operating system data structures, and parsing the operating system data structures to identify the host filesystem and locate the host filesystem data structures
- the operating system data structures may include a superblock, in which case parsing the operating system data structures may include parsing the superblock
- the host filesystem data structures may be parsed by making a working copy of a host filesystem data structure and parsing the working copy
- Fig 1 is an illustration of an embodiment of the invention in which an object is parsed into a series of chunks for storage
- Fig 2 illustrates in the same embodiment how a pattern for fault-tolerant storage for a chunk may be dynamically changed as a result of the addition of more storage
- Fig 3 illustrates in a further embodiment of the invention the storage of chunks in differing fault-tolerant patterns on a storage system constructed using different sized storage devices
- Fig 4 illustrates another embodiment of the invention in which indicator states are used to warn of inefficient storage use and low levels of fault tolerance
- Fig 5 is a block diagram of functional modules used in the storage, retrieval and re-layout of data in accordance with an embodiment of the invention
- Fig 6 shows an example in which mirroring is used in an array containing more than two drives
- Fig 7 shows some exemplary zones using different layout schemes to store their data
- Fig 8 shows a lookup table for implementing sparse volumes
- Fig 9 shows status indicators for an exemplary array having available storage space and operating in a fault-tolerant manner, in accordance with an exemplary embodiment of the present invention
- Fig 10 shows status indicators for an exemplary array that does not have enough space to maintain redundant data storage and more space must be added, in accordance with an exemplary embodiment of the present invention
- Fig 11 shows status indicators for an exemplary array that would be unable to maintain redundant data in the event of a failure, in accordance with an exemplary embodiment of the present invention
- Fig 12 shows status indicators for an exemplary array in which a storage device has failed, in accordance with an exemplary embodiment of the present invention Slots B, C, and D are populated with storage devices
- Fig 13 shows a module hierarchy representing the different software layers of an exemplary embodiment and how they relate to one another
- Fig 14 shows how a cluster access table is used to access a data clusters in a Zone, in accordance with an exemplary embodiment of the present invention
- FIG 15 shows a journal table update in accordance with an exemplary embodiment of the present invention
- Fig 16 shows drive layout in accordance with an exemplary embodiment of the invention
- Fig 17 demonstrates the layout of Zone 0 and how other zones are referenced, in accordance with an exemplary embodiment of the invention
- Fig 18 demonstrates read error handling in accordance with an exemplary embodiment of the invention
- Fig 19 demonstrates write error handling in accordance with an exemplary embodiment of the invention
- Fig 20 is a logic flow diagram demonstrating backup of a bad Region by the Error Manager in accordance with an exemplary embodiment of the invention
- Fig 21 is a schematic block diagram showing the relevant components of a storage array in accordance with an exemplary embodiment of the present invention
- Fig 22 is a logic flow diagram showing exemplary logic for managing a virtual hot spare in accordance with an exemplary embodiment of the present invention
- Fig 23 is a logic flow diagram showing exemplary logic for determining a re- layout scenario for each possible disk failure, as in block 2102 of Fig 22, in accordance with an exemplary embodiment of the present invention
- Fig 24 is a logic flow diagram showing exemplary logic for invoking the virtual hot spare functionality in accordance with an exemplary embodiment of the present invention
- Fig 25 is a logic flow diagram showing exemplary logic for automatically reconfiguring the one or more remaining drives to restore fault tolerance for the data, as in block 2306 of Fig 24, in accordance with an exemplary embodiment of the present invention
- Fig 26 is a logic flow diagram showing exemplary logic for upgrading a storage device, in accordance with an exemplary embodiment of the present invention
- FIG 27 is a conceptual block diagram of a computer system in accordance with an exemplary embodiment of the present invention
- FIG 28 is high-level logic flow diagram for the filesystem-aware storage controller, in accordance with an exemplary embodiment of the present invention
- FIG 29 is a logic flow diagram for locating the host filesystem data structures, in accordance with an exemplary embodiment of the present invention
- FIG 30 is a logic flow diagram for reclaiming unused storage space, in accordance with an exemplary embodiment of the present invention
- FIG 31 is a logic flow diagram for managing storage of the user data based on the data types, in accordance with an exemplary embodiment of the present invention
- FIG 32 is a schematic block diagram showing the relevant components of a scavenger, in accordance with an exemplary embodiment of the present invention
- FIG 33 is pseudo code for locating the host filesystem bitmaps, in accordance with an exemplary embodiment of the present invention
- FIG 34 is high-level pseudo code for the BBUM, in accordance with an exemplary embodiment of the present invention
- FIG 35 is high-level pseudo code for synchronous processing of an LBA 0 update creating a new partition, in accordance with an exemplary embodiment of the present invention
- FIG 36 is high-level pseudo code for synchronous processing of an LBA 0 update (re)formatting a partition, in accordance with an exemplary embodiment of the present invention
- FIG 37 is high-level pseudo code for synchronous processing of an LBA 0 update deleting a partition, in accordance with an exemplary embodiment of the present invention
- FIG 38 is high-level pseudo code for the asynchronous task, in accordance with an exemplary embodiment of the present invention
- a "chunk” of an object is an abstract slice of an object, made independently of any physical storage being used, and is typically a fixed number of contiguous bytes of the object
- a fault-tolerant "pattern" for data storage is the particular which by data is distributed redundantly over one or more storage devices, and may be, among other things, mirroring (e g , in a manner analogous to RAIDl), striping (e g , m a manner analogous to RAID5), RATD6, dual parity, diagonal Parity, Low Density Parity Check codes, turbo codes, or other redundancy scheme or combination of redundancy schemes
- a hash number for a given chunk is "unique" when the given chunk produces a hash number that generally will differ from the hash number for any other chunk, except when the other chunk has data content identical to the given chunk That is, two chunks will generally have different hash numbers whenever their content is non-identical
- the term "unique" is used in this context to cover a hash number that is generated from
- a “Region” is a set of contiguous physical blocks on a storage medium (e g , hard drive)
- a “Zone” is composed of two or more Regions The Regions that make up a Zone are generally not required to be contiguous In an exemplary embodiment as described below, a Zone stores the equivalent of IGB of data or control information
- a “Cluster” is the unit size within Zones and represents a unit of compression (discussed below)
- a Cluster is 4KB (i e , eight 512-byte sectors) and essentially equates to a Chunk
- a "Redundant set” is a set of sectors/clusters that provides redundancy for a set of data
- a "first pair" and a “second pair” of storage devices may include a common storage device
- a “first plurality” and a “second plurality” of storage devices may include one or more common storage devices
- a “first arrangement” and a “second arrangement” or “different arrangement” of storage devices may include one or more common storage devices
- a filesystem-aware storage system analyzes host filesystem data structures in order to determine storage usage of the host filesystem
- the block storage device may parse the host filesystem data structures to determine such things as used blocks, unused blocks, and data types
- the block storage device manages the physical storage based on the storage usage of the host filesystem
- Such a filesystem-aware block storage device can make intelligent decisions regarding the physical storage of data
- the filesystem-aware block storage device can identify blocks that have been released by the host filesystem and reuse the released blocks in order to effectively extend the data storage capacity of the system.
- Such reuse of released blocks which may be referred to hereinafter as “scavenging” or “garbage collection,” may be particularly useful in implementing virtual storage, where the host filesystem is configured with more storage than the actual physical storage capacity
- the filesystem-aware block storage device can also identify the data types of objects stored by the filesystem and store the objects using different storage schemes based on the data types (e g , frequently accessed data can be stored uncompressed and in sequential blocks, while infrequently accessed data can be stored compressed and/or in non-sequential blocks, different encoding schemes such as data compression and encryption can be applied to different objects based on the data types)
- the filesystem-aware block storage device will generally support a predetermined set of filesystems for which it "understands
- Fig 1 is an illustration of an embodiment of the invention in which an object, in this example, a file, is parsed into a series of chunks for storage Initially the file 11 is passed into the storage software where it is designated as an object 12 and allocated a unique object identification number, in this case, #007 A new entry 131 is made into the object table 13 to represent the allocation for this new object
- the object is now parsed into "chunks" of data 121, 122, and 123, which are fixed-length segments of the object
- Each chunk is passed through a hashing algorithm, which returns a unique hash number for the chunk
- This algorithm can later be applied to a retrieved chunk and the result compared with the original hash to ensure the retried chunk is the same as that stored
- the hash numbers for each chunk are stored in the object table 13 in the entry row for the object 132 so that later the complete object can be retrieved by collection of the chunks
- the chunk hashes are now compared with existing entries in the chunk table 14 Any hash that
- Fig 2 illustrates in the same embodiment how a pattern for fault-tolerant storage for a chunk may be dynamically changed as a result of the addition of more storage
- Fig 2 shows how a chunk physically stored on the storage devices may be laid out in a new pattern once additional storage is added to the overall system
- the storage system comprises two storage devices 221 and 222 and the chunk data is physically mirrored onto the two storage devices at locations 2211 and 2221 to provide fault tolerance
- a third storage device 223 is added, and it become possible to store the chunk in a parity striped manner, a pattern which is more storage efficient than the mirrored pattern
- the chunk is laid out in this new pattern in three physical locations 2311, 2321, and 2331, taking a much lower proportion of the available storage
- the chunk table 21 is updated to show the new layout is in three locations 212 and also the new chunk physical locations 2311, 2321, and 2331 are recorded 213
- Fig 3 shows a mature storage system, in accordance with an embodiment of the present invention, which has been
- Fig 4 illustrates another embodiment of the invention in which indicator states are used to warn of inefficient storage use and low levels of fault tolerance
- all three storage devices 41, 42, and 45 have free space and the indicator light 44 is green to show data is being stored in an efficient and fault-tolerant manner
- Fig 4 (b) the
- 40GB storage device 41 has become full, and thus new data can be stored only on the two storage devices 42 and 43 with remaining free space in a mirrored pattern 46
- the indicator light 44 has turned amber In Fig 4 (c), only the 120GB storage device 43 has free space remaining and so all new data can be stored only in a mirrored pattern on this one device 43
- the indicator light 44 turns red to indicate the addition of more storage is necessary
- an indicator is provided for each drive/slot in the array, for example, in the form of a three-color light (e g , green, yellow, red)
- the lights are used to light the whole front of a disk carrier with a glowing effect
- the lights are controlled to indicate not only the overall status of the system, but also which drive/slot requires attention (if any)
- Each three-color light can be placed in at least four states, specifically off, green, yellow, red
- the light for a particular slot may be placed in the off state if the slot is empty and the system is operating with sufficient storage and redundancy so that no drive need be installed in the slot
- the light for a particular slot may be placed in the green state if the corresponding drive is sufficient and need not be replaced
- the light for a particular slot may be placed in the yellow state if system operation is degraded such that replacement of the corresponding drive with a larger drive is recommended
- the light for a particular slot may be placed in the red state if the corresponding drive must be installed or replaced Additional
- a single LCD display could be used to indicate system status and, if needed, a slot number that requires attention
- other types of indicators e g , a single status indicator for the system (e g , green/yellow/red) along with either a slot indicator or a light for each slot) could be used
- Fig 5 is a block diagram of functional modules used in the storage, retrieval and re-layout of data in accordance with an embodiment of the invention, such as discussed above in connections with Figs 1-3
- the entry and exit point for communication are the object interface 511 for passing objects to the system for storage or retrieving objects, the block interface 512, which makes the storage system appear to be one large storage device, and the CIFS interface 513, which makes the storage system appear to be a Windows file system.
- the data is passed to the Chunk Parser 52, which performs the break up of the data into chunks and creates an initial entry into the object table 512 (as discussed above in connection with Fig 1)
- These chunks are then passed to the hash code generator 53, which creates the associated hash codes for each chunk and enters these into the object table so the chunks associated with each object are listed 512 (as discussed above in connection with in Fig 1)
- the chunk hash numbers are compared with the entries in the chunk table 531 Where a match is found, the new chunk is discarded, as it will be identical to a chunk already stored in the storage system.
- the physical storage manager stores the chunk in the most efficient pattern possible on the available storage devices 571, 572, and 573 and makes a corresponding entry in the chunk table 531 to show where the physical storage of the chunk has occurred so that the contents of the chunk can be retrieved later 512 (as discussed above in connection with Fig 1)
- the retrieval of data in Fig 5 by the object 511, block 512 or CIFS 513 interface is performed by a request to the retrieval manager 56, which consults the object table 521 to determine which chunks comprise the object and then requests these chunks from the physical storage manager 54
- the physical storage manager 54 consults the chunk table 531 to determine where the requested chunks are stored and then retrieves them and passes the completed data (object) back to the retrieval manager 56, which returns the data to the requesting interface
- the fault tolerant manager (FTL) 55 which constantly scans the chunk table to determine if chunks are stored in the most efficient manner possible (This may change as storage devices 571, 572, and 573 are added and removed ) If a chunk is not stored in the most efficient manner possible, then the FTL will request the physical storage manager 54 to create a new layout pattern for the chunk and update the chunk table 531 This way all data continues to remain stored in the most efficient manner possible for the number of storage devices comprising the array (as discussed above in connection with Figs
- a Zone has the effect of hiding redundancy and disk re- layout from the actual data being stored on the disk Zones allow additional layout methods to be added and changed without affecting the user of the zone
- Zones stores a given and fixed amount of data (for example 1 G Bytes)
- a zone may reside on a single disk or span across one or more drives
- the physical layout of a Zone provides redundancy in the form specified for that zone
- Fig 6 shows an example in which mirroring is used in an array containing more than two drives
- Fig 7 shows some example zones using different layout schemes to store their data
- the diagram assumes a zone stores IGB of data Note the following points l) A zone that spans multiple drives does not necessarily use the same offset into the drive across the set ii) A single drive mirror requires 2G of storage to store IG of data m) A dual drive mirror requires 2G of storage to store IG of data iv) A 3 drive stripe requires 1 5G of storage to store IG of data v) A 4 drive stripe requires 1 33G of storage to store IG of data vi) Zone A, zone B etc are arbitrary zone names In a real implementation each zone would be identified by a unique number vii) Although implied by the diagram, zones are not necessarily contiguous on a disk (see regions later) viii) There is no technical reason why mirroring is restricted to 2 drives For example, in a 3 drive system 1 copy of the data could be stored on 1 drive and half of the mirrored data
- Each disk is split into a set of equal-sized Regions
- the size of a Region is much smaller than a Zone and a Zone is constructed from one or more regions from one or more disks
- the size of a Region is typically a common factor of the different Zone sizes and the different number of disks supported by the array
- Regions are 1/12 the data size of a Zone
- the following table lists the number of Regions/Zone and the number of Regions/disk for various layouts, in accordance with an exemplary embodiment of the invention.
- Regions can be marked as used, free or bad When a Zone is created, a set of free Regions from the appropriate disks are selected and logged in a table These Regions can be in any arbitrary order and need not be contiguous on the disk When data is written to or read from a Zone, the access is redirected to the appropriate Region
- expansion and contraction may be enforced For example, if two drives are suddenly added, the expansion of a zone may go through an intermediate expansion as though one drive was added before a second expansion is performed to incorporate the second drive Alternatively, expansion and contraction involving multiple drives may be handled atomically, without an intermediate step Before any re-layout occurs, the required space must be available This should be calculated before starting the re-layout to ensure that unnecessary re-layout does not
- Zone reconstruction occurs when a drive has been removed and there is enough space on the remaining drives for ideal zone re-layout or the drive has been replaced with a new drive of larger size
- the following describes the general process of dual drive mirror reconstruction in accordance with an exemplary embodiment of the invention i) Assuming single drive mirror has data 'A' and missing mirror 'B' ii) Allocate 12 regions 'C on a d ⁇ ve other than that containing 'A' m) Copy data 'A' to 'C iv) Any writes made to data already copied must be mirrored to the appropriate place in 'C v) When copy is complete, update zone table pointers to 'B' with pointers to 'C"
- four -drive-reconstruction can only occur if the removed drive is replaced by another drive
- the reconstruction consists of allocating six regions on the new drive and reconstructing the missing data from the other three region sets
- the array When a drive is removed and there is no room for re-layout, the array will continue to operate in degraded mode until either the old drive is plugged back in or the drive is replaced with a new one If a new one is plugged in, then the drive set should be rebuilt In this case, data will be re-laid out If the old disk is placed back into the array, it will no longer be part of the current disk set and will be regarded as a new disk However, if a new disk is not placed in the array and the old one is put back in, the old one will still be recognized as being a member of the disk set, albeit an out of date member In this case, any zones that have already been re-laid out will keep their new configuration and the regions on the old disk will be freed Any zone that has not been re-laid out will still be pointing at the appropriate regions of the old disk However, as some writes may have been performed to the degraded zones, these zones need to be refreshed Rather than logging every write that has occurred, degraded regions
- the hash mechanism discussed above provides an additional mechanism for data corruption detection over that which is available under RAID
- a hash value is computed for the chunk and stored Any time the chunk is read, a hash value for the retrieved chunk can be computed and compared with the stored hash value If the hash values do not match (indicating a corrupted chunk), then chunk data can be recovered from redundant data
- a regular scan of the disks will be performed to find and correct corrupted data as soon as possible It will also, optionally, allow a check to be performed on reads from the array
- the storage array consists of one or more drive slots Each drive slot can either be empty or contain a hard disk drive Each drive slot has a dedicated indicator capable of indicating four states Off, OK, Degraded and Fail
- the states are interpreted generally as follows
- red/amber/green light emitting diodes are used as the indicators
- the LEDs are interpreted generally as follows
- Fig 9 shows an exemplary array having available storage space and operating in a fault-tolerant manner, in accordance with an exemplary embodiment of the present invention
- Slots B, C, and D are populated with storage devices, and there is sufficient storage space available to store additional data redundantly
- the indicators for slots B, C, and D are green (indicating that these storage devices are operating correctly, the array data is redundant, and the array has available disk space), and the indicator for slot A is off (indicating that no storage device needs to be populated in slot A)
- Fig 10 shows an exemplary array that does not have enough space to maintain redundant data storage and more space must be added, in accordance with an exemplary embodiment of the present invention
- Slots B, C, and D are populated with storage devices
- the storage devices in slots C and D are full
- the indicators for slots B, C, and D are green (indicating that these storage devices are operating correctly), and the indicator for slot A is red (indicating that the array does not have enough space to maintain redundant data storage and a storage device should be populated in slot A)
- Fig 11 shows an exemplary array that would be unable to maintain redundant data in the event of a failure, in accordance with an exemplary embodiment of the present invention Slots A, B, C, and D are populated with storage devices
- the storage devices in slots C and D are Ml
- the indicators for slots A, B, and C are green (indicating that they are operating correctly), and the indicator for slot D is amber (indicating that the storage device in slot D should be replaced with a storage device having greater storage capacity)
- Fig 12 shows an exemplary array in which a storage device has failed, in accordance with an exemplary embodiment of the present invention Slots B, C, and D are populated with storage devices The storage device in slot C has failed
- the indicators for slots B and D are green (indicating that they are operating correctly), the indicator for slot C is red (indicating that the storage device in slot C should be replaced), and the indicator for slot A is off (indicating that no storage device needs to be populated in slot A)
- the software design is based on six software layers, which span the logical architecture from physically accessing the disks to communicating with the host computing system
- a file system resides on a host server, such as a Windows, Linux, or Apple server, and accesses the storage array as a USB or iSCSI device
- Physical disk requests arriving over the host interface are processed by the Host Request Manager (HRM)
- HRM Host Request Manager
- a Host VO interface coordinates the presentation of a host USB or iSCSI interface to the host, and interfaces with the HRM
- the HRM coordinates data read/write requests from the host I/O interface, dispatches read and write requests, and co-ordinates the retiring of these requests back to the host as they are completed
- An overarching aim of the storage array is to ensure that once data is accepted by the system, it is stored in a reliable fashion, making use of the maximum amount of redundancy the system currently stores As the array changes physical configuration, data is re-orgamzed so as to maintain (and possibly maximize) redundancy
- simple hash based compression is used to reduce the amount of storage used
- Disks may be attached via various interfaces, such as ATA tunneled over a USB interface
- Sectors on the disks are organized into regions, zones, and clusters, each of which has a different logical role
- Regions represent a set of contiguous physical blocks on a disk On a four drive system, each region is 1/12 GB in size, and represents minimal unit of redundancy If a sector in a region is found to be physically damaged, the whole region will be abandoned
- Zones represent units of redundancy
- a zone will consist of a number of regions, possibly on different disks to provide the appropriate amount of redundancy Zones will provide 1 GB of data capacity, but may require more regions in order to provide the redundancy IGB with no redundancy require one set of 12 regions (IGB), a IGB mirrored zone will require 2 sets of IGB regions (24 regions), a IGB 3-disk stripped zone will require 3 sets of 0 5GB regions (18 regions) Different zones will have different redundancy characteristics
- Clusters represent the basic unit of compression, and are the unit size within zones They are currently 4KB 8 x 512 byte sectors in size Many clusters on a disk will likely contain the same data
- a cluster access table (CAT) is used to track the usage of clusters via a hashing function The CAT translates between logical host address and the location of the appropriate cluster in the zone
- the CAT table resides in its own zone If it exceeds the size of the zone, an additional zone will be used, and a table will be used to map logical address to the zone for that part of the CAT Alternatively, zones are pre-allocated to contain the CAT table
- journal manager In order to reduce host write latency and to ensure data reliability, a journal manager will record all write requests (either to disk, or to NVRAM) If the system is rebooted, journal entries will be committed on reboot Disks may come and go, or regions may be retired if they are found to have corruption. In either of these situations, a layout manager will be able to re-organize regions within a zone in order to change its redundancy type, or change the regional composition of a zone (should a region be corrupted)
- a garbage collector (either located on the host or in firmware) will analyze the file system to determine which clusters have been freed, and remove them from the hash table
- Zones manager Allocates/frees chunks of sectors called Zones Knows about SDM, DDM, SD3 etc in order to deal with errors and error recovery Layout Manager
- Fig 13 shows a module hierarchy representing the different software layers and how they relate to one another Software layering is preferably rigid in order to present clear APIs and delineation
- the Garbage Collector frees up clusters which are no longer used by the host file system. For example, when a file is deleted, the clusters that were used to contain the file are preferably freed
- the Journal Manager provides a form of journaling of writes so that pending writes are not lost in the case of a power failure or other error condition
- the Layout Manager provides run-time re-layout of the Zones vis-a-vis then- Regions This may occur as a result of disk insertion/removal or failure
- the Cluster Manager allocates clusters within the set of data Zones
- the Disk Utilization Daemon checks for free disk space on a periodic basis
- the Lock Table deals with read after write collision issues
- the Host Request Manager deals with the read/write requests from the Host and Garbage Collector Writes are passed to the Journal Manager, whereas Reads are processed via the Cluster Access Table (CAT) Management layer
- the system operates on a cluster of data at any time (e g , 8 physical sectors), and this is the unit that is hashed
- the SHAl algorithm is used to generate a 160-bit hash This has a number of benefits, including good uniqueness, and being supported on-chip in a number of processors All 160-bits will be stored in the hash record, but only the least significant 16-bits will be used as an index into a hash table Other instances matching the lowest 16-bits will be chained via a linked- list
- hash analysis is not permitted to happen when writing a cluster to disk Instead, hash analysis will occur as a background activity by the hash manager
- the data being written will need to be merged with the existing data stored in the cluster
- LSA logical sector address
- the CAT entry for the cluster is located
- the hash key, zone and cluster offset information is obtained from this record, which can then be used to search the hash table to find a match This is the cluster It might well be necessary to doubly hash the hash table, once via the SHAl digest, and then by the zone/cluster offset to improve the speed of lookup of the correct hash entry
- the hash record has already been used, the reference count is decremented If the reference count is now zero, and there is no snapshot referenced by the hash entry, the hash entry and cluster can be freed back to their respective free lists
- the original cluster data is now merged with the update section of the cluster, and the data is re-hashed
- a new cluster is taken off the free-list, the merged data is written to the cluster, new entry is added to the hash table, and the entry in the CAT table is updated to point to the new cluster
- the entry is also added to an internal queue to be processed by a background task This task will compare the newly added cluster and hash entry with other hash entries that match the hash table row address, and will combine records if they are duplicates, freeing up hash entries and CAT table entries as appropriate This ensures that write latency is not burdened by this activity If a failure (e g , a loss of power) occurs during this processing, the various tables can be deleted, with a resulting loss of data
- the tables should be managed in such a way that the final commit is atomic or the journal entry can be re-run if it did not complete fully
- the following is pseudocode for the write logic
- o ⁇ ginalCluster catMgr readCluster(catEntry)
- o ⁇ ginalHash hashMgr calcHash(onginalCluster)
- hashRecord hashMgr Lookup(onginalHash, zone, offset)
- hashRecord RefCount— hashRecord Update(hashRecord)
- mergedCluster mergeCluster(onginalCluster, newCluster),
- newHash hashMgr calcHash(mergedCluster)
- newCluster clusterMgr AllocateCluster(zone, offset), clusterMgr wnte(cluster, mergedCluster), zoneMgr w ⁇ te(cluster, mergedCluster),
- Read requests are also processed one cluster (as opposed to "sector") at a time Read requests do not go through the hash-related processing outlined above Instead, the host logical sector address is used to reference the CAT and obtain a Zone number and cluster offset into the Zone Read requests should look up the CAT table entry in the CAT Cache, and must be delayed in the w ⁇ te-in-progress bit is set Other reads/writes may proceed un-impeded
- a cluster when a cluster is read, it will be hashed, and the hash compared with the SHAl hash value stored in the hash record This will require using the hash, zone and cluster offset as a search key into the hash table
- Clusters are allocated to use as few Zones as possible This is because Zones correspond directly to disk drive usage For every Zone, there are two or more Regions on the hard drive array By minimizing the number of Zones, the number of physical Regions is minimized and hence the consumption of space on the hard drive array is reduced
- the Cluster Manager allocates cluster from the set of Data Zones
- a linked list is used to keep track of free clusters in a Zone
- the free cluster information is stored as a bit map (32KB per Zone) on disk
- the linked list is constructed dynamically from the bitmap Initially, a linked list of a certain number of free clusters is created in memory When clusters are allocated, the list shrinks At a predetermined low- water mark, new linked list nodes representing free clusters are extracted from the bitmap on disk In this way, the bitmap does not need to be parsed in order to find a free cluster for allocation
- the hash table is a 64K table of records (indexed by the lower 16 bits of the hash) and has the following format
- a cluster of all zeros may be fairly common, so the all-zeros case may be treated as a special case, for example, such that it can never be deleted (so wrapping the count would not be a problem)
- a linked list of free hash record is used when the multiple hashes have the same least significant hash, or when two hash entries point to different data clusters In either case, a free hash record will be taken from the list, and linked via the pNextHash pointer
- the hash manager will tidy up entries added to the hash table, and will combine identical clusters on the disk As new hash records are added to the hash table, a message will be posted to the hash manager This will be done automatically by the hash manager As a background activity, the hash manager will process entries on its queue It will compare the full hash value to see if it matches any existing hash records If it does, it will also compare the complete cluster data If the clusters match, the new hash record can be discarded back to the free queue, the hash record count will be incremented, and the duplicate cluster will be returned to the cluster free queue The hash manager must take care to propagate the snapshot bit forward when combining records
- a Cluster Access Table contains indirect pointers The pointers point to data clusters (with 0 being the first data cluster) within Zones
- One CAT entry references a single data cluster (tentatively 4KB in size) CATs are used (in conjunction with hashing) in order to reduce the disk usage requirements when there is a lot of repetitive data
- a single CAT always represents a contiguous block of storage CATs are contained within non-data Zones
- Each CAT entry is 48-bits The following table shows how each entry is laid out (assuming each data Zone contains IGB of data)
- the CAT table for a 2TB array is currently ⁇ 4GB in size
- Redundant data is referenced by more than one entry in the CAT Two logical clusters contain the same data, so their CAT entries are pointed to the same physical cluster
- the Hash Key entry contains the 16-bit extract of the 160-bit SHAl hash value of the entire cluster This entry is used to update the hash table during a write operation There are enough bits in each entry in the CAT to reference 16TB of data
- a Host Logical Sector Translation Table is used to translate a Host Logical Sector Address into a Zone number
- the portion of the CAT that corresponds to the Host Logical Sector Address will reside in this zone Note that each CAT entry represents a cluster size of 4096 bytes This is eight 512 byte sectors
- the following shows a representation of the host logical sector translation table
- Zones can be pre-allocated to hold the entire CAT Alternatively, Zones can be allocated for the CAT as more entries to the CAT are required Since the CAT maps the 2TB virtual disk to the host sector address space, it is likely that a large part of the CAT will be referenced during hard disk partitioning or formatting by the host Because of this, the Zones may be pre-allocated
- the CAT is a large lGB/zone table
- the working set of clusters being used will be a sparse set from this large table
- active entries may be cached in processor memory rather than always reading them from the disk
- the cache needs to be at least as large at the maximum number of outstanding write requests
- Entries in the cache will be a cluster size (ie 4K) There is a need to know whether there is a write-in-progress in operation on a cluster This indication can be stored as a flag in the cache entry for the cluster
- the following table shows the format of a CAT cache entry
- the write-in-progress flag in the cache entry has two implications First, it indicates that a write is in progress, and any reads (or additional writes) on this cluster must be held off until the write has completed Secondly, this entry in the cache must not be flushed while the bit is set This is partly to protect the state of the bit, and also to reflect the fact that this cluster is currently in use In addition, this means that the size of the cache must be at least as large as the number of outstanding write operations
- One advantage of storing the write-in-progress indicator in the cache entry for the cluster is that it reflects the fact that the operation is current, saves having another table, and it saves an additional hashed-based lookup, or table walk to check this bit too
- the cache can be a w ⁇ te-delayed cache It is only necessary to write a cache entry back to disk when the write operation has completed, although it might be beneficial to have it written back earlier A hash function or other mechanism could be used to increase the number of outstanding write entries that can be hashed
- the table size would be 4096 + 96 (4192 bytes) Assuming it is necessary to have a cache size of 250 entries, the cache would occupy approximately 1MB It is possible to calculate whether the first and last entry is incomplete or not by appropriate masking of the logical CAT entry address The caching lookup routine should do this prior to loading an entry and should load the required CAT cluster
- the host When the host sends a sector (or cluster) read request, it sends over the logical sector address
- the logical sector address is used as an offset into the CAT in order to obtain the offset of the cluster in the Zone that contains the actual data that is requested by the host
- the result is a Zone number and an offset into that Zone That information is passed to the Layer 2 software, which then extracts the raw cluster(s) from the d ⁇ ve(s)
- the journal manager is a bi- level write journaling system.
- An aim of the system is to ensure that write requests can be accepted from the host and quickly indicate back to the host that the data has been accepted while ensuring its integrity
- the system needs to ensure that there will be no corruption or loss of any block level data or system metadata (e g , CAT and Hash table entries) in the event of a system reset during any disk write
- the Jl journal manager caches all write requests from the hosts to disk as quickly as possible Once the write has successfully completed (ie , the data has been accepted by the array), the host can be signaled to indicate that the operation has completed
- the journal entry allows recovery of write requests when recovering from a failure Journal records consist of the data to be written to disk, and the meta-data associated with the write transaction
- journal record will be written to a journal queue on a non-mirrored zone
- Each record will be a sector in size, and aligned to a sector boundary in order to reduce the risk that a failure during a journal write would corrupt a previous journal entry Journal entries will contain a unique, incrementing sequence count at the end of the record so that the end of a queue can easily be identified
- Journal write operations will happen synchronously within a host queue processing thread
- Journal writes must be ordered as they are written to disk, so only one thread may write to the journal as any time
- the address of the journal entry in the Jl table can be used as a unique identifier so that the Jl journal entry can be correlated with entries in the J2 journal
- journal record will be aligned to a sector boundary
- a journal record might contain an array of zone/offset/size tuples
- FIG 15 shows a journal table update in accordance with an exemplary embodiment of the present invention Specifically, when a host write request is received, the journal table is updated, one or more clusters are allocated, and data is written to the cluster(s)
- the J2 journal exists logically at layer 3 It is used to journal meta-data updates that would involve writes through the zone manager When playback of a journal entry occurs, it will use zone manager methods
- the journal itself can be stored in a specialized region Given the short lifespan of journal entries, they will not be mirrored
- J2 journal A simple approach for the J2 journal is to contain a single record As soon as the record is committed to disk, it is replayed, updating the structures on disk It is possible to have multiple J2 records and to have a background task committing updating records on disks In this case, close attention will need to be paid to the interaction between the journal and any caching algorithms associated with the various data structures
- journal entries should be committed as soon as they have been submitted
- J2 journal is analyzed, and any records will be replayed If a journal entry is correlated with a Jl journal entry, the Jl entry will be marked as completed, and can be removed Once all J2 journal entries have been completed, the meta-data is in a reliable state and any remaining Jl journal entries can be processed
- the J2 journal record includes the following information
- This scheme could operate similarly to the Jl journal scheme, for example, with a sequence number to identify the end of a J2 journal entry and placing J2 journal entries on sector boundaries
- the J2 journal could store the whole sector that was written so that the sector could be re- written from this information if necessary
- a CRC calculated for each modified sector could be stored in the J2 record and compared with a CRC computed from the sector on disk (e g , by the zone manager) in order to determine whether a replay of the write operation is required
- the different journals can be stored in different locations, so there will be an interface layer provided to write journal records to backing store
- the location should be non- volatile Two candidates are hard disk and NVRAM If the Jl journal is stored to hard disk, it will be stored in a Jl journal non-mirrored zone
- the Jl journal is a candidate for storing in NVRAM
- the J2 journal should be stored on disk, although it can be stored in a specialized region (i e , not redundant, as it has a short lifespan)
- An advantage of storing the J2 journal on disk is that, if there is a system reset during an internal data structure update, the data structures can be returned to a consistent state (even if the unit is left un-powered for a long period of time)
- the Zones Manager (ZM) allocates Zones that are needed by higher level software Requests to the ZM include a Allocate Zone b De-allocate/Free Zone c Control data read/write pass through to Ll (?) d Read/Write cluster in a
- the ZM manages the redundancy mechanisms (as a function of the number of drives and their relative sizes) and handles mirroring, striping, and other redundancy schemes for data reads/writes
- the ZM When the ZM needs to allocate a Zone, it will request an allocation of 2 or more sets of Regions For example, a Zone may be allocated for IGB of data
- the Regions that make up this Zone will be able to contain IGB of data including redundancy data
- the Zone will be made up of 2 sets of Regions of IGB each
- a 3-disk striping mechanism utilize 3 sets of Regions of 1/2 GB each
- the ZM uses the ZR translation table (6) to find out the location (drive number and start Region number) of each set of Regions that makes up the Zone Assuming a 1/12GB Region size, a maximum of 24 Regions will be needed 24 Regions make up 2x IGB Zones So the ZR translation table contains 24 columns that provide drive/region data
- the ZM works generally as follows a In the case of SDM (single drive mirroring), 24 columns are used The drive numbers are the same in all columns Each entry corresponds to a physical Region on a physical drive that makes up the Zone The first 12 entries point to Regions that contain one copy of the data The last 12 entries point to the
- Regions containing the second copy of the data are the same as SDM except that the drive number on the first 12 entries is different from that in the last 12 entries c
- three or more columns may be used For example, if striping is used across three drives, six Regions may be needed from three different drives (i e , 18 entries are used), with the first six entries containing the same drive number, the next six entries containing another drive number, and the following six entries containing a third drive number, the unused entries are zeroed
- the following table shows a representation of the zone region translation table
- the ZM When a read/write request comes in, the ZM is provided with the Zone number and an offset into that Zone The ZM looks in the ZR translation table to figure out the redundancy mechanism for that Zone and uses the offset to calculate which Drive/Region contains the sector that must be read/written The Drive/Region information is then provided to the Ll layer to do the actual read/write An additional possible entry in the Usage column is "Free” "Free” indicates that the Zone is defined but currently not used
- the cluster manager allocates and de-allocates clusters within the set of data Zones
- the Layout Manager provides run-time re-layout of the Zones vis-a-vis their Regions This may occur as a result of disk insertion/removal or failure
- the Layer 1 (Ll) software knows about physical drives and physical sectors Among other things, the Ll software allocates Regions from physical drives for use by the Zones Manager In this exemplary embodiment, each Region has a size of 1/12GB (i e , 174763 sectors) for a four-drive array system. A system with a larger maximum number of drives (8, 12 or 16) will have a different Region size
- This Region scheme allows us to provide better utilization of disk space when Zones get moved around or reconfigured e g , from mirroring to striping
- the Ll software keeps track of available space on the physical drives with a bitmap of Regions Each drive has one bitmap Each Region is represented by two bits in the bitmap in order to track if the Region is free, used, or bad
- ZM L2 software
- Requests to Ll include a Data read/write (to a cluster within a group of Regions) b Control data read/write (tables, data structures, DIC etc) c Allocate physical space for a Region (actual physical sectors within 1 drive) d De-allocate Region e Raw read/write to physical clusters within a physical drive f Copy data from one Region to another g Mark region as bad
- the free region bitmap may be large, and therefore searches to find the free entry
- bitmap (worst case is that no entries are free) may be slow
- part of the bitmap can be preloaded into memory, and a linked list of free regions can be stored in memory There is a list for each active zone If a low water mark on the list is reached, more free entries can be read from the disk as a background activity
- the Disk Manager operates at layer 0 As shown in the following table, there are two sub-layers, specifically an abstraction layer and the device drivers that communicate with the physical storage array
- the Device Drivers layer may also contain several layers For example, for a storage array using USB drives, there is an ATA or SCSI stack on top of the USB transport layer
- the abstraction layer provides basic read/write functions that are independent of the kinds of drives used in the storage array
- One or more disk access queues may be used to queue disk access requests Disk access rates will be one of the key performance bottlenecks in our system We will want to ensure that the disk interface is kept as busy as possible at all times so as to reduce general system latency and improve performance Requests to the disk interface should have an asynchronous interface, with a callback handler to complete the operation when the disk operation has finished Completion of one disk request will automatically initiate the next request on the queue
- Layer 1 will reference drives as logical drive numbers
- Layer 0 will translate logical drive numbers to physical drive references (e g , /dev/sda or file device number as a result of an open() call)
- physical drive references e g , /dev/sda or file device number as a result of an open() call
- each disk is divided into Regions of fixed size
- each Region has a size of 1/12GB (i e , 174763 sectors) for a four-drive array system.
- a system with a larger maximum number of drives (8, 12 or 16) will have a different Region size
- Region numbers 0 and 1 are reserved for use by the Regions Manager and are not used for allocation
- Region number 1 is a mirror of Region number 0
- internal data used by the Regions Manager for a given hard disk is stored in Region numbers 0 and 1 of this hard disk This information is not duplicated (or mirrored) to other drives If there are errors in either Region 0 or 1, other Regions can be allocated to hold the data
- Disk Information Structure points to these Regions
- Each disk will contain a DIS that identifies the disk, the disk set to which it belongs, and the layout information for the disk
- the first sector on the hard disk is reserved
- the DIS is stored in the first non-bad cluster after the first sector
- the DIS is contained in 1KB worth of data
- There are two copies of the DIS The copies of the DIS will be stored on the disk to which it belongs
- every disk in the system will contain a copy of all the DISs of the disks in the system.
- the following table shows the DIS format
- Regions Manager stores its internal data in a regions information structure
- the following table shows the regions information structure format
- the zones information structure provides information on where the Zones Manager can find the Zones Table
- the following shows the zones information structure format
- Zone tables contain the Zone tables and other tables used by the high level managers These will be protected using mirroring
- the following table shows the zones table node format
- Zones Table Nodes The linked list of Zones Table Nodes is placed after the ZIS in the following manner
- Fig 16 shows the drive layout in accordance with an exemplary embodiment of the invention
- the first two regions are copies of one another
- a third (optional) Zones Table Region contains the Zone Tables
- only two of the drives contain a ZTR
- two Regions are used to hold the two (mirrored) copies of the ZTR
- the DIS contains information on the location of the RIS and the ZIS Note that the first copy of the RIS does not have to be in Region 0 (e g , could be located in a different Region if Region 0 contains bad sectors)
- the Zones Manager needs to load the Zones Tables on system start up To do that, it extracts the Region number and offset from the DISs This will point to the start of the ZIS
- Certain modules (e g , the CAT Manager) store their control structures and data tables in Zones AU control structures for modules in Layer 3 and higher are referenced from structures that are stored in Zone 0 This means, for example, that the actual CAT (Cluster Allocation Tables) locations are referenced from the data structures stored in Zone 0
- the following table shows the zone 0 information table format
- the CAT linked list is a linked list of nodes describing the Zones that contain the CAT
- the following table shows the CAT Linked List node format
- the hash table linked list is a linked list of nodes that describe the Zones which hold the Hash Table
- the following table shows the Hash Table Linked List node format
- Fig 17 demonstrates the layout of Zone 0 and how other zones are referenced, in accordance with an exemplary embodiment of the invention
- a Redundant set is a set of sectors/clusters that provides redundancy for a set of data Backing up a Region involves copying the contents of a Region to another Region
- the lower level software In the case of a data read error, the lower level software (Disk Manager or Device Driver) retries the read request two additional times after an initial failed attempt The failure status is passed back up to the Zones Manager The Zones Manager then attempts to reconstruct the data that is requested (by the read) from the redundant clusters in the disk array
- the redundant data can be either a mirrored cluster (for SDM, DDM) or a set of clusters including parity (for a striped implementation)
- the reconstructed data is then passed up back to the host If the ZM is unable to reconstruct the data, then a read error is passed up back to the host
- the Zones Manager sends an Error Notification Packet to the Error Manager Fig 18 demonstrates read error handling in accordance with an exemplary embodiment of the invention
- the lower level software Disk Manager or Device Driver
- Zones Manager sends an Error Notification Packet to the Error Manager
- Fig 19 demonstrates write error handling in accordance with an exemplary embodiment of the invention
- EM Error Manager
- the EM disables access to the clusters corresponding to this part of the Zone It then updates the Zones Table to point to the newly allocated Regions Subsequently, accesses to the clusters are re-enabled
- This exemplary embodiment is designed to support eight snapshots (which allows use of one byte to indicate whether hash/cluster entries are used by a particular snapshot instance) There are two tables involved with snapshots
- a per-snapshot CAT table will need to exist to capture the relationship between logical sector addresses and the cluster on the disk that contains the data for that LSA Ultimately the per-snapshot CAT must be a copy of the CAT at the moment the snapshot was taken 2
- the system hash table which maps between hash values and a data cluster
- the hash function returns the same results regardless of which snapshot instance is being used, and as a result is common across all snapshots As a result, this table must understand whether a unique cluster is being used by any snapshots A hash cluster entry can not be freed, or replaced with new data unless there are no snapshots using the hash entry
- the hash table can be walked as a background activity
- a second CAT zone could be written whenever the main CAT is being updated These updates could be queued and the shadow CAT could be updated as another task In order to snapshot, the shadow CAT becomes the snapshot CAT
- a background process can be kicked off to copy this snapshot table to a new zone become the new snapshot CAT
- a queue could be used so that the shadow CAT queue is not processed until the copy of the CAT had completed If a failure were to occur before updating the shadow CAT (in which case entries in the queue may be lost), re-shadow from the primary CAT table could be performed before the array is brought online
- FIG 27 is a conceptual block diagram of a computer system 2700 in accordance with an exemplary embodiment of the present invention
- the computer system 2700 includes, among other things, a host computer 2710 and a storage system 2720
- the host computer 2710 includes, among other things, a host operating system (OS) 2712 and a host filesystem 2711
- the storage system 2720 includes, among other things, a filesystem-aware storage controller 2721 and storage 2722 (e g , an array including one or more populated disk drives) Storage 2722 is used to store,
- the filesystem-aware storage controller 2721 generally needs to have a sufficient understanding of the inner workings of the host filesystem(s) in order to locate and 5 analyze the host filesystem data structures
- different filesystems have different data structures and operate in different ways, and these differences can affect design/implementation choices
- the filesystem-aware storage controller 2721 locates host filesystem data structures 2723 in storage 2722 and analyzes the host filesystem data structures 2723 to determine storage usage of the host filesystem
- the filesystem-aware storage controller 2721 can then manage the user data storage 2724 based on such storage usage
- FIG 28 is high-level logic flow diagram for the filesystem-aware storage controller 2721, in accordance with an exemplary embodiment of the present invention
- the filesystem-aware storage controller 2721 locates host filesystem data
- the filesystem-aware storage controller 2721 analyzes the host filesystem data structures to determine host filesystem storage usage In block 2806, the filesystem-aware storage controller 2721 manages user data storage based on the host filesystem storage usage
- FIG 29 is a logic flow diagram for locating the host filesystem data structures
- the filesystem-aware storage controller 2721 locates its partition table in the storage controller data structures 2726 In block 2904, the filesystem-aware storage controller 2721 parses the partition table to locate the OS partition containing the host OS data structures 2725 In block 2906, the filesystem-aware storage controller 2721 parses
- the filesystem-aware storage controller 2721 parses the host OS data structures 2725 to identify the host filesystem 2711 and locate the host filesystem data structures 2723
- the filesystem-aware storage controller 2721 mayuse the host filesystem data structures 2723 for such things as identifying storage blocks no longer being used by the host filesystem 2711 and identifying the types of data stored by the host filesystem 2711 The filesystem-aware storage controller 2721 could then dynamically reclaim storage space no longer being used by the host filesystem 2711 5 and/or manage storage of the user data 2724 based on data types (e g , store frequently accessed data uncompressed and in sequential blocks to facilitate access, store infrequently accessed data compressed and/or in non-sequential blocks, and apply different encodingschemes based on data types, to name but a few)
- data types e g , store frequently accessed data uncompressed and in sequential blocks to facilitate access, store infrequently accessed data compressed and/or in non-sequential blocks, and apply different encodingschemes based on data types, to name but a few
- FIG 30 is a logic flow diagram for reclaiming unused storage space, in
- the filesystem-aware storage controller 2721 identifies blocks that are marked as being unused by the host filesystem 2711 In block 3004, the filesystem-aware storage controller 2721 identifies any blocks that are marked as unused by the host filesystem 2711 but are marked as used by the filesystem-aware storage controller 2721 In block 3002, the filesystem-aware storage controller 2721 identifies blocks that are marked as being unused by the host filesystem 2711 In block 3004, the filesystem-aware storage controller 2721 identifies any blocks that are marked as unused by the host filesystem 2711 but are marked as used by the filesystem-aware storage controller 2721 In block
- the filesystem-aware storage controller 2721 reclaims any blocks that are marked as used by the filesystem-aware storage controller 2721 but are no longer being used by the host filesystem 2711 and makes the reclaimed storage space available for additional storage
- FIG 31 is a logic flow diagram for managing storage of the user data 2724 based
- the filesystem-aware storage controller 2721 identifies the data type associated with particular user data 2724 In block 3104, the filesystem-aware storage controller 2721 optionally stores the particular user data 2724 using a storage layout selected based on the data type In block 3106, the filesystem-aware storage controller
- the filesystem-aware storage controller 2721 optionally encodes the particular user data 2724 using an encoding scheme (e g , data compression and/or encryption) selected based on the data type In this way, the filesystem-aware storage controller 2721 can store different types of data using different layouts and/or encoding schemes that are tailored to the data type
- an encoding scheme e g , data compression and/or encryption
- the garbage collector may be used to free up clusters which are no longer used by the host file system (e g , when a file is deleted)
- garbage collection works by finding free blocks, computing their host LSAs, and locating their CAT entries based on the LSAs If there is no CAT entry for a particular LSA, then the cluster is already free If, however, the CAT entry is located, the reference count is decremented, and the cluster is freed if the count hits zero
- One concern is that it may be difficult for the garbage collector to distinguish a block that the host filesystem has in use from one that it has previously used and at some point marked free
- the storage system allocates a cluster for the data as well as a CAT entry to describe it From that point on, the cluster will generally appear to be in use, even if the host filesystem subsequently ceases to use
- bitmap For example, certain host filesystems use a bitmap to track its used disk blocks Initially, the bitmap will indicate all blocks are free, for example, by having all bits clear As the filesystem is used, the host filesystem will allocate blocks through use of its free block bitmap The storage system will associate physical storage with these filesystem
- the host filesystem continues to allocate what from the storage system's point of view are new, i e previously unused, blocks then the storage system will quickly run out of free clusters, subject to whatever space can be reclaimed via compression For example, assuming a filesystem block is 4k, if the host allocates filesystem blocks 100 through 500, subsequently frees blocks 300 through 500, and then allocates blocks 1000 through
- the storage system may detect the release of host filesystem disk resources by accessing the host filesystem layout, parsing its free block bitmaps, and using that information to identify clusters that are no longer being used by the filesystem.
- the storage system In order for the storage system to be able to identify unused clusters in this way, the storage system must be able to locate and understand the free block bitmaps of the filesystem Thus, the storage system will generally support a predetermined set of filesystems for which it "understands" the inner working sufficiently to locate and utilize the free block bitmaps For unsupported filesystems, the storage system would likely be unable to perform garbage collection and should therefore only advertise the real physical size of the array in order to avoid being overcommitted
- the filesystem' s superblock (or an equivalent structure) needs to be located
- the partition table will be parsed in an attempt to locate the OS partition
- the OS partition will be parsed in an attempt to locate the superblock and thereby identify the filesystem type
- the layout can be parsed to find the free block bitmaps
- historical data of the host filesystem bitmap can be kept, for example, by making a copy of the free block bitmap that can be stored in a private, non-redundant zone and performing searches using the copy Given the size of the bitmap, information may be kept for a relatively small number of clusters at a time rather than for the whole bitmap
- the current free block bitmap can be compared, cluster-by-cluster, with the historical copy Any bitmap entries transitioning from allocated to free can be identified, allowing the scavenging operation to be accurately directed to clusters that are good candidates for reclamation
- the historical copy can be replaced with the current copy to maintain a rolling history of bitmap operations Over time the copy of the free block bitmap will become a patchwork of temporally disjoint clusters, but since the current copy will always be used to locate free entries, this does not cause any problems Under certain conditions, there could be a race condition regarding the free block bitmap, for example, if the
- garbage collection can be a fairly expensive operation, and since even lightweight scavenging will consume back-end I/O bandwidth, garbage collection should not be overused
- the garbage collector should be able to run in several modes ranging from a light background lazy scavenge to an aggressive heavyweight or even high priority scavenge
- the garbage collector could be run lightly when 30% of space is used or once per week at a minimum, run slightly more heavily when 50% of space is used, and run at a frill high-priority scavenge when 90% or more of disk space is used
- the aggressiveness of the garbage collector could be controlled by limiting it to a target number of clusters to reclaim and perhaps a maximum permissible I/O count for each collection run
- the garbage collector could be configured to reclaim 1 GB using no more than 10,000 I/Os Failure to achieve the reclaim request could be used as feedback to the collector to operate more aggressively next time it is run There may also be a "reclaim everything" mode that gives the garbage collector permission to parse
- FIG 21 is a schematic block diagram showing the relevant components of a storage array in accordance with an exemplary embodiment of the present invention
- the storage array includes a chassis 2502 over which a storage manager 2504 communicates with a plurality of storage devices 2508 I -2508 N , which are coupled to the chassis respectively through a plurality of slots 2506 I -2506 N Each slot 2506 I -2506 N may be associated with one or more indicators 2507 I -2507 N
- the storage manager 2504 typically includes various hardware and software components for implementing the functionality described above Hardware components typically include a memory for storing such things as program code, data structures, and data as well as a microprocess
- journaling filesystems do not normally guarantee to preserve all of the user data for transactions that have already happened or guarantee to recover all of the metadata for such transactions, but generally only guarantee the ability to recover to a consistent state
- journaling filesystems often deploy some degree of asynchronicity between user data writes and metadata writes
- the metadata writes to disk it is common for the metadata writes to disk to be performed lazily such that there is a delay between a user data update and a corresponding metadata update
- Journal writes may also be performed lazily in some filesystems (such as NTFS according to Ed 4 of Microsoft Windows Internals)
- lazy metadata writes may be performed by play-out of the journal in a transaction-by-transaction manner, and that has considerable potential to push the metadata temporarily into states that are inconsistent with the user data already on-disk An example of this would be a bitmap update showing a de-allocation after the host
- the scavenger could operate in a purely asynchronous manner
- the scavenger may be a purely asynchronous task that periodically scans the bitmap, either in whole or part, and compares the bitmap with information contained in the CAT to determine whether any of the storage array clusters can be reclaimed Before checking a bitmap, the system may also check those blocks that contain the location of the bitmap in order to determine whether the bitmap has moved
- One advantage of a purely asynchronous scavenger is that there is essentially no direct impact on processor overhead within the main data path, although it may involve substantial asynchronous disk I/O (e g , for a 2TB volume logically divided into 4k clusters and having a 64MB bitmap, reading the whole bitmap would involve reading 64+MB of disk data every time the scavenger runs) and therefore may impact overall system performance depending on how often the scavenger runs Therefore, the scavenger frequency may be varied depending on the amount of available storage space and
- the scavenger could operate in a partly synchronous, partly asynchronous manner
- the scavenger could monitor changes to the bitmap as they occur, for example, by adding some additional checks to the main write handling path
- the scavenger could construct a table at boot time that includes the LBA range(s) of interest (hereinafter referred to as the Bitmap Locator Table or BLT)
- the BLT would generally include only LBA 0
- the BLT would generally include LBA 0, the LBA(s) of every partition boot sector, the LBA(s) containing the bitmap metadata, and the LBA range(s) containing the bitmap data itself
- the main write handling path typically calls the scavenger with details of the write being handled, in which case the call would generally internally cross- reference the LBA(s) of the write request with the BLT with a view to identifying those writes which overlap with the LBA range(s) of interest
- the scavenger would then need to parse those writes, which could be mostly done with an asynchronous task (in which case key details would generally need to be stored for the asynchronous task, as discussed below), but with critical writes parsed inline (e g , if an update is potentially indicative of a relocated bitmap, that write could be parsed inline so that the BLT may be updated before any further writes are cross-referenced)
- critical writes parsed inline e g , if an update is potentially indicative of a relocated bitmap, that write could be parsed inline so that the BLT may be updated before any further writes are cross-referenced
- the frequency of the asynchronous task could be varied depending on the amount of available storage space and/or
- Storage for the asynchronous task could be in the form of a queue
- a simple queue would allow queuing of multiple requests for the same block, which could occur because the semantics of a write cache makes it likely that a number of requests would point to the same data block in the cache (i e , the most recent data) and is inefficient because there is generally no reason to hold multiple requests representing the same LBA This could be alleviated by checking through the queue and removing earlier requests for the same block
- the queue should be provisioned with the expectation that it will reach its maximum size during periods of intense activity (which might be sustained for days) in which the asynchronous task is suppressed
- the maximum theoretical size of the queue is a product of the size of the LBA and the number of LBAs within the bitmap, which could result in very large queue size (e g , a 2TB volume has a 64
- Alternate storage for the asynchronous task could be in the form of a bitmap of the bitmap (referred to hereinafter as the "Bitmap Block Updates Bitmap" or "BBUB”), with each bit representing one block of the real bitmap
- BBUB Bitmap Block Updates Bitmap
- the size of the BBUB is essentially fixed, without regard to the frequency of the asynchronous task, and generally occupies less space than a queue (e g , the BBUB would occupy 16KB of memory for a 2TB volume, or 128KB for a 16TB volume)
- the storage system can easily adjust the mapping of the bits in the BBUB, but will generally need to take care not to map pending requests to the new location before the host has copied the data across (in fact, it may be possible to zero out the bitmap on the assumption that the host filesystem will rewrite every LBA anyway)
- the scavenger could operate in a purely synchronous manner In this embodiment, the scavenger would process writes as they occur
- a purely synchronous embodiment avoids the complexities associated with operation of an asynchronous task and its associated storage, although it interjects overhead on the processor during the critical time of handling a write from the host, and additional logic and state information might be required to compensate for asynchronous metadata updates
- the storage system may keep some history of cluster accesses it performs (e g , whether or not it has recently accessed the user data in a cluster) and only reclaim a cluster if the cluster has been quiescent over some previous time interval to ensure that no metadata updates are pending for that cluster
- the storage system might require a cluster to be quiescent for at least one minute before performing any reclamation of the cluster (generally speaking, increasing the quiescent time reduces the risk of inappropriate reclamation but increases latency in reacting to data deletion, so there is a trade-off here)
- the storage system could track only cluster writes, although the storage system could additionally track cluster reads for thoroughness in assessing cluster activity, albeit at the expense of additional disk I/O)
- the quiescent time could be a fixed value or could be different for different filesystems
- Cluster accesses could be tracked, for example, by writing a scavenger cycle number to the CAT as an indicator of access time relative to the scavenger runs
- Cluster accesses could alternatively be tracked by writing bits to the filesystem's bitmap prior to writing the data Any such modification of the filesystem's metadata would have to be coordinated carefully, though, in order to avoid any adverse interactions with filesystem operation
- Cluster access could alternatively be tracked using a bit for each cluster, block, or chunk (of whatever size)
- the bit would generally be set when that entity is accessed and might be reset when the scavenger completes its next run or when the scavenger next tries to reclaim the cluster
- the scavenger generally would only reclaim the cluster if this bit was already reset when trying to perform the reclamation, which would itself be driven by the corresponding bit in the real host filesystem bitmap being clear
- These bits could be kept together as a simple bitmap or could be added to the CAT as a distributed bitmap (requiring an additional one bit per CAT record)
- the simple bitmap approach may require an additional read-modify- write on most data write operations, potentially causing a decrease in performance of the main data path unless the bitmap is cached in memory (the bitmap could be cached in volatile memory, which could be problematic if the bitmap is lost due to an unexpected outage, or in non- volatile memory, which might necessitate a smaller
- every scavenger run may be associated with a one-byte identifier value, which could be implemented as a global counter in NVRAM that increments each time the scavenger wakes such that the identifier for a scavenger run will be the post-increment value of the counter
- the CAT manager could use the the current value of the global counter whenever it services an update to a cluster and could store a copy of that value in the corresponding CAT record Such an implementation would require modification of the CAT manager logic
- Cluster access could alternatively be tracked by keeping a short history of cluster updates in a wrapping list
- the scavenger could then search the list to verify that any cluster it was about to free had not recently been accessed by the host
- the size of the list would generally be implementation-specific However long it was, the storage system would generally have to ensure that it could run the asynchronous task before the list could get full, and that would compromise the ability to postpone the task until a quiet period
- a storage system supporting scavenging it might be desirable to identify and track premature reclamations, particularly reads that fail because of premature reclamation (i e , an attempt to read from a cluster that has been freed by the scavenger), but also writes to unallocated clusters (which will generally just result in allocation and should therefore be harmless)
- the scavenger might check whether the allocated bits actually correspond to allocated clusters, allocate clusters, or at least CAT records, where they do not, set a bit in each such CAT record indicating that the allocation was forced from the scavenger, a bit that would be reset by a data write to the cluster, and check the bit again on the next scavenger run
- scavengers described above are exemplary only and do not limit the present invention to any particular design or implementation
- Each scavenger type has certain relative advantages and disadvantages that may make it particularly suitable or unsuitable for a particular implementation
- particular implementations could support more than one of the scavenger types and dynamically switch between them as needed, for example, based on such things as the host filesystem, the amount of available storage space, and the system load
- a partly synchronous, partly asynchronous scavenger, using a BBUB to store information for the asynchronous task, and using byte-size scavenger run counter (as a timestamp of sorts) within the CAT to track cluster accesses, is contemplated for a particular implementation
- a separate monitor in addition to, or in lieu of, a scavenger could be used to keep track of how many clusters are being used by the host filesystem (for example, a scavenger might be omitted if the host filesystem is known to reliably reuse de-allocated blocks in preference to using new blocks so that reclamation is not needed and monitoring would be sufficient, a monitor might be omitted as duplicative in systems that implement a scavenger)
- the monitor only needs to determine how many bits are set in the bitmap and does not need to know precisely which bits are set and which bits are clear
- the monitor may not need a precise bit count, but may only need to determine whether the number of set bits is more or less than certain threshold values or whether the number is more or less than a previous value for the same region Therefore, the monitor may not need to parse the whole bitmap
- the monitor function could be implemented in whole or in part using an asynchronous task, which could periodically compare the new data with the
- FIG 32 is a schematic block diagram showing the relevant components of a scavenger 3210, in accordance with an exemplary embodiment of the present invention
- the scavenger 3210 includes a Bitmap Block Updates Monitor
- BBUM Bitmap Locator Tables
- BLTs Bitmap Locator Tables
- DSTs De-allocated Space Tables
- the scavenger 3210 includes a BLT 3212 for each LUN
- Each BLT 3212 contams a series of records that include a partition identifier, an LBA range, an indication of the role of the LBA range, and a flag indicating whether or not that LBA range should be parsed synchronously or asynchronously
- Each BLT has an entry for LBA 0, which is partition independent
- the BLTs are generally required to provide rapid LBA-based lookup for incoming writes on that LUN (without checking which partition they belong to first) and to provide relatively rapid partition based lookup
- the scavenger 3210 includes a BBUB 3213 for each partition supported by the storage system Each BBUB 3213 is sized appropriately for the size of the filesystem bitmap to which it pertains Each BBUB 3213 is associated with a counter reflecting how may bits are set in the bitmap The BBUBs 3213 also some mapping information showing how each bitmap pertains to its corresponding filesystem bitmap
- the scavenger 3210 includes a DST 3215 for each LUN Each DST 3215 includes one LBA range per record Each LBA range present in the table is part of a deleted or truncated partition that needs to be reclaimed from the CAT
- the BBUM 3211 may update the DSTs 3215, for example, when it identifies an unused storage area for reclamation during synchronous processing (in which case the BBUM 3211 adds an LBA range to the DSTs 3215)
- the asynchronous task 3214 may update the DSTs 3215, for example, when it identifies an unused storage area for reclamation during asynchronous processing (in which case it adds an LBA range to the DSTs 3215)
- the asynchronous task 3214 uses the DSTs 3215 to reclaim unused storage space asynchronously
- the DSTs 3215 may be stored persistently in a way that is resilient to unclean shutdown, or else additional logic may be provided to recover from any loss of the DSTs 3215
- the BBUBs 3213 are too large to be stored in NVRAM and therefore may be stored in volatile memory or on disk, while the DSTs 3215 may be stored in non-volatile memory, volatile memory, or on disk If the DSTs 3215 and BBUBs 3213 are completely volatile, then the scavenger 3210 generally must be capable of recovering from a loss of the DSTs 3215 and BBUBs 3213 (e g , due to an unexpected shutdown) Recovery might be accomplished, for example, by scanning through the entire CAT and comparing it with current partition and cluster bitmap information to see whether each cluster is mapped to a known partition and whether it is allocated in the cluster bitmap of the corresponding filesystem Another possibihtyis to store the DSTs 3215 in NVRAM and leave the BBUBs 3213 in volatile memory so that state information for disk space outside of volumes would be preserved across reboots (potentially preventing the
- the scavenger 3210 does not have much to do until a disk pack is loaded, although, in an exemplary embodiment, it is contemplated that the scavenger 3210 will be initialized by the system manager after initializing the modules that the scavenger depends upon, such as CAT Manager or the Cache Manager (for reading from the DiskPack) and the NVRAM Manager (for incrementing a counter) Alternatively, the scavenger could be initialized lazily, e g , after a DiskPack is loaded Since the scavenger could begin reading from the DiskPack almost immediately, the scavenger should not be instructed to load the DiskPack (i e , LoadDiskPack) until the other components are ready and have loaded the same DiskPack themselves
- the BBUM 3211 looks for the NTFS Partition Table at LBA 0
- the NTFS Partition Table is a 64-byte data structure located in the same LBA as the Master Boot Record, namely LBA 0, and contains information about NTFS primary partitions
- Each Partition Table entry is 16 bytes long, making a maximum of four entries available
- Each entry starts at a predetermined offset from the beginning of the sector and a predetermined structure
- the partition record includes a system identifier that enables the storage system to determine whether the partition type is NTFS or not It has been found that the Partition Table position and layout is generally somewhat independent of the operating system that writes it, with the same partition table structure serving a range of filesystem formats, not just NTFS, and not just Microsoft formats (FIFS+ and other filesystems may use a different structure to locate its partition)
- the BBUM 3211 reads the the Partition Table from LBA 0, and then, for each NTFS partition identified in the
- Partition Table reads the boot sector of the partition (the first sector of the partition), and in particular the extended BIOS partition block, which is a structure proprietary to NTFS partitions that will provide the location of the Master File Table (MFT)
- the BBUM 3211 then reads the resident Sbitmap record of the MFT to get the file attributes, in particular the location(s) and length(s) of the actual bitmap data
- the BBUM 3211 also programs the BLTs 3212 with the boot sector LBA of each partition, the LBA(s) of the bitmap record(s), and the LBAs of the actual bitmaps Boot sector LBAs and bitmap record LBAs will also be flagged as locations whose updates always require immediate parsing
- the actual bitmap generally does not need immediate parsing and will be flagged accordingly If no partition table is found at LBA 0, then no additional locations are added to the BLTs 3212
- FIG 33 is pseudo code for locating the host filesystem bitmaps, in accordance with an exemplary embodiment of the present invention
- the filesystem-aware storage controller 2721 first looks for the partition table at LBA 0 Assuming the partition table is found, then the filesystem-aware storage controller 2721 reads the partition table to identify partitions Then, for each partition, the filesystem-aware storage controller 2721 reads the boot sector of the partition to find the MFT and reads the resident Sbitmap record of the MFT to get file attributes, such as the location(s) and length(s) of the actual bitmaps.
- the filesystem-aware storage controller 2721 programs the BLTs with the boot sector LBA of each partition, the LBA(s) of the bitmap record(s), and the LBA(s) of the actual bitmap(s), and flags the boot sector LBA(s) and the bitmap record LBA(s) to require immediate parsing and flags the actual bitmap(s) to not require immediate parsing If the filesystem-aware
- the first LBA of the new partition will be added to the BLT 3212 and flagged as a location whose updates always require immediate parsing
- the DST 3215 will be purged of any LBA ranges that fall within the new partition, in anticipation of there soon being a bitmap with a series of updates that will drive cluster reclamation
- One concern is that, if the partition were ever to be written out ahead of the partition table update, then this information is potentially being written to blocks in the DST 3215, and could be reclaimed incorrectly by the scavenger thread This could be alleviated, for example, by checking every write received for coincidence with the ranges in the DST 3215 and removing any w ⁇ tten-to block from the DST 3215
- the BBUM 3211 will immediately re-examine the LUN at the location of the partition boot sector, as the identifier change tends to occur after the partition boot sector has been written This is really just part of partition addition
- the BLT will be flushed of records pertaining to the deleted partition, the BBUB 3213 for that partition will be deleted, and the LBA range will be added to the DST 3215 for asynchronous reclamation of clusters If an existing partition is being relocated, the existing boot sector record in the
- BLT 3212 will be updated with the new boot sector LBA to monitor There is potential for the LUN to be immediately re-examined at the new location in case it has already been written, but this is not generally done
- the DST 3215 will be purged of any LBA ranges that fall within the new partition. There is potential for the LUN to be immediately re-examined at the location of the partition boot sector in case the new boot sector has already been written, but this is not generally done
- Any write found to be addressed to the first LBA of a partition will be parsed immediately (synchronously), as per the flag instructing that action
- the starting LBA of the bitmap record will be determined and added to the BLT 3212, and flagged as a location whose updates always require immediate parsing
- FIG 34 is high-level pseudo code for the BBUM 3211, in accordance with an exemplary embodiment of the present invention
- the BBUM 3211 receives a client request, it gets the LUN from the ClientRequest and finds the right BLT based on the LUN
- the BBUM 3211 gets the LBA from the ClientRequest, looks for this LBA in the BLT, and checks the "immediate action" field to see if immediate action is required for this LBA If immediate action is required, then the BBUM 3211 processes the client request synchronously If, however, immediate action is not required, then the BBUM 3211 sets the BBUB bit corresponding to the LBA for asynchronous processing
- FIG 35 is high-level pseudo code for synchronous processing of an LBA 0 update creating a new partition, in accordance with an exemplary embodiment of the present invention Specifically, if immediate action is required and the block is the partition table, then the BBUM 3211 compares partitions in new data with partitions in BLT If a new partition is being added, then the BBUM 3211 gets the start and end of partition from the new data, checks the DSTs 3215 for any overlapping LBA ranges and remove them, adds the start of partition to the BLT, and flags the entry for immediate action
- FIG 36 is high-level pseudo code for synchronous processing of an LBA 0 update
- the BBUM 3211 gets the start of the MFT from the new data and calculates the location of the bitmap record If there is already an identical bitmap record entry in the BLT for this partition then nothing is required If, however, the bitmap record is at a different location from the BLT version, then the BBUM 3211 updates the BLT and reads the new location from the disk If that location does not look like a bitmap record (i e , it does not have a Sbitmap string), then nothing is required If, however, the location does look like a bitmap record, then the BBUM 3211 gets the new bitmap location(s) and compares them with the BLT If the new bitmap location(s) are identical, then nothing is required If the new bitmaps are at a different location, then the BBUM 3211 sets all BBUB bits, updates the BBUB mappings, and moves the
- FIG 37 is high-level pseudo code for synchronous processing of an LBA 0 update deleting a partition, in accordance with an exemplary embodiment of the present invention Specifically, if immediate action is required and the block is a partition table, then the BBUM 3211 compares partitions in new data with partitions in BLT If a partition is being deleted, then the BBUM 3211 deletes the BBUB, deletes the boot sector from the BLT, deletes the Bitmap record from the BLT, deletes the Bitmap ranges from the BLT, and adds the partition range to the DST
- FIG 38 is high-level pseudo code for the asynchronous task 3214, in accordance with an exemplary embodiment of the present invention
- the asynchronous task 3214 parses the BBUB and then, for each bit set in the BBUB, the asynchronous task 3214 checks whether the corresponding cluster is marked unused by the host filesystem If the cluster is marked unused by the host filesystem, then the asynchronous task 3214 checks whether the cluster is marked used by the storage controller If the cluster is marked used by the storage controller, then the asynchronous task 3214 adds the LBA range to the
- the asynchronous task 3214 also reclaims the storage space for each LBA range in the DST
- the BBUM 3211 After receiving a boot sector update, it is generally not sufficient to wait for the write of the bitmap record (it is generally not know what order an NTFS format occurs in, and it could change in a minor patch anyway), since the bitmap record may already have been written to disk If the bitmap record is written before the extended BPB, the BBUM 3211 will not catch it because the location is not present in the BLT 3212, an exception to this is when the location of the bitmap record has not changed The exception notwithstanding, the BBUM 3211 generally has to immediately read the bitmap record location from the disk at this point to see if the bitmap record is present, and it generally needs to be able to distinguish random noise from an initialized bitmap record (checking for the Sbitmap Unicode string is a possibility) If it has not been written, it can wait for the write If it is already on disk, it generally must be parsed immediately Parsing generally requires that the record be decoded for the location(s) of the bitmap, and those locations are added to the BLT 32
- the bitmap would likely be the same size and occupy the same place, so there generally would be no change to the BLT 3212 or BBUB 3213
- the new bitmap would presumably be rewritten with most blocks being all zero, so the asynchronous task 3214 should be able to get on with processing them in order to reclaim the unallocated clusters from the CAT
- the volume serial number of the boot sector could be checked to determine whether the update was the result of a reformat
- the bitmap record could also be updated at any time, for reasons independent of the boot sector
- the scavenger 3210 may have to be able to cope with the bitmap moving or changing size on the fly, it is not clear whether the bitmap could ever change in size without creating a different sized partition, but future versions of NTFS may support this for whatever reason In this situation, the new location(s) of the bitmap generally must be programmed into the BLT 3212, with the old entries removed and the new ones added
- the BBUB 3213 has to be enlarged or contracted accordingly Any LBA ranges freed up by a contraction can be added to the DST 3215, although strictly they still map to the partition
- Another concern is that, if the time of last update field of the bitmap record is frequently modified to reflect ongoing modification of the bitmap, the result could be a substantial amount of inline parsing
- the dedicated scavenger task 3214 that nominally wakes up once a minute, collects some work from the BBUB 3213, and executes it by paging- in bitmap blocks through the cache and comparing the bits with the CAT
- the BBUB 3213 will be logically segmented (Ik segment sizes), with a counter for each segment showing the number of updates for that segment, and a global counter that reflects the highest value held by any counter, these counters will be incremented by the work producer (the BBUM 3211) and decremented by the work consumer (the scavenger task 3214)
- the scavenger task 3214 on waking, will check the global counter and decide whether the value therein is high enough to justify paging-in the bitmap If it is, then the task 3214 will determine which segment that value corresponds to (e g , by iterating through the counter array) and then begin iterating through the bits of the appropriate BBUB segment When it finds
- task priority is generally fixed at compile time and therefore is generally not changed during system operation.
- the storage system implements clusters of size 4K.
- a bit in the filesystem bitmap would not correlate neatly with a cluster in the storage system
- the filesystem cluster size is less than 4K, then multiple bits of the bitmap will generally have to be clear to bother cross-referencing with the CAT
- the filesystem cluster size is greater than 4K, then one clear bit of the bitmap will generally require multiple lookups into the CAT, one for each 4K
- Another concern is how to handle situations in which the scavenger encounters a cluster that is too young to reclaim In such situations, the scavenger could leave the bit set in the BBUB, thereby requiring one or more subsequent scans to parse through the whole 512 bits again (e g , the next scan might go through the 512 bits only to find that the cluster is still too young to reclaim)
- the scavenger could leave the bit set in the BBUB, thereby requiring one or more subsequent scans to parse through the whole 512 bits again (e
- the scavenger and BBUB will both read from disk through the CAT Manager Cluster reclamation will be performed through a special API provided by the CAT Manager
- a hot spare storage device will be maintained in a ready state so that it can be brought online quickly in the event another storage device fails
- a virtual hot spare is created from unused storage capacity across a plurality of storage devices Unlike a physical hot spare, this unused storage capacity is available if and when a storage device fails for storage of data recovered from the remaining storage device(s)
- the virtual hot spare feature requires that enough space be available on the array to ensure that data can be re-laid out redundantly in the event of a disk failure
- the storage system typically determines the amount of unused storage capacity that would be required for implementation of a virtual hot spare (e g , based on the number of storage devices, the capacities of the various storage devices, the amount of data stored, and the manner in which the data is stored) and generates a signal if additional storage capacity is needed for a virtual hot spare (e g , using green/yellow/red lights to indicate status and slot, substantially as described above)
- a record is kept of how many regions are required to re- layout that zone on a per disk basis
- the following table demonstrates a virtual hot spare with four drives used
- virtual hot spare is not available on an array with only 1 or 2 drives
- the array determines a re-layout scenario for each possible disk failure and ensure that enough space is available on each drive for each scenario
- the information generated can be fed back into the re-layout engine and the zone manager so that the data can be correctly balanced between the data storage and the hot spare feature
- the hot spare feature requires enough spare working space regions on top of those calculated from the zone layout data so that re-layout can occur
- Fig 22 is a logic flow diagram showing exemplary logic for managing a virtual hot spare in accordance with an exemplary embodiment of the present invention
- the logic determines a re-layout scenario for each possible disk failure
- the logic determines the amount of space needed on each drive for re-layout of data redundantly in a worst case scenario
- the logic determines the amount of spare working space regions needed for re-layout of data redundantly in a worst case scenario
- Fig 23 is a logic flow diagram showing exemplary logic for determining a re- layout scenario for each possible disk failure, as in block 2102 of Fig 22, in accordance with an exemplary embodiment of the present invention
- the logic allocates a zone
- the logic determines how many regions are required to re-layout that zone on a per-disk basis
- the logic iteration terminates in block 2299
- Fig 24 is a logic flow diagram showing exemplary logic for invoking the virtual hot spare functionality in accordance with an exemplary embodiment of the present invention
- the logic maintains a sufficient amount of available storage to permit re-layout of data redundantly in the event of a worst case scenario
- the logic automatically reconfigures the one or more remaining drives to restore fault tolerance for the data, in block 2306
- the logic iteration terminates in block 2399
- Fig 25 is a logic flow diagram showing exemplary logic for automatically reconfiguring the one or more remaining drives to restore fault tolerance for the data, as in block 2306 of Fig 24, in accordance with an exemplary embodiment of the present invention
- the logic may convert a first striped pattern across four or more storage devices to a second striped pattern across three or more remaining storage devices
- the logic may convert a striped pattern across three storage devices to a mirrored pattern across two remaining storage devices
- the logic may convert patterns in other ways in order to re-layout the data redundantly following loss of a drive
- the logic iteration terminates in block 2499
- the storage manager 2504 typically includes appropriate components and logic for implementing the virtual hot spare functionality as described above
- the logic described above for handling dynamic expansion and contraction of storage can be extended to provide a dynamically upgradeable storage system in which storage devices can be replaced with a larger storage devices as needed, and existing data is automatically reconfigured across the storage devices in such a way that redundancy is 5 maintained or enhanced and the additional storage space provided by the larger storage devices will be included in the pool of available storage space across the plurality of storage devices
- redundancy is 5 maintained or enhanced
- the additional storage space can be used to improve redundancy for already stored data as well as to store additional data Whenever more storage space is needed, an
- Fig 26 is a logic flow diagram showing exemplary logic for upgrading a storage device, in accordance with an exemplary embodiment of the present invention.
- the logic stores data on a first storage device in a manner that the data stored thereon appears redundantly on other storage devices
- the logic detects replacement of the first storage device with a replacement device having greater storage capacity than the first storage device
- the logic automatically reproduces the data that was stored on the first device onto the replacement device using the data
- the logic makes the additional storage space on the replacement device available for storing new data redundantly
- the logic may store new data redundantly within the additional storage space on the replacement device if no other device has a sufficient amount of available storage capacity to provide redundancy for the new data
- the logic may store new
- the storage manager 2504 typically includes appropriate components and logic for implementing the dynamic upgrade functionality as described above
- Embodiments of the present invention may be employed to provide storage capacity to a host computer, e g , using a peripheral connect protocol in the manner described in my United States Provisional Patent Application No 60/625,495, which was filed on November 5, 2004 in the name of Geoffrey S Barrall, and is hereby incorporated herein by reference in its entirety
- the hash function typically includes a mechanism for confirming uniqueness For example, in an exemplary embodiment of the invention as described above, if the hash value for one chunk is different than the hash value of another chunk, then the content of those chunks are considered to be non-identical. If, however, the hash value for one chunk is the same as the hash value of another chunk, then the hash function might compare the contents of the two chunks or utilize some other mechanism (e g , a different hash function) to determine whether the contents are identical or non- identical
- logic flow diagrams are used herein to demonstrate various aspects of the invention, and should not be construed to limit the present invention to any particular logic flow or logic implementation
- the described logic may be partitioned into different logic blocks (e g , programs, modules, functions, or subroutines) without changing the overall results or otherwise departing from the true scope of the invention
- logic elements may be added, modified, omitted, performed in a different order, or implemented using different logic constructs (e g , logic gates, looping primitives, conditional logic, and other logic constructs) without changing the overall results or otherwise departing from the true scope of the invention
- the present invention may be embodied in many different forms, including, but in no way limited to, computer program logic for use with a processor (e g , ⁇ microprocessor, microcontroller, digital signal processor, or general purpose computer), programmable logic for use with a programmable logic device (e g , ⁇ Field
- FPGA Programmable Gate Array
- ASIC Application Specific Integrated Circuit
- Source code may include a series of computer program instructions implemented in any of various programming languages (e g , an object code, an assembly language, or a high-level language such as Fortran, C, C++, JAVA, or HTML) for use with various operating systems or operating environments
- the source code may define and use various data structures and communication messages
- the source code may be in a computer executable form (e g , via an interpreter), or the source code may be converted (e g , via a translator, assembler, or compiler) into a computer executable form.
- the computer program may be fixed in any form (e g , source code form, computer executable form, or an intermediate form) either permanently or transitorily in a tangible storage medium, such as a semiconductor memory device (e g , a RAM, ROM, PROM, EEPROM, or Flash-Programmable RAM), a magnetic memory device (eg , a diskette or fixed disk), an optical memory device (e g , a CD-ROM), a PC card (e g , PCMCIA card), or other memory device
- the computer program may be fixed in any form in a signal that is transmittable to a computer using any of various communication technologies, including, but in no way limited to, analog technologies, digital technologies, optical technologies, wireless technologies (e g , Bluetooth), networking technologies, and internetworking technologies
- the computer program may be distributed in any form as a removable storage medium with accompanying printed or electronic documentation (e g , shrink wrapped software), preloaded with a computer system (e g , on system ROM or fixed disk), or distributed from
- Hardware logic including programmable logic for use with a programmable logic device
- implementing all or part of the functionality previously described herein may be designed using traditional manual methods, or may be designed, captured, simulated, or documented electronically using various tools, such as Computer Aided Design (CAD), a hardware description language (e g , VHDL or AHDL), or a PLD programming language (e g , PALASM, ABEL, or CUPL)
- CAD Computer Aided Design
- a hardware description language e g , VHDL or AHDL
- PLD programming language e g , PALASM, ABEL, or CUPL
- Programmable logic may be fixed either permanently or transitorily in a tangible storage medium, such as a semiconductor memory device (e g , a RAM, ROM, PROM, EEPROM, or Flash-Programmable RAM), a magnetic memory device (e g , a diskette or fixed disk), an optical memory device (eg , a CD-ROM), or other memory device
- the programmable logic may be fixed in a signal that is transmittable to a computer using any of various communication technologies, including, but in no way limited to, analog technologies, digital technologies, optical technologies, wireless technologies (e g , Bluetooth), networking technologies, and internetworking technologies
- the programmable logic may be distributed as a removable storage medium with accompanying printed or electronic documentation (e g , shrink wrapped software), preloaded with a computer system (e g , on system ROM or fixed disk), or distributed from a server or electronic bulletin board over the communication system (e g , the Internet or World Wide Web)
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Human Computer Interaction (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP11171934.0A EP2372520B1 (en) | 2006-05-03 | 2007-05-03 | Filesystem-aware block storage system, apparatus, and method |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US79712706P | 2006-05-03 | 2006-05-03 | |
PCT/US2007/068139 WO2007128005A2 (en) | 2006-05-03 | 2007-05-03 | Filesystem-aware block storage system, apparatus, and method |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP11171934.0A Division EP2372520B1 (en) | 2006-05-03 | 2007-05-03 | Filesystem-aware block storage system, apparatus, and method |
Publications (1)
Publication Number | Publication Date |
---|---|
EP2024809A2 true EP2024809A2 (en) | 2009-02-18 |
Family
ID=38610547
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP11171934.0A Not-in-force EP2372520B1 (en) | 2006-05-03 | 2007-05-03 | Filesystem-aware block storage system, apparatus, and method |
EP07797330A Ceased EP2024809A2 (en) | 2006-05-03 | 2007-05-03 | Filesystem-aware block storage system, apparatus, and method |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP11171934.0A Not-in-force EP2372520B1 (en) | 2006-05-03 | 2007-05-03 | Filesystem-aware block storage system, apparatus, and method |
Country Status (7)
Country | Link |
---|---|
EP (2) | EP2372520B1 (en) |
JP (1) | JP4954277B2 (en) |
KR (1) | KR101362561B1 (en) |
CN (1) | CN101501623B (en) |
AU (1) | AU2007244671B9 (en) |
CA (1) | CA2651757A1 (en) |
WO (1) | WO2007128005A2 (en) |
Families Citing this family (28)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8386537B2 (en) * | 2009-12-15 | 2013-02-26 | Intel Corporation | Method for trimming data on non-volatile flash media |
KR101656102B1 (en) | 2010-01-21 | 2016-09-23 | 삼성전자주식회사 | Apparatus and method for generating/providing contents file |
CN102622184A (en) * | 2011-01-27 | 2012-08-01 | 北京东方广视科技股份有限公司 | Data storage system and method |
CN102270161B (en) * | 2011-06-09 | 2013-03-20 | 华中科技大学 | Methods for storing, reading and recovering erasure code-based multistage fault-tolerant data |
KR102147359B1 (en) | 2012-06-29 | 2020-08-24 | 삼성전자 주식회사 | Method for managing non-volatile memory device, and non-volatile memory device |
US20140129526A1 (en) | 2012-11-06 | 2014-05-08 | International Business Machines Corporation | Verifying data structure consistency across computing environments |
GB2527529B (en) * | 2014-06-24 | 2021-07-14 | Advanced Risc Mach Ltd | A device controller and method for performing a plurality of write transactions atomically within a non-volatile data storage device |
KR101744685B1 (en) * | 2015-12-31 | 2017-06-09 | 한양대학교 산학협력단 | Protection method and apparatus for metadata of file |
US10126962B2 (en) * | 2016-04-22 | 2018-11-13 | Microsoft Technology Licensing, Llc | Adapted block translation table (BTT) |
CN108062200B (en) * | 2016-11-08 | 2019-12-20 | 杭州海康威视数字技术股份有限公司 | Disk data reading and writing method and device |
US11301433B2 (en) | 2017-11-13 | 2022-04-12 | Weka.IO Ltd. | Metadata journal in a distributed storage system |
CN107885492B (en) * | 2017-11-14 | 2021-03-12 | 中国银行股份有限公司 | Method and device for dynamically generating data structure in host |
CN110019097B (en) * | 2017-12-29 | 2021-09-28 | 中国移动通信集团四川有限公司 | Virtual logic copy management method, device, equipment and medium |
KR102090374B1 (en) * | 2018-01-29 | 2020-03-17 | 엄희정 | The Method and Apparatus for File System Level Encryption Using GPU |
CN108829345B (en) * | 2018-05-25 | 2020-02-21 | 华为技术有限公司 | Data processing method of log file and terminal equipment |
KR102697883B1 (en) * | 2018-09-27 | 2024-08-22 | 삼성전자주식회사 | Method of operating storage device, storage device performing the same and storage system including the same |
TWI682296B (en) * | 2018-12-06 | 2020-01-11 | 啓碁科技股份有限公司 | Image file packaging method and image file packaging system |
CN109783398B (en) * | 2019-01-18 | 2020-09-15 | 上海海事大学 | Performance optimization method for FTL (fiber to the Home) solid state disk based on relevant perception page level |
US10809927B1 (en) * | 2019-04-30 | 2020-10-20 | Microsoft Technology Licensing, Llc | Online conversion of storage layout |
CN110532262B (en) * | 2019-07-30 | 2021-02-05 | 北京三快在线科技有限公司 | Automatic data storage rule recommendation method, device and equipment and readable storage medium |
US11347698B2 (en) * | 2019-10-04 | 2022-05-31 | Target Brands, Inc. | Garbage collection for hash-based data structures |
CN110750495A (en) * | 2019-10-14 | 2020-02-04 | Oppo(重庆)智能科技有限公司 | File management method, file management device, storage medium and terminal |
CN112804071B (en) * | 2019-11-13 | 2024-09-06 | 南京中兴新软件有限责任公司 | Online upgrade method, upgrade file providing method, device and storage medium |
KR20210108749A (en) | 2020-02-26 | 2021-09-03 | 삼성전자주식회사 | Accelerator, method for operating the same and accelerator system including the same |
CN113535942B (en) * | 2021-07-21 | 2022-08-19 | 北京海泰方圆科技股份有限公司 | Text abstract generating method, device, equipment and medium |
CN113934691B (en) * | 2021-12-08 | 2022-05-17 | 荣耀终端有限公司 | Method for accessing file, electronic device and readable storage medium |
CN114691698B (en) * | 2022-04-24 | 2022-11-08 | 山西中汇数智科技有限公司 | Data processing system and method for computer system |
US12131046B1 (en) * | 2023-04-28 | 2024-10-29 | Netapp, Inc. | Clone volume split of clone volume from parent volume with data tiered to object store |
Family Cites Families (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH0695955A (en) * | 1992-09-09 | 1994-04-08 | Ricoh Co Ltd | Flash file system |
WO2001024010A1 (en) * | 1999-09-29 | 2001-04-05 | Hitachi, Ltd. | Method of file sharing and storage system |
US6606651B1 (en) * | 2000-05-03 | 2003-08-12 | Datacore Software Corporation | Apparatus and method for providing direct local access to file level data in client disk images within storage area networks |
US20020161982A1 (en) * | 2001-04-30 | 2002-10-31 | Erik Riedel | System and method for implementing a storage area network system protocol |
US20040078641A1 (en) * | 2002-09-23 | 2004-04-22 | Hewlett-Packard Company | Operating system-independent file restore from disk image |
JP4322031B2 (en) * | 2003-03-27 | 2009-08-26 | 株式会社日立製作所 | Storage device |
JP2005122439A (en) * | 2003-10-16 | 2005-05-12 | Sharp Corp | Device equipment and format conversion method for recording device of device equipment |
US7523140B2 (en) | 2004-03-01 | 2009-04-21 | Sandisk Il Ltd. | File system that manages files according to content |
US7603532B2 (en) * | 2004-10-15 | 2009-10-13 | Netapp, Inc. | System and method for reclaiming unused space from a thinly provisioned data container |
-
2007
- 2007-05-03 CA CA002651757A patent/CA2651757A1/en not_active Abandoned
- 2007-05-03 JP JP2009510073A patent/JP4954277B2/en not_active Expired - Fee Related
- 2007-05-03 WO PCT/US2007/068139 patent/WO2007128005A2/en active Application Filing
- 2007-05-03 EP EP11171934.0A patent/EP2372520B1/en not_active Not-in-force
- 2007-05-03 EP EP07797330A patent/EP2024809A2/en not_active Ceased
- 2007-05-03 CN CN2007800252087A patent/CN101501623B/en not_active Expired - Fee Related
- 2007-05-03 KR KR1020087029601A patent/KR101362561B1/en not_active IP Right Cessation
- 2007-05-03 AU AU2007244671A patent/AU2007244671B9/en not_active Ceased
Non-Patent Citations (1)
Title |
---|
See references of WO2007128005A2 * |
Also Published As
Publication number | Publication date |
---|---|
CN101501623B (en) | 2013-03-06 |
AU2007244671B2 (en) | 2012-12-13 |
KR20090009300A (en) | 2009-01-22 |
AU2007244671A2 (en) | 2009-01-08 |
AU2007244671A1 (en) | 2007-11-08 |
CA2651757A1 (en) | 2007-11-08 |
EP2372520B1 (en) | 2014-03-19 |
WO2007128005A2 (en) | 2007-11-08 |
JP4954277B2 (en) | 2012-06-13 |
EP2372520A1 (en) | 2011-10-05 |
JP2009536414A (en) | 2009-10-08 |
AU2007244671B9 (en) | 2013-01-31 |
KR101362561B1 (en) | 2014-02-13 |
CN101501623A (en) | 2009-08-05 |
WO2007128005A3 (en) | 2008-01-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
AU2007244671B2 (en) | Filesystem-aware block storage system, apparatus, and method | |
US7873782B2 (en) | Filesystem-aware block storage system, apparatus, and method | |
AU2005304792B2 (en) | Storage system condition indicator and method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
17P | Request for examination filed |
Effective date: 20081128 |
|
AK | Designated contracting states |
Kind code of ref document: A2 Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LI LT LU LV MC MT NL PL PT RO SE SI SK TR |
|
RIN1 | Information on inventor provided before grant (corrected) |
Inventor name: BARRALL, GEOFFREY, S. Inventor name: CLARKSON, NEIL, A. Inventor name: TERRY, JULIAN, M. |
|
RAP1 | Party data changed (applicant data changed or rights of an application transferred) |
Owner name: DATA ROBOTICS, INC. |
|
DAX | Request for extension of the european patent (deleted) | ||
17Q | First examination report despatched |
Effective date: 20090903 |
|
REG | Reference to a national code |
Ref country code: DE Ref legal event code: R003 |
|
DAC | Divisional application: reference to earlier application (deleted) | ||
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION HAS BEEN REFUSED |
|
18R | Application refused |
Effective date: 20110317 |