WO2015161140A1 - System and method for fault-tolerant block data storage - Google Patents

System and method for fault-tolerant block data storage Download PDF

Info

Publication number
WO2015161140A1
WO2015161140A1 PCT/US2015/026267 US2015026267W WO2015161140A1 WO 2015161140 A1 WO2015161140 A1 WO 2015161140A1 US 2015026267 W US2015026267 W US 2015026267W WO 2015161140 A1 WO2015161140 A1 WO 2015161140A1
Authority
WO
WIPO (PCT)
Prior art keywords
blocks
block
data
primary
erasure
Prior art date
Application number
PCT/US2015/026267
Other languages
French (fr)
Inventor
Sharath CHANDRASHEKHARA
Madhusudhan Ramesh KUMAR
Vipin CHAUDHARY
Original Assignee
The Research Foundation For The State University Of New York
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by The Research Foundation For The State University Of New York filed Critical The Research Foundation For The State University Of New York
Publication of WO2015161140A1 publication Critical patent/WO2015161140A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1076Parity data used in redundant arrays of independent storages, e.g. in RAID systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/2053Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant
    • G06F11/2056Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant by mirroring
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2211/00Indexing scheme relating to details of data-processing equipment not covered by groups G06F3/00 - G06F13/00
    • G06F2211/10Indexing scheme relating to G06F11/10
    • G06F2211/1002Indexing scheme relating to G06F11/1076
    • G06F2211/1028Distributed, i.e. distributed RAID systems with parity

Definitions

  • RAID has limited abilities to handle multiple disk failures. When one of the disks in the RAID array fails, the data has to be reconstructed. Typically in a RAID system after a disk fails it is highly critical to replace the disk and start the rebuild immediately. These reconstruction times are often so high that the possibility of a second disk failure becomes significant. This scenario is extremely likely in today's world of Terabyte disks as Robin Harris and many others have pointed out in their articles.
  • RAID has emerged as a strong alternative to RAID in providing a reliable data storage.
  • Erasure coding With some extra redundancy of erasure coding it provides the flexibility to schedule a reconstruction when the system is light on load rather than doing it immediately. Erasure coding also has a much lower storage footprint compared to replication. Many implementations with such a scheme are already available for object-based stores in the cloud. But these object stores are typically slower compared to block stores and are not suitable for many applications.
  • the presently disclosed system and method aims at provide an extremely flexible, reliable and distributed block store.
  • CIDER Reed- Solomon erasure codes were used to provide fault tolerance.
  • Systems according to the present disclosure take a new approach to reduce storage overhead by offering a variable degree of fault tolerance which can be set by the user at a granularity of a single block. This is achieved by the use of a thin block translation layer and a block level metadata system.
  • CIDER provides a reliable data store with minimal storage overhead by uniquely allowing varying reliability for data with different requirements.
  • CIDER allows for the requirements for reliability of the same data to change over time, e.g., system logs of last week require higher reliability than those saved couple of years ago.
  • CIDER is block based and thus has high performance.
  • We have implemented CIDER and preliminary results show that it is very efficient and practical.
  • the overheads of CIDER are negligible and the performance is better than raw NBD by a factor of two, even for small systems.
  • CIDER can be used by any file system or even as a raw device. We believe that the potential storage savings that CIDER offers makes it a suitable candidate for cloud and archival storage. Description of the Drawings
  • Figure 1 is an illustration of a storage stack
  • Figure 2A is a diagram showing a RAID 0 configuration
  • Figure 2B is a diagram showing a RAID 1 configuration
  • Figure 3 is a diagram showing a RAID 5 configuration
  • Figure 4 is a diagram showing a RAID 6 configuration
  • Figure 5 is a diagram showing the high level architecture of CIDER
  • Figure 6 is a diagram showing the CIDER software components
  • Figure 7 is a diagram showing the physical block distribution for an exemplary variable klm model according to an embodiment of the present disclosure
  • Figure 8 is a diagram showing the physical block distribution for an exemplary constant k/m model according to an embodiment of the present disclosure
  • Figure 9 is a table showing the configuration of a test system
  • Figure 10 is a flowchart of a write operation according to an embodiment of the present disclosure.
  • Figure 11 is a flowchart of a read operation according to an embodiment of the present disclosure.
  • Figure 12 is a flowchart showing metadata caching according to an embodiment of the present disclosure
  • Figure 13 is a diagram showing a network block device server-client
  • Figure 14 is a diagram showing another representation of the architecture according to an embodiment of the present disclosure.
  • Figure 15 is a diagram showing physical blocks spread to primary and secondary devices;
  • Figure 16 is a diagram showing a technique for spreading secondary blocks across
  • Figure 17 is a diagram showing primary and secondary blocks in a variable model
  • Figure 18 is a graph showing library encoding performance
  • Figure 19 is a graph showing library decoding performance
  • Erasure codes are used to create a distributed and reliable block store that has a high degree of flexibility in which the degree of fault tolerance can be set on a per block basis. Such fine control over setting the redundancy results in reduction of storage footprint if the redundancy levels are carefully chosen by the applications. Such a system has a great deal of potential in areas like archival storage systems and database stores which require block level access. The benefits from the disclosed system far outweigh the small performance penalty paid for the calculation of the erasure codes.
  • the present disclosure may be embodied as a method 100 for electronically storing block-level data.
  • the method 100 comprises the step of receiving 103 a request to write data.
  • the data comprises a plurality of data blocks.
  • the request can be received 103 from a file system or any client software.
  • One or more coding indicators is received, where each data block of the plurality of data blocks is associated with a coding indicator.
  • each data block may be associated with the same coding indicator as the other data blocks.
  • one or more of the data blocks may be associated with a coding indicator which is different from the coding indicator of other data blocks.
  • the coding indicator(s) are received as part of the received 103 request to write data ("in-band").
  • the coding indicator(s) are received 106 separate from the data (“out-of-band”) as further described below.
  • Each coding indicator represents a value, k, of a number of primary blocks to be written for an associated data block. For example, where Hs 4, a data block will be split into, and written as, four primary blocks.
  • the coding indicator further represents a value, m, which is a sum of the number of primary blocks (k) and the number of erasure-coded blocks calculated for a corresponding set of primary blocks. For redundancy, a number (m - k) of erasure-coded blocks can be calculated from the primary blocks such that if one or more of the primary blocks cannot be read, the data block can be reconstructed using the remaining primary blocks and one or more of the erasure-coded blocks.
  • the method 100 comprises writing 109 a data block of the plurality of data blocks.
  • the data block is written 109 as a set of primary blocks, where k is the value according to the coding indicator associated with the data block.
  • Each primary block of the set of primary blocks is written 109 to a separate storage device.
  • a value of each of m - k erasure-coded blocks is calculated 112 based on the set of primary blocks.
  • the calculated 112 erasure-coded blocks are written 115 to separate storage devices which are different than the storage devices on which the primary blocks were written 109. In this way, each primary block and each erasure-coded block is written 109, 115 to a separate storage device.
  • Encoding at the block-level allows the present disclosure to advantageously provide functionality not previously obtainable. For example, in a virtualized environment, a storage device is modeled as a large file, called a VDisk.
  • a virtual machine may include a 256 GB hard drive which is actually stored as a large 256 GB file.
  • a host system using a redundant storage scheme which sets redundancy on the file level will only be able to set a redundancy parameter for the entire VDisk.
  • the typical result is that a great deal of space is wasted due to overly-redundant storage of unimportant files contained within the VDisk.
  • the level of redundancy can be set at the block level, thereby allowing for customized redundancy for each file contained within the VDisk (and even more granular).
  • the VDisk example is one useful, but very important example of block-level coding. Other uses will be apparent to one having skill in the art in light of the present disclosure.
  • a metadata entry is recorded 118 as further described below.
  • the recorded 118 metadata entry is associated with the written 109 data block.
  • the metadata entry comprises the coding indicator for the data block.
  • the set of storage devices corresponding to the set of primary blocks for the written 109 data block is identified 121, and the meta data entry further comprises the identity of the set of storage devices.
  • Steps of the method 100 are repeated such that each data block is written 109 according to its respective coding indicator and the corresponding erasure-coded blocks are calculated 112 and written 115 to storage devices as described above. In this way, each data block of the plurality of data blocks is processed and the received 103 request to write data is fulfilled.
  • the method 100 may further comprise receiving 124 a request to retrieve the data.
  • a metadata entry associated with the data block is received and the corresponding coding indicator is determined.
  • the identity of the set of storage devices corresponding with the data block is determined, k primary blocks are read from the identified set of k storage devices, wherein the value of k is selected according to the coding indicator associated with the data block. If one or more of the primary blocks is inaccessible, the method 100 comprises the step of reading the corresponding erasure-coded blocks and reconstructing one or more of the primary blocks corresponding to the inaccessible devices.
  • the data block is assembled from the read k primary blocks. The steps of querying the metadata entry, reading primary blocks, and assembling the data block are repeated until the data is retrieved.
  • the method 100 may further comprise reconstructing the primary blocks and/or erasure-coded blocks written to the lost storage device according to the respective sets of k primary blocks and a corresponding erasure-coded block stored on the remaining storage devices, wherein values of k may differ for each reconstructed primary block and writing the reconstructed primary blocks and/or erasure-coded blocks to a replacement storage device.
  • the present disclosure may be embodied in software, for example, as a software application that performs the disclosed methods on, for example, the same computer system as the host.
  • the present disclosure may be embodied as a storage controller for storing and retrieving blocks of data in a block-level storage system.
  • a storage controller may comprise a controller (e.g., a microcontroller, etc.) configured to be in electronic communication with a plurality of storage devices, wherein the controller is configured to perform any of the disclosed methods.
  • CIDER is a recursive acronym for "CIDER is Distributed and Extended RAID.”
  • NBD Network Block Device
  • CIDER itself runs in the user space, making use of block device in userspace (“BUSE”) libraries which exposes a filesystem in userspace (“FUSE”)-like interface to a block device.
  • BUSE block device in userspace
  • FUSE filesystem in userspace
  • Erasure coding is a technique of adding error correction codes to data which enables reconstruction of the data when a part of the original data is lost.
  • 'm' is the minimum number of blocks required to reconstruct data and ' ⁇ ' is the total number of blocks (data blocks + erasure-coded blocks) for the set of blocks. We can interpret this as expanding units of data to 'm' units, where
  • Network Block Device is a Linux component which gives a block storage device interface to a remote file or raw device.
  • the server which runs on a remote machine, exports a file or a raw device and listens for incoming requests on a configurable port (see, Figure 13).
  • the client runs on a local machine and connects to the NBD server and exposes a block device interface through /dev/nbd[0- 16] locally. Once the connection is established, the remote NBD can be accessed as if it is present locally.
  • the server runs in the user space and redirects the IO requests to the appropriate storage component.
  • the client runs in the kernel space and transfers the IO requests to the remote machine to which the NBD is connected to.
  • CIDER combines the principles of NBD and erasure coding to offer a better system in terms of flexibility, reliability and storage overhead.
  • BUSE is an implementation of the NBD server which enables developers to easily implement a Virtual Block Storage Device in user space by providing an interface similar to that of FUSE. This is achieved by creating an NBD client as a loopback device which redirects all IO calls to the NBD server running on localhost. These requests are intercepted by the NBD server and the developer can implement how these commands are interpreted.
  • the NBD protocol has five types of requests:
  • trim reedsolo trim
  • a storage stack is a layered system, typically including storage media such as hard disk drives (“HDD”), solid state drives (“SSD”), etc. at the bottom-most layer, and having file-systems at the top-most layer with multiple logical block device layers in between.
  • An exemplary storage stack is depicted in Figure 1.
  • a file system is a piece of software which maintains the logical structure in the way the files are stored on a computer. It gives an interface to create, access, modify, and delete files on computers. Apart from basic access, a file system also implements functionality such as:
  • a block is the smallest logical unit of data in a storage system. It is typically
  • a block device is a logical abstraction of an underlying storage system.
  • the block device deals with data in units of blocks.
  • the file system is built on top of the block device.
  • the physically smallest unit of data that can be written to or read from the hardware is referred to as a sector.
  • a sector is typically 512 bytes for traditional hard disks.
  • Striping is a technique where sequential data is logically split into many smaller fragments and stored separately in multiple storage media or sometime within the same storage media but in different partitions.
  • RAID Redundant Array of Inexpensive Disks
  • RAID is a logical combination of storage media (HDD, SSD, etc.) that is exposed as a logical block device by the operating system.
  • RAID can be implemented at the hardware, firmware, and software levels. Different forms of RAID include RAID - 0 / 1 / 5 / 6 and combinations like RAID - 10 / 01 / 50 / 60 (see, e.g., Figs. 2A-4).
  • An extended file attribute is a file system feature that enables users to associate computer files with metadata not interpreted by the filesystem.
  • the extended attributes are metadata, associated with files, which are free to be interpreted at any desired level. Typically, these are used to store information of a file related to encryption, digital signatures, etc., and are interpreted and used at the application layer. All major file systems (ext2, ext3, ext4, JFS, ReiserFS, XFS, Btrfs on Linux; NTFS on windows; UFS2 and ZFS on Unix; HFS on Apple
  • Out-of-band communication is a technique where a dedicated communication channel is used to the control commands. This typically involves two channels— one specifically for control data, and another for the data itself.
  • in-band communication is a technique where only one communication channel exists for both control commands and the data.
  • a process can exist in either of two contexts—the user mode or kernel mode.
  • the kernel mode the process has permissions to access the underlying hardware resources. All programs other than the operating system are started in user mode. To utilize these resources the user space program has to switch to the kernel mode by making a "system call.”
  • cache is an in-memory structure that is maintained to make access to physical data blocks much faster.
  • Block Number Every data block is addressed by a block number (e.g., 64 bits), in CIDER, these block numbers are all virtual block numbers that are derived from the block offset and length which are specified by the file system from the upper layer.
  • pages In a storage media, groups of data blocks are collectively called "pages.” In cache operations, pages are loaded and offloaded based on a fair page replacement policy that is implemented. [0058] Page Number & Page Offset
  • a block number includes a page number and page offset.
  • the first 48 bits of a block number represent the page number and the remaining 16 bits represent the page offset.
  • the metadata cache includes a plurality of pages that contain a plurality of page table entries. Each page table entry represents the metadata associated for a physical data block.
  • Embodiments of the CIDER system architecture is depicted at a high-level in
  • the system comprises the following components:
  • Storage Nodes These are remote machines or storage controllers where the data will be physically stored. These are connected to a master node through a high-speed interconnect, such as, for example, infiniband, Fiber Channel, or the Ethernet. A single machine can also host multiple devices, each through a different port.
  • a high-speed interconnect such as, for example, infiniband, Fiber Channel, or the Ethernet.
  • a single machine can also host multiple devices, each through a different port.
  • the master node runs the CIDER virtual block device and is connected to the remote storage nodes which are presented as block devices.
  • the virtual block device is a virtual interface running on the master node and visible to the user or file system.
  • the Metadata server holds CIDER' s block-level metadata.
  • a metadata server is typically a highly fault-tolerant, fast server.
  • BUSE The CIDER system creates the virtual block device (e.g., /dev/nbdO) using BUSE.
  • the master node registers each of these nodes using the NBD - client and would see the pool of NBDs as /dev/nbd[l-8].
  • the master node also creates a virtual block device /dev/nbdO to include all the registered NBDs whose size will be 5 TB. /dev/nbdO will be the only interface to the user/filesystem.
  • write requests are made to the block device (typically through the file system) the data is striped in units of block size (1024 bytes) and written to 5 NBDs.
  • RS codes are calculated and written to 3 NBDs. Typically, all NBDs will be on different physically nodes.
  • CIDER is implemented in C on the GNU/Linux platform and has various software components. Before detailing each of these components, it is helpful to define the concepts of the virtual block and physical block in the context of CIDER. Since CIDER itself is a virtual device which does not physically store any data, the data block numbers of CIDER are referred to herein as "virtual block numbers.” Each such virtual block is striped into k segments and after erasure coding to m segments, it is written to m devices. The block numbers of the set of m devices are referred to herein as "physical block numbers.”
  • Figure 6 shows a block diagram of various components within CIDER.
  • Constant k/m Some embodiments of the present disclosure use constant k and m values.
  • the constant k and m values may be set while creating (i.e., instantiating) the virtual device. In one embodiment, every write to the virtual device, will add the same amount of error correction codes and every block has the same amount of fault tolerance.
  • a minimal storage unit of 1024 bytes may be selected. The minimal storage unit may be selected based on a sector size of a hard disk used in the system. Setting the block size of the virtual device to the same size of the HDD sector size allows for flexibility to choose any value for k and m. For example, k and m can be chosen to be 100 and 120 respectively. This means that 20 blocks of 1024 bytes each will be added to 100 blocks of 1024 bytes, thus enabling
  • An alternate option is to make the block size of the virtual device to -times the storage device sector size. This enables striping a virtual block into k sectors and adding (m - k) sectors, thereby writing the virtual block to m storage devices. Such a selection may be less flexible but could offer higher performance.
  • the first k may be designated to be primary devices and the next (m - k) as secondary devices. This is illustrated in Figure 15.
  • every virtual block may be mapped to a primary device through modulo arithmetic.
  • a set of virtual blocks, each mapping to a different primary device may form a group on which the Reed Solo codes may be calculated.
  • the (m - k) encoded blocks are then written to the secondary devices.
  • This approach may create problematic hot spots wherein the secondary devices are worn out faster than the primary counterparts.
  • One potential solution to this problem is to spread out the blocks such that, the secondary blocks are changed in cyclic order. This is shown the Figure 16 for a 4 + 3 scheme.
  • Header window block 1 1 - 12;
  • Center windows block 13 - 20, consisting of 2 windows, each of 4 blocks;
  • the following method may be the callback which is invoked when a
  • NBD_CMD_READ request is received: static int reedsolo_read
  • buf is a buffer that will contain the read data
  • offset is an offset in bytes from where data has to be read
  • len is a length of data to be read, in bytes.
  • the following method may be the callback which is invoked when a
  • NBD_CMD_WRITE request is received: static int reedsolo_write (
  • buf is a buffer which contains the data that has to be written
  • offset is an offset in bytes from where data has to be written
  • len is the length of data to write, in bytes.
  • the Reed Solo coded blocks may be calculated once per window, thus it may be advantageous to select a window of to align the writes with k blocks;
  • the additional blocks which are required to calculate the codes may be read from the respective devices, for example, by the encode operation which is explained below;
  • the device may be marked as failed, either permanently or temporarily.
  • the parity blocks may be generated and written to the secondary devices. If the write do not involve all of the k devices or the write concerned is that of the header or the trailer window, additional reads may be required to recalculate the Reed Solo codes. Further, if there is another device failure during the process of recalculation, the failed device has to be reconstructed first to recalculate the new parity. This condition can be achieved by performing a two phase commit, wherein the old values are read into a temporary buffer before writing the new values. This can be used to calculate the new parity information. [0076] A reconstruction operation may be required during reads if one or more primary devices have failed. This is done by reading k blocks from respective devices which includes secondary devices, and reconstructing the missing primary block using, for example, the zfec APIs.
  • Reads and writes to devices may be issued with one block at a time.
  • the requested blocks may be bundled in one command. This may require maintaining a cache and scheduling the IOs to remote devices.
  • Some embodiments use a variable k and m.
  • the values of k and m may be set on a per block basis.
  • One example of how blocks may be stored is shown in Figure 17.
  • the caller of the block level driver may specify the value of k and m with each write request.
  • an out-band-communication inter-process communication may be implemented using named pipes.
  • the caller may pass a message through a named pipe in a well-defined format.
  • One such format is indicated below: struct Cider_write_hdr ⁇
  • An 8-byte magic number may be used to verify the integrity of the message.
  • the block device may use the previously used values of and m. During initialization these values may be set based on a pre-defined policy.
  • m may be variable and k may be constant.
  • k may be constant.
  • a virtual device may be created with a block size of * sector size of the other network devices). For example, if the block size of the 8 network block devices participating in the system is 1024 Bytes and k is fixed at 4, then the virtual device will have a block size of 4096 bytes. Each block is split into 4 sectors of 1024 bytes each and distributed to 4 nodes. In addition, (m - k) parity sectors may be created and written to the (m - k) disks. In one embodiment, none of the sectors go to the same device.
  • Embodiments with variable k and m vales may have metadata associated with each block.
  • the metadata may need to be stored in a persistent storage.
  • Various types of metadata and metadata storage may be used.
  • the entire metadata may be stored to a disk and the entire metadata may be loaded from the disk.
  • all the metadata may be in memory and when the device is disconnected, the metadata may be flushed to the disk.
  • the metadata may be read from the disk and loaded to the memory.
  • the metadata comprises two components, a block map and a bit map.
  • the superblock may contain information about the number of devices, size of the block device, as well as other high-level data.
  • One structure of the superblock is shown below.
  • the superblock may also contain a handle to a list of block headers.
  • the block headers may contain the values of and m used for that particular block and the device blocks numbers of all the devices to which the block was written to.
  • the size of the entire block map may be approximately 1.6% of the block device size.
  • bitmap of the blocks in use may be maintained for each NBD.
  • the corresponding bitmap may be checked and the first free block may be allocated.
  • the size of the bitmap would be about 0.2% of the block device size. Therefore, the disk wastage due to the metadata would be about 1.8% of the total block device size.
  • the block header may be checked to see if that block was previously allocated
  • the device with the minimum number of blocks may be chosen as the starting device. This may be done to ensure uniform distribution of the blocks over the plurality of devices. Once the start device is chosen, the set of devices serving this particular write would be from the start device to the next m devices, wrapping around after the last device is reached.
  • Some embodiments may allow for variable values of and m, in other words, change the degree of the redundancy for each file, by utilizing the Extended Attributes of the filesystem.
  • the filesystem may need to be modified, in particular, the calls which set extended attributes.
  • the block driver may be informed about the k and m for the set of blocks associated with that file and the block driver will write the data accordingly.
  • Metadata size may grow linearly with the size of the block device driver. For larger devices, it may become infeasible to maintain the entire metadata in memory.
  • a caching mechanism may be used to access metadata. Since the access of the blocks is generally sequential, cached metadata may result in performance gains. For example, a Least Recently Used (LRU) cache based on a clock algorithm may be used for caching the block headers.
  • LRU Least Recently Used
  • bitmap For the bitmap, with the increase in the number of blocks, searching a bitmap for available free blocks becomes increasingly costly. To make this scalable, a multilevel bitmap may be used in place of a single level bitmap. Also, a write-through cache may be implemented for all the metadata in place of the store and flush methods discussed previously.
  • the caller of the block store specifies the value of k and m with each write request. For this, an out- band-communication IPC was used with named pipes. The caller passes a message through this pipe in a format that CIDER defines. When no message is available in the pipe, the block device uses the previously used values of and m. During initialization, these values are set to a predefined policy. Though any arbitrary value for k is supported, during testing, k was kept constant and only the value for m was varied. One of the advantages of keeping k constant is that all the incoming write requests can be easily divided to k sectors and the extra parity information added into secondary sectors. [0093] 10 engine
  • All the 10 requests to the CIDER block device are handled by this component. It is responsible for striping the data and writing them to remote nodes, reading from remote nodes and rearranging them, and reconstructing the data blocks if primary nodes are not reachable.
  • the caller sets the values of and m as described in the previous section. Using these values, the virtual block number is converted to a set of physical block numbers (more in the next section). Each block is striped, and primary data along with encoded secondary data is written to remote nodes.
  • the virtual block requested is converted to the set of physical blocks with the help of a "block translation layer" which is explained below. Reads are issued to the corresponding devices at the corresponding offsets. If there is failure of one of the devices, the data is constructed on the fly by reading the secondary blocks.
  • the block translation layer has the job of converting any given virtual block number to a set of physical block numbers. For writes, the new physical blocks have to be allocated according to the specified k and m values and for reads, these values have to be retrieved. For this purpose, it uses the following two sub-components:
  • BlockMap A large array having the fields required for translation of a virtual block to physical block(s). Each virtual block in use will have an entry in the BlockMap. The size of the block map increases with the size of the system and needs to be persistent. For this reason, the BlockMap is maintained in a persistent store, either on the master node itself or on a dedicated metadata server. In the CIDER prototype, which uses 8 nodes, each block is associated with a metadata of 68 bytes. For a block size of 4k, this would amount to an overhead of about 1.6%
  • Block allocator - (Bitmap)
  • the block translation layer has to manage the physical block allocations and deletions. Whenever there is a write request for a new virtual block, at least one unused physical block is assigned to this virtual block. To do this, a bitmap is maintained representing the physical blocks for each device. With the increase in the number of blocks, searching a bitmap for an available free block becomes increasingly costly. This is a common problem with file systems and two popular approaches to resolving the problem are: maintaining a list of free blocks and maintaining a Btree structure. In CIDER a multi-level bitmap was implemented which has a reasonably fast performance. Given a block size of 4k, the size of all the bitmaps would be about 0.2% of the system capacity. Therefore, the storage overhead due to the metadata would be about 1.8-2.0% of the system's capacity. [0097] Metadata Cache
  • the size of the metadata grows linearly with the capacity of the system. For larger devices, it becomes infeasible to maintain the entire metadata in memory. On the other hand, accessing the metadata from disk would be a very costly operation. Because the access of the blocks is generally sequential, the metadata was cached to get significant performance gains. A simple write-back cache was used with the least recently used ("LRU") replacement policy for BlockMap metadata. The size of the cache allocated will have an impact on the performance of a system. Advantageously, all the metadata can be brought into memory for best performance.
  • LRU least recently used
  • Encoder/Decoder [0100] In the exemplary system, the Zfec-Tahoe library was used for Erasure Coding.
  • the disk recovery framework provides a mechanism to reconstruct an entire disk when there is a permanent failure of a node.
  • the reconstruction operation will be triggered.
  • the administrator can schedule the reconstructor to start when the system is light on load.
  • relevant writes will be forwarded to the new disk.
  • reads will continue to be served by block level reconstruction to avoid any inconsistencies until the entire process is complete.
  • the device is marked as fully available and all relevant IO can be served by the new disk.
  • Embodiments of the presently disclosed system accommodate varying the values of and m.
  • the degree of the redundancy can be set for each file of a file system, by utilizing the Extended Attributes of the file system.
  • the file system can be modified with calls for setting extended attributes.
  • CIDER is informed about the k and m for the set of blocks associated with that file and CIDER writes the block data accordingly.
  • CIDER Operational details of CIDER
  • a network of 7 storage nodes and 1 master node where each storage node exposes a 1TB block device through NBD.
  • the master node registers each storage node using the NBD - client and would see the pool of NBDs as /dev/nbd[l-7].
  • the master node also creates the CIDER block device which will be accessible at /dev/nbdO and would have an aggregate size of 4 TB.
  • the block size of the CIDER system in this example is 4 KB and the block size of the individual NBDs is 1 KB.
  • the flow chart of Fig. 10 depcits the steps of a write operation.
  • the system attempts to read the requested k and m values from the out of band channel described above. If there is no specific value set, the previously used values of and m are used. Even if the system which uses CIDER never specifies new k and m values, the default values will be used throughout the operation.
  • Types of write operations include: ⁇ a new write to a new set of blocks;
  • the second case can be divided into two types:
  • new physical blocks are allocated to hold the additional redundancy, read the original data, recalculate the parity blocks, and write the additional parity to the newly allocated physical blocks.
  • the new physical blocks are allocated on devices which do not contain any of the other physical blocks for the virtual block in question.
  • the second case is simpler, it involves freeing up erasure-coded blocks and updating the metadata as per the new value of the m.
  • the devices While allocating new physical blocks on devices, the devices can be chosen such that the physical blocks are evenly spread across the storage nodes. That is, at any given time, all the devices will have approximately the same amount of used physical blocks.
  • the write is failed.
  • the write operation ends and returns a success code.
  • the flowchart of Fig. 11 depicts a read operation on the CIDER system.
  • the CIDER block translation layer is more involved during read operations.
  • all reads in the exemplary system are in multiples of 4k.
  • the virtual block number is translated to a set of physical block numbers and device pairs.
  • the block number is not used, we simply ignore the read request.
  • On a typical read only four primary devices of that block are read. If any of the primary devices are down, the secondary devices of that block are read to reconstruct the data.
  • the number of failures is more than (m - k)
  • the read is failed. Once all the blocks are read, the read operation is completed.
  • CIDER In CIDER, an application cache is maintained, where the application cache is an in-memory representation of a subset of the system metadata. As defined above, CIDER maintains metadata for every data block with the following information:
  • Block status Indicated whether this data block is in use or not;
  • Starting device Indicates the starting storage device id from which this data block has been striped.
  • Device block numbers Indicates the block numbers at which the striped data is present on the respective storage device(s).
  • CIDER In CIDER, a cold start mechanism was employed for the cache. This means that when the system starts up the cache is empty and as the system is used over time the cache gets populated according to the access patterns.
  • the structure of the cache in CIDER includes N number of fixed size pages, which include M page table entries, where each page table entry represents the metadata information for one physical block.
  • the virtual block number is derived from the block offset provided by the file system. From the virtual block number, the page number and page offset is determined (see, e.g., Figure 12). The cache is searched for this page, if this page is present then the corresponding physical blocks are returned. Otherwise, a page is loaded from the metadata store onto the cache. In case the cache is full, a page is evicted from the cache and the new page is loaded onto the cache. Once loaded onto the cache, the actual physical block is sent to the routine that requested it.
  • Reconstruction When a storage node fails permanently, a reconstruction operation can be triggered. As discussed above, an advantage of the reconstruction is that one can choose to reconstruct only data blocks which are in use. If desired, a larger-scale reconstruction can be triggered only when the system is light on load. In one embodiment, the load on the system can be measured by a counter which ticks every time a write or a read operation is requested. Other techniques for determining load will be apparent in light of the present disclosure.
  • the client can trigger the reconstruction by passing the reconstruct request with the details about the old and the new device.
  • the process of reconstruction involves iterating through all the virtual blocks and writing the new data.
  • Each virtual block reconstruction might involve one of the below two scenarios:
  • the failed device contains the secondary block.
  • data is read from k primary blocks and parity blocks are recalculated and the appropriate parity block is written to the new (i.e., replacement) device.
  • the failed device contains a primary block.
  • k blocks are read and the primary block is first reconstructed using the secondary blocks and is written back to the new device.
  • the process can lazily be done in the background when the system is light on load.
  • the host machine is the master node running the Cider system
  • the VM exposes 8 virtual disk through the NBD client. These are connected on the host as /dev/nbd[l-7];
  • Hardware Used Intel i5 processor with a 1 TB HDD running at 5400 RPM and a 8GB RAM, running Ubuntu 12.10;
  • Figure 20 summarizes the test results. When compared to raw reads, we observe that the performance degrades by about 40% for no disk failures and by about 80% for 3 disk failures. We expect the read times to improve when deployed on the multi-node cluster mainly due to parallel reads. At normal operation, the reads do not involve any overhead of reconstruction and hence on faster networks, should outperform single disks and times should even approach RAID devices.
  • Figure 21 summarizes the test results.
  • Write involves a higher overhead than read as there is the encoding overhead for every write.
  • our system is between 4 times to 2 times slower.
  • Figure 22 shows the read times for 4+3 scheme for 0, 1, 2 and 3 disk failures. Like in the constant scheme, these times are compared with the read times of a raw device. We also see that these times are very close to the read times of constant model. We cannot observe the overhead mainly due to the small data involved in the tests. However, we expect a small degradation in performance due to accessing the metadata on large systems.
  • Figure 23 shows the write times for 4+3 scheme for 0, 1, 2 and 3 disk failures. Like writes in constant scheme, we see decrease in the write times with increase in the disk failures, as we will have less disks to write. In case of write too, the values are very similar to the constant scheme as the overhead due to metadata is hidden.
  • Figure 24 shows the write times for various values of redundancy. We can see the times increase with increase in m when k is fixed at 4. Similar to previous cases, we expect this difference to reduce if the nodes are physically separate disks.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present disclosure provides methods and systems for fault-tolerant storage of data, where the level of fault tolerance is selectable at the block level. A method includes receiving a request to write data, the data comprising a plurality of data blocks. Each data block is written to a number of storage devices as a set of k primary blocks. A number (represented by m - k) of erasure- coded blocks are calculated and written to separate storage devices. The values of m and/or k can varied for each data block of the data.

Description

SYSTEM AND METHOD FOR FAULT-TOLERANT BLOCK DATA STORAGE
Cross-Reference to Related Applications
[0001] This application claims priority to U.S. Provisional Application No. 61/980,562, filed on April 16, 2014, now pending, the disclosure of which is incorporated herein by reference.
Background of the Disclosure
[0002] Storing data reliably with minimal storage overhead is the principal goal of a typical data storage system. While reliability cannot be compromised, keeping the storage overhead minimal is desirable for practical systems dealing with large amounts of data as in today's petabyte scale data centers. Traditional solutions to reliable storage involve using parity and/or replication based RAID like systems or object stores. These remain largely inflexible in the sense that the level of redundancy cannot be easily changed.
[0003] Reliably storing data has been one of the most important goals ever since the advent of computer revolution. Storage, as we know it today, has come a long way since then. 1 -2 TB disks are very common today and disks with capacity of 6 TB have arrived in the market. One of the breakthroughs in reliably storing data came from the introduction of storage devices with RAID. These devices not only tolerate single disk failures (two disk failures in the case of RAID-6/DP) but are also faster than regular SATA disks due to parallel IOs. The RAID devices have enjoyed a wide spread use in the industry and have become synonymous with reliable storage. However, RAID devices being implemented at the hardware tend to be expensive and offer limited flexibility. The reduction in the prices of reliable storage devices has not kept up with the increase in the amount of data. Secondly, RAID has limited abilities to handle multiple disk failures. When one of the disks in the RAID array fails, the data has to be reconstructed. Typically in a RAID system after a disk fails it is highly critical to replace the disk and start the rebuild immediately. These reconstruction times are often so high that the possibility of a second disk failure becomes significant. This scenario is extremely likely in today's world of Terabyte disks as Robin Harris and many others have pointed out in their articles.
[0004] Another issue is the limited flexibility RAID offers. Once set up, it is not easy to switch between different RAID levels. Lastly, RAID requires replication and/or parity to be done locally which is a severe limitation in today's ubiquitous wide-area distributed systems. There exist many implementations of RAID at the software level— collectively called soft RAID. While these systems offer better flexibility, they suffer from most of the limitations associated with hardware RAID, including high reconstruction times. Full replication of data across multiple nodes is one popular alternative that has emerged in cloud storage. However, the footprint of such storage is extremely high. Erasure coding has emerged as a strong alternative to RAID in providing a reliable data storage. With some extra redundancy of erasure coding it provides the flexibility to schedule a reconstruction when the system is light on load rather than doing it immediately. Erasure coding also has a much lower storage footprint compared to replication. Many implementations with such a scheme are already available for object-based stores in the cloud. But these object stores are typically slower compared to block stores and are not suitable for many applications.
[0005] Another significant problem with the available storage solutions is a globally fixed redundancy for the data. Data by nature is heterogeneous, and different types of data require different degrees of reliability. For instance, based on legal requirements, financial data would mandate a high degree of reliability for the first seven years while medical records would be expected to last forever. Additionally, the data might also change its reliability need over time. Thus, a system that allows changes to the degree of fault tolerance at a finer level is highly desirable.
Brief Summary of the Disclosure [0006] The presently disclosed system and method aims at provide an extremely flexible, reliable and distributed block store. In an exemplary embodiment, called "CIDER," Reed- Solomon erasure codes were used to provide fault tolerance. Systems according to the present disclosure take a new approach to reduce storage overhead by offering a variable degree of fault tolerance which can be set by the user at a granularity of a single block. This is achieved by the use of a thin block translation layer and a block level metadata system. When storing
heterogeneous data requiring various amounts of redundancy, this technique can lead to significant storage savings.
[0007] CIDER provides a reliable data store with minimal storage overhead by uniquely allowing varying reliability for data with different requirements. One might prefer moderate reliability for systems logs that are over a year old but extremely high reliability for personal photos. Moreover, CIDER allows for the requirements for reliability of the same data to change over time, e.g., system logs of last week require higher reliability than those saved couple of years ago. Unlike object stores, CIDER is block based and thus has high performance. We have implemented CIDER and preliminary results show that it is very efficient and practical. The overheads of CIDER are negligible and the performance is better than raw NBD by a factor of two, even for small systems. Faster interconnects like Infiniband and faster block device interfaces instead of NBD will further improve the performance of our system. Finally, as a low level implementation, CIDER can be used by any file system or even as a raw device. We believe that the potential storage savings that CIDER offers makes it a suitable candidate for cloud and archival storage. Description of the Drawings
[0008] For a fuller understanding of the nature and objects of the invention, reference should be made to the following detailed description taken in conjunction with the
accompanying drawings, in which:
Figure 1 is an illustration of a storage stack;
Figure 2A is a diagram showing a RAID 0 configuration;
Figure 2B is a diagram showing a RAID 1 configuration;
Figure 3 is a diagram showing a RAID 5 configuration;
Figure 4 is a diagram showing a RAID 6 configuration;
Figure 5 is a diagram showing the high level architecture of CIDER;
Figure 6 is a diagram showing the CIDER software components;
Figure 7 is a diagram showing the physical block distribution for an exemplary variable klm model according to an embodiment of the present disclosure; Figure 8 is a diagram showing the physical block distribution for an exemplary constant k/m model according to an embodiment of the present disclosure; Figure 9 is a table showing the configuration of a test system;
Figure 10 is a flowchart of a write operation according to an embodiment of the present disclosure;
Figure 11 is a flowchart of a read operation according to an embodiment of the present disclosure;
Figure 12 is a flowchart showing metadata caching according to an embodiment of the present disclosure; Figure 13 is a diagram showing a network block device server-client;
Figure 14 is a diagram showing another representation of the architecture according to an embodiment of the present disclosure;
Figure 15 is a diagram showing physical blocks spread to primary and secondary devices; Figure 16 is a diagram showing a technique for spreading secondary blocks across
devices;
Figure 17 is a diagram showing primary and secondary blocks in a variable model;
Figure 18 is a graph showing library encoding performance;
Figure 19 is a graph showing library decoding performance;
Figure 20 is a graph showing read times for k = 4 and m = 3 in a constant klm model;
Figure 21 is a graph showing write times for k = 4 and m = 3 in a constant klm model;
Figure 22 is a graph showing read times for k = 4 and m = 3 in a variable klm model;
Figure 23 is a graph showing write times for k = 4 and m = 3 in a variable klm model; and
Figure 24 is a graph showing write times for k = 4 and variable m in a variable klm
model.
Detailed Description of the Disclosure
[0009] Erasure codes are used to create a distributed and reliable block store that has a high degree of flexibility in which the degree of fault tolerance can be set on a per block basis. Such fine control over setting the redundancy results in reduction of storage footprint if the redundancy levels are carefully chosen by the applications. Such a system has a great deal of potential in areas like archival storage systems and database stores which require block level access. The benefits from the disclosed system far outweigh the small performance penalty paid for the calculation of the erasure codes.
[0010] The following is a general description of a system and apparatus of the present disclosure, further details are provided below under the section heading "CIDER" which describes an exemplary implementation of the disclosure. The CIDER implementation is intended to be illustrative and, as such, should not be interpreted as limiting.
[0011] The present disclosure may be embodied as a method 100 for electronically storing block-level data. The method 100 comprises the step of receiving 103 a request to write data. The data comprises a plurality of data blocks. The request can be received 103 from a file system or any client software. [0012] One or more coding indicators is received, where each data block of the plurality of data blocks is associated with a coding indicator. As such, each data block may be associated with the same coding indicator as the other data blocks. In other cases, one or more of the data blocks may be associated with a coding indicator which is different from the coding indicator of other data blocks. In some embodiments, the coding indicator(s) are received as part of the received 103 request to write data ("in-band"). In other embodiments, the coding indicator(s) are received 106 separate from the data ("out-of-band") as further described below.
[0013] Each coding indicator represents a value, k, of a number of primary blocks to be written for an associated data block. For example, where Hs 4, a data block will be split into, and written as, four primary blocks. The coding indicator further represents a value, m, which is a sum of the number of primary blocks (k) and the number of erasure-coded blocks calculated for a corresponding set of primary blocks. For redundancy, a number (m - k) of erasure-coded blocks can be calculated from the primary blocks such that if one or more of the primary blocks cannot be read, the data block can be reconstructed using the remaining primary blocks and one or more of the erasure-coded blocks.
[0014] The method 100 comprises writing 109 a data block of the plurality of data blocks. The data block is written 109 as a set of primary blocks, where k is the value according to the coding indicator associated with the data block. Each primary block of the set of primary blocks is written 109 to a separate storage device. [0015] A value of each of m - k erasure-coded blocks is calculated 112 based on the set of primary blocks. The value of m (and therefore, m - k) is selected according to the coding indicator associated with the data block. For example, where the data block being processed is associated with a coding indicator where k = 4 and m = l, then 3 erasure-coded blocks are calculated 112 based on the 4 primary blocks. The calculated 112 erasure-coded blocks are written 115 to separate storage devices which are different than the storage devices on which the primary blocks were written 109. In this way, each primary block and each erasure-coded block is written 109, 115 to a separate storage device.
[0016] It should be noted that because each data block is written as primary blocks and corresponding erasure-coded blocks according to the coding indicator for that data block, then each data block may be written with different encoding schemes. For example, a first data block may be written with k = 4 and m = 7, such that the data block can be retrieved even if 3 primary blocks are inaccessible, and a subsequent (and presumably less important) data block can be written with k = 4 and m = 5, such that the data block can only survive the loss of one primary block. Encoding at the block-level allows the present disclosure to advantageously provide functionality not previously obtainable. For example, in a virtualized environment, a storage device is modeled as a large file, called a VDisk. For instance, a virtual machine may include a 256 GB hard drive which is actually stored as a large 256 GB file. A host system using a redundant storage scheme which sets redundancy on the file level will only be able to set a redundancy parameter for the entire VDisk. The typical result is that a great deal of space is wasted due to overly-redundant storage of unimportant files contained within the VDisk. With the systems and methods of the present disclosure, the level of redundancy can be set at the block level, thereby allowing for customized redundancy for each file contained within the VDisk (and even more granular). The VDisk example is one useful, but very important example of block-level coding. Other uses will be apparent to one having skill in the art in light of the present disclosure. [0017] A metadata entry is recorded 118 as further described below. The recorded 118 metadata entry is associated with the written 109 data block. The metadata entry comprises the coding indicator for the data block. In another embodiment, the set of storage devices corresponding to the set of primary blocks for the written 109 data block is identified 121, and the meta data entry further comprises the identity of the set of storage devices. [0018] Steps of the method 100 are repeated such that each data block is written 109 according to its respective coding indicator and the corresponding erasure-coded blocks are calculated 112 and written 115 to storage devices as described above. In this way, each data block of the plurality of data blocks is processed and the received 103 request to write data is fulfilled. [0019] The method 100 may further comprise receiving 124 a request to retrieve the data.
A metadata entry associated with the data block is received and the corresponding coding indicator is determined. The identity of the set of storage devices corresponding with the data block is determined, k primary blocks are read from the identified set of k storage devices, wherein the value of k is selected according to the coding indicator associated with the data block. If one or more of the primary blocks is inaccessible, the method 100 comprises the step of reading the corresponding erasure-coded blocks and reconstructing one or more of the primary blocks corresponding to the inaccessible devices. The data block is assembled from the read k primary blocks. The steps of querying the metadata entry, reading primary blocks, and assembling the data block are repeated until the data is retrieved.
[0020] When a loss of a storage device of the plurality of storage devices is detected, the method 100 may further comprise reconstructing the primary blocks and/or erasure-coded blocks written to the lost storage device according to the respective sets of k primary blocks and a corresponding erasure-coded block stored on the remaining storage devices, wherein values of k may differ for each reconstructed primary block and writing the reconstructed primary blocks and/or erasure-coded blocks to a replacement storage device. [0021] The present disclosure may be embodied in software, for example, as a software application that performs the disclosed methods on, for example, the same computer system as the host. In other embodiments, the present disclosure may be embodied as a storage controller for storing and retrieving blocks of data in a block-level storage system. Such a storage controller may comprise a controller (e.g., a microcontroller, etc.) configured to be in electronic communication with a plurality of storage devices, wherein the controller is configured to perform any of the disclosed methods.
CIDER
[0022] The present disclosure is further described by way of an exemplary system, which was built to demonstrate these concepts and to prove the feasibility of such a system. The exemplary system, called CIDER, built according to an embodiment of the present disclosure, was implemented on a Linux platform. CIDER is a recursive acronym for "CIDER is Distributed and Extended RAID." In CIDER erasure coding is performed at a block level, and data blocks are striped across a number of remote disks using Network Block Device ("NBD") protocol (further described below). CIDER itself runs in the user space, making use of block device in userspace ("BUSE") libraries which exposes a filesystem in userspace ("FUSE")-like interface to a block device. Thus, by taking advantage of NBD and BUSE, a low cost RAID replacement is provided that has the potential for better reliability and flexibility than software RAID with some penalty paid for write overheads to compute the Reed-Solomon codes.
[0023] Erasure Coding [0024] Erasure coding is a technique of adding error correction codes to data which enables reconstruction of the data when a part of the original data is lost. Throughout the present disclosure, the terms and 'm' are used, where is the minimum number of blocks required to reconstruct data and 'τη' is the total number of blocks (data blocks + erasure-coded blocks) for the set of blocks. We can interpret this as expanding units of data to 'm' units, where
(m - k) units of encoded data is added to units of original data, enabling reconstruction of the data with a combination of any units. In CIDER we use the well-known Reed-Solomon encoding libraries, although other encoding schemes are compatible with erasure coding and are within the scope of the present disclosure. [0025] Network Block Device
[0026] Network Block Device (NBD) is a Linux component which gives a block storage device interface to a remote file or raw device. The server, which runs on a remote machine, exports a file or a raw device and listens for incoming requests on a configurable port (see, Figure 13). The client runs on a local machine and connects to the NBD server and exposes a block device interface through /dev/nbd[0- 16] locally. Once the connection is established, the remote NBD can be accessed as if it is present locally. The server runs in the user space and redirects the IO requests to the appropriate storage component. The client runs in the kernel space and transfers the IO requests to the remote machine to which the NBD is connected to. CIDER combines the principles of NBD and erasure coding to offer a better system in terms of flexibility, reliability and storage overhead.
[0027] Loopback NBD Server/Client BUSE
[0028] BUSE is an implementation of the NBD server which enables developers to easily implement a Virtual Block Storage Device in user space by providing an interface similar to that of FUSE. This is achieved by creating an NBD client as a loopback device which redirects all IO calls to the NBD server running on localhost. These requests are intercepted by the NBD server and the developer can implement how these commands are interpreted.
[0029] The NBD protocol has five types of requests:
1) NBD CMD READ
2) NBD CMD WRITE 3) NBD CMD DISC (disconnect)
4) NBD CMD FLUSH
5) NBD CMD TRIM
[0030] BUSE will invoke the respective callback functions as registered by the user. These user defined callback functions can be set by populating the struct below and passing it to the BUSE server during the initialization. static struct buse_operations aop = {
. read = reedsolo_read ,
. write = reedsolo_write ,
. disc = reedsolo_disc ,
. flush = reeds olo_flush ,
. trim = reedsolo trim ,
. size = SIZE ,
} ; [0031] Therefore by implementing these functions as explained below, a virtual block device is created which does erasure-coded storage over the network.
[0032] Storage Stack
[0033] A storage stack is a layered system, typically including storage media such as hard disk drives ("HDD"), solid state drives ("SSD"), etc. at the bottom-most layer, and having file-systems at the top-most layer with multiple logical block device layers in between. An exemplary storage stack is depicted in Figure 1.
[0034] File system
[0035] A file system is a piece of software which maintains the logical structure in the way the files are stored on a computer. It gives an interface to create, access, modify, and delete files on computers. Apart from basic access, a file system also implements functionality such as:
• maintaining information like owner, permissions, size, creation timestamp, last modified timestamp etc., for each file;
• organizing files into hierarchical structure (directories and subdirectories); and • providing algorithms to perform caching in order to optimize read and write operations. [0036] Data Block
[0037] A block is the smallest logical unit of data in a storage system. It is typically
4 kilobytes (4096 bytes) on modern systems. On some systems it can be as large as 16 KB or larger.
[0038] Block Device
[0039] A block device is a logical abstraction of an underlying storage system. The block device deals with data in units of blocks. Typically, the file system is built on top of the block device. The physically smallest unit of data that can be written to or read from the hardware is referred to as a sector. A sector is typically 512 bytes for traditional hard disks.
[0040] Data Striping
[0041] Striping is a technique where sequential data is logically split into many smaller fragments and stored separately in multiple storage media or sometime within the same storage media but in different partitions. [0042] Redundant Array of Inexpensive Disks ("RAID")
[0043] RAID is a logical combination of storage media (HDD, SSD, etc.) that is exposed as a logical block device by the operating system. RAID can be implemented at the hardware, firmware, and software levels. Different forms of RAID include RAID - 0 / 1 / 5 / 6 and combinations like RAID - 10 / 01 / 50 / 60 (see, e.g., Figs. 2A-4). [0044] Extended file attributes
[0045] An extended file attribute is a file system feature that enables users to associate computer files with metadata not interpreted by the filesystem. In other words, the extended attributes are metadata, associated with files, which are free to be interpreted at any desired level. Typically, these are used to store information of a file related to encryption, digital signatures, etc., and are interpreted and used at the application layer. All major file systems (ext2, ext3, ext4, JFS, ReiserFS, XFS, Btrfs on Linux; NTFS on windows; UFS2 and ZFS on Unix; HFS on Apple
05 X) support extended file attributes. [0046] In-band and Out-of-band Communication
[0047] Out-of-band communication is a technique where a dedicated communication channel is used to the control commands. This typically involves two channels— one specifically for control data, and another for the data itself. On the on the other hand, in-band communication is a technique where only one communication channel exists for both control commands and the data.
[0048] User & Kernel Level
[0049] Typically a process can exist in either of two contexts— the user mode or kernel mode. In the kernel mode the process has permissions to access the underlying hardware resources. All programs other than the operating system are started in user mode. To utilize these resources the user space program has to switch to the kernel mode by making a "system call."
[0050] Cache
[0051] In CIDER, cache is an in-memory structure that is maintained to make access to physical data blocks much faster. [0052] Cache Hit & Miss
[0053] In CIDER, if a requested page is present in cache, it is termed as a cache "hit."
Otherwise if a requested page is not present in the cache, then this condition is called a cache "miss."
[0054] Block Number [0055] Every data block is addressed by a block number (e.g., 64 bits), in CIDER, these block numbers are all virtual block numbers that are derived from the block offset and length which are specified by the file system from the upper layer.
[0056] Page
[0057] In a storage media, groups of data blocks are collectively called "pages." In cache operations, pages are loaded and offloaded based on a fair page replacement policy that is implemented. [0058] Page Number & Page Offset
[0059] In a storage system, a block number includes a page number and page offset. For example, in CIDER for the purpose of caching, the first 48 bits of a block number represent the page number and the remaining 16 bits represent the page offset.
[0060] Page Table Entry
[0061] In CIDER, the metadata cache includes a plurality of pages that contain a plurality of page table entries. Each page table entry represents the metadata associated for a physical data block.
CIDER High-Level System Architecture
[0062] Embodiments of the CIDER system architecture is depicted at a high-level in
Figs. 5 and 14. The system comprises the following components:
• Storage Nodes - These are remote machines or storage controllers where the data will be physically stored. These are connected to a master node through a high-speed interconnect, such as, for example, infiniband, Fiber Channel, or the Ethernet. A single machine can also host multiple devices, each through a different port.
• Master Node - The master node runs the CIDER virtual block device and is connected to the remote storage nodes which are presented as block devices. The virtual block device is a virtual interface running on the master node and visible to the user or file system. The master node on which the virtual block device is connected to each of the remote network block devices through the NBD-client interface. These devices will be available as local block devices through the /dev/nbd[l-N] interface. All the IO requests to the virtual block device are handled by the CIDER system and are forwarded to network block devices for actual IOs.
• CIDER Metadata Server - The Metadata server holds CIDER' s block-level metadata.
This can run on the master node or on a dedicated server. A metadata server is typically a highly fault-tolerant, fast server. • BUSE - The CIDER system creates the virtual block device (e.g., /dev/nbdO) using BUSE.
[0063] Example: Consider a network of 8 nodes, each exposing a 1 TB file/raw device as a NBD and an erasure coding scheme of 5 + 3, where 3 is the maximum number of failures the system can tolerate.
1) The master node registers each of these nodes using the NBD - client and would see the pool of NBDs as /dev/nbd[l-8]. The master node also creates a virtual block device /dev/nbdO to include all the registered NBDs whose size will be 5 TB. /dev/nbdO will be the only interface to the user/filesystem. 2) When write requests are made to the block device (typically through the file system) the data is striped in units of block size (1024 bytes) and written to 5 NBDs.
3) RS codes are calculated and written to 3 NBDs. Typically, all NBDs will be on different physically nodes.
4) When a read request is made, data is read from appropriate device and in case of failures, data is reconstructed using the encoded blocks returned to the file system.
[0064] CIDER is implemented in C on the GNU/Linux platform and has various software components. Before detailing each of these components, it is helpful to define the concepts of the virtual block and physical block in the context of CIDER. Since CIDER itself is a virtual device which does not physically store any data, the data block numbers of CIDER are referred to herein as "virtual block numbers." Each such virtual block is striped into k segments and after erasure coding to m segments, it is written to m devices. The block numbers of the set of m devices are referred to herein as "physical block numbers." Figure 6 shows a block diagram of various components within CIDER.
[0065] Constant k/m [0066] Some embodiments of the present disclosure use constant k and m values. The constant k and m values may be set while creating (i.e., instantiating) the virtual device. In one embodiment, every write to the virtual device, will add the same amount of error correction codes and every block has the same amount of fault tolerance. In one exemplary implementation, a minimal storage unit of 1024 bytes may be selected. The minimal storage unit may be selected based on a sector size of a hard disk used in the system. Setting the block size of the virtual device to the same size of the HDD sector size allows for flexibility to choose any value for k and m. For example, k and m can be chosen to be 100 and 120 respectively. This means that 20 blocks of 1024 bytes each will be added to 100 blocks of 1024 bytes, thus enabling
reconstruction of the original 100 blocks with any available 100 of the total 120 blocks.
[0067] An alternate option is to make the block size of the virtual device to -times the storage device sector size. This enables striping a virtual block into k sectors and adding (m - k) sectors, thereby writing the virtual block to m storage devices. Such a selection may be less flexible but could offer higher performance. In one embodiment, amongst m available physical devices, the first k may be designated to be primary devices and the next (m - k) as secondary devices. This is illustrated in Figure 15.
[0068] In some embodiments, every virtual block may be mapped to a primary device through modulo arithmetic. For example, the owner of the virtual block may be calculated as a modulus of k. Every primary device owns 1 / virtual blocks and it is easy to calculate the physical block number just by dividing the virtual block number by k. owner = vi rtual blo c k number % k
devi ce blo c k number =
vi rtual bl oc k numbe r / k
[0069] A set of virtual blocks, each mapping to a different primary device may form a group on which the Reed Solo codes may be calculated. The (m - k) encoded blocks are then written to the secondary devices. However, this approach may create problematic hot spots wherein the secondary devices are worn out faster than the primary counterparts. One potential solution to this problem is to spread out the blocks such that, the secondary blocks are changed in cyclic order. This is shown the Figure 16 for a 4 + 3 scheme.
[0070] IOs may be performed in parallel by making use of an asynchronous IO framework. For bulk IOs involving more than k blocks, a performance gain may be achieved due to the parallel IOs. Reed Solo codes may be calculated for every kth block. A write operation is successful if at least k nodes are accessible, in other words, a quorum size would be k. The IOs may be performed in a window size of blocks to improve performance. If the reads are not aligned to k block boundaries, a header and trailer window may be added and the 10 may be performed in multiples of blocks in central windows. Such as split may be advantageous. For example, consider k = 4 and an 10 request for 12 blocks from block 1 1 to 22. This 10 could be split into 3 parts:
1) Header window: block 1 1 - 12;
2) Center windows: block 13 - 20, consisting of 2 windows, each of 4 blocks; and
3) Trailer window: block 21 - 22.
[0071] The following method may be the callback which is invoked when a
NBD_CMD_READ request is received: static int reedsolo_read
(void *buf, u_int32_t len,
u_int64_t offset, void *userdata) wherein buf is a buffer that will contain the read data, offset is an offset in bytes from where data has to be read, and len is a length of data to be read, in bytes.
[0072] An exemplary algorithm for a read operation is as follows:
1) Calculate device block number and the device which owns it for the starting virtual block;
2) For bulk reads, split the blocks into 3 parts consisting of a header, the central blocks and a trailer;
3) Issue parallel reads to respective devices in a window of size k, as explained before;
4) If all the primary blocks are available, populate the buffer with the data read and return;
5) If any of the primary devices are unavailable, reconstruct the missing blocks by reading the secondary blocks; and 6) If the number of failures is more than (m - k) reconstruction is not possible, reject the read request.
[0073] The following method may be the callback which is invoked when a
NBD_CMD_WRITE request is received: static int reedsolo_write (
void *buf , u_int32_t len,
u_int64_t offset, void *userdata) wherein buf is a buffer which contains the data that has to be written, offset is an offset in bytes from where data has to be written, and len is the length of data to write, in bytes.
[0074] An exemplary algorithm for a write operation is as follows:
1) Calculate device block number and the device which owns it for the starting virtual block;
2) For bulk writes, split the blocks into three parts consisting of a header, the central blocks and a trailer;
3) Issue parallel writes to the respective devices block in a window of size k, as explained before;
4) The Reed Solo coded blocks may be calculated once per window, thus it may be advantageous to select a window of to align the writes with k blocks;
5) For the header and trailer windows, the additional blocks which are required to calculate the codes may be read from the respective devices, for example, by the encode operation which is explained below; and
6) If any of the writes to the primary devices fail, the device may be marked as failed, either permanently or temporarily.
[0075] When the writes to all the primary devices have succeeded, the parity blocks may be generated and written to the secondary devices. If the write do not involve all of the k devices or the write concerned is that of the header or the trailer window, additional reads may be required to recalculate the Reed Solo codes. Further, if there is another device failure during the process of recalculation, the failed device has to be reconstructed first to recalculate the new parity. This condition can be achieved by performing a two phase commit, wherein the old values are read into a temporary buffer before writing the new values. This can be used to calculate the new parity information. [0076] A reconstruction operation may be required during reads if one or more primary devices have failed. This is done by reading k blocks from respective devices which includes secondary devices, and reconstructing the missing primary block using, for example, the zfec APIs.
[0077] Reads and writes to devices may be issued with one block at a time. When handling large IOs, rather than issuing multiple IOs to devices, the requested blocks may be bundled in one command. This may require maintaining a cache and scheduling the IOs to remote devices.
[0078] Some embodiments use a variable k and m. In these embodiments, the values of k and m may be set on a per block basis. One example of how blocks may be stored is shown in Figure 17.
[0079] The caller of the block level driver may specify the value of k and m with each write request. In one embodiment, an out-band-communication inter-process communication (IPC) may be implemented using named pipes. The caller may pass a message through a named pipe in a well-defined format. One such format is indicated below: struct Cider_write_hdr {
uint8_t magic_num [ 8 ] ;
uint8_t m;
uint8_t k ;
} [0080] An 8-byte magic number may be used to verify the integrity of the message.
When no message is available, the block device may use the previously used values of and m. During initialization these values may be set based on a pre-defined policy. In one embodiment, m may be variable and k may be constant. One of the advantages of keeping k constant is that incoming write requests can be divided into k sectors while the extra parity information may be divided into secondary sectors.
[0081] In one exemplary embodiment, a virtual device may be created with a block size of * sector size of the other network devices). For example, if the block size of the 8 network block devices participating in the system is 1024 Bytes and k is fixed at 4, then the virtual device will have a block size of 4096 bytes. Each block is split into 4 sectors of 1024 bytes each and distributed to 4 nodes. In addition, (m - k) parity sectors may be created and written to the (m - k) disks. In one embodiment, none of the sectors go to the same device.
[0082] Embodiments with variable k and m vales may have metadata associated with each block. The metadata may need to be stored in a persistent storage. Various types of metadata and metadata storage may be used. In one embodiment, the entire metadata may be stored to a disk and the entire metadata may be loaded from the disk. During operation, all the metadata may be in memory and when the device is disconnected, the metadata may be flushed to the disk. [0083] Similarly, during startup, the metadata may be read from the disk and loaded to the memory. In one embodiment, the metadata comprises two components, a block map and a bit map.
[0084] At the highest level of abstraction, a superblock is maintained. The superblock may contain information about the number of devices, size of the block device, as well as other high-level data. One structure of the superblock is shown below. The superblock may also contain a handle to a list of block headers. The block headers may contain the values of and m used for that particular block and the device blocks numbers of all the devices to which the block was written to. The size of the entire block map may be approximately 1.6% of the block device size. struct superblock {
struct blk hdr *blks;
uint64 t n blks;
uint64 t blks in use;
int8 t n devs;
struct bitmap **bm; / * Other private members * /
}
[0085] When a write request is made for a particular virtual block for the first time, depending on the values of the m, physical sectors on NBDs have to be allocated to the virtual block. To do this, a bitmap of the blocks in use may be maintained for each NBD. For each NBD, the corresponding bitmap may be checked and the first free block may be allocated. In one embodiment, the size of the bitmap would be about 0.2% of the block device size. Therefore, the disk wastage due to the metadata would be about 1.8% of the total block device size.
[0086] The IO operations are similar to that explained above in the constant k and m section. However, the additional steps required are explained in the following exemplary algorithm:
1) When a write request is made to a particular block, the block header may be checked to see if that block was previously allocated;
2) If the block was previously allocated, the associated sectors and the values of and m may be retrieved and parallel writes may be issued to the primary disks;
3) The requested number of encoded blocks are created and written to the secondary blocks;
4) If the virtual block was not previously created, the device with the minimum number of blocks may be chosen as the starting device. This may be done to ensure uniform distribution of the blocks over the plurality of devices. Once the start device is chosen, the set of devices serving this particular write would be from the start device to the next m devices, wrapping around after the last device is reached.
[0087] Some embodiments may allow for variable values of and m, in other words, change the degree of the redundancy for each file, by utilizing the Extended Attributes of the filesystem. To achieve this, the filesystem may need to be modified, in particular, the calls which set extended attributes. When a "redundancy degree" extended attribute to a file is set, the block driver may be informed about the k and m for the set of blocks associated with that file and the block driver will write the data accordingly. [0088] Metadata size may grow linearly with the size of the block device driver. For larger devices, it may become infeasible to maintain the entire metadata in memory. In these circumstances, a caching mechanism may be used to access metadata. Since the access of the blocks is generally sequential, cached metadata may result in performance gains. For example, a Least Recently Used (LRU) cache based on a clock algorithm may be used for caching the block headers.
[0089] For the bitmap, with the increase in the number of blocks, searching a bitmap for available free blocks becomes increasingly costly. To make this scalable, a multilevel bitmap may be used in place of a single level bitmap. Also, a write-through cache may be implemented for all the metadata in place of the store and flush methods discussed previously.
[0090] Additional Description of Variable klm Model
[0091] In the variable k and m on which CIDER is based, the values of k and m are set at a per block basis. An example of the way blocks are stored is shown in Figure 7. Primary blocks are named as IP, 2P, 3P, etc., and secondary (erasure-coded) blocks are named as IS, 2S, 3S, etc. From the Figure 7, it can be seen that, the first virtual block is written with a policy of 4 + 3 and starts from device 0. The second virtual block is written with a policy of 4 + 2 and hence occupies only 2 secondary blocks. The third virtual block starts with device 6 and is written with a policy of 4 + 1. Finally, the fourth virtual block starts at device 5 and is written with a policy of 4 + 0. We can see that the last physical block in devices 1 to 7 is not used. [0092] The caller of the block store specifies the value of k and m with each write request. For this, an out- band-communication IPC was used with named pipes. The caller passes a message through this pipe in a format that CIDER defines. When no message is available in the pipe, the block device uses the previously used values of and m. During initialization, these values are set to a predefined policy. Though any arbitrary value for k is supported, during testing, k was kept constant and only the value for m was varied. One of the advantages of keeping k constant is that all the incoming write requests can be easily divided to k sectors and the extra parity information added into secondary sectors. [0093] 10 engine
[0094] All the 10 requests to the CIDER block device are handled by this component. It is responsible for striping the data and writing them to remote nodes, reading from remote nodes and rearranging them, and reconstructing the data blocks if primary nodes are not reachable. On each write request, the caller sets the values of and m as described in the previous section. Using these values, the virtual block number is converted to a set of physical block numbers (more in the next section). Each block is striped, and primary data along with encoded secondary data is written to remote nodes. For reads, the virtual block requested is converted to the set of physical blocks with the help of a "block translation layer" which is explained below. Reads are issued to the corresponding devices at the corresponding offsets. If there is failure of one of the devices, the data is constructed on the fly by reading the secondary blocks.
[0095] CIDER Block Translation Layer
[0096] The block translation layer has the job of converting any given virtual block number to a set of physical block numbers. For writes, the new physical blocks have to be allocated according to the specified k and m values and for reads, these values have to be retrieved. For this purpose, it uses the following two sub-components:
• Block Metadata (BlockMap) - A large array having the fields required for translation of a virtual block to physical block(s). Each virtual block in use will have an entry in the BlockMap. The size of the block map increases with the size of the system and needs to be persistent. For this reason, the BlockMap is maintained in a persistent store, either on the master node itself or on a dedicated metadata server. In the CIDER prototype, which uses 8 nodes, each block is associated with a metadata of 68 bytes. For a block size of 4k, this would amount to an overhead of about 1.6%
• Block allocator - (Bitmap) The block translation layer has to manage the physical block allocations and deletions. Whenever there is a write request for a new virtual block, at least one unused physical block is assigned to this virtual block. To do this, a bitmap is maintained representing the physical blocks for each device. With the increase in the number of blocks, searching a bitmap for an available free block becomes increasingly costly. This is a common problem with file systems and two popular approaches to resolving the problem are: maintaining a list of free blocks and maintaining a Btree structure. In CIDER a multi-level bitmap was implemented which has a reasonably fast performance. Given a block size of 4k, the size of all the bitmaps would be about 0.2% of the system capacity. Therefore, the storage overhead due to the metadata would be about 1.8-2.0% of the system's capacity. [0097] Metadata Cache
[0098] As noted above, the size of the metadata grows linearly with the capacity of the system. For larger devices, it becomes infeasible to maintain the entire metadata in memory. On the other hand, accessing the metadata from disk would be a very costly operation. Because the access of the blocks is generally sequential, the metadata was cached to get significant performance gains. A simple write-back cache was used with the least recently used ("LRU") replacement policy for BlockMap metadata. The size of the cache allocated will have an impact on the performance of a system. Advantageously, all the metadata can be brought into memory for best performance.
[0099] Encoder/Decoder [0100] In the exemplary system, the Zfec-Tahoe library was used for Erasure Coding.
The Zfec library for erasure coding has been in development since 2007; but its roots have been around for over a decade. Zfec is built on top of a Reed-Solomon coding library developed for reliable multicast by Rizzo. That library was based on previous work by Karn et al, and has seen wide use and tuning. Zfec is based on Vandermonde matrices when w = 8. The library is programmable, portable and actively supported by the author. When the writes to all the primary devices have succeeded, the parity blocks are generated and written to the secondary devices. During reads, this component is invoked only when there is a failure in one of the primary disks.
[0101] Disk Defragmenter
[0102] With the usage of the system, performance degrades due to sparse distribution of the blocks. This is especially true if the system is initially used with high redundancy (i.e., more parity blocks) and later changed to a lower redundancy model (i.e., fewer parity blocks). For example, if the system has written the first 1000 blocks with a scheme of 4+4, and later the first 100 blocks are converted to a 4+0 scheme, a large number of blocks are freed. Many such changes would create "holes" in the storage system leading to degraded performance. To avoid this, a defragmentation module was used to run periodically through the list of blocks and rearranges them in contiguous fashion.
[0103] Reconstructor
[0104] The disk recovery framework provides a mechanism to reconstruct an entire disk when there is a permanent failure of a node. On receiving a control message from the user, the reconstruction operation will be triggered. The administrator can schedule the reconstructor to start when the system is light on load. When the disk is under reconstruction, relevant writes will be forwarded to the new disk. However, reads will continue to be served by block level reconstruction to avoid any inconsistencies until the entire process is complete. Once the reconstruction is complete, the device is marked as fully available and all relevant IO can be served by the new disk.
[0105] Interface with the file system - Extended File Attributes
[0106] Embodiments of the presently disclosed system accommodate varying the values of and m. In some embodiments, the degree of the redundancy can be set for each file of a file system, by utilizing the Extended Attributes of the file system. To achieve this, the file system can be modified with calls for setting extended attributes. When the "redundancy degree" extended attribute for a file is set, CIDER is informed about the k and m for the set of blocks associated with that file and CIDER writes the block data accordingly.
Operational details of CIDER [0107] As an example configuration, a network of 7 storage nodes and 1 master node is provided where each storage node exposes a 1TB block device through NBD. In the example, the default erasure coding scheme is 4 + 3 (corresponding to k = 4 and m = 7), where 3 is the maximum number of failures the system can tolerate. The master node registers each storage node using the NBD - client and would see the pool of NBDs as /dev/nbd[l-7]. The master node also creates the CIDER block device which will be accessible at /dev/nbdO and would have an aggregate size of 4 TB. The block size of the CIDER system in this example is 4 KB and the block size of the individual NBDs is 1 KB. [0108] CIDER Initialization:
[0109] When a system with the above exemplary configuration boots up, it may learn about the system parameters from command line arguments. The system checks if the metadata store (on the local system or on the remote server) has valid content. If not, the system assumes it a virgin system and starts the initialization of the metadata. It marks all the blocks as unused and stores a superblock containing the system parameters at the beginning of the metadata. It also initializes the bitmaps of each storage node, marking all the blocks as free.
[0110] Write Operations
[0111] The flow chart of Fig. 10 depcits the steps of a write operation. When a write request is received by CIDER, the system attempts to read the requested k and m values from the out of band channel described above. If there is no specific value set, the previously used values of and m are used. Even if the system which uses CIDER never specifies new k and m values, the default values will be used throughout the operation.
[0112] Types of write operations include: · a new write to a new set of blocks;
• a modify operation on a set of old blocks with the previously used k and m;
• a change in the value of m for a set of blocks; and
• a modify operation involving a change in the value of k
[0113] The last 3 cases fall under the modify category. The second case can be divided into two types:
• a request to increase redundancy; or
• a request to decrease redundancy.
[0114] In the first case, new physical blocks are allocated to hold the additional redundancy, read the original data, recalculate the parity blocks, and write the additional parity to the newly allocated physical blocks. For safety, the new physical blocks are allocated on devices which do not contain any of the other physical blocks for the virtual block in question. [0115] The second case is simpler, it involves freeing up erasure-coded blocks and updating the metadata as per the new value of the m.
[0116] All other writes follow the steps in the flow chart. In the exemplary system with the configuration described above, all write requests to CIDER were in multiples of the virtual block size of 4k. Each 4k data block is striped in units of fragment size of 1024 bytes and written to four NBDs as primary blocks. Reed-Solomon codes are calculated and written to (m - k) NBDs as secondary blocks (i.e., erasure-coded blocks).
[0117] While allocating new physical blocks on devices, the devices can be chosen such that the physical blocks are evenly spread across the storage nodes. That is, at any given time, all the devices will have approximately the same amount of used physical blocks. During the a write operation, if more than (m - k) devices fail, the write is failed. When all the virtual blocks are written successfully, the write operation ends and returns a success code.
[0118] Read Operations
[0119] The flowchart of Fig. 11 depicts a read operation on the CIDER system. As explained previously, the CIDER block translation layer is more involved during read operations. Like write, all reads in the exemplary system are in multiples of 4k. Once the block number is from the block offset provided by the file system, the virtual block number is translated to a set of physical block numbers and device pairs. As shown in the flowchart, if the block number is not used, we simply ignore the read request. On a typical read, only four primary devices of that block are read. If any of the primary devices are down, the secondary devices of that block are read to reconstruct the data. As depicted in the flowchart, if the number of failures is more than (m - k), the read is failed. Once all the blocks are read, the read operation is completed.
[0120] When a read request is made, primary blocks are read in parallel from the four NBDs and rearranged by CIDER. In case of failures, data is reconstructed by reading one or more of the corresponding erasure-coded blocks. [0121] Cache Operation
[0122] In CIDER, an application cache is maintained, where the application cache is an in-memory representation of a subset of the system metadata. As defined above, CIDER maintains metadata for every data block with the following information:
Block status - Indicated whether this data block is in use or not;
• k, m - Indicates the degree of redundancy for this data block;
• Starting device - Indicates the starting storage device id from which this data block has been striped; and
• Device block numbers - Indicates the block numbers at which the striped data is present on the respective storage device(s).
[0123] In CIDER, a cold start mechanism was employed for the cache. This means that when the system starts up the cache is empty and as the system is used over time the cache gets populated according to the access patterns. The structure of the cache in CIDER includes N number of fixed size pages, which include M page table entries, where each page table entry represents the metadata information for one physical block. By having a subset of the metadata in the cache, the performance of the system can be improved significantly.
[0124] When a block is requested, the virtual block number is derived from the block offset provided by the file system. From the virtual block number, the page number and page offset is determined (see, e.g., Figure 12). The cache is searched for this page, if this page is present then the corresponding physical blocks are returned. Otherwise, a page is loaded from the metadata store onto the cache. In case the cache is full, a page is evicted from the cache and the new page is loaded onto the cache. Once loaded onto the cache, the actual physical block is sent to the routine that requested it.
[0125] Reconstruction [0126] When a storage node fails permanently, a reconstruction operation can be triggered. As discussed above, an advantage of the reconstruction is that one can choose to reconstruct only data blocks which are in use. If desired, a larger-scale reconstruction can be triggered only when the system is light on load. In one embodiment, the load on the system can be measured by a counter which ticks every time a write or a read operation is requested. Other techniques for determining load will be apparent in light of the present disclosure.
[0127] In an embodiment, the client can trigger the reconstruction by passing the reconstruct request with the details about the old and the new device. The process of reconstruction involves iterating through all the virtual blocks and writing the new data. Each virtual block reconstruction might involve one of the below two scenarios:
• The failed device contains the secondary block. In this case, data is read from k primary blocks and parity blocks are recalculated and the appropriate parity block is written to the new (i.e., replacement) device.
• The failed device contains a primary block. In this case, k blocks are read and the primary block is first reconstructed using the secondary blocks and is written back to the new device.
[0128] The process can lazily be done in the background when the system is light on load.
PERFORMANCE
[0129] We have measured the read and write times on a cold system by eliminating the cache effects as much as possible. In all our experiments, we have used the below environment as our test bed:
• The host machine is the master node running the Cider system;
• The VM exposes 8 virtual disk through the NBD client. These are connected on the host as /dev/nbd[l-7];
• Hardware Used: Intel i5 processor with a 1 TB HDD running at 5400 RPM and a 8GB RAM, running Ubuntu 12.10;
• Network overhead which would be present in real clusters would be absent in our test bed; and
• The environment also conceals the performance gains due to parallel IO, as there is only 1 physical disk. A. Erasure codec library
[0130] The Zfec library for erasure coding has been in development since 2007; but its roots have been around for over a decade. Zfec is built on top of a RS coding library developed for reliable multicast by Rizzo. That library was based on previous work by Karn et al[15], and has seen wide use and tuning. Zfec is based on Vandermonde matrices when w = 8. The library is programmable, portable and actively supported by the author. A study of the performance of Zfec compared with other open source libraries can be found in the paper written by Plank[5]. The below figures shows the performance of the zfec command line tool on our local test bed. Figure 18 shows the encoding performance for various k, m schemes. We can see that the time increases with increasing the increase in the number of parity/secondary blocks. As these times, include the disk IO times as well, for a fair comparison we have included a 4+0 scheme. This is nothing but a 4 way split of the data. Figure 19 shows the reconstruction times with various disk failures. Again, with no disk failures, this is equivalent to the concatenation of the split files. Here too, we can see that the times increase with the failure rate. Comparing the encoding and decoding times, we can observe, the encoding is a costlier operation compared to decoding.
B. Constant k and m
[0131] We studied the read and write performance for the constant k and m model by fixing the values of and m to 4 and 7 respectively. In other words, we have 4 primary disks and 3 secondary disks and the system can tolerate up to 3 disk failures. We compare the IO performance of such a system with the regular Linux raw block device using the dd interface.
[0132] 1) Read Performance: We tested the erasure coding for the following scenarios with a test file of size 1Gb.
1) Read times for raw data access on ext4 file system using the dd interface
2) Read times on Cider with no failures
3) Read times on Cider with 1, 2 and 3 disk failures
[0133] Figure 20 summarizes the test results. When compared to raw reads, we observe that the performance degrades by about 40% for no disk failures and by about 80% for 3 disk failures. We expect the read times to improve when deployed on the multi-node cluster mainly due to parallel reads. At normal operation, the reads do not involve any overhead of reconstruction and hence on faster networks, should outperform single disks and times should even approach RAID devices.
[0134] 2) Write Performance: We tested the erasure coding for the following scenarios with a test file of size 1 Gb. 1) Write times for raw data access on ext4 file system using the dd interface
2) Write times on Cider with no failures
3) Write times on Cider with 1 , 2 and 3 disk failures
[0135] Figure 21 summarizes the test results. Write involves a higher overhead than read as there is the encoding overhead for every write. When compared to raw writes with no EC our system is between 4 times to 2 times slower. We also observe the write times decrease when the disks fail. This is because with each additional disk failing, we have lesser IO to perform. As with reads, we expect this to improve on multi-node clusters dues to parallel IO. For the same reason, we expect the write times to become independent of the number of disk failures.
[0136] However, if the system parameters, viz., values of k, m block sizes of the virtual block devices are not chosen wisely, we can see a performance degradation, especially for smaller writes. As an example, let the block size of the virtual device and the actual device be chosen as 1 KB and the values of and m be chosen as 5 and 3 respectively. In this case, if the set of arbitrary writes of 1 KB are issued, every write involves reading 4 blocks from other devices to recalculate the parity blocks. Performance degrades even further if there are any disk failures. The failed blocks have to reconstructed using the old secondary blocks first, and then using the reconstructed block, the new secondary blocks have to be computed. As explained in the architecture, this problem can be avoided if the block size of the virtual device is times the block size of the network devices.
C. Variable k and m [0137] We repeated the same set of experiments with our system which allows a variable k and m model. We have set the values of and m to 4 and 7 respective unless mentioned otherwise. In this model, there is an amount of overhead for mapping each virtual block to physical block. For writes, there is additional overhead for allocating the first free block in each device. However, in our current implementation, all the metadata required for such a mapping would be present in the memory and this operation is not very costly. However, we expect the performance to degrade marginally when the metadata is present in the disk and is accessed through a write-through cache.
[0138] 1) Read Performance: Figure 22 shows the read times for 4+3 scheme for 0, 1, 2 and 3 disk failures. Like in the constant scheme, these times are compared with the read times of a raw device. We also see that these times are very close to the read times of constant model. We cannot observe the overhead mainly due to the small data involved in the tests. However, we expect a small degradation in performance due to accessing the metadata on large systems.
[0139] 2) Write Performance: Figure 23 shows the write times for 4+3 scheme for 0, 1, 2 and 3 disk failures. Like writes in constant scheme, we see decrease in the write times with increase in the disk failures, as we will have less disks to write. In case of write too, the values are very similar to the constant scheme as the overhead due to metadata is hidden.
[0140] Figure 24 shows the write times for various values of redundancy. We can see the times increase with increase in m when k is fixed at 4. Similar to previous cases, we expect this difference to reduce if the nodes are physically separate disks.
[0141] Although the present disclosure has been described with respect to one or more particular embodiments, it will be understood that other embodiments of the present disclosure may be made without departing from the spirit and scope of the present disclosure. Hence, the present disclosure is deemed limited only by the appended claims and the reasonable interpretation thereof.

Claims

What is claimed is:
1. A method for electronically storing block-level data, comprising:
receiving, from a file system or client software, a request to write data, the data comprising a plurality of data blocks;
receiving one or more coding indicators, wherein each data block of the plurality of data blocks is associated with a coding indicator, wherein each coding indicator represents a value, k, of a number of primary blocks to be written for a data block and a value, m, of a sum of k plus a number of erasure-coded blocks for the set of primary blocks;
writing a data block of the plurality of data blocks as a set of primary blocks such that k is selected according to the coding indicator associated with the data block, each primary block being written to a separate storage device;
calculating a value of each of m - k erasure-coded blocks based on the set of primary blocks according to the coding indicator associated with the data block;
writing each of the calculated erasure-coded blocks to a separate storage device, and wherein the erasure-coded blocks are not written to the same storage devices as the set of k primary blocks; and
recording a metadata entry associated the data block and comprising the coding indicator used for the data block.
2. The method of claim 1, wherein the steps of writing a data block, calculating m - k erasure- coded blocks based on the set of k primary blocks, and writing each of the calculated erasure- coded blocks, are repeated until each data block of the plurality of data blocks has been written.
3. The method of claim 2, wherein at least one of the data blocks has a coding indicator for values of k and/or m which are different than the values of and m for other data blocks.
4. The method of claim 1, wherein the one or more coding indicators are received as part of the request to write data.
5. The method of claim 1, further comprising:
identifying a set of storage devices corresponding to the set of primary blocks for a data block; and
wherein the metadata entry comprises the identity of the set of storage devices.
6. The method of claim 5, further comprising:
receiving a request to retrieve the data;
querying the metadata entry associated with a data block to determine the corresponding coding indicator and the identity of the set of storage devices corresponding with the data block;
reading k primary blocks from the identified set of k storage devices, wherein the value of k is selected according to the coding indicator associated with the data block;
optionally reading the corresponding erasure-coded blocks and reconstructing one or more of the primary blocks when one or more of the k primary blocks is inaccessible;
assembling the data block from the read k primary blocks; and
repeating the steps of querying the metadata entry, reading primary blocks, and assembling the data block until the data is retrieved.
7. The method of claim 1, wherein a loss of a storage device of the plurality of storage devices is detected and further comprising:
reconstructing the primary blocks and/or erasure-coded blocks written to the lost storage device according to the respective sets of primary blocks and a corresponding erasure- coded block stored on the remaining storage devices, wherein values of may differ for each reconstructed primary block; and
writing the reconstructed primary blocks and/or erasure-coded blocks to a replacement
storage device.
8. The method of claim 7, wherein reconstructing the primary blocks and/or erasure-coded blocks is performed during periods of low activity.
9. The method of claim 7, wherein reconstructing the primary blocks and/or erasure-coded blocks is performed only on demand.
10. The method of claim 1, further comprising defragmenting the primary blocks and/or erasure- coded blocks of the storage devices such that the blocks are contiguous.
1 1. The method of claim 1 , wherein the coding indicator is determined by an extended attribute of the file.
12. A storage controller for storing and retrieving data in a block-level storage system, the storage controller comprising:
a controller configured to be in electronic communication with a plurality of storage devices, wherein the controller is configured to perform any of the methods of claims 1-1 1.
PCT/US2015/026267 2014-04-16 2015-04-16 System and method for fault-tolerant block data storage WO2015161140A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201461980562P 2014-04-16 2014-04-16
US61/980,562 2014-04-16

Publications (1)

Publication Number Publication Date
WO2015161140A1 true WO2015161140A1 (en) 2015-10-22

Family

ID=54324591

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2015/026267 WO2015161140A1 (en) 2014-04-16 2015-04-16 System and method for fault-tolerant block data storage

Country Status (1)

Country Link
WO (1) WO2015161140A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11416180B2 (en) 2020-11-05 2022-08-16 International Business Machines Corporation Temporary data storage in data node of distributed file system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070245083A1 (en) * 2006-04-04 2007-10-18 Margolus Norman H Erasure Coding Technique For Scalable And Fault Tolerant Storage System
US20080115017A1 (en) * 2006-10-31 2008-05-15 Jacobson Michael B Detection and correction of block-level data corruption in fault-tolerant data-storage systems
US20100218037A1 (en) * 2008-09-16 2010-08-26 File System Labs Llc Matrix-based Error Correction and Erasure Code Methods and Apparatus and Applications Thereof
US20120017140A1 (en) * 2010-07-15 2012-01-19 John Johnson Wylie Non-mds erasure codes for storage systems
US20130173996A1 (en) * 2011-12-30 2013-07-04 Michael H. Anderson Accelerated erasure coding system and method
US20130204849A1 (en) * 2010-10-01 2013-08-08 Peter Chacko Distributed virtual storage cloud architecture and a method thereof

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070245083A1 (en) * 2006-04-04 2007-10-18 Margolus Norman H Erasure Coding Technique For Scalable And Fault Tolerant Storage System
US20080115017A1 (en) * 2006-10-31 2008-05-15 Jacobson Michael B Detection and correction of block-level data corruption in fault-tolerant data-storage systems
US20100218037A1 (en) * 2008-09-16 2010-08-26 File System Labs Llc Matrix-based Error Correction and Erasure Code Methods and Apparatus and Applications Thereof
US20120017140A1 (en) * 2010-07-15 2012-01-19 John Johnson Wylie Non-mds erasure codes for storage systems
US20130204849A1 (en) * 2010-10-01 2013-08-08 Peter Chacko Distributed virtual storage cloud architecture and a method thereof
US20130173996A1 (en) * 2011-12-30 2013-07-04 Michael H. Anderson Accelerated erasure coding system and method

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11416180B2 (en) 2020-11-05 2022-08-16 International Business Machines Corporation Temporary data storage in data node of distributed file system

Similar Documents

Publication Publication Date Title
USRE49011E1 (en) Mapping in a storage system
US8954710B2 (en) Variable length encoding in a storage system
US9454476B2 (en) Logical sector mapping in a flash storage array
EP3168737A2 (en) Distributed multimode storage management
US7584229B2 (en) Method and system for priority-based allocation in a storage pool
CN109313538B (en) Inline deduplication
US7415653B1 (en) Method and apparatus for vectored block-level checksum for file system data integrity
US8495010B2 (en) Method and system for adaptive metadata replication
CN111587425A (en) File operations in a distributed storage system
US11256447B1 (en) Multi-BCRC raid protection for CKD
US7865673B2 (en) Multiple replication levels with pooled devices
US7480684B2 (en) Method and system for object allocation using fill counts
US7873799B2 (en) Method and system supporting per-file and per-block replication
US11868248B2 (en) Optimization for garbage collection in a storage system
WO2015161140A1 (en) System and method for fault-tolerant block data storage
US7281188B1 (en) Method and system for detecting and correcting data errors using data permutations
US11204706B2 (en) Enhanced hash calculation in distributed datastores
US20230325324A1 (en) Caching techniques

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15780096

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15780096

Country of ref document: EP

Kind code of ref document: A1