US20080270704A1 - Cache arrangement for improving raid i/o operations - Google Patents
Cache arrangement for improving raid i/o operations Download PDFInfo
- Publication number
- US20080270704A1 US20080270704A1 US11/741,826 US74182607A US2008270704A1 US 20080270704 A1 US20080270704 A1 US 20080270704A1 US 74182607 A US74182607 A US 74182607A US 2008270704 A1 US2008270704 A1 US 2008270704A1
- Authority
- US
- United States
- Prior art keywords
- parity
- cache
- data
- data blocks
- storage nodes
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/08—Error detection or correction by redundancy in data representation, e.g. by using checking codes
- G06F11/10—Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
- G06F11/1076—Parity data used in redundant arrays of independent storages, e.g. in RAID systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0806—Multiuser, multiprocessor or multiprocessing cache systems
- G06F12/0813—Multiuser, multiprocessor or multiprocessing cache systems with a network or matrix configuration
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0866—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches for peripheral storage systems, e.g. disk cache
- G06F12/0871—Allocation or management of cache space
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/50—Network services
- H04L67/56—Provisioning of proxy services
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/50—Network services
- H04L67/56—Provisioning of proxy services
- H04L67/568—Storing data temporarily at an intermediate stage, e.g. caching
- H04L67/5682—Policies or rules for updating, deleting or replacing the stored data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2211/00—Indexing scheme relating to details of data-processing equipment not covered by groups G06F3/00 - G06F13/00
- G06F2211/10—Indexing scheme relating to G06F11/10
- G06F2211/1002—Indexing scheme relating to G06F11/1076
- G06F2211/1028—Distributed, i.e. distributed RAID systems with parity
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/26—Using a specific storage system architecture
- G06F2212/261—Storage comprising a plurality of storage devices
- G06F2212/262—Storage comprising a plurality of storage devices configured as RAID
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2212/00—Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
- G06F2212/28—Using a specific disk cache architecture
- G06F2212/283—Plural cache memories
- G06F2212/284—Plural cache memories being distributed
Definitions
- the embodiments of the invention provide a method, apparatus, etc. for a cache arrangement for improving RAID I/O operations.
- Network-RAID Redundant array of independent disks
- Distributed storage systems using a network-RAID protocol can process, or coordinate, a network-RAID-protocol I/O request (I/O request) locally at a client node or the request can be forwarded to a storage server or a coordination server for processing. For example, one client node may locally write data to a particular data location, while another client node may choose to forward a read or a write request for the same data location to a shared, or coordination, server.
- the embodiments of the invention provide a method, apparatus, etc. for a cache arrangement for improving RAID I/O operations. More specifically, a method for cache management within a distributed data storage system begins by partitioning a data object into a plurality of data blocks and creating one or more parity data blocks from the data object. Next, the data blocks and the parity data blocks are stored within storage nodes.
- the method caches data blocks within a partitioned cache, wherein the partitioned cache includes a plurality of cache partitions.
- the cache partitions are located within the storage nodes, wherein each cache partition is smaller than the data object.
- the caching within the partitioned cache only caches data blocks in parity storage nodes, wherein the parity storage nodes comprise a parity storage field.
- caching within the partitioned cache avoids caching data blocks within storage nodes lacking the parity storage field.
- the storage nodes comprise more than one parity storage node, the data blocks are cached in any of the parity storage nodes.
- the method further includes updating the data object. Specifically, a write request is annotated with information regarding changed data blocks within the data object; and, the write request is only sent to the parity storage nodes.
- the sending of the write request only to the parity storage nodes comprises simultaneously performing an invalidation operation and a write operation. Subsequently, the data blocks and parity data block are read from the storage nodes.
- An apparatus for cache management within a distributed data storage system comprises a partitioner to partition a data object into a plurality of data blocks.
- An analysis engine is operatively connected to the partitioner, wherein the analysis engine creates one or more parity data blocks from the data object.
- a controller is operatively connected to the analysis engine, wherein the controller stores the data blocks and the parity data blocks within storage nodes.
- the controller also caches data blocks within a partitioned cache, wherein the partitioned cache includes a plurality of cache partitions.
- the cache partitions are located within the storage nodes, wherein each cache partition is smaller than the data object.
- the controller only caches data blocks in parity storage nodes, wherein the parity storage nodes have a parity storage field.
- the controller avoids caching data blocks within storage nodes lacking the parity storage field.
- the controller caches the data blocks in any of the parity storage nodes.
- the controller annotates a write request with information regarding changed data blocks within the data object and sends the write request to the parity storage nodes.
- the controller simultaneously performs an invalidation operation and a write operation.
- the apparatus further includes a reader operatively connected to the controller, wherein the reader reads the data blocks and the parity data blocks from the storage nodes.
- the embodiments of the invention provide a technique to build a lightweight cache coherence protocol in distributed storage systems by exploiting the update patterns inherent with erasure coded data.
- Such a technique can unify the partitioned caches into a single large read cache and can use the cached data to improve RAID I/O operations from clients.
- FIG. 1 is a table illustrating benefits of caching while executing write and reconstruct read operations
- FIG. 2 is a table illustrating an enumeration of the type of plans generated by the embodiments of the invention
- FIGS. 3A and 3B are diagrams illustrating two variants of I/O update topology for distributed RAID that keep data in sync;
- FIGS. 4A , 4 B, 4 C, and 4 d are diagrams illustrating four ways to prime the cache at the parity nodes to improve RAID I/O operations in distributed RAID storage systems;
- FIG. 5 is a diagram illustrating a system for a cache arrangement for improving RAID I/O operations
- FIG. 6 is a diagram illustrating a data object stripe
- FIGS. 7A and 7B are diagrams illustrating cache arrangement for improving RAID I/O operations
- FIG. 8 is a diagram illustrating an apparatus for cache management within a distributed data storage system.
- FIG. 9 is a flow diagram illustrating a method for cache management within a distributed data storage system.
- the embodiments of the invention provide a technique to build a lightweight cache coherence protocol in distributed storage systems by exploiting the update patterns inherent with erasure coded data. Such a technique can unify the partitioned caches into a single large read cache and can use the cached data to improve RAID I/O operations from clients.
- FIG. 1 illustrates a table showing the benefits. Specifically, an example of savings with the embodiments of the invention is shown when the underlying distributed RAID layout is RAID5 over 4 nodes. The savings comes from exploiting the cache state at various nodes of a distributed RAID system. Pages for a given stripe could be in the read cache at one or more parity node(s), data nodes and/or client nodes. Embodiments herein can deliver such savings when the working set exceeds the total cache size of a single client node. Brick systems may have more (aggregate) cache space fronting the drives as comparable RAID controllers. Phrased another way, for the same cost of the system, more aggregate cache can be included in a brick system than in a monolithic system.
- dispersed cache requires some cache coherence scheme, which comprises of two parts.
- the execution of a RAID read or write operation at a node can be optimized by leveraging pages that are in the caches of the different nodes.
- each brick stores a stripe of data for which it is the target node (TN).
- TN target node
- PN parity node
- Client nodes (CN) are also provided. From the perspective of any dirty data page, the multiple nodes in the system are categorized as described below. CN is the client node that initiates the flush of this dirty page; and, TN is the target node to which the dirty data page should be written.
- ⁇ PN ⁇ is the parity node that hosts the parity page that depends on the dirty data page. There can be multiple parities depending on the layout, which is indicated by the curved brackets.
- ⁇ DN ⁇ is the dependent node that hosts the dependent data (dD) contributing to the calculation of the same parity as the dirty page.
- the XOR calculations for new parity can be performed at any one or combination of these nodes.
- each of the above nodes can have one of two plans: parity compute (PC) or parity increment (PI).
- PC parity compute
- PI parity increment
- FIG. 2 illustrates a table, which enumerates all possible I/O plans possible amongst these nodes for a given dirty page.
- the overarching notation is that a write changes D old to D new which requires updating the relevant parity page from P old to P new .
- a method is presented to derive the best local I/O plan and the communication protocol to allow different nodes to reach an agreement on the final I/O plan.
- Data pages can be cached only at parity nodes that depend on it.
- the invalidation can be piggybacked on the that operation to the new parity page (to PN).
- PN is guaranteed to get an update operation due to how redundancy is maintained i.e., erasure coding.
- data pages are cached at the parity node(s)
- the new data is always in the parity nodes. This can be checked during read to that data by any CN.
- the unchanged data, which is not in the parity nodes, are not invalidated.
- FIGS. 3A and 3 B illustrate two such I/O plans (each employing the parity increment with ⁇ ). Specifically, in FIG. 3A , CN writes new data to the target node, computes ⁇ , and ships it to the affected parity nodes to be applied. In FIG. 3B , CN writes new data to the parity node with old data. This parity node computes ⁇ and ships it to the target and other parity nodes to be applied.
- FIGS. 4A , 4 B, 4 C, and 4 D four alternatives are provided to describe how the parity node(s) gather data pages from client or target nodes.
- the target node in response to a client read) ships the data to one or more parity nodes.
- the client demotes a clean page it would have discarded to one or more parity nodes.
- the target demotes the page to the parity node.
- the parity node asynchronously reads pages from the target node.
- TN does not read cached data pages except during system transience (writing, buffering). This makes TN's cache exclusive.
- the first rule is not applicable to parity pages, which can cached during transience.
- embodiments herein can use one round of messages to gather all candidate I/O plan costs from all t PN's and compare with the local plans available to CN and pick the best plan.
- degraded/critical mode reconstructed pages are held at the parity node longer (until rebuild completes or cache pressure builds sufficiently) for possible reuse by another client.
- read/write performance can be improved.
- the execution of a RAID read or write operation at a node can be optimized by leveraging pages that are in the caches of the different nodes.
- the embodiments herein can be applied to distributed (clustered) storage systems.
- the embodiments of the invention have the ability to provide read cache unification and to improve RAID I/O operations.
- the embodiments of the invention provide a distributed cache management scheme for a storage system that uses erasure coded distributed RAID and has partitioned cache (where the total sum can be fairly substantial). This speeds up RAID reads and writes by leveraging cached data, where possible. Moreover, this unifies the cache, which maximizes cache effectiveness. There is no duplication of cached data.
- the cache management scheme is lightweight; no (additional) messaging for cache coherence or a data directory is needed.
- the management scheme is also opportunistic; any steps can be skipped under a heavy load without affecting correctness.
- FIG. 5 is a diagram illustrating a system for such a cache arrangement scheme.
- the initiator for read or write operations to the dRAID volume can be at a client node 510 A or 510 B (direct access) or a storage node 520 A or 520 B (gateway). Meta-data 530 is available to the initiator via a network 540 .
- the storage nodes 520 A/ 520 B could have a write and read cache or a read cache only (cache 522 A/ 522 B).
- a dRAIDed stripe is spread across the storage nodes 520 A/ 520 B, wherein the system assumes uniformly spread storage.
- FIG. 6 is a diagram illustrating a data object stripe within five storage nodes (SN 1 , SN 2 , SN 3 , SN 4 , and SN 5 ).
- the data object stripe includes a first data block (D 1 ), a second data block (D 2 ), a third data block (D 3 ), a fourth data block (D 4 ), and a parity block (P).
- the role of a storage node for a data block can be a client node (CN), parity node (PN), or target node (TN). Each storage node can play multiple roles for different blocks.
- SN 3 is the target node for D 3
- SN 5 is the parity node
- any of the storage nodes can be a client node.
- Embodiments of the invention provide the following cache rules.
- Fifth, a client or storage node can locally decide to evict (clean) pages. This provides for loosely coupled caching.
- Consequences of the cache rules provide that data pages from multiple clients get “percolated” into caches in storage nodes, which is advantageous for shared workloads without clients even cooperating. This is irrelevant for totally random workloads, which are no worse than before. Moreover, caches at storage nodes are aligned in a “RAID-friendly” way. All data used to compute a parity block localized. Further, due to the nature of erasure code updates, cache coherence is free. Parity node(s) have to be written to for write completion. Annotation helps identify which blocks have changed.
- FIGS. 7A and 7B are diagrams illustrating cache arrangement for improving RAID I/O operations.
- FIG. 7A illustrates storage node 1 (SN 1 ), which includes data blocks 1 , 6 , and 11 .
- Storage node 2 (SN 2 ) includes data blocks 2 , 7 , and 12 ; and, storage node 3 (SN 3 ) has data blocks 3 and 8 , and parity block 3 (P 3 ).
- storage node 4 (SN 4 ) includes data blocks 4 and 9 , and parity block 2 (P 2 ); and, storage node 5 (SN 5 ) has data blocks 5 and 10 , and parity block 1 (P 1 ).
- data blocks are only cached in storage nodes having parity blocks (i.e., SN 3 , SN 4 , and SN 5 ).
- Reads and writes include an extra messaging phase to query the cache state at parity node(s).
- Client costs various read/update plans possible around metrics, such as disk IOs and memory bandwidth. The client chooses the best plan and drives I/O.
- Read plan choices include finding the cheapest reconstruction plan in three steps: inverting the matrix; masking cached pages; and, cost planning. Possible locations include the client node and parity node(s).
- the embodiments herein are applicable to a class of problems that requires coordination of a distributed cache resource and updates to a set of data blocks that require updates to some common (dependent) block(s).
- Such systems could include distributed databases and cluster file systems.
- the embodiments of the invention provide a distributed cache arrangement for a storage system that speeds up RAID operations where workload is conducive.
- the working set is larger than any single client cache but it fits in the collective cache.
- a shared data set exists between the clients but the data set is time shifted.
- the cache arrangement adjusts automatically to workloads from clients. If there is a shared workload, then there is a benefit; otherwise, the cache arrangement exploits collective cache space.
- an apparatus 800 for cache management within a distributed data storage system is illustrated. More specifically, a partitioner 810 is provided to partition a data object into a plurality of data blocks. An analysis engine 820 is operatively connected to the partitioner 810 , wherein the analysis engine 820 creates one or more parity data blocks from the data object.
- the data object stripe includes a first data block (D 1 ), a second data block (D 2 ), a third data block (D 3 ), a fourth data block (D 4 ), and a parity block (P).
- a controller 830 is operatively connected to the analysis engine 820 , wherein the controller 830 stores the data blocks and the parity data block within storage nodes. For example, as illustrated in FIGS. 7A and 7B , the data blocks 1 - 12 and the parity data blocks P 1 -P 3 are stored within the storage nodes SN 1 -SN 5 .
- the controller 830 also caches the data blocks within a partitioned cache, wherein the partitioned cache includes cache partitions.
- the cache partitions are located within the storage nodes, wherein each cache partition is smaller than the data object (e.g., volume, LUN, file system). More specifically, each cache partition is located within a storage node.
- the controller 830 only caches the data blocks in parity storage nodes, wherein the parity storage nodes include a parity storage field (a field within a storage node where parity data block(s) can be stored).
- the controller 830 avoids caching data blocks within storage nodes lacking the parity storage field. For example, as illustrated in FIGS.
- data blocks 1 - 12 are only cached within the storage nodes having stored parity data blocks.
- parity data blocks P 1 , P 2 , and P 3 are stored in storage nodes SN 5 , SN 4 , and SN 3 , respectively.
- the controller 830 caches the data blocks in any of the parity storage nodes. Moreover, the controller 830 annotates a write request with information regarding changed data blocks within the data object and sends the write request to the parity storage nodes. The controller 830 simultaneously performs an invalidation operation and a write operation. Additionally, a reader 840 is operatively connected to the controller 830 , wherein the reader 840 reads the data blocks and the parity data block from the storage nodes.
- FIG. 9 a method 900 for cache management within a distributed data storage system is illustrated. More specifically, the method 900 begins in item 910 by partitioning a data object into data blocks. Next, in item 920 , one or more parity data blocks are created from the data object. As described above, FIG. 6 illustrates a data object stripe having a first data block (D 1 ), a second data block (D 2 ), a third data block (D 3 ), a fourth data block (D 4 ), and a parity block (P). Following this, in item 930 , the data blocks and the parity data block are stored within storage nodes. As described above, the role of a storage node for a data block can be a client node (CN), a parity node (PN), or a target node (TN). Each storage node can play multiple roles for different blocks.
- CN client node
- PN parity node
- TN target node
- the data blocks are also cached within a partitioned cache, wherein the partitioned cache includes cache partitions.
- the cache partitions are located within the storage nodes, wherein each cache partition is smaller than the data object.
- the storage nodes could have a write and read cache or a read cache only.
- the caching within the partitioned cache only caches the data blocks in parity storage nodes, wherein the parity storage nodes include a parity storage field (item 942 ).
- the parity storage nodes include a parity storage field
- caching the data blocks within storage nodes lacking the parity storage field is avoided (item 944 ). Accordingly, as described above, a separate cache directory is not required because the cached data blocks are only in the parity storage nodes.
- FIGS. 4A , 4 B, 4 C, and 4 D illustrate four alternatives to describe how the parity node(s) gather data pages from client or target nodes.
- the target node in response to a client read) ships the data to one or more parity nodes.
- the client demotes a clean page it would have discarded to one or more parity nodes.
- the target demotes the page to the parity node.
- the parity node asynchronously reads pages from the target node.
- the method 900 also includes, in item 950 , updating the data object. This includes annotating a write request with information regarding changed data blocks within the data object (item 952 ) and sending the write request only to the parity storage nodes (item 954 ). The sending of the write request only to the parity storage nodes comprises simultaneously performing an invalidation operation and a write operation (item 956 ). Thus, as described above, cache invalidation is piggybacked onto regular operations. Due to the nature of erasure code updating, cache coherence is free because parity node(s) have to be written to for a write completion. Annotation helps identify which blocks have changed. Subsequently, in item 960 , the data blocks and parity data block are read from the storage nodes. The method 900 can check the cache at the parity storage nodes before reading the data block from the target storage nodes.
- the embodiments of the invention provide a technique to build a lightweight cache coherence protocol in distributed storage systems by exploiting the update patterns inherent with erasure coded data.
- Such a technique can unify the partitioned caches into a single large read cache and can use the cached data to improve RAID I/O operations from clients.
Abstract
The embodiments of the invention provide a method, apparatus, etc. for a cache arrangement for improving RAID I/O operations. More specifically, a method begins by partitioning a data object into a plurality of data blocks and creating one or more parity data blocks from the data object. Next, the data blocks and the parity data blocks are stored within storage nodes. Following this, the method caches data blocks within a partitioned cache, wherein the partitioned cache includes a plurality of cache partitions. The cache partitions are located within the storage nodes, wherein each cache partition is smaller than the data object. Moreover, the caching within the partitioned cache only caches data blocks in parity storage nodes, wherein the parity storage nodes comprise a parity storage field. Thus, caching within the partitioned cache avoids caching data blocks within storage nodes lacking the parity storage field.
Description
- 1. Field of the Invention
- The embodiments of the invention provide a method, apparatus, etc. for a cache arrangement for improving RAID I/O operations.
- 2. Description of the Related Art
- It is often necessary in a distributed storage system to read or write data redundantly that has been striped on more than one storage server (or target). Such a system configuration is referred to as a “network-RAID” (redundant array of independent disks) because the function of a RAID controller is performed by the network protocol of the distributed storage system by coordinating I/O (input/output) operations that are processed at multiple places concurrently in order to ensure correct system behavior, both atomically and serially. Distributed storage systems using a network-RAID protocol can process, or coordinate, a network-RAID-protocol I/O request (I/O request) locally at a client node or the request can be forwarded to a storage server or a coordination server for processing. For example, one client node may locally write data to a particular data location, while another client node may choose to forward a read or a write request for the same data location to a shared, or coordination, server.
- The embodiments of the invention provide a method, apparatus, etc. for a cache arrangement for improving RAID I/O operations. More specifically, a method for cache management within a distributed data storage system begins by partitioning a data object into a plurality of data blocks and creating one or more parity data blocks from the data object. Next, the data blocks and the parity data blocks are stored within storage nodes.
- Following this, the method caches data blocks within a partitioned cache, wherein the partitioned cache includes a plurality of cache partitions. The cache partitions are located within the storage nodes, wherein each cache partition is smaller than the data object. Moreover, the caching within the partitioned cache only caches data blocks in parity storage nodes, wherein the parity storage nodes comprise a parity storage field. Thus, caching within the partitioned cache avoids caching data blocks within storage nodes lacking the parity storage field. When the storage nodes comprise more than one parity storage node, the data blocks are cached in any of the parity storage nodes.
- The method further includes updating the data object. Specifically, a write request is annotated with information regarding changed data blocks within the data object; and, the write request is only sent to the parity storage nodes. The sending of the write request only to the parity storage nodes comprises simultaneously performing an invalidation operation and a write operation. Subsequently, the data blocks and parity data block are read from the storage nodes.
- An apparatus for cache management within a distributed data storage system is also provided. More specifically, the apparatus comprises a partitioner to partition a data object into a plurality of data blocks. An analysis engine is operatively connected to the partitioner, wherein the analysis engine creates one or more parity data blocks from the data object. Moreover, a controller is operatively connected to the analysis engine, wherein the controller stores the data blocks and the parity data blocks within storage nodes.
- The controller also caches data blocks within a partitioned cache, wherein the partitioned cache includes a plurality of cache partitions. The cache partitions are located within the storage nodes, wherein each cache partition is smaller than the data object. When caching within the partitioned cache, the controller only caches data blocks in parity storage nodes, wherein the parity storage nodes have a parity storage field. Thus, when caching, the controller avoids caching data blocks within storage nodes lacking the parity storage field. When the storage nodes have more than one parity storage node, the controller caches the data blocks in any of the parity storage nodes.
- Additionally, the controller annotates a write request with information regarding changed data blocks within the data object and sends the write request to the parity storage nodes. The controller simultaneously performs an invalidation operation and a write operation. The apparatus further includes a reader operatively connected to the controller, wherein the reader reads the data blocks and the parity data blocks from the storage nodes.
- Accordingly, the embodiments of the invention provide a technique to build a lightweight cache coherence protocol in distributed storage systems by exploiting the update patterns inherent with erasure coded data. Such a technique can unify the partitioned caches into a single large read cache and can use the cached data to improve RAID I/O operations from clients.
- The embodiments of the invention will be better understood from the following detailed description with reference to the drawings, in which:
-
FIG. 1 is a table illustrating benefits of caching while executing write and reconstruct read operations; -
FIG. 2 is a table illustrating an enumeration of the type of plans generated by the embodiments of the invention; -
FIGS. 3A and 3B are diagrams illustrating two variants of I/O update topology for distributed RAID that keep data in sync; -
FIGS. 4A , 4B, 4C, and 4 d are diagrams illustrating four ways to prime the cache at the parity nodes to improve RAID I/O operations in distributed RAID storage systems; -
FIG. 5 is a diagram illustrating a system for a cache arrangement for improving RAID I/O operations; -
FIG. 6 is a diagram illustrating a data object stripe; -
FIGS. 7A and 7B are diagrams illustrating cache arrangement for improving RAID I/O operations; -
FIG. 8 is a diagram illustrating an apparatus for cache management within a distributed data storage system; and -
FIG. 9 is a flow diagram illustrating a method for cache management within a distributed data storage system. - The embodiments of the invention and the various features and advantageous details thereof are explained more fully with reference to the non-limiting embodiments that are illustrated in the accompanying drawings and detailed in the following description. It should be noted that the features illustrated in the drawings are not necessarily drawn to scale. Descriptions of well-known components and processing techniques are omitted so as to not unnecessarily obscure the embodiments of the invention. The examples used herein are intended merely to facilitate an understanding of ways in which the embodiments of the invention may be practiced and to further enable those of skill in the art to practice the embodiments of the invention. Accordingly, the examples should not be construed as limiting the scope of the embodiments of the invention.
- The embodiments of the invention provide a technique to build a lightweight cache coherence protocol in distributed storage systems by exploiting the update patterns inherent with erasure coded data. Such a technique can unify the partitioned caches into a single large read cache and can use the cached data to improve RAID I/O operations from clients.
- Erasure coded data benefits the most from caching while executing write and reconstruct read operations.
FIG. 1 illustrates a table showing the benefits. Specifically, an example of savings with the embodiments of the invention is shown when the underlying distributed RAID layout is RAID5 over 4 nodes. The savings comes from exploiting the cache state at various nodes of a distributed RAID system. Pages for a given stripe could be in the read cache at one or more parity node(s), data nodes and/or client nodes. Embodiments herein can deliver such savings when the working set exceeds the total cache size of a single client node. Brick systems may have more (aggregate) cache space fronting the drives as comparable RAID controllers. Phrased another way, for the same cost of the system, more aggregate cache can be included in a brick system than in a monolithic system. - To make effective use, dispersed cache requires some cache coherence scheme, which comprises of two parts. First, a scalable cache directory needs to map pages to nodes. Second, an invalidation (or coherence) protocol is needed to ensure correctness. With erasure codes, read/write performance of data in degraded/critical mode is significantly slower than under fault-free mode. If at least the working set is cached somewhere until the rebuild operation completes, then read/write performance can be improved. Specifically, given a cache arrangement scheme, the execution of a RAID read or write operation at a node can be optimized by leveraging pages that are in the caches of the different nodes.
- Considering data laid out in some erasure code layout (e.g., RAID5), for each data stripe, a subset of the bricks take on different roles. Each brick stores a stripe of data for which it is the target node (TN). For each stripe, there will be at least t pages to store parity for a t-fault tolerant code. Each parity page is stored on a different parity node (PN). Client nodes (CN) are also provided. From the perspective of any dirty data page, the multiple nodes in the system are categorized as described below. CN is the client node that initiates the flush of this dirty page; and, TN is the target node to which the dirty data page should be written. {PN} is the parity node that hosts the parity page that depends on the dirty data page. There can be multiple parities depending on the layout, which is indicated by the curved brackets. {DN} is the dependent node that hosts the dependent data (dD) contributing to the calculation of the same parity as the dirty page.
- The XOR calculations for new parity can be performed at any one or combination of these nodes. Locally, each of the above nodes can have one of two plans: parity compute (PC) or parity increment (PI). Additionally, two issues need to be addressed. The first issue is how each kind of nodes derive their own best I/O plan. The second issue is how different nodes interact with each other to get an agreement on the final I/O plan.
-
FIG. 2 illustrates a table, which enumerates all possible I/O plans possible amongst these nodes for a given dirty page. The overarching notation is that a write changes Dold to Dnew which requires updating the relevant parity page from Pold to Pnew. In some schemes, a partial parity is used as Δ=Dnewxor Dold. Next, a method is presented to derive the best local I/O plan and the communication protocol to allow different nodes to reach an agreement on the final I/O plan. - Data pages can be cached only at parity nodes that depend on it. When an update to the data page occurs (at CN) the invalidation can be piggybacked on the that operation to the new parity page (to PN). PN is guaranteed to get an update operation due to how redundancy is maintained i.e., erasure coding. In other words, if data pages are cached at the parity node(s), the new data is always in the parity nodes. This can be checked during read to that data by any CN. The unchanged data, which is not in the parity nodes, are not invalidated.
- Beyond just invalidation, by employing certain client write I/O plans, this cache at the parity node(s) can be kept in sync without any extra messaging.
FIGS. 3A and 3B illustrate two such I/O plans (each employing the parity increment with Δ). Specifically, inFIG. 3A , CN writes new data to the target node, computes Δ, and ships it to the affected parity nodes to be applied. InFIG. 3B , CN writes new data to the parity node with old data. This parity node computes Δ and ships it to the target and other parity nodes to be applied. - As illustrated in
FIGS. 4A , 4B, 4C, and 4D, four alternatives are provided to describe how the parity node(s) gather data pages from client or target nodes. InFIG. 4A , the target node (in response to a client read) ships the data to one or more parity nodes. InFIG. 4B , the client demotes a clean page it would have discarded to one or more parity nodes. Further, inFIG. 4C , the target demotes the page to the parity node. InFIG. 4D , the parity node asynchronously reads pages from the target node. - If both TN and one or more PNs cache a data block, the effective cache size is reduced. This leads to greater cache pressure on (global pool) cache pages. To avoid this, three rules for caching data are provided. First, TN does not read cached data pages except during system transience (writing, buffering). This makes TN's cache exclusive. Second, when the erasure code allows for multiple PN's, then any one can be chosen (e.g., randomly). Third, the first rule is not applicable to parity pages, which can cached during transience.
- With this caching scheme in place, embodiments herein can use one round of messages to gather all candidate I/O plan costs from all t PN's and compare with the local plans available to CN and pick the best plan. In degraded/critical mode, reconstructed pages are held at the parity node longer (until rebuild completes or cache pressure builds sufficiently) for possible reuse by another client. As discussed above, if at least the working set is cached somewhere until the rebuild operation completes, then read/write performance can be improved. Specifically, given a cache arrangement scheme, the execution of a RAID read or write operation at a node can be optimized by leveraging pages that are in the caches of the different nodes.
- Thus, while cache invalidation is piggybacked on write operations, priming caches at the parity nodes takes some extra work. Moreover, read operations will need two phases, including a first phase to exchange plans. Write operations may require 3 phases, including a first phase to exchange plans (but here is an opportunity to piggyback). Further, by location shifting the cache, the impact it will have on local I/O optimizations (like prefetching etc.) is unknown.
- The embodiments herein can be applied to distributed (clustered) storage systems. For such systems, the embodiments of the invention have the ability to provide read cache unification and to improve RAID I/O operations.
- Furthermore, the embodiments of the invention provide a distributed cache management scheme for a storage system that uses erasure coded distributed RAID and has partitioned cache (where the total sum can be fairly substantial). This speeds up RAID reads and writes by leveraging cached data, where possible. Moreover, this unifies the cache, which maximizes cache effectiveness. There is no duplication of cached data. The cache management scheme is lightweight; no (additional) messaging for cache coherence or a data directory is needed. The management scheme is also opportunistic; any steps can be skipped under a heavy load without affecting correctness.
-
FIG. 5 is a diagram illustrating a system for such a cache arrangement scheme. The initiator for read or write operations to the dRAID volume can be at aclient node storage node data 530 is available to the initiator via anetwork 540. Thestorage nodes 520A/520B could have a write and read cache or a read cache only (cache 522A/522B). A dRAIDed stripe is spread across thestorage nodes 520A/520B, wherein the system assumes uniformly spread storage. -
FIG. 6 is a diagram illustrating a data object stripe within five storage nodes (SN1, SN2, SN3, SN4, and SN5). The data object stripe includes a first data block (D1), a second data block (D2), a third data block (D3), a fourth data block (D4), and a parity block (P). The role of a storage node for a data block can be a client node (CN), parity node (PN), or target node (TN). Each storage node can play multiple roles for different blocks. Thus, SN3 is the target node for D3; SN5 is the parity node; and, any of the storage nodes can be a client node. - Embodiments of the invention provide the following cache rules. First, each write request from a client is annotated with information about changed blocks within a stripe. Thus, cache invalidation is piggybacked onto regular operations. Second, data blocks can be cached only at parity node(s). Multiple candidates exist for higher distance codes; and, no separate cache directory is needed. Third, data blocks are not cached at the target node, except by the operating system as staging during read/write operations. The “home” location of data is shifted from a target node to a parity node. Fourth, clients “demote” victim data page to parity node(s). In case of a higher distance code, a lexicographical parity node is chosen. Such a parity node primes caches in storage nodes opportunistically from clients. Fifth, a client or storage node can locally decide to evict (clean) pages. This provides for loosely coupled caching.
- Consequences of the cache rules provide that data pages from multiple clients get “percolated” into caches in storage nodes, which is advantageous for shared workloads without clients even cooperating. This is irrelevant for totally random workloads, which are no worse than before. Moreover, caches at storage nodes are aligned in a “RAID-friendly” way. All data used to compute a parity block localized. Further, due to the nature of erasure code updates, cache coherence is free. Parity node(s) have to be written to for write completion. Annotation helps identify which blocks have changed.
-
FIGS. 7A and 7B are diagrams illustrating cache arrangement for improving RAID I/O operations.FIG. 7A illustrates storage node 1 (SN1), which includes data blocks 1, 6, and 11. Storage node 2 (SN2) includes data blocks 2, 7, and 12; and, storage node 3 (SN3) hasdata blocks data blocks FIG. 7B , data blocks are only cached in storage nodes having parity blocks (i.e., SN3, SN4, and SN5). - Reads and writes include an extra messaging phase to query the cache state at parity node(s). Client costs various read/update plans possible around metrics, such as disk IOs and memory bandwidth. The client chooses the best plan and drives I/O.
- Read plan choices include finding the cheapest reconstruction plan in three steps: inverting the matrix; masking cached pages; and, cost planning. Possible locations include the client node and parity node(s).
- Beyond distributed RAID, the embodiments herein are applicable to a class of problems that requires coordination of a distributed cache resource and updates to a set of data blocks that require updates to some common (dependent) block(s). Such systems could include distributed databases and cluster file systems.
- Thus, the embodiments of the invention provide a distributed cache arrangement for a storage system that speeds up RAID operations where workload is conducive. The working set is larger than any single client cache but it fits in the collective cache. A shared data set exists between the clients but the data set is time shifted. Moreover, the cache arrangement adjusts automatically to workloads from clients. If there is a shared workload, then there is a benefit; otherwise, the cache arrangement exploits collective cache space.
- Referring to
FIG. 8 , anapparatus 800 for cache management within a distributed data storage system is illustrated. More specifically, apartitioner 810 is provided to partition a data object into a plurality of data blocks. Ananalysis engine 820 is operatively connected to thepartitioner 810, wherein theanalysis engine 820 creates one or more parity data blocks from the data object. For example, as illustrated inFIG. 6 , the data object stripe includes a first data block (D1), a second data block (D2), a third data block (D3), a fourth data block (D4), and a parity block (P). Furthermore, acontroller 830 is operatively connected to theanalysis engine 820, wherein thecontroller 830 stores the data blocks and the parity data block within storage nodes. For example, as illustrated inFIGS. 7A and 7B , the data blocks 1-12 and the parity data blocks P1-P3 are stored within the storage nodes SN1-SN5. - The
controller 830 also caches the data blocks within a partitioned cache, wherein the partitioned cache includes cache partitions. The cache partitions are located within the storage nodes, wherein each cache partition is smaller than the data object (e.g., volume, LUN, file system). More specifically, each cache partition is located within a storage node. When caching within the partitioned cache, thecontroller 830 only caches the data blocks in parity storage nodes, wherein the parity storage nodes include a parity storage field (a field within a storage node where parity data block(s) can be stored). Thus, thecontroller 830 avoids caching data blocks within storage nodes lacking the parity storage field. For example, as illustrated inFIGS. 7A and 7B , data blocks 1-12 are only cached within the storage nodes having stored parity data blocks. In this example, parity data blocks P1, P2, and P3 are stored in storage nodes SN5, SN4, and SN3, respectively. - When caching within the partitioned cache, and when the storage nodes comprise more than one parity storage node, the
controller 830 caches the data blocks in any of the parity storage nodes. Moreover, thecontroller 830 annotates a write request with information regarding changed data blocks within the data object and sends the write request to the parity storage nodes. Thecontroller 830 simultaneously performs an invalidation operation and a write operation. Additionally, areader 840 is operatively connected to thecontroller 830, wherein thereader 840 reads the data blocks and the parity data block from the storage nodes. - Referring to
FIG. 9 , amethod 900 for cache management within a distributed data storage system is illustrated. More specifically, themethod 900 begins initem 910 by partitioning a data object into data blocks. Next, initem 920, one or more parity data blocks are created from the data object. As described above,FIG. 6 illustrates a data object stripe having a first data block (D1), a second data block (D2), a third data block (D3), a fourth data block (D4), and a parity block (P). Following this, initem 930, the data blocks and the parity data block are stored within storage nodes. As described above, the role of a storage node for a data block can be a client node (CN), a parity node (PN), or a target node (TN). Each storage node can play multiple roles for different blocks. - In
item 940, the data blocks are also cached within a partitioned cache, wherein the partitioned cache includes cache partitions. The cache partitions are located within the storage nodes, wherein each cache partition is smaller than the data object. As described above, the storage nodes could have a write and read cache or a read cache only. Moreover, the caching within the partitioned cache only caches the data blocks in parity storage nodes, wherein the parity storage nodes include a parity storage field (item 942). Thus, caching the data blocks within storage nodes lacking the parity storage field is avoided (item 944). Accordingly, as described above, a separate cache directory is not required because the cached data blocks are only in the parity storage nodes. - When caching the data blocks within the partitioned cache, and when the storage nodes have more than one parity storage node, the data blocks are cached in any of the parity storage nodes (item 946). As described above,
FIGS. 4A , 4B, 4C, and 4D illustrate four alternatives to describe how the parity node(s) gather data pages from client or target nodes. InFIG. 4A , the target node (in response to a client read) ships the data to one or more parity nodes. InFIG. 4B , the client demotes a clean page it would have discarded to one or more parity nodes. Further, inFIG. 4C , the target demotes the page to the parity node. InFIG. 4D , the parity node asynchronously reads pages from the target node. - The
method 900 also includes, initem 950, updating the data object. This includes annotating a write request with information regarding changed data blocks within the data object (item 952) and sending the write request only to the parity storage nodes (item 954). The sending of the write request only to the parity storage nodes comprises simultaneously performing an invalidation operation and a write operation (item 956). Thus, as described above, cache invalidation is piggybacked onto regular operations. Due to the nature of erasure code updating, cache coherence is free because parity node(s) have to be written to for a write completion. Annotation helps identify which blocks have changed. Subsequently, initem 960, the data blocks and parity data block are read from the storage nodes. Themethod 900 can check the cache at the parity storage nodes before reading the data block from the target storage nodes. - Accordingly, the embodiments of the invention provide a technique to build a lightweight cache coherence protocol in distributed storage systems by exploiting the update patterns inherent with erasure coded data. Such a technique can unify the partitioned caches into a single large read cache and can use the cached data to improve RAID I/O operations from clients.
- The foregoing description of the specific embodiments will so fully reveal the general nature of the invention that others can, by applying current knowledge, readily modify and/or adapt for various applications such specific embodiments without departing from the generic concept, and, therefore, such adaptations and modifications should and are intended to be comprehended within the meaning and range of equivalents of the disclosed embodiments. It is to be understood that the phraseology or terminology employed herein is for the purpose of description and not of limitation. Therefore, while the embodiments of the invention have been described in terms of preferred embodiments, those skilled in the art will recognize that the embodiments of the invention can be practiced with modification within the spirit and scope of the appended claims.
Claims (3)
1-7. (canceled)
7. A method for cache management within a distributed data storage system, said method comprising:
partitioning a data object into a plurality of data blocks;
creating at least one parity data block from said data object;
storing said data blocks and said parity data block within storage nodes;
caching said data blocks within a partitioned cache, wherein said partitioned cache comprises a plurality of cache partitions, wherein said cache partitions are located within said storage nodes,
wherein said caching within said partitioned cache only caches said data blocks in parity storage nodes, wherein said parity storage nodes comprise a parity storage field;
updating said data object, said updating comprising
annotating a write request with information regarding changed data blocks within said data object, and
sending said write request only to said parity storage nodes; and
reading said data blocks and said parity data block from said storage nodes;
wherein said caching within said partitioned cache comprises avoiding caching said data blocks within storage nodes lacking said parity storage field,
wherein said sending of said write request only to said parity storage nodes comprises simultaneously performing an invalidation operation and a write operation, and
wherein said caching of said data blocks within said partitioned cache comprises, when said storage nodes comprise more than one of said parity storage nodes, caching said data blocks in any of said parity storage nodes.
8-20. (canceled)
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US11/741,826 US20080270704A1 (en) | 2007-04-30 | 2007-04-30 | Cache arrangement for improving raid i/o operations |
US12/059,067 US7979641B2 (en) | 2007-04-30 | 2008-03-31 | Cache arrangement for improving raid I/O operations |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US11/741,826 US20080270704A1 (en) | 2007-04-30 | 2007-04-30 | Cache arrangement for improving raid i/o operations |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/059,067 Continuation US7979641B2 (en) | 2007-04-30 | 2008-03-31 | Cache arrangement for improving raid I/O operations |
Publications (1)
Publication Number | Publication Date |
---|---|
US20080270704A1 true US20080270704A1 (en) | 2008-10-30 |
Family
ID=39888394
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US11/741,826 Abandoned US20080270704A1 (en) | 2007-04-30 | 2007-04-30 | Cache arrangement for improving raid i/o operations |
US12/059,067 Expired - Fee Related US7979641B2 (en) | 2007-04-30 | 2008-03-31 | Cache arrangement for improving raid I/O operations |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US12/059,067 Expired - Fee Related US7979641B2 (en) | 2007-04-30 | 2008-03-31 | Cache arrangement for improving raid I/O operations |
Country Status (1)
Country | Link |
---|---|
US (2) | US20080270704A1 (en) |
Cited By (32)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100077101A1 (en) * | 2008-09-12 | 2010-03-25 | Institute Of Acoustics, Chinese Academy Of Sciences | Storage network structure based on the Peterson graph and data read-write method thereof |
US20100262771A1 (en) * | 2009-04-13 | 2010-10-14 | Takehiko Kurashige | Data storage system and cache data-consistency assurance method |
WO2012100037A1 (en) * | 2011-01-20 | 2012-07-26 | Google Inc. | Storing data on storage nodes |
US8255739B1 (en) * | 2008-06-30 | 2012-08-28 | American Megatrends, Inc. | Achieving data consistency in a node failover with a degraded RAID array |
US8498967B1 (en) | 2007-01-30 | 2013-07-30 | American Megatrends, Inc. | Two-node high availability cluster storage solution using an intelligent initiator to avoid split brain syndrome |
US8533343B1 (en) | 2011-01-13 | 2013-09-10 | Google Inc. | Virtual network pairs |
US8595455B2 (en) | 2007-01-30 | 2013-11-26 | American Megatrends, Inc. | Maintaining data consistency in mirrored cluster storage systems using bitmap write-intent logging |
US8677449B1 (en) | 2012-03-19 | 2014-03-18 | Google Inc. | Exposing data to virtual machines |
US8800009B1 (en) | 2011-12-30 | 2014-08-05 | Google Inc. | Virtual machine service access |
US8812586B1 (en) | 2011-02-15 | 2014-08-19 | Google Inc. | Correlating status information generated in a computer network |
US8874888B1 (en) | 2011-01-13 | 2014-10-28 | Google Inc. | Managed boot in a cloud system |
US20150032725A1 (en) * | 2013-07-25 | 2015-01-29 | Facebook, Inc. | Systems and methods for efficient data ingestion and query processing |
US8958293B1 (en) | 2011-12-06 | 2015-02-17 | Google Inc. | Transparent load-balancing for cloud computing services |
US8966198B1 (en) | 2011-09-01 | 2015-02-24 | Google Inc. | Providing snapshots of virtual storage devices |
US8972478B1 (en) * | 2012-05-23 | 2015-03-03 | Netapp, Inc. | Using append only log format in data storage cluster with distributed zones for determining parity of reliability groups |
WO2015031378A1 (en) * | 2013-08-29 | 2015-03-05 | Oracle International Corporation | System and method for supporting partition level journaling for synchronizing data in a distributed data grid |
US8983860B1 (en) | 2012-01-30 | 2015-03-17 | Google Inc. | Advertising auction system |
US8996887B2 (en) | 2012-02-24 | 2015-03-31 | Google Inc. | Log structured volume encryption for virtual machines |
US9063818B1 (en) | 2011-03-16 | 2015-06-23 | Google Inc. | Automated software updating based on prior activity |
US9069616B2 (en) | 2011-09-23 | 2015-06-30 | Google Inc. | Bandwidth throttling of virtual disks |
US9069806B2 (en) | 2012-03-27 | 2015-06-30 | Google Inc. | Virtual block devices |
US9075979B1 (en) | 2011-08-11 | 2015-07-07 | Google Inc. | Authentication based on proximity to mobile device |
US9135037B1 (en) | 2011-01-13 | 2015-09-15 | Google Inc. | Virtual network protocol |
US9231933B1 (en) | 2011-03-16 | 2016-01-05 | Google Inc. | Providing application programs with access to secured resources |
US9237087B1 (en) | 2011-03-16 | 2016-01-12 | Google Inc. | Virtual machine name resolution |
WO2016057537A1 (en) * | 2014-10-09 | 2016-04-14 | Netapp, Inc. | Methods and systems for cache management in storage systems |
US9430255B1 (en) | 2013-03-15 | 2016-08-30 | Google Inc. | Updating virtual machine generated metadata to a distribution service for sharing and backup |
US9557978B2 (en) | 2011-03-16 | 2017-01-31 | Google Inc. | Selection of ranked configurations |
US9619662B1 (en) | 2011-01-13 | 2017-04-11 | Google Inc. | Virtual network pairs |
US9672052B1 (en) | 2012-02-16 | 2017-06-06 | Google Inc. | Secure inter-process communication |
US9817713B2 (en) | 2016-02-04 | 2017-11-14 | International Business Machines Corporation | Distributed cache system utilizing multiple erasure codes |
US10185639B1 (en) | 2015-05-08 | 2019-01-22 | American Megatrends, Inc. | Systems and methods for performing failover in storage system with dual storage controllers |
Families Citing this family (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9880970B2 (en) * | 2007-10-03 | 2018-01-30 | William L. Bain | Method for implementing highly available data parallel operations on a computational grid |
US8037391B1 (en) * | 2009-05-22 | 2011-10-11 | Nvidia Corporation | Raid-6 computation system and method |
US8296515B1 (en) | 2009-05-22 | 2012-10-23 | Nvidia Corporation | RAID-6 computation system and method |
US20130290636A1 (en) * | 2012-04-30 | 2013-10-31 | Qiming Chen | Managing memory |
US10210167B1 (en) * | 2012-05-07 | 2019-02-19 | Amazon Technologies, Inc. | Multi-level page caching for distributed object store |
US9811530B1 (en) * | 2013-06-29 | 2017-11-07 | EMC IP Holding Company LLC | Cluster file system with metadata server for storage of parallel log structured file system metadata for a shared file |
US10116336B2 (en) * | 2014-06-13 | 2018-10-30 | Sandisk Technologies Llc | Error correcting code adjustment for a data storage device |
US10547681B2 (en) * | 2016-06-30 | 2020-01-28 | Purdue Research Foundation | Functional caching in erasure coded storage |
US10459807B2 (en) * | 2017-05-23 | 2019-10-29 | International Business Machines Corporation | Determining modified portions of a RAID storage array |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6049851A (en) * | 1994-02-14 | 2000-04-11 | Hewlett-Packard Company | Method and apparatus for checking cache coherency in a computer architecture |
US6148368A (en) * | 1997-07-31 | 2000-11-14 | Lsi Logic Corporation | Method for accelerating disk array write operations using segmented cache memory and data logging |
US20030033572A1 (en) * | 2001-08-09 | 2003-02-13 | Walton John K. | Memory system and method of using same |
US6523087B2 (en) * | 2001-03-06 | 2003-02-18 | Chaparral Network Storage, Inc. | Utilizing parity caching and parity logging while closing the RAID5 write hole |
US6594698B1 (en) * | 1998-09-25 | 2003-07-15 | Ncr Corporation | Protocol for dynamic binding of shared resources |
US6651140B1 (en) * | 2000-09-01 | 2003-11-18 | Sun Microsystems, Inc. | Caching pattern and method for caching in an object-oriented programming environment |
US6963959B2 (en) * | 2002-10-31 | 2005-11-08 | International Business Machines Corporation | Storage system and method for reorganizing data to improve prefetch effectiveness and reduce seek distance |
US6970987B1 (en) * | 2003-01-27 | 2005-11-29 | Hewlett-Packard Development Company, L.P. | Method for storing data in a geographically-diverse data-storing system providing cross-site redundancy |
US7457980B2 (en) * | 2004-08-13 | 2008-11-25 | Ken Qing Yang | Data replication method over a limited bandwidth network by mirroring parities |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6725392B1 (en) * | 1999-03-03 | 2004-04-20 | Adaptec, Inc. | Controller fault recovery system for a distributed file system |
US6718434B2 (en) | 2001-05-31 | 2004-04-06 | Hewlett-Packard Development Company, L.P. | Method and apparatus for assigning raid levels |
US6782450B2 (en) | 2001-12-06 | 2004-08-24 | Raidcore, Inc. | File mode RAID subsystem |
US7107403B2 (en) | 2003-09-30 | 2006-09-12 | International Business Machines Corporation | System and method for dynamically allocating cache space among different workload classes that can have different quality of service (QoS) requirements where the system and method may maintain a history of recently evicted pages for each class and may determine a future cache size for the class based on the history and the QoS requirements |
US7313749B2 (en) | 2004-06-29 | 2007-12-25 | Hewlett-Packard Development Company, L.P. | System and method for applying error correction code (ECC) erasure mode and clearing recorded information from a page deallocation table |
-
2007
- 2007-04-30 US US11/741,826 patent/US20080270704A1/en not_active Abandoned
-
2008
- 2008-03-31 US US12/059,067 patent/US7979641B2/en not_active Expired - Fee Related
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6049851A (en) * | 1994-02-14 | 2000-04-11 | Hewlett-Packard Company | Method and apparatus for checking cache coherency in a computer architecture |
US6148368A (en) * | 1997-07-31 | 2000-11-14 | Lsi Logic Corporation | Method for accelerating disk array write operations using segmented cache memory and data logging |
US6594698B1 (en) * | 1998-09-25 | 2003-07-15 | Ncr Corporation | Protocol for dynamic binding of shared resources |
US6651140B1 (en) * | 2000-09-01 | 2003-11-18 | Sun Microsystems, Inc. | Caching pattern and method for caching in an object-oriented programming environment |
US6523087B2 (en) * | 2001-03-06 | 2003-02-18 | Chaparral Network Storage, Inc. | Utilizing parity caching and parity logging while closing the RAID5 write hole |
US20030033572A1 (en) * | 2001-08-09 | 2003-02-13 | Walton John K. | Memory system and method of using same |
US6963959B2 (en) * | 2002-10-31 | 2005-11-08 | International Business Machines Corporation | Storage system and method for reorganizing data to improve prefetch effectiveness and reduce seek distance |
US6970987B1 (en) * | 2003-01-27 | 2005-11-29 | Hewlett-Packard Development Company, L.P. | Method for storing data in a geographically-diverse data-storing system providing cross-site redundancy |
US7457980B2 (en) * | 2004-08-13 | 2008-11-25 | Ken Qing Yang | Data replication method over a limited bandwidth network by mirroring parities |
Cited By (53)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8595455B2 (en) | 2007-01-30 | 2013-11-26 | American Megatrends, Inc. | Maintaining data consistency in mirrored cluster storage systems using bitmap write-intent logging |
US8498967B1 (en) | 2007-01-30 | 2013-07-30 | American Megatrends, Inc. | Two-node high availability cluster storage solution using an intelligent initiator to avoid split brain syndrome |
US8255739B1 (en) * | 2008-06-30 | 2012-08-28 | American Megatrends, Inc. | Achieving data consistency in a node failover with a degraded RAID array |
US8667322B1 (en) | 2008-06-30 | 2014-03-04 | American Megatrends, Inc. | Achieving data consistency in a node failover with a degraded raid array |
US8219707B2 (en) * | 2008-09-12 | 2012-07-10 | Institute Of Acoustics, Chinese Academy Of Sciences | Storage network structure based on the Peterson graph and data read-write method thereof |
US20100077101A1 (en) * | 2008-09-12 | 2010-03-25 | Institute Of Acoustics, Chinese Academy Of Sciences | Storage network structure based on the Peterson graph and data read-write method thereof |
US20100262771A1 (en) * | 2009-04-13 | 2010-10-14 | Takehiko Kurashige | Data storage system and cache data-consistency assurance method |
US8108605B2 (en) * | 2009-04-13 | 2012-01-31 | Kabushiki Kaisha Toshiba | Data storage system and cache data—consistency assurance method |
US9740516B1 (en) | 2011-01-13 | 2017-08-22 | Google Inc. | Virtual network protocol |
US8533343B1 (en) | 2011-01-13 | 2013-09-10 | Google Inc. | Virtual network pairs |
US9135037B1 (en) | 2011-01-13 | 2015-09-15 | Google Inc. | Virtual network protocol |
US8874888B1 (en) | 2011-01-13 | 2014-10-28 | Google Inc. | Managed boot in a cloud system |
US9619662B1 (en) | 2011-01-13 | 2017-04-11 | Google Inc. | Virtual network pairs |
US8745329B2 (en) | 2011-01-20 | 2014-06-03 | Google Inc. | Storing data across a plurality of storage nodes |
US9250830B2 (en) | 2011-01-20 | 2016-02-02 | Google Inc. | Storing data across a plurality of storage nodes |
WO2012100037A1 (en) * | 2011-01-20 | 2012-07-26 | Google Inc. | Storing data on storage nodes |
US9794144B1 (en) | 2011-02-15 | 2017-10-17 | Google Inc. | Correlating status information generated in a computer network |
US8812586B1 (en) | 2011-02-15 | 2014-08-19 | Google Inc. | Correlating status information generated in a computer network |
US9231933B1 (en) | 2011-03-16 | 2016-01-05 | Google Inc. | Providing application programs with access to secured resources |
US9237087B1 (en) | 2011-03-16 | 2016-01-12 | Google Inc. | Virtual machine name resolution |
US11237810B2 (en) | 2011-03-16 | 2022-02-01 | Google Llc | Cloud-based deployment using templates |
US10241770B2 (en) | 2011-03-16 | 2019-03-26 | Google Llc | Cloud-based deployment using object-oriented classes |
US9063818B1 (en) | 2011-03-16 | 2015-06-23 | Google Inc. | Automated software updating based on prior activity |
US9557978B2 (en) | 2011-03-16 | 2017-01-31 | Google Inc. | Selection of ranked configurations |
US10212591B1 (en) | 2011-08-11 | 2019-02-19 | Google Llc | Authentication based on proximity to mobile device |
US9769662B1 (en) | 2011-08-11 | 2017-09-19 | Google Inc. | Authentication based on proximity to mobile device |
US9075979B1 (en) | 2011-08-11 | 2015-07-07 | Google Inc. | Authentication based on proximity to mobile device |
US8966198B1 (en) | 2011-09-01 | 2015-02-24 | Google Inc. | Providing snapshots of virtual storage devices |
US9251234B1 (en) | 2011-09-01 | 2016-02-02 | Google Inc. | Providing snapshots of virtual storage devices |
US9501233B2 (en) | 2011-09-01 | 2016-11-22 | Google Inc. | Providing snapshots of virtual storage devices |
US9069616B2 (en) | 2011-09-23 | 2015-06-30 | Google Inc. | Bandwidth throttling of virtual disks |
US8958293B1 (en) | 2011-12-06 | 2015-02-17 | Google Inc. | Transparent load-balancing for cloud computing services |
US8800009B1 (en) | 2011-12-30 | 2014-08-05 | Google Inc. | Virtual machine service access |
US8983860B1 (en) | 2012-01-30 | 2015-03-17 | Google Inc. | Advertising auction system |
US9672052B1 (en) | 2012-02-16 | 2017-06-06 | Google Inc. | Secure inter-process communication |
US8996887B2 (en) | 2012-02-24 | 2015-03-31 | Google Inc. | Log structured volume encryption for virtual machines |
US8677449B1 (en) | 2012-03-19 | 2014-03-18 | Google Inc. | Exposing data to virtual machines |
US9069806B2 (en) | 2012-03-27 | 2015-06-30 | Google Inc. | Virtual block devices |
US9720952B2 (en) | 2012-03-27 | 2017-08-01 | Google Inc. | Virtual block devices |
US8972478B1 (en) * | 2012-05-23 | 2015-03-03 | Netapp, Inc. | Using append only log format in data storage cluster with distributed zones for determining parity of reliability groups |
US9740403B2 (en) | 2012-05-23 | 2017-08-22 | Netapp, Inc. | Methods for managing storage in a data storage cluster with distributed zones based on parity values and devices thereof |
US9430255B1 (en) | 2013-03-15 | 2016-08-30 | Google Inc. | Updating virtual machine generated metadata to a distribution service for sharing and backup |
US20150032725A1 (en) * | 2013-07-25 | 2015-01-29 | Facebook, Inc. | Systems and methods for efficient data ingestion and query processing |
US9442967B2 (en) * | 2013-07-25 | 2016-09-13 | Facebook, Inc. | Systems and methods for efficient data ingestion and query processing |
US9652520B2 (en) | 2013-08-29 | 2017-05-16 | Oracle International Corporation | System and method for supporting parallel asynchronous synchronization between clusters in a distributed data grid |
US9659078B2 (en) | 2013-08-29 | 2017-05-23 | Oracle International Corporation | System and method for supporting failover during synchronization between clusters in a distributed data grid |
WO2015031378A1 (en) * | 2013-08-29 | 2015-03-05 | Oracle International Corporation | System and method for supporting partition level journaling for synchronizing data in a distributed data grid |
US9703853B2 (en) | 2013-08-29 | 2017-07-11 | Oracle International Corporation | System and method for supporting partition level journaling for synchronizing data in a distributed data grid |
US10423643B2 (en) | 2013-08-29 | 2019-09-24 | Oracle International Corporation | System and method for supporting resettable acknowledgements for synchronizing data in a distributed data grid |
US9753853B2 (en) | 2014-10-09 | 2017-09-05 | Netapp, Inc. | Methods and systems for cache management in storage systems |
WO2016057537A1 (en) * | 2014-10-09 | 2016-04-14 | Netapp, Inc. | Methods and systems for cache management in storage systems |
US10185639B1 (en) | 2015-05-08 | 2019-01-22 | American Megatrends, Inc. | Systems and methods for performing failover in storage system with dual storage controllers |
US9817713B2 (en) | 2016-02-04 | 2017-11-14 | International Business Machines Corporation | Distributed cache system utilizing multiple erasure codes |
Also Published As
Publication number | Publication date |
---|---|
US7979641B2 (en) | 2011-07-12 |
US20080270878A1 (en) | 2008-10-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US7979641B2 (en) | Cache arrangement for improving raid I/O operations | |
JP7077359B2 (en) | Distributed storage system | |
US10789020B2 (en) | Recovering data within a unified storage element | |
US11068389B2 (en) | Data resiliency with heterogeneous storage | |
US20230315346A1 (en) | Utilizing Multiple Redundancy Schemes Within A Unified Storage Element | |
US10365983B1 (en) | Repairing raid systems at per-stripe granularity | |
US6912669B2 (en) | Method and apparatus for maintaining cache coherency in a storage system | |
CN105960639B (en) | Prioritization data reconstruct in distributed memory system | |
CN110737541B (en) | Method and system for distributing data in distributed storage system | |
US7788244B2 (en) | Method and system for copying a snapshot tree | |
CN102937882B (en) | To effective access with the memory device using bitmap | |
CN102884502B (en) | Managing write operations to an extent of tracks migrated between storage devices | |
US10825477B2 (en) | RAID storage system with logical data group priority | |
US20150127975A1 (en) | Distributed virtual array data storage system and method | |
US10467527B1 (en) | Method and apparatus for artificial intelligence acceleration | |
CN1770114A (en) | Copy operations in storage networks | |
CN104395904A (en) | Efficient data object storage and retrieval | |
CN1679000A (en) | Using file system information in raid data reconstruction and migration | |
WO2011101482A1 (en) | Read-other protocol for maintaining parity coherency in a write-back distributed redundancy data storage system | |
CN1804810A (en) | Method and system of redirection for storage access requests | |
CN102841854A (en) | Method and system for executing data reading based on dynamic hierarchical memory cache (hmc) awareness | |
CN1770115A (en) | Recovery operations in storage networks | |
CN101147118A (en) | Methods and apparatus for reconfiguring a storage system | |
US20170277450A1 (en) | Lockless parity management in a distributed data storage system | |
US7725654B2 (en) | Affecting a caching algorithm used by a cache of storage system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:HE, DINGSHAN;KENCHAMMANA-HOSEKOTE, DEEPAK R.;REEL/FRAME:019226/0474;SIGNING DATES FROM 20070424 TO 20070425 |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |