US20080270704A1 - Cache arrangement for improving raid i/o operations - Google Patents

Cache arrangement for improving raid i/o operations Download PDF

Info

Publication number
US20080270704A1
US20080270704A1 US11/741,826 US74182607A US2008270704A1 US 20080270704 A1 US20080270704 A1 US 20080270704A1 US 74182607 A US74182607 A US 74182607A US 2008270704 A1 US2008270704 A1 US 2008270704A1
Authority
US
United States
Prior art keywords
parity
cache
data
data blocks
storage nodes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US11/741,826
Inventor
Dingshan HE
Deepak R. Kenchammana-Hosekote
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
International Business Machines Corp
Original Assignee
International Business Machines Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Business Machines Corp filed Critical International Business Machines Corp
Priority to US11/741,826 priority Critical patent/US20080270704A1/en
Assigned to INTERNATIONAL BUSINESS MACHINES CORPORATION reassignment INTERNATIONAL BUSINESS MACHINES CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: HE, DINGSHAN, KENCHAMMANA-HOSEKOTE, DEEPAK R.
Priority to US12/059,067 priority patent/US7979641B2/en
Publication of US20080270704A1 publication Critical patent/US20080270704A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1076Parity data used in redundant arrays of independent storages, e.g. in RAID systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0806Multiuser, multiprocessor or multiprocessing cache systems
    • G06F12/0813Multiuser, multiprocessor or multiprocessing cache systems with a network or matrix configuration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0866Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches for peripheral storage systems, e.g. disk cache
    • G06F12/0871Allocation or management of cache space
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/56Provisioning of proxy services
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/56Provisioning of proxy services
    • H04L67/568Storing data temporarily at an intermediate stage, e.g. caching
    • H04L67/5682Policies or rules for updating, deleting or replacing the stored data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2211/00Indexing scheme relating to details of data-processing equipment not covered by groups G06F3/00 - G06F13/00
    • G06F2211/10Indexing scheme relating to G06F11/10
    • G06F2211/1002Indexing scheme relating to G06F11/1076
    • G06F2211/1028Distributed, i.e. distributed RAID systems with parity
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/26Using a specific storage system architecture
    • G06F2212/261Storage comprising a plurality of storage devices
    • G06F2212/262Storage comprising a plurality of storage devices configured as RAID
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/28Using a specific disk cache architecture
    • G06F2212/283Plural cache memories
    • G06F2212/284Plural cache memories being distributed

Definitions

  • the embodiments of the invention provide a method, apparatus, etc. for a cache arrangement for improving RAID I/O operations.
  • Network-RAID Redundant array of independent disks
  • Distributed storage systems using a network-RAID protocol can process, or coordinate, a network-RAID-protocol I/O request (I/O request) locally at a client node or the request can be forwarded to a storage server or a coordination server for processing. For example, one client node may locally write data to a particular data location, while another client node may choose to forward a read or a write request for the same data location to a shared, or coordination, server.
  • the embodiments of the invention provide a method, apparatus, etc. for a cache arrangement for improving RAID I/O operations. More specifically, a method for cache management within a distributed data storage system begins by partitioning a data object into a plurality of data blocks and creating one or more parity data blocks from the data object. Next, the data blocks and the parity data blocks are stored within storage nodes.
  • the method caches data blocks within a partitioned cache, wherein the partitioned cache includes a plurality of cache partitions.
  • the cache partitions are located within the storage nodes, wherein each cache partition is smaller than the data object.
  • the caching within the partitioned cache only caches data blocks in parity storage nodes, wherein the parity storage nodes comprise a parity storage field.
  • caching within the partitioned cache avoids caching data blocks within storage nodes lacking the parity storage field.
  • the storage nodes comprise more than one parity storage node, the data blocks are cached in any of the parity storage nodes.
  • the method further includes updating the data object. Specifically, a write request is annotated with information regarding changed data blocks within the data object; and, the write request is only sent to the parity storage nodes.
  • the sending of the write request only to the parity storage nodes comprises simultaneously performing an invalidation operation and a write operation. Subsequently, the data blocks and parity data block are read from the storage nodes.
  • An apparatus for cache management within a distributed data storage system comprises a partitioner to partition a data object into a plurality of data blocks.
  • An analysis engine is operatively connected to the partitioner, wherein the analysis engine creates one or more parity data blocks from the data object.
  • a controller is operatively connected to the analysis engine, wherein the controller stores the data blocks and the parity data blocks within storage nodes.
  • the controller also caches data blocks within a partitioned cache, wherein the partitioned cache includes a plurality of cache partitions.
  • the cache partitions are located within the storage nodes, wherein each cache partition is smaller than the data object.
  • the controller only caches data blocks in parity storage nodes, wherein the parity storage nodes have a parity storage field.
  • the controller avoids caching data blocks within storage nodes lacking the parity storage field.
  • the controller caches the data blocks in any of the parity storage nodes.
  • the controller annotates a write request with information regarding changed data blocks within the data object and sends the write request to the parity storage nodes.
  • the controller simultaneously performs an invalidation operation and a write operation.
  • the apparatus further includes a reader operatively connected to the controller, wherein the reader reads the data blocks and the parity data blocks from the storage nodes.
  • the embodiments of the invention provide a technique to build a lightweight cache coherence protocol in distributed storage systems by exploiting the update patterns inherent with erasure coded data.
  • Such a technique can unify the partitioned caches into a single large read cache and can use the cached data to improve RAID I/O operations from clients.
  • FIG. 1 is a table illustrating benefits of caching while executing write and reconstruct read operations
  • FIG. 2 is a table illustrating an enumeration of the type of plans generated by the embodiments of the invention
  • FIGS. 3A and 3B are diagrams illustrating two variants of I/O update topology for distributed RAID that keep data in sync;
  • FIGS. 4A , 4 B, 4 C, and 4 d are diagrams illustrating four ways to prime the cache at the parity nodes to improve RAID I/O operations in distributed RAID storage systems;
  • FIG. 5 is a diagram illustrating a system for a cache arrangement for improving RAID I/O operations
  • FIG. 6 is a diagram illustrating a data object stripe
  • FIGS. 7A and 7B are diagrams illustrating cache arrangement for improving RAID I/O operations
  • FIG. 8 is a diagram illustrating an apparatus for cache management within a distributed data storage system.
  • FIG. 9 is a flow diagram illustrating a method for cache management within a distributed data storage system.
  • the embodiments of the invention provide a technique to build a lightweight cache coherence protocol in distributed storage systems by exploiting the update patterns inherent with erasure coded data. Such a technique can unify the partitioned caches into a single large read cache and can use the cached data to improve RAID I/O operations from clients.
  • FIG. 1 illustrates a table showing the benefits. Specifically, an example of savings with the embodiments of the invention is shown when the underlying distributed RAID layout is RAID5 over 4 nodes. The savings comes from exploiting the cache state at various nodes of a distributed RAID system. Pages for a given stripe could be in the read cache at one or more parity node(s), data nodes and/or client nodes. Embodiments herein can deliver such savings when the working set exceeds the total cache size of a single client node. Brick systems may have more (aggregate) cache space fronting the drives as comparable RAID controllers. Phrased another way, for the same cost of the system, more aggregate cache can be included in a brick system than in a monolithic system.
  • dispersed cache requires some cache coherence scheme, which comprises of two parts.
  • the execution of a RAID read or write operation at a node can be optimized by leveraging pages that are in the caches of the different nodes.
  • each brick stores a stripe of data for which it is the target node (TN).
  • TN target node
  • PN parity node
  • Client nodes (CN) are also provided. From the perspective of any dirty data page, the multiple nodes in the system are categorized as described below. CN is the client node that initiates the flush of this dirty page; and, TN is the target node to which the dirty data page should be written.
  • ⁇ PN ⁇ is the parity node that hosts the parity page that depends on the dirty data page. There can be multiple parities depending on the layout, which is indicated by the curved brackets.
  • ⁇ DN ⁇ is the dependent node that hosts the dependent data (dD) contributing to the calculation of the same parity as the dirty page.
  • the XOR calculations for new parity can be performed at any one or combination of these nodes.
  • each of the above nodes can have one of two plans: parity compute (PC) or parity increment (PI).
  • PC parity compute
  • PI parity increment
  • FIG. 2 illustrates a table, which enumerates all possible I/O plans possible amongst these nodes for a given dirty page.
  • the overarching notation is that a write changes D old to D new which requires updating the relevant parity page from P old to P new .
  • a method is presented to derive the best local I/O plan and the communication protocol to allow different nodes to reach an agreement on the final I/O plan.
  • Data pages can be cached only at parity nodes that depend on it.
  • the invalidation can be piggybacked on the that operation to the new parity page (to PN).
  • PN is guaranteed to get an update operation due to how redundancy is maintained i.e., erasure coding.
  • data pages are cached at the parity node(s)
  • the new data is always in the parity nodes. This can be checked during read to that data by any CN.
  • the unchanged data, which is not in the parity nodes, are not invalidated.
  • FIGS. 3A and 3 B illustrate two such I/O plans (each employing the parity increment with ⁇ ). Specifically, in FIG. 3A , CN writes new data to the target node, computes ⁇ , and ships it to the affected parity nodes to be applied. In FIG. 3B , CN writes new data to the parity node with old data. This parity node computes ⁇ and ships it to the target and other parity nodes to be applied.
  • FIGS. 4A , 4 B, 4 C, and 4 D four alternatives are provided to describe how the parity node(s) gather data pages from client or target nodes.
  • the target node in response to a client read) ships the data to one or more parity nodes.
  • the client demotes a clean page it would have discarded to one or more parity nodes.
  • the target demotes the page to the parity node.
  • the parity node asynchronously reads pages from the target node.
  • TN does not read cached data pages except during system transience (writing, buffering). This makes TN's cache exclusive.
  • the first rule is not applicable to parity pages, which can cached during transience.
  • embodiments herein can use one round of messages to gather all candidate I/O plan costs from all t PN's and compare with the local plans available to CN and pick the best plan.
  • degraded/critical mode reconstructed pages are held at the parity node longer (until rebuild completes or cache pressure builds sufficiently) for possible reuse by another client.
  • read/write performance can be improved.
  • the execution of a RAID read or write operation at a node can be optimized by leveraging pages that are in the caches of the different nodes.
  • the embodiments herein can be applied to distributed (clustered) storage systems.
  • the embodiments of the invention have the ability to provide read cache unification and to improve RAID I/O operations.
  • the embodiments of the invention provide a distributed cache management scheme for a storage system that uses erasure coded distributed RAID and has partitioned cache (where the total sum can be fairly substantial). This speeds up RAID reads and writes by leveraging cached data, where possible. Moreover, this unifies the cache, which maximizes cache effectiveness. There is no duplication of cached data.
  • the cache management scheme is lightweight; no (additional) messaging for cache coherence or a data directory is needed.
  • the management scheme is also opportunistic; any steps can be skipped under a heavy load without affecting correctness.
  • FIG. 5 is a diagram illustrating a system for such a cache arrangement scheme.
  • the initiator for read or write operations to the dRAID volume can be at a client node 510 A or 510 B (direct access) or a storage node 520 A or 520 B (gateway). Meta-data 530 is available to the initiator via a network 540 .
  • the storage nodes 520 A/ 520 B could have a write and read cache or a read cache only (cache 522 A/ 522 B).
  • a dRAIDed stripe is spread across the storage nodes 520 A/ 520 B, wherein the system assumes uniformly spread storage.
  • FIG. 6 is a diagram illustrating a data object stripe within five storage nodes (SN 1 , SN 2 , SN 3 , SN 4 , and SN 5 ).
  • the data object stripe includes a first data block (D 1 ), a second data block (D 2 ), a third data block (D 3 ), a fourth data block (D 4 ), and a parity block (P).
  • the role of a storage node for a data block can be a client node (CN), parity node (PN), or target node (TN). Each storage node can play multiple roles for different blocks.
  • SN 3 is the target node for D 3
  • SN 5 is the parity node
  • any of the storage nodes can be a client node.
  • Embodiments of the invention provide the following cache rules.
  • Fifth, a client or storage node can locally decide to evict (clean) pages. This provides for loosely coupled caching.
  • Consequences of the cache rules provide that data pages from multiple clients get “percolated” into caches in storage nodes, which is advantageous for shared workloads without clients even cooperating. This is irrelevant for totally random workloads, which are no worse than before. Moreover, caches at storage nodes are aligned in a “RAID-friendly” way. All data used to compute a parity block localized. Further, due to the nature of erasure code updates, cache coherence is free. Parity node(s) have to be written to for write completion. Annotation helps identify which blocks have changed.
  • FIGS. 7A and 7B are diagrams illustrating cache arrangement for improving RAID I/O operations.
  • FIG. 7A illustrates storage node 1 (SN 1 ), which includes data blocks 1 , 6 , and 11 .
  • Storage node 2 (SN 2 ) includes data blocks 2 , 7 , and 12 ; and, storage node 3 (SN 3 ) has data blocks 3 and 8 , and parity block 3 (P 3 ).
  • storage node 4 (SN 4 ) includes data blocks 4 and 9 , and parity block 2 (P 2 ); and, storage node 5 (SN 5 ) has data blocks 5 and 10 , and parity block 1 (P 1 ).
  • data blocks are only cached in storage nodes having parity blocks (i.e., SN 3 , SN 4 , and SN 5 ).
  • Reads and writes include an extra messaging phase to query the cache state at parity node(s).
  • Client costs various read/update plans possible around metrics, such as disk IOs and memory bandwidth. The client chooses the best plan and drives I/O.
  • Read plan choices include finding the cheapest reconstruction plan in three steps: inverting the matrix; masking cached pages; and, cost planning. Possible locations include the client node and parity node(s).
  • the embodiments herein are applicable to a class of problems that requires coordination of a distributed cache resource and updates to a set of data blocks that require updates to some common (dependent) block(s).
  • Such systems could include distributed databases and cluster file systems.
  • the embodiments of the invention provide a distributed cache arrangement for a storage system that speeds up RAID operations where workload is conducive.
  • the working set is larger than any single client cache but it fits in the collective cache.
  • a shared data set exists between the clients but the data set is time shifted.
  • the cache arrangement adjusts automatically to workloads from clients. If there is a shared workload, then there is a benefit; otherwise, the cache arrangement exploits collective cache space.
  • an apparatus 800 for cache management within a distributed data storage system is illustrated. More specifically, a partitioner 810 is provided to partition a data object into a plurality of data blocks. An analysis engine 820 is operatively connected to the partitioner 810 , wherein the analysis engine 820 creates one or more parity data blocks from the data object.
  • the data object stripe includes a first data block (D 1 ), a second data block (D 2 ), a third data block (D 3 ), a fourth data block (D 4 ), and a parity block (P).
  • a controller 830 is operatively connected to the analysis engine 820 , wherein the controller 830 stores the data blocks and the parity data block within storage nodes. For example, as illustrated in FIGS. 7A and 7B , the data blocks 1 - 12 and the parity data blocks P 1 -P 3 are stored within the storage nodes SN 1 -SN 5 .
  • the controller 830 also caches the data blocks within a partitioned cache, wherein the partitioned cache includes cache partitions.
  • the cache partitions are located within the storage nodes, wherein each cache partition is smaller than the data object (e.g., volume, LUN, file system). More specifically, each cache partition is located within a storage node.
  • the controller 830 only caches the data blocks in parity storage nodes, wherein the parity storage nodes include a parity storage field (a field within a storage node where parity data block(s) can be stored).
  • the controller 830 avoids caching data blocks within storage nodes lacking the parity storage field. For example, as illustrated in FIGS.
  • data blocks 1 - 12 are only cached within the storage nodes having stored parity data blocks.
  • parity data blocks P 1 , P 2 , and P 3 are stored in storage nodes SN 5 , SN 4 , and SN 3 , respectively.
  • the controller 830 caches the data blocks in any of the parity storage nodes. Moreover, the controller 830 annotates a write request with information regarding changed data blocks within the data object and sends the write request to the parity storage nodes. The controller 830 simultaneously performs an invalidation operation and a write operation. Additionally, a reader 840 is operatively connected to the controller 830 , wherein the reader 840 reads the data blocks and the parity data block from the storage nodes.
  • FIG. 9 a method 900 for cache management within a distributed data storage system is illustrated. More specifically, the method 900 begins in item 910 by partitioning a data object into data blocks. Next, in item 920 , one or more parity data blocks are created from the data object. As described above, FIG. 6 illustrates a data object stripe having a first data block (D 1 ), a second data block (D 2 ), a third data block (D 3 ), a fourth data block (D 4 ), and a parity block (P). Following this, in item 930 , the data blocks and the parity data block are stored within storage nodes. As described above, the role of a storage node for a data block can be a client node (CN), a parity node (PN), or a target node (TN). Each storage node can play multiple roles for different blocks.
  • CN client node
  • PN parity node
  • TN target node
  • the data blocks are also cached within a partitioned cache, wherein the partitioned cache includes cache partitions.
  • the cache partitions are located within the storage nodes, wherein each cache partition is smaller than the data object.
  • the storage nodes could have a write and read cache or a read cache only.
  • the caching within the partitioned cache only caches the data blocks in parity storage nodes, wherein the parity storage nodes include a parity storage field (item 942 ).
  • the parity storage nodes include a parity storage field
  • caching the data blocks within storage nodes lacking the parity storage field is avoided (item 944 ). Accordingly, as described above, a separate cache directory is not required because the cached data blocks are only in the parity storage nodes.
  • FIGS. 4A , 4 B, 4 C, and 4 D illustrate four alternatives to describe how the parity node(s) gather data pages from client or target nodes.
  • the target node in response to a client read) ships the data to one or more parity nodes.
  • the client demotes a clean page it would have discarded to one or more parity nodes.
  • the target demotes the page to the parity node.
  • the parity node asynchronously reads pages from the target node.
  • the method 900 also includes, in item 950 , updating the data object. This includes annotating a write request with information regarding changed data blocks within the data object (item 952 ) and sending the write request only to the parity storage nodes (item 954 ). The sending of the write request only to the parity storage nodes comprises simultaneously performing an invalidation operation and a write operation (item 956 ). Thus, as described above, cache invalidation is piggybacked onto regular operations. Due to the nature of erasure code updating, cache coherence is free because parity node(s) have to be written to for a write completion. Annotation helps identify which blocks have changed. Subsequently, in item 960 , the data blocks and parity data block are read from the storage nodes. The method 900 can check the cache at the parity storage nodes before reading the data block from the target storage nodes.
  • the embodiments of the invention provide a technique to build a lightweight cache coherence protocol in distributed storage systems by exploiting the update patterns inherent with erasure coded data.
  • Such a technique can unify the partitioned caches into a single large read cache and can use the cached data to improve RAID I/O operations from clients.

Abstract

The embodiments of the invention provide a method, apparatus, etc. for a cache arrangement for improving RAID I/O operations. More specifically, a method begins by partitioning a data object into a plurality of data blocks and creating one or more parity data blocks from the data object. Next, the data blocks and the parity data blocks are stored within storage nodes. Following this, the method caches data blocks within a partitioned cache, wherein the partitioned cache includes a plurality of cache partitions. The cache partitions are located within the storage nodes, wherein each cache partition is smaller than the data object. Moreover, the caching within the partitioned cache only caches data blocks in parity storage nodes, wherein the parity storage nodes comprise a parity storage field. Thus, caching within the partitioned cache avoids caching data blocks within storage nodes lacking the parity storage field.

Description

    BACKGROUND
  • 1. Field of the Invention
  • The embodiments of the invention provide a method, apparatus, etc. for a cache arrangement for improving RAID I/O operations.
  • 2. Description of the Related Art
  • It is often necessary in a distributed storage system to read or write data redundantly that has been striped on more than one storage server (or target). Such a system configuration is referred to as a “network-RAID” (redundant array of independent disks) because the function of a RAID controller is performed by the network protocol of the distributed storage system by coordinating I/O (input/output) operations that are processed at multiple places concurrently in order to ensure correct system behavior, both atomically and serially. Distributed storage systems using a network-RAID protocol can process, or coordinate, a network-RAID-protocol I/O request (I/O request) locally at a client node or the request can be forwarded to a storage server or a coordination server for processing. For example, one client node may locally write data to a particular data location, while another client node may choose to forward a read or a write request for the same data location to a shared, or coordination, server.
  • SUMMARY
  • The embodiments of the invention provide a method, apparatus, etc. for a cache arrangement for improving RAID I/O operations. More specifically, a method for cache management within a distributed data storage system begins by partitioning a data object into a plurality of data blocks and creating one or more parity data blocks from the data object. Next, the data blocks and the parity data blocks are stored within storage nodes.
  • Following this, the method caches data blocks within a partitioned cache, wherein the partitioned cache includes a plurality of cache partitions. The cache partitions are located within the storage nodes, wherein each cache partition is smaller than the data object. Moreover, the caching within the partitioned cache only caches data blocks in parity storage nodes, wherein the parity storage nodes comprise a parity storage field. Thus, caching within the partitioned cache avoids caching data blocks within storage nodes lacking the parity storage field. When the storage nodes comprise more than one parity storage node, the data blocks are cached in any of the parity storage nodes.
  • The method further includes updating the data object. Specifically, a write request is annotated with information regarding changed data blocks within the data object; and, the write request is only sent to the parity storage nodes. The sending of the write request only to the parity storage nodes comprises simultaneously performing an invalidation operation and a write operation. Subsequently, the data blocks and parity data block are read from the storage nodes.
  • An apparatus for cache management within a distributed data storage system is also provided. More specifically, the apparatus comprises a partitioner to partition a data object into a plurality of data blocks. An analysis engine is operatively connected to the partitioner, wherein the analysis engine creates one or more parity data blocks from the data object. Moreover, a controller is operatively connected to the analysis engine, wherein the controller stores the data blocks and the parity data blocks within storage nodes.
  • The controller also caches data blocks within a partitioned cache, wherein the partitioned cache includes a plurality of cache partitions. The cache partitions are located within the storage nodes, wherein each cache partition is smaller than the data object. When caching within the partitioned cache, the controller only caches data blocks in parity storage nodes, wherein the parity storage nodes have a parity storage field. Thus, when caching, the controller avoids caching data blocks within storage nodes lacking the parity storage field. When the storage nodes have more than one parity storage node, the controller caches the data blocks in any of the parity storage nodes.
  • Additionally, the controller annotates a write request with information regarding changed data blocks within the data object and sends the write request to the parity storage nodes. The controller simultaneously performs an invalidation operation and a write operation. The apparatus further includes a reader operatively connected to the controller, wherein the reader reads the data blocks and the parity data blocks from the storage nodes.
  • Accordingly, the embodiments of the invention provide a technique to build a lightweight cache coherence protocol in distributed storage systems by exploiting the update patterns inherent with erasure coded data. Such a technique can unify the partitioned caches into a single large read cache and can use the cached data to improve RAID I/O operations from clients.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • The embodiments of the invention will be better understood from the following detailed description with reference to the drawings, in which:
  • FIG. 1 is a table illustrating benefits of caching while executing write and reconstruct read operations;
  • FIG. 2 is a table illustrating an enumeration of the type of plans generated by the embodiments of the invention;
  • FIGS. 3A and 3B are diagrams illustrating two variants of I/O update topology for distributed RAID that keep data in sync;
  • FIGS. 4A, 4B, 4C, and 4 d are diagrams illustrating four ways to prime the cache at the parity nodes to improve RAID I/O operations in distributed RAID storage systems;
  • FIG. 5 is a diagram illustrating a system for a cache arrangement for improving RAID I/O operations;
  • FIG. 6 is a diagram illustrating a data object stripe;
  • FIGS. 7A and 7B are diagrams illustrating cache arrangement for improving RAID I/O operations;
  • FIG. 8 is a diagram illustrating an apparatus for cache management within a distributed data storage system; and
  • FIG. 9 is a flow diagram illustrating a method for cache management within a distributed data storage system.
  • DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS
  • The embodiments of the invention and the various features and advantageous details thereof are explained more fully with reference to the non-limiting embodiments that are illustrated in the accompanying drawings and detailed in the following description. It should be noted that the features illustrated in the drawings are not necessarily drawn to scale. Descriptions of well-known components and processing techniques are omitted so as to not unnecessarily obscure the embodiments of the invention. The examples used herein are intended merely to facilitate an understanding of ways in which the embodiments of the invention may be practiced and to further enable those of skill in the art to practice the embodiments of the invention. Accordingly, the examples should not be construed as limiting the scope of the embodiments of the invention.
  • The embodiments of the invention provide a technique to build a lightweight cache coherence protocol in distributed storage systems by exploiting the update patterns inherent with erasure coded data. Such a technique can unify the partitioned caches into a single large read cache and can use the cached data to improve RAID I/O operations from clients.
  • Erasure coded data benefits the most from caching while executing write and reconstruct read operations. FIG. 1 illustrates a table showing the benefits. Specifically, an example of savings with the embodiments of the invention is shown when the underlying distributed RAID layout is RAID5 over 4 nodes. The savings comes from exploiting the cache state at various nodes of a distributed RAID system. Pages for a given stripe could be in the read cache at one or more parity node(s), data nodes and/or client nodes. Embodiments herein can deliver such savings when the working set exceeds the total cache size of a single client node. Brick systems may have more (aggregate) cache space fronting the drives as comparable RAID controllers. Phrased another way, for the same cost of the system, more aggregate cache can be included in a brick system than in a monolithic system.
  • To make effective use, dispersed cache requires some cache coherence scheme, which comprises of two parts. First, a scalable cache directory needs to map pages to nodes. Second, an invalidation (or coherence) protocol is needed to ensure correctness. With erasure codes, read/write performance of data in degraded/critical mode is significantly slower than under fault-free mode. If at least the working set is cached somewhere until the rebuild operation completes, then read/write performance can be improved. Specifically, given a cache arrangement scheme, the execution of a RAID read or write operation at a node can be optimized by leveraging pages that are in the caches of the different nodes.
  • Considering data laid out in some erasure code layout (e.g., RAID5), for each data stripe, a subset of the bricks take on different roles. Each brick stores a stripe of data for which it is the target node (TN). For each stripe, there will be at least t pages to store parity for a t-fault tolerant code. Each parity page is stored on a different parity node (PN). Client nodes (CN) are also provided. From the perspective of any dirty data page, the multiple nodes in the system are categorized as described below. CN is the client node that initiates the flush of this dirty page; and, TN is the target node to which the dirty data page should be written. {PN} is the parity node that hosts the parity page that depends on the dirty data page. There can be multiple parities depending on the layout, which is indicated by the curved brackets. {DN} is the dependent node that hosts the dependent data (dD) contributing to the calculation of the same parity as the dirty page.
  • The XOR calculations for new parity can be performed at any one or combination of these nodes. Locally, each of the above nodes can have one of two plans: parity compute (PC) or parity increment (PI). Additionally, two issues need to be addressed. The first issue is how each kind of nodes derive their own best I/O plan. The second issue is how different nodes interact with each other to get an agreement on the final I/O plan.
  • FIG. 2 illustrates a table, which enumerates all possible I/O plans possible amongst these nodes for a given dirty page. The overarching notation is that a write changes Dold to Dnew which requires updating the relevant parity page from Pold to Pnew. In some schemes, a partial parity is used as Δ=Dnewxor Dold. Next, a method is presented to derive the best local I/O plan and the communication protocol to allow different nodes to reach an agreement on the final I/O plan.
  • Data pages can be cached only at parity nodes that depend on it. When an update to the data page occurs (at CN) the invalidation can be piggybacked on the that operation to the new parity page (to PN). PN is guaranteed to get an update operation due to how redundancy is maintained i.e., erasure coding. In other words, if data pages are cached at the parity node(s), the new data is always in the parity nodes. This can be checked during read to that data by any CN. The unchanged data, which is not in the parity nodes, are not invalidated.
  • Beyond just invalidation, by employing certain client write I/O plans, this cache at the parity node(s) can be kept in sync without any extra messaging. FIGS. 3A and 3B illustrate two such I/O plans (each employing the parity increment with Δ). Specifically, in FIG. 3A, CN writes new data to the target node, computes Δ, and ships it to the affected parity nodes to be applied. In FIG. 3B, CN writes new data to the parity node with old data. This parity node computes Δ and ships it to the target and other parity nodes to be applied.
  • As illustrated in FIGS. 4A, 4B, 4C, and 4D, four alternatives are provided to describe how the parity node(s) gather data pages from client or target nodes. In FIG. 4A, the target node (in response to a client read) ships the data to one or more parity nodes. In FIG. 4B, the client demotes a clean page it would have discarded to one or more parity nodes. Further, in FIG. 4C, the target demotes the page to the parity node. In FIG. 4D, the parity node asynchronously reads pages from the target node.
  • If both TN and one or more PNs cache a data block, the effective cache size is reduced. This leads to greater cache pressure on (global pool) cache pages. To avoid this, three rules for caching data are provided. First, TN does not read cached data pages except during system transience (writing, buffering). This makes TN's cache exclusive. Second, when the erasure code allows for multiple PN's, then any one can be chosen (e.g., randomly). Third, the first rule is not applicable to parity pages, which can cached during transience.
  • With this caching scheme in place, embodiments herein can use one round of messages to gather all candidate I/O plan costs from all t PN's and compare with the local plans available to CN and pick the best plan. In degraded/critical mode, reconstructed pages are held at the parity node longer (until rebuild completes or cache pressure builds sufficiently) for possible reuse by another client. As discussed above, if at least the working set is cached somewhere until the rebuild operation completes, then read/write performance can be improved. Specifically, given a cache arrangement scheme, the execution of a RAID read or write operation at a node can be optimized by leveraging pages that are in the caches of the different nodes.
  • Thus, while cache invalidation is piggybacked on write operations, priming caches at the parity nodes takes some extra work. Moreover, read operations will need two phases, including a first phase to exchange plans. Write operations may require 3 phases, including a first phase to exchange plans (but here is an opportunity to piggyback). Further, by location shifting the cache, the impact it will have on local I/O optimizations (like prefetching etc.) is unknown.
  • The embodiments herein can be applied to distributed (clustered) storage systems. For such systems, the embodiments of the invention have the ability to provide read cache unification and to improve RAID I/O operations.
  • Furthermore, the embodiments of the invention provide a distributed cache management scheme for a storage system that uses erasure coded distributed RAID and has partitioned cache (where the total sum can be fairly substantial). This speeds up RAID reads and writes by leveraging cached data, where possible. Moreover, this unifies the cache, which maximizes cache effectiveness. There is no duplication of cached data. The cache management scheme is lightweight; no (additional) messaging for cache coherence or a data directory is needed. The management scheme is also opportunistic; any steps can be skipped under a heavy load without affecting correctness.
  • FIG. 5 is a diagram illustrating a system for such a cache arrangement scheme. The initiator for read or write operations to the dRAID volume can be at a client node 510A or 510B (direct access) or a storage node 520A or 520B (gateway). Meta-data 530 is available to the initiator via a network 540. The storage nodes 520A/520B could have a write and read cache or a read cache only (cache 522A/522B). A dRAIDed stripe is spread across the storage nodes 520A/520B, wherein the system assumes uniformly spread storage.
  • FIG. 6 is a diagram illustrating a data object stripe within five storage nodes (SN1, SN2, SN3, SN4, and SN5). The data object stripe includes a first data block (D1), a second data block (D2), a third data block (D3), a fourth data block (D4), and a parity block (P). The role of a storage node for a data block can be a client node (CN), parity node (PN), or target node (TN). Each storage node can play multiple roles for different blocks. Thus, SN3 is the target node for D3; SN5 is the parity node; and, any of the storage nodes can be a client node.
  • Embodiments of the invention provide the following cache rules. First, each write request from a client is annotated with information about changed blocks within a stripe. Thus, cache invalidation is piggybacked onto regular operations. Second, data blocks can be cached only at parity node(s). Multiple candidates exist for higher distance codes; and, no separate cache directory is needed. Third, data blocks are not cached at the target node, except by the operating system as staging during read/write operations. The “home” location of data is shifted from a target node to a parity node. Fourth, clients “demote” victim data page to parity node(s). In case of a higher distance code, a lexicographical parity node is chosen. Such a parity node primes caches in storage nodes opportunistically from clients. Fifth, a client or storage node can locally decide to evict (clean) pages. This provides for loosely coupled caching.
  • Consequences of the cache rules provide that data pages from multiple clients get “percolated” into caches in storage nodes, which is advantageous for shared workloads without clients even cooperating. This is irrelevant for totally random workloads, which are no worse than before. Moreover, caches at storage nodes are aligned in a “RAID-friendly” way. All data used to compute a parity block localized. Further, due to the nature of erasure code updates, cache coherence is free. Parity node(s) have to be written to for write completion. Annotation helps identify which blocks have changed.
  • FIGS. 7A and 7B are diagrams illustrating cache arrangement for improving RAID I/O operations. FIG. 7A illustrates storage node 1 (SN1), which includes data blocks 1, 6, and 11. Storage node 2 (SN2) includes data blocks 2, 7, and 12; and, storage node 3 (SN3) has data blocks 3 and 8, and parity block 3 (P3). Additionally, storage node 4 (SN4) includes data blocks 4 and 9, and parity block 2 (P2); and, storage node 5 (SN5) has data blocks 5 and 10, and parity block 1 (P1). Thus, as illustrated in FIG. 7B, data blocks are only cached in storage nodes having parity blocks (i.e., SN3, SN4, and SN5).
  • Reads and writes include an extra messaging phase to query the cache state at parity node(s). Client costs various read/update plans possible around metrics, such as disk IOs and memory bandwidth. The client chooses the best plan and drives I/O.
  • Read plan choices include finding the cheapest reconstruction plan in three steps: inverting the matrix; masking cached pages; and, cost planning. Possible locations include the client node and parity node(s).
  • Beyond distributed RAID, the embodiments herein are applicable to a class of problems that requires coordination of a distributed cache resource and updates to a set of data blocks that require updates to some common (dependent) block(s). Such systems could include distributed databases and cluster file systems.
  • Thus, the embodiments of the invention provide a distributed cache arrangement for a storage system that speeds up RAID operations where workload is conducive. The working set is larger than any single client cache but it fits in the collective cache. A shared data set exists between the clients but the data set is time shifted. Moreover, the cache arrangement adjusts automatically to workloads from clients. If there is a shared workload, then there is a benefit; otherwise, the cache arrangement exploits collective cache space.
  • Referring to FIG. 8, an apparatus 800 for cache management within a distributed data storage system is illustrated. More specifically, a partitioner 810 is provided to partition a data object into a plurality of data blocks. An analysis engine 820 is operatively connected to the partitioner 810, wherein the analysis engine 820 creates one or more parity data blocks from the data object. For example, as illustrated in FIG. 6, the data object stripe includes a first data block (D1), a second data block (D2), a third data block (D3), a fourth data block (D4), and a parity block (P). Furthermore, a controller 830 is operatively connected to the analysis engine 820, wherein the controller 830 stores the data blocks and the parity data block within storage nodes. For example, as illustrated in FIGS. 7A and 7B, the data blocks 1-12 and the parity data blocks P1-P3 are stored within the storage nodes SN1-SN5.
  • The controller 830 also caches the data blocks within a partitioned cache, wherein the partitioned cache includes cache partitions. The cache partitions are located within the storage nodes, wherein each cache partition is smaller than the data object (e.g., volume, LUN, file system). More specifically, each cache partition is located within a storage node. When caching within the partitioned cache, the controller 830 only caches the data blocks in parity storage nodes, wherein the parity storage nodes include a parity storage field (a field within a storage node where parity data block(s) can be stored). Thus, the controller 830 avoids caching data blocks within storage nodes lacking the parity storage field. For example, as illustrated in FIGS. 7A and 7B, data blocks 1-12 are only cached within the storage nodes having stored parity data blocks. In this example, parity data blocks P1, P2, and P3 are stored in storage nodes SN5, SN4, and SN3, respectively.
  • When caching within the partitioned cache, and when the storage nodes comprise more than one parity storage node, the controller 830 caches the data blocks in any of the parity storage nodes. Moreover, the controller 830 annotates a write request with information regarding changed data blocks within the data object and sends the write request to the parity storage nodes. The controller 830 simultaneously performs an invalidation operation and a write operation. Additionally, a reader 840 is operatively connected to the controller 830, wherein the reader 840 reads the data blocks and the parity data block from the storage nodes.
  • Referring to FIG. 9, a method 900 for cache management within a distributed data storage system is illustrated. More specifically, the method 900 begins in item 910 by partitioning a data object into data blocks. Next, in item 920, one or more parity data blocks are created from the data object. As described above, FIG. 6 illustrates a data object stripe having a first data block (D1), a second data block (D2), a third data block (D3), a fourth data block (D4), and a parity block (P). Following this, in item 930, the data blocks and the parity data block are stored within storage nodes. As described above, the role of a storage node for a data block can be a client node (CN), a parity node (PN), or a target node (TN). Each storage node can play multiple roles for different blocks.
  • In item 940, the data blocks are also cached within a partitioned cache, wherein the partitioned cache includes cache partitions. The cache partitions are located within the storage nodes, wherein each cache partition is smaller than the data object. As described above, the storage nodes could have a write and read cache or a read cache only. Moreover, the caching within the partitioned cache only caches the data blocks in parity storage nodes, wherein the parity storage nodes include a parity storage field (item 942). Thus, caching the data blocks within storage nodes lacking the parity storage field is avoided (item 944). Accordingly, as described above, a separate cache directory is not required because the cached data blocks are only in the parity storage nodes.
  • When caching the data blocks within the partitioned cache, and when the storage nodes have more than one parity storage node, the data blocks are cached in any of the parity storage nodes (item 946). As described above, FIGS. 4A, 4B, 4C, and 4D illustrate four alternatives to describe how the parity node(s) gather data pages from client or target nodes. In FIG. 4A, the target node (in response to a client read) ships the data to one or more parity nodes. In FIG. 4B, the client demotes a clean page it would have discarded to one or more parity nodes. Further, in FIG. 4C, the target demotes the page to the parity node. In FIG. 4D, the parity node asynchronously reads pages from the target node.
  • The method 900 also includes, in item 950, updating the data object. This includes annotating a write request with information regarding changed data blocks within the data object (item 952) and sending the write request only to the parity storage nodes (item 954). The sending of the write request only to the parity storage nodes comprises simultaneously performing an invalidation operation and a write operation (item 956). Thus, as described above, cache invalidation is piggybacked onto regular operations. Due to the nature of erasure code updating, cache coherence is free because parity node(s) have to be written to for a write completion. Annotation helps identify which blocks have changed. Subsequently, in item 960, the data blocks and parity data block are read from the storage nodes. The method 900 can check the cache at the parity storage nodes before reading the data block from the target storage nodes.
  • Accordingly, the embodiments of the invention provide a technique to build a lightweight cache coherence protocol in distributed storage systems by exploiting the update patterns inherent with erasure coded data. Such a technique can unify the partitioned caches into a single large read cache and can use the cached data to improve RAID I/O operations from clients.
  • The foregoing description of the specific embodiments will so fully reveal the general nature of the invention that others can, by applying current knowledge, readily modify and/or adapt for various applications such specific embodiments without departing from the generic concept, and, therefore, such adaptations and modifications should and are intended to be comprehended within the meaning and range of equivalents of the disclosed embodiments. It is to be understood that the phraseology or terminology employed herein is for the purpose of description and not of limitation. Therefore, while the embodiments of the invention have been described in terms of preferred embodiments, those skilled in the art will recognize that the embodiments of the invention can be practiced with modification within the spirit and scope of the appended claims.

Claims (3)

1-7. (canceled)
7. A method for cache management within a distributed data storage system, said method comprising:
partitioning a data object into a plurality of data blocks;
creating at least one parity data block from said data object;
storing said data blocks and said parity data block within storage nodes;
caching said data blocks within a partitioned cache, wherein said partitioned cache comprises a plurality of cache partitions, wherein said cache partitions are located within said storage nodes,
wherein said caching within said partitioned cache only caches said data blocks in parity storage nodes, wherein said parity storage nodes comprise a parity storage field;
updating said data object, said updating comprising
annotating a write request with information regarding changed data blocks within said data object, and
sending said write request only to said parity storage nodes; and
reading said data blocks and said parity data block from said storage nodes;
wherein said caching within said partitioned cache comprises avoiding caching said data blocks within storage nodes lacking said parity storage field,
wherein said sending of said write request only to said parity storage nodes comprises simultaneously performing an invalidation operation and a write operation, and
wherein said caching of said data blocks within said partitioned cache comprises, when said storage nodes comprise more than one of said parity storage nodes, caching said data blocks in any of said parity storage nodes.
8-20. (canceled)
US11/741,826 2007-04-30 2007-04-30 Cache arrangement for improving raid i/o operations Abandoned US20080270704A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US11/741,826 US20080270704A1 (en) 2007-04-30 2007-04-30 Cache arrangement for improving raid i/o operations
US12/059,067 US7979641B2 (en) 2007-04-30 2008-03-31 Cache arrangement for improving raid I/O operations

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US11/741,826 US20080270704A1 (en) 2007-04-30 2007-04-30 Cache arrangement for improving raid i/o operations

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US12/059,067 Continuation US7979641B2 (en) 2007-04-30 2008-03-31 Cache arrangement for improving raid I/O operations

Publications (1)

Publication Number Publication Date
US20080270704A1 true US20080270704A1 (en) 2008-10-30

Family

ID=39888394

Family Applications (2)

Application Number Title Priority Date Filing Date
US11/741,826 Abandoned US20080270704A1 (en) 2007-04-30 2007-04-30 Cache arrangement for improving raid i/o operations
US12/059,067 Expired - Fee Related US7979641B2 (en) 2007-04-30 2008-03-31 Cache arrangement for improving raid I/O operations

Family Applications After (1)

Application Number Title Priority Date Filing Date
US12/059,067 Expired - Fee Related US7979641B2 (en) 2007-04-30 2008-03-31 Cache arrangement for improving raid I/O operations

Country Status (1)

Country Link
US (2) US20080270704A1 (en)

Cited By (32)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100077101A1 (en) * 2008-09-12 2010-03-25 Institute Of Acoustics, Chinese Academy Of Sciences Storage network structure based on the Peterson graph and data read-write method thereof
US20100262771A1 (en) * 2009-04-13 2010-10-14 Takehiko Kurashige Data storage system and cache data-consistency assurance method
WO2012100037A1 (en) * 2011-01-20 2012-07-26 Google Inc. Storing data on storage nodes
US8255739B1 (en) * 2008-06-30 2012-08-28 American Megatrends, Inc. Achieving data consistency in a node failover with a degraded RAID array
US8498967B1 (en) 2007-01-30 2013-07-30 American Megatrends, Inc. Two-node high availability cluster storage solution using an intelligent initiator to avoid split brain syndrome
US8533343B1 (en) 2011-01-13 2013-09-10 Google Inc. Virtual network pairs
US8595455B2 (en) 2007-01-30 2013-11-26 American Megatrends, Inc. Maintaining data consistency in mirrored cluster storage systems using bitmap write-intent logging
US8677449B1 (en) 2012-03-19 2014-03-18 Google Inc. Exposing data to virtual machines
US8800009B1 (en) 2011-12-30 2014-08-05 Google Inc. Virtual machine service access
US8812586B1 (en) 2011-02-15 2014-08-19 Google Inc. Correlating status information generated in a computer network
US8874888B1 (en) 2011-01-13 2014-10-28 Google Inc. Managed boot in a cloud system
US20150032725A1 (en) * 2013-07-25 2015-01-29 Facebook, Inc. Systems and methods for efficient data ingestion and query processing
US8958293B1 (en) 2011-12-06 2015-02-17 Google Inc. Transparent load-balancing for cloud computing services
US8966198B1 (en) 2011-09-01 2015-02-24 Google Inc. Providing snapshots of virtual storage devices
US8972478B1 (en) * 2012-05-23 2015-03-03 Netapp, Inc. Using append only log format in data storage cluster with distributed zones for determining parity of reliability groups
WO2015031378A1 (en) * 2013-08-29 2015-03-05 Oracle International Corporation System and method for supporting partition level journaling for synchronizing data in a distributed data grid
US8983860B1 (en) 2012-01-30 2015-03-17 Google Inc. Advertising auction system
US8996887B2 (en) 2012-02-24 2015-03-31 Google Inc. Log structured volume encryption for virtual machines
US9063818B1 (en) 2011-03-16 2015-06-23 Google Inc. Automated software updating based on prior activity
US9069616B2 (en) 2011-09-23 2015-06-30 Google Inc. Bandwidth throttling of virtual disks
US9069806B2 (en) 2012-03-27 2015-06-30 Google Inc. Virtual block devices
US9075979B1 (en) 2011-08-11 2015-07-07 Google Inc. Authentication based on proximity to mobile device
US9135037B1 (en) 2011-01-13 2015-09-15 Google Inc. Virtual network protocol
US9231933B1 (en) 2011-03-16 2016-01-05 Google Inc. Providing application programs with access to secured resources
US9237087B1 (en) 2011-03-16 2016-01-12 Google Inc. Virtual machine name resolution
WO2016057537A1 (en) * 2014-10-09 2016-04-14 Netapp, Inc. Methods and systems for cache management in storage systems
US9430255B1 (en) 2013-03-15 2016-08-30 Google Inc. Updating virtual machine generated metadata to a distribution service for sharing and backup
US9557978B2 (en) 2011-03-16 2017-01-31 Google Inc. Selection of ranked configurations
US9619662B1 (en) 2011-01-13 2017-04-11 Google Inc. Virtual network pairs
US9672052B1 (en) 2012-02-16 2017-06-06 Google Inc. Secure inter-process communication
US9817713B2 (en) 2016-02-04 2017-11-14 International Business Machines Corporation Distributed cache system utilizing multiple erasure codes
US10185639B1 (en) 2015-05-08 2019-01-22 American Megatrends, Inc. Systems and methods for performing failover in storage system with dual storage controllers

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9880970B2 (en) * 2007-10-03 2018-01-30 William L. Bain Method for implementing highly available data parallel operations on a computational grid
US8037391B1 (en) * 2009-05-22 2011-10-11 Nvidia Corporation Raid-6 computation system and method
US8296515B1 (en) 2009-05-22 2012-10-23 Nvidia Corporation RAID-6 computation system and method
US20130290636A1 (en) * 2012-04-30 2013-10-31 Qiming Chen Managing memory
US10210167B1 (en) * 2012-05-07 2019-02-19 Amazon Technologies, Inc. Multi-level page caching for distributed object store
US9811530B1 (en) * 2013-06-29 2017-11-07 EMC IP Holding Company LLC Cluster file system with metadata server for storage of parallel log structured file system metadata for a shared file
US10116336B2 (en) * 2014-06-13 2018-10-30 Sandisk Technologies Llc Error correcting code adjustment for a data storage device
US10547681B2 (en) * 2016-06-30 2020-01-28 Purdue Research Foundation Functional caching in erasure coded storage
US10459807B2 (en) * 2017-05-23 2019-10-29 International Business Machines Corporation Determining modified portions of a RAID storage array

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6049851A (en) * 1994-02-14 2000-04-11 Hewlett-Packard Company Method and apparatus for checking cache coherency in a computer architecture
US6148368A (en) * 1997-07-31 2000-11-14 Lsi Logic Corporation Method for accelerating disk array write operations using segmented cache memory and data logging
US20030033572A1 (en) * 2001-08-09 2003-02-13 Walton John K. Memory system and method of using same
US6523087B2 (en) * 2001-03-06 2003-02-18 Chaparral Network Storage, Inc. Utilizing parity caching and parity logging while closing the RAID5 write hole
US6594698B1 (en) * 1998-09-25 2003-07-15 Ncr Corporation Protocol for dynamic binding of shared resources
US6651140B1 (en) * 2000-09-01 2003-11-18 Sun Microsystems, Inc. Caching pattern and method for caching in an object-oriented programming environment
US6963959B2 (en) * 2002-10-31 2005-11-08 International Business Machines Corporation Storage system and method for reorganizing data to improve prefetch effectiveness and reduce seek distance
US6970987B1 (en) * 2003-01-27 2005-11-29 Hewlett-Packard Development Company, L.P. Method for storing data in a geographically-diverse data-storing system providing cross-site redundancy
US7457980B2 (en) * 2004-08-13 2008-11-25 Ken Qing Yang Data replication method over a limited bandwidth network by mirroring parities

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6725392B1 (en) * 1999-03-03 2004-04-20 Adaptec, Inc. Controller fault recovery system for a distributed file system
US6718434B2 (en) 2001-05-31 2004-04-06 Hewlett-Packard Development Company, L.P. Method and apparatus for assigning raid levels
US6782450B2 (en) 2001-12-06 2004-08-24 Raidcore, Inc. File mode RAID subsystem
US7107403B2 (en) 2003-09-30 2006-09-12 International Business Machines Corporation System and method for dynamically allocating cache space among different workload classes that can have different quality of service (QoS) requirements where the system and method may maintain a history of recently evicted pages for each class and may determine a future cache size for the class based on the history and the QoS requirements
US7313749B2 (en) 2004-06-29 2007-12-25 Hewlett-Packard Development Company, L.P. System and method for applying error correction code (ECC) erasure mode and clearing recorded information from a page deallocation table

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6049851A (en) * 1994-02-14 2000-04-11 Hewlett-Packard Company Method and apparatus for checking cache coherency in a computer architecture
US6148368A (en) * 1997-07-31 2000-11-14 Lsi Logic Corporation Method for accelerating disk array write operations using segmented cache memory and data logging
US6594698B1 (en) * 1998-09-25 2003-07-15 Ncr Corporation Protocol for dynamic binding of shared resources
US6651140B1 (en) * 2000-09-01 2003-11-18 Sun Microsystems, Inc. Caching pattern and method for caching in an object-oriented programming environment
US6523087B2 (en) * 2001-03-06 2003-02-18 Chaparral Network Storage, Inc. Utilizing parity caching and parity logging while closing the RAID5 write hole
US20030033572A1 (en) * 2001-08-09 2003-02-13 Walton John K. Memory system and method of using same
US6963959B2 (en) * 2002-10-31 2005-11-08 International Business Machines Corporation Storage system and method for reorganizing data to improve prefetch effectiveness and reduce seek distance
US6970987B1 (en) * 2003-01-27 2005-11-29 Hewlett-Packard Development Company, L.P. Method for storing data in a geographically-diverse data-storing system providing cross-site redundancy
US7457980B2 (en) * 2004-08-13 2008-11-25 Ken Qing Yang Data replication method over a limited bandwidth network by mirroring parities

Cited By (53)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8595455B2 (en) 2007-01-30 2013-11-26 American Megatrends, Inc. Maintaining data consistency in mirrored cluster storage systems using bitmap write-intent logging
US8498967B1 (en) 2007-01-30 2013-07-30 American Megatrends, Inc. Two-node high availability cluster storage solution using an intelligent initiator to avoid split brain syndrome
US8255739B1 (en) * 2008-06-30 2012-08-28 American Megatrends, Inc. Achieving data consistency in a node failover with a degraded RAID array
US8667322B1 (en) 2008-06-30 2014-03-04 American Megatrends, Inc. Achieving data consistency in a node failover with a degraded raid array
US8219707B2 (en) * 2008-09-12 2012-07-10 Institute Of Acoustics, Chinese Academy Of Sciences Storage network structure based on the Peterson graph and data read-write method thereof
US20100077101A1 (en) * 2008-09-12 2010-03-25 Institute Of Acoustics, Chinese Academy Of Sciences Storage network structure based on the Peterson graph and data read-write method thereof
US20100262771A1 (en) * 2009-04-13 2010-10-14 Takehiko Kurashige Data storage system and cache data-consistency assurance method
US8108605B2 (en) * 2009-04-13 2012-01-31 Kabushiki Kaisha Toshiba Data storage system and cache data—consistency assurance method
US9740516B1 (en) 2011-01-13 2017-08-22 Google Inc. Virtual network protocol
US8533343B1 (en) 2011-01-13 2013-09-10 Google Inc. Virtual network pairs
US9135037B1 (en) 2011-01-13 2015-09-15 Google Inc. Virtual network protocol
US8874888B1 (en) 2011-01-13 2014-10-28 Google Inc. Managed boot in a cloud system
US9619662B1 (en) 2011-01-13 2017-04-11 Google Inc. Virtual network pairs
US8745329B2 (en) 2011-01-20 2014-06-03 Google Inc. Storing data across a plurality of storage nodes
US9250830B2 (en) 2011-01-20 2016-02-02 Google Inc. Storing data across a plurality of storage nodes
WO2012100037A1 (en) * 2011-01-20 2012-07-26 Google Inc. Storing data on storage nodes
US9794144B1 (en) 2011-02-15 2017-10-17 Google Inc. Correlating status information generated in a computer network
US8812586B1 (en) 2011-02-15 2014-08-19 Google Inc. Correlating status information generated in a computer network
US9231933B1 (en) 2011-03-16 2016-01-05 Google Inc. Providing application programs with access to secured resources
US9237087B1 (en) 2011-03-16 2016-01-12 Google Inc. Virtual machine name resolution
US11237810B2 (en) 2011-03-16 2022-02-01 Google Llc Cloud-based deployment using templates
US10241770B2 (en) 2011-03-16 2019-03-26 Google Llc Cloud-based deployment using object-oriented classes
US9063818B1 (en) 2011-03-16 2015-06-23 Google Inc. Automated software updating based on prior activity
US9557978B2 (en) 2011-03-16 2017-01-31 Google Inc. Selection of ranked configurations
US10212591B1 (en) 2011-08-11 2019-02-19 Google Llc Authentication based on proximity to mobile device
US9769662B1 (en) 2011-08-11 2017-09-19 Google Inc. Authentication based on proximity to mobile device
US9075979B1 (en) 2011-08-11 2015-07-07 Google Inc. Authentication based on proximity to mobile device
US8966198B1 (en) 2011-09-01 2015-02-24 Google Inc. Providing snapshots of virtual storage devices
US9251234B1 (en) 2011-09-01 2016-02-02 Google Inc. Providing snapshots of virtual storage devices
US9501233B2 (en) 2011-09-01 2016-11-22 Google Inc. Providing snapshots of virtual storage devices
US9069616B2 (en) 2011-09-23 2015-06-30 Google Inc. Bandwidth throttling of virtual disks
US8958293B1 (en) 2011-12-06 2015-02-17 Google Inc. Transparent load-balancing for cloud computing services
US8800009B1 (en) 2011-12-30 2014-08-05 Google Inc. Virtual machine service access
US8983860B1 (en) 2012-01-30 2015-03-17 Google Inc. Advertising auction system
US9672052B1 (en) 2012-02-16 2017-06-06 Google Inc. Secure inter-process communication
US8996887B2 (en) 2012-02-24 2015-03-31 Google Inc. Log structured volume encryption for virtual machines
US8677449B1 (en) 2012-03-19 2014-03-18 Google Inc. Exposing data to virtual machines
US9069806B2 (en) 2012-03-27 2015-06-30 Google Inc. Virtual block devices
US9720952B2 (en) 2012-03-27 2017-08-01 Google Inc. Virtual block devices
US8972478B1 (en) * 2012-05-23 2015-03-03 Netapp, Inc. Using append only log format in data storage cluster with distributed zones for determining parity of reliability groups
US9740403B2 (en) 2012-05-23 2017-08-22 Netapp, Inc. Methods for managing storage in a data storage cluster with distributed zones based on parity values and devices thereof
US9430255B1 (en) 2013-03-15 2016-08-30 Google Inc. Updating virtual machine generated metadata to a distribution service for sharing and backup
US20150032725A1 (en) * 2013-07-25 2015-01-29 Facebook, Inc. Systems and methods for efficient data ingestion and query processing
US9442967B2 (en) * 2013-07-25 2016-09-13 Facebook, Inc. Systems and methods for efficient data ingestion and query processing
US9652520B2 (en) 2013-08-29 2017-05-16 Oracle International Corporation System and method for supporting parallel asynchronous synchronization between clusters in a distributed data grid
US9659078B2 (en) 2013-08-29 2017-05-23 Oracle International Corporation System and method for supporting failover during synchronization between clusters in a distributed data grid
WO2015031378A1 (en) * 2013-08-29 2015-03-05 Oracle International Corporation System and method for supporting partition level journaling for synchronizing data in a distributed data grid
US9703853B2 (en) 2013-08-29 2017-07-11 Oracle International Corporation System and method for supporting partition level journaling for synchronizing data in a distributed data grid
US10423643B2 (en) 2013-08-29 2019-09-24 Oracle International Corporation System and method for supporting resettable acknowledgements for synchronizing data in a distributed data grid
US9753853B2 (en) 2014-10-09 2017-09-05 Netapp, Inc. Methods and systems for cache management in storage systems
WO2016057537A1 (en) * 2014-10-09 2016-04-14 Netapp, Inc. Methods and systems for cache management in storage systems
US10185639B1 (en) 2015-05-08 2019-01-22 American Megatrends, Inc. Systems and methods for performing failover in storage system with dual storage controllers
US9817713B2 (en) 2016-02-04 2017-11-14 International Business Machines Corporation Distributed cache system utilizing multiple erasure codes

Also Published As

Publication number Publication date
US7979641B2 (en) 2011-07-12
US20080270878A1 (en) 2008-10-30

Similar Documents

Publication Publication Date Title
US7979641B2 (en) Cache arrangement for improving raid I/O operations
JP7077359B2 (en) Distributed storage system
US10789020B2 (en) Recovering data within a unified storage element
US11068389B2 (en) Data resiliency with heterogeneous storage
US20230315346A1 (en) Utilizing Multiple Redundancy Schemes Within A Unified Storage Element
US10365983B1 (en) Repairing raid systems at per-stripe granularity
US6912669B2 (en) Method and apparatus for maintaining cache coherency in a storage system
CN105960639B (en) Prioritization data reconstruct in distributed memory system
CN110737541B (en) Method and system for distributing data in distributed storage system
US7788244B2 (en) Method and system for copying a snapshot tree
CN102937882B (en) To effective access with the memory device using bitmap
CN102884502B (en) Managing write operations to an extent of tracks migrated between storage devices
US10825477B2 (en) RAID storage system with logical data group priority
US20150127975A1 (en) Distributed virtual array data storage system and method
US10467527B1 (en) Method and apparatus for artificial intelligence acceleration
CN1770114A (en) Copy operations in storage networks
CN104395904A (en) Efficient data object storage and retrieval
CN1679000A (en) Using file system information in raid data reconstruction and migration
WO2011101482A1 (en) Read-other protocol for maintaining parity coherency in a write-back distributed redundancy data storage system
CN1804810A (en) Method and system of redirection for storage access requests
CN102841854A (en) Method and system for executing data reading based on dynamic hierarchical memory cache (hmc) awareness
CN1770115A (en) Recovery operations in storage networks
CN101147118A (en) Methods and apparatus for reconfiguring a storage system
US20170277450A1 (en) Lockless parity management in a distributed data storage system
US7725654B2 (en) Affecting a caching algorithm used by a cache of storage system

Legal Events

Date Code Title Description
AS Assignment

Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:HE, DINGSHAN;KENCHAMMANA-HOSEKOTE, DEEPAK R.;REEL/FRAME:019226/0474;SIGNING DATES FROM 20070424 TO 20070425

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION