WO2008144018A1 - Dispositif et procédé pour un gestionnaire de mémorisation efficace pour des instantanés copie sur écriture - Google Patents

Dispositif et procédé pour un gestionnaire de mémorisation efficace pour des instantanés copie sur écriture Download PDF

Info

Publication number
WO2008144018A1
WO2008144018A1 PCT/US2008/006366 US2008006366W WO2008144018A1 WO 2008144018 A1 WO2008144018 A1 WO 2008144018A1 US 2008006366 W US2008006366 W US 2008006366W WO 2008144018 A1 WO2008144018 A1 WO 2008144018A1
Authority
WO
WIPO (PCT)
Prior art keywords
snapshot
snapshots
pages
page
database
Prior art date
Application number
PCT/US2008/006366
Other languages
English (en)
Inventor
Liuba Shrira
Hao Xu
Original Assignee
Brandeis University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Brandeis University filed Critical Brandeis University
Publication of WO2008144018A1 publication Critical patent/WO2008144018A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • G06F16/2308Concurrency control

Definitions

  • the present invention relates to snapshot storage management systems. More specifically, the present invention relates to snapshot storage management systems that provide applications with the ability to discriminate among snapshots efficiently.
  • a new generation of storage systems exploit decreasing storage costs and efficient versioning techniques to allow applications to take snapshots of past states and retain them for long durations. Snapshot analysis is becoming increasingly important.
  • an Intensive Care Unit (ICU) monitoring system may analyze the information on patients' past response to treatment.
  • ICU Intensive Care Unit
  • Figure 2 shows an embodiment of a split copy-on-write
  • Figure 3 shows an embodiment of archiving split snapshots
  • Figure 4 shows an embodiment of rank-tree policy
  • Figure 5 shows an embodiment of eager ranked segregation
  • Figure 6 shows an embodiment of a diff-based representation of BITE
  • Figure 7 shows an embodiment of lazy segregation
  • Figure 8 shows an embodiment of reclamation in Hybrid
  • Figure 9 shows a graph ⁇ dP relative to Thor in an embodiment of Hybrid.
  • Figure 10 shows a graph ⁇ p relative to Thor in an embodiment having multiple clients.
  • Thresher is a new snapshot storage management system, based on a novel snapshot technique that is the first to provide applications the ability to discriminate among snapshots efficiently.
  • An application may provide a discrimination policy that ranks snapshots. The policy may be specified when snapshots are taken, or later after snapshots have been created. Thresher may efficiently disentangle differently ranked snapshots, allowing valuable snapshots to be stored with faster access or to remain accessible for longer, while also allowing less valuable snapshots to be discarded. These operations may be executed in Thresher without creating disk fragmentation.
  • Thresher is based on two key innovations. First, a novel technique called "ranked segregation" may efficiently separate on disk the states of differently-ranked copy-on-write snapshots, enabling no- copy reclamation without fragmentation. Second, while most snapshot systems rely on a no-overwrite update approach, Thresher may rely on a novel update-in-place technique that provides an efficient way to transform snapshot representation as snapshots are created.
  • the ranked segregation technique may be efficiently composed with different snapshot representations to lower the storage management costs for several useful discrimination policies.
  • Thresher may segregate the normal and abnormal snapshots efficiently by composing ranked segregation with a compact diff-based representation to reduce the cost of copying.
  • Thresher may compose ranked segregation with a dual snapshot representation that is less compact but provides faster access.
  • a snapshot storage manager like a garbage collector, must be designed with a concrete system in mind, and must perform well for different application workloads.
  • Thresher was prototyped in an experimental snapshot system that allows flexible control of workload parameters. Update density and overwriting were identified as the two key parameters that determine the performance of a snapshot storage manager. Measurements of the Thresher prototype indicate that the new techniques are efficient and scalable, imposing minimal (4%) performance penalty on common expected workloads.
  • Thresher is described herein in the context of a concrete system but it will be appreciated by one skilled in the art that the techniques described are more general and may be readily applied to other systems.
  • Thresher is implemented in SNAP, the snapshot system that provides snapshots for the Thor object storage system.
  • SNAP assumes that applications are structured as sequences of "transactions" accessing a storage system. It supports Back-In-Time Execution (BITE), a capability of a storage system where applications running general code may run against read-only snapshots in addition to the current state.
  • BITE Back-In-Time Execution
  • the snapshots may reflect transactionally consistent historical states.
  • An application may choose which snapshots it wants to access so that snapshots may reflect states meaningful to the application.
  • Applications may take snapshots at unlimited "resolution", e.g. after each transaction, without disrupting access to the current state.
  • Thresher may allow applications to discriminate among snapshots by incorporating a snapshot discrimination policy into the following three snapshot operations: a request to take a snapshot ("snapshot request”, or “declaration") that provides a discrimination policy, or indicates lazy discrimination, a request to access a snapshot ("snapshot access”), and a request to specify a discrimination policy for a snapshot ("discrimination request").
  • the operations may have the following semantics.
  • an application may take a snapshot by asking for a snapshot "now".
  • This snapshot request may be serialized along with other transactions and other snapshots. That is, a snapshot may reflect all state-modifications by transactions serialized before this request, but does not reflect modifications by transactions serialized after.
  • a snapshot request may return a snapshot "name" that applications can use to refer to this snapshot later, e.g. to specify a discrimination policy for a snapshot. For simplicity, it is assumed that snapshots are assigned unique sequence numbers that correspond to the order in which they occur.
  • a snapshot access request may specify which snapshot an application wants to use for back-in-time execution.
  • the request may return a consistent set of object states, allowing the read-only transaction to run as if it were running against the current storage state.
  • a discrimination policy may rank snapshots.
  • a rank may be a numeric score assigned to a snapshot. Thresher may interpret the ranking to determine the relative lifetimes of snapshots and the relative snapshot access latency.
  • a snapshot storage management system needs to be efficient and not unduly slow-down the snapshot system.
  • the general approach to snapshot discrimination may be applicable to snapshot systems that separate snapshots from the current storage system state.
  • Such so-called “split" snapshot systems may rely on update-in-place storage and create snapshots by copying out the past states, unlike snapshot systems that rely on no-overwrite storage and do not separate snapshot and current states.
  • Split snapshots may be attractive in long-lived systems because they allow creation of high-frequency snapshots without disrupting access to the current state while preserving the on-disk object clustering for the current state. This approach may take advantage of the separation between snapshot and current states to provide efficient snapshot discrimination.
  • a specialized snapshot representation may be created which is tailored to the discrimination policy while copying out the past states.
  • Thor has a client/ server architecture.
  • Servers provide persistent storage (called database storage) for objects.
  • Clients may cache copies of the objects and run applications that interact with the system by making calls to methods of cached objects. Method calls may occur within a the context of transaction.
  • a transaction commit may cause all modifications to become persistent, while an abort may leave no transaction changes in the persistent state.
  • the system may use optimistic concurrency control.
  • a client may send its read and write object sets with modified object states to the server when the application asks to commit the transaction. If no conflicts were detected, the server may commit the transaction.
  • An object may belong to a particular server. The object within a server may be uniquely identified by an object reference (Oref). Objects may be clustered into 8KB pages.
  • object reference Oref
  • An object Oref is composed of a PageID and a oid.
  • the PageID identifies the containing page and allows the lookup of an object location using a page table.
  • the oid is an index into an offset table stored in the page.
  • the offset table contains the object offsets within the page. This indirection may allow moving an object within a page without changing the references to it.
  • the client may fetch the containing page from the server. Only modified objects may be shipped back to the server when the transaction commits. Thor provides transaction durability using the ARIES no-force no-steal redo log protocol. Since only modified objects may be shipped back at commit time, the server may need to do an installation read (iread) to obtain the containing page from disk.
  • An in-memory, recoverable cache called the Modified Object Buffer (MOB) may store the committed modifications allowing ireads to be deferred and write absorption to be increased. The modifications may be propagated to the disk by a background "cleaner" thread that cleans the MOB. The cleaner may process the MOB in transaction log order to facilitate the truncation of the transaction log. For each modified object encountered, it may read the page containing the object from disk (iread). If the page is not cached, it may install all modifications in the MOB for objects in that page, write the updated page back to disk, and remove the objects from the MOB.
  • MOB Modified Object Buffer
  • the server also may manage an in-memory page cache used to serve client fetch requests. Before returning a requested page to the client, the server may update the cache copy, installing all modifications in the MOB for that page so that the fetched page reflects the up-to-date committed state.
  • the page cache may use Least Recently Used (LRU) replacement but discards old dirty pages (it may depend on ireads to read them back during MOB cleaning) rather than writing them back to disk immediately. Therefore, the cleaner thread may be the only component of the system that writes pages to disk.
  • LRU Least Recently Used
  • SNAP may create snapshots by copying out the past storage system states onto a separate snapshot archive disk.
  • a snapshot may provide the same abstraction as the storage system, consisting of "snapshot pages” and a “snapshot page table”. This may allow unmodified application code running in the storage system to run as BITE over a snapshot.
  • SNAP may copy snapshot pages and snapshot page table mappings into the archive during cleaning. It may use an incremental copy-on-write technique specialized for split snapshots: a snapshot page is constructed and copied into the archive when a page on the database disk is about to be overwritten the first time after a snapshot is declared. Archiving a page may create a snapshot page table mapping for the archived page.
  • a client application may send a "snapshot access request" to the server.
  • the server may construct an Archive Page Table (APT) for version v (APT V ) and “mounts" it for the client.
  • APT V may map each page in snapshot v into its archive address or indicate the page is in the database.
  • the server may receive a page fetch requests from the client may look up pages in APT 1 , and read them from either archive or database. Since snapshots may be accessed readonly, APT V can be shared by all clients mounting snapshot v.
  • Figure 1 shows an embodiment of how unmodified client application code accesses objects in snapshot v that includes both archived and database pages.
  • client code may request object y on page Q.
  • the server may look up Q in APT V , load page Q v from the archive and send it to the client. Later on client code may follow a reference from y to x in the client cache, requesting object x in page P from the server.
  • the server may look up P in APT V and find out that the page P for snapshot v is still in the database.
  • the server reads P from the database and sends it to the client.
  • the archive representation for a snapshot page may include the complete storage system page. This representation is referred to as "page-based".
  • page-based This representation is referred to as "page-based”.
  • the following describe different snapshot page representations, specialized to various discrimination policies.
  • a snapshot page can have a more compact representation based on modified object diffs, or it can have two different representations.
  • Such variations in snapshot representation are transparent to the application code running BITE, since the archive read operation reconstructs the snapshot page into storage system representation before sending it to the client.
  • snapshot declarations may partition transaction history into "spans".
  • the span of a snapshot v starts with its declaration and ends with the declaration of the next snapshot (v+1).
  • snapshot v “records” its version of P.
  • snapshot v records pages P and S (retaining the pre-states modified by transaction tx 2 ) and the page T (retaining the pre-state modified by transaction /xj). Note that there may not be a need to retain the pre-state of page P modified by transaction fct ? since it is not the first modification of P in the span.
  • the later snapshot may records v's version of P.
  • v's version of page Q may be recorded by a later snapshot v+1 who also records its own version of P.
  • Snapshot pages may be constructed and copied into the archive during cleaning when the pre- states of modified pages about to be overwritten in the database are available in memory. Since the cleaner may run asynchronously with the snapshot declaration, the snapshot system needs to prevent snapshot states from being overwritten by the on-going transactions. For example, if several snapshots are declared between two successive cleaning rounds, and a page P gets modified after each snapshot declaration, the snapshot system may have to retain a different version of P for each snapshot.
  • SNAP may prevent snapshot state overwriting, without blocking the on-going transactions. It may retain the pre-states needed for snapshot creation in an in-memory data structure called Versioned Modified Object Buffer (VMOB).
  • VMOB may contain a queue of buckets, one for each snapshot. Bucket v may hold modifications committed in v's span. As transactions commit modifications, modified objects may be added to the bucket of the latest snapshot (Step 1 , Figure 3).
  • the declaration of a new snapshot may create a new mutable bucket, and make the preceding snapshot bucket immutable, preventing the overwriting of the needed snapshot states.
  • a cleaner may update the database by cleaning the modifications in the VMOB, and in the process of cleaning, construct the snapshot pages for archiving.
  • Steps 2-5 in Figure 3 trace this process.
  • the cleaner may first obtain a database copy of P.
  • the cleaner may then use P and the modifications in the buckets to create all the needed snapshot versions of P before updating P in the database.
  • the cleaner may construct the version of P recorded by v simply by using the database copy of P.
  • the cleaner may then update P by applying modifications in bucket v, remove the modifications from the bucket v, and proceed to the following bucket.
  • the updated P will be the version of P recorded by the snapshot that has the next modification to P in its bucket. This process may be repeated for all pages with modifications in VMOB, constructing the recorded snapshot pages for the snapshots corresponding to the immutable VMOB buckets.
  • the cleaner may write the recorded pages into the archive sequentially in snapshot order, thus creating "incremental" snapshots.
  • the mappings for the archived snapshot pages may be collected in versioned incremental snapshot page tables.
  • VPT V versioned page table for snapshot v
  • VPT V may be a data structure containing the mappings (from page id to archive address) for the pages recorded by snapshot v.
  • mappings may be inserted into VPT V .
  • VPT V may be archived as well.
  • the cleaner may write the VPTs sequentially, in snapshot order, into a separate archive data structure. This way, a forward sequential scan through the archived incremental page tables from VPT V and onward may find the mappings for all the archived pages that belong to snapshot v. Namely, the mapping for v's version of page P may be found either in VPT V , or, if not there, in the VPT of the first subsequently declared snapshot that records P. SNAP efficiently bounds the length of the scan. For brevity, the bounded scan protocol here is not reviewed here.
  • SNAP may need to identify the snapshot v pages that are in the current database.
  • Highest Archived Version may be an auxiliary data structure that tracks the highest archived version for each page. If HAV(P) ⁇ v, the snapshot v page P may be in the database.
  • a snapshot discrimination policy may specify that older snapshots outlive more recently declared snapshots. Since snapshots may be archived incrementally, managing storage for snapshots according to such a discrimination policy can be costly. Pages that belong to a longer-lived snapshot may be recorded by a later short-lived snapshot thus entangling short-lived and long-lived pages. When different lifetime pages are entangled, discarding shorter-lived pages may create archive storage fragmentation. For example, consider two consecutive snapshots v and v+1 in Figure 2, with v recording pages versions P v , and Sv, and v+1 recording pages P v+ ⁇ , ⁇ 2 v+ i and S v+ i- The page recorded by v+1 belongs to both snapshots v and v+1.
  • Thresher exploits the copying of past states in a split snapshot system.
  • Thresher may "segregate" the long-lived and the short-lived snapshot pages and copy different lifetime pages into different archive areas.
  • reclamation may create no fragmentation.
  • the archive system knows at snapshot v+1 creation time that it is shorter-lived than v, it can store the long-lived snapshot pages P v , S v and Q v+ ⁇ in a long-lived archive area and the transient 5 v+ i pages in a short-lived area, so that the shorter-lived pages can be reclaimed without fragmentation.
  • This approach therefore, combines a discrimination policy and a discrimination mechanism.
  • the discrimination policies supported in Thresher are categorized. The following describes the discrimination mechanisms for different policies.
  • a snapshot discrimination policy may convey to the snapshot storage management system the importance of snapshots so that more important snapshots can have longer lifetimes, or can be stored with faster access. Thresher may support a class of flexible discrimination policies described below using an example.
  • An application may specify a discrimination policy by providing a relative snapshot ranking. Higher-ranked snapshots may be deemed more important. By default, every snapshot may be created with a lowest rank. An application can "bump up" the importance of a snapshot by assigning it a higher rank.
  • a policy may assign the lowest rank to snapshots corresponding to minute by minute vital signs monitor readings, a higher rank to the monitor readings that correspond to hourly nurses' checkups, yet a higher rank to the readings viewed in doctors' rounds. Within a given rank level, more recent snapshots may be considered more important.
  • the discrimination policy may assign longer lifetimes to more important snapshots, defining a 3-level sliding window hierarchy of snapshot lifetimes.
  • rank- tree a general class of discrimination policies
  • rank levels are given integer values 1 through k:
  • a snapshot ranked as level i, i > 1, corresponds to a snapshot at each lower rank level from 1 to (/ - 1).
  • RT2 Ranking a snapshot at a higher rank level increases its lifetime.
  • RT3 Within a rank level, more recent snapshots outlive older snapshots.
  • Figure 4 depicts an embodiment of a 3-level rank-tree policy for the hospital example, where snapshot number 1 , ranked at level 3, corresponds to a monitor reading that was sent for inspection to both the nurse and the doctor, but snapshot number 4 was only sent to the nurse.
  • An application can specify a rank-tree policy "eagerly" by providing a snapshot rank at snapshot declaration time, or "lazily", by providing the rank after declaring a snapshot.
  • An application can also ask to store recent snapshots with faster access. In the hospital example above, the importance and the relative lifetimes of the snapshots associated with routine procedure are likely to be known in advance, so the hospital application can specify a snapshot discrimination policy eagerly.
  • the eager "ranked segregation" protocol may provide efficient discrimination for eager rank-tree policies.
  • the protocol may assign a separate archive region to hold the snapshot pages ⁇ volumes') and snapshot page tables (VPT') for snapshots at level i.
  • VPT' snapshot page tables
  • the protocol may segregate the different lifetime pages and copy them into the corresponding regions. This way, each region may contain pages and page tables with the same lifetime, and temporal reclamation of snapshots (satisfying policy property RT2) within a region does not create disk fragmentation.
  • Figure 3 shows an embodiment of a segregated archive.
  • snapshots ranked at level / may be archived in the same incremental manner as in SNAP and at the same low sequential cost.
  • the cost is low because by using sufficiently large write buffers (one for each volume), archiving to multiple volumes can be as efficient as strictly sequential archiving into one volume. Since it is expected that the rank-tree may be quite shallow the total amount of memory allocated to write buffers is small.
  • the eager ranked segregation works as follows.
  • the declaration of a snapshot v with a rank specified at level k (k > 1 ), may create a separate incremental snapshot page table, VPTJ for every rank level i (i ⁇ k).
  • the incremental page table VPTJ may collect the mappings for the pages recorded by snapshot v at level /. Since the incremental tables in VPT' may map the pages recorded by all the snapshots at level i, the basic snapshot page table reconstruction protocol based on a forward scan through VPT' detailed above can be used in region i to reconstruct snapshot tables for level i snapshots.
  • the recorded pages may contain the pre-state before the first page modification in the snapshot span. Since the span for snapshot v at level i (denoted S v ' ) may include the spans of all the lower level snapshots declared during S v ' , pages recorded by a level i snapshot v may also be recorded by some of these lower-ranked snapshots. In Figure 4, the span of snapshot 4 ranked at level 2 includes the spans of snapshots (4), 5 and 6 at level 1. Therefore, a page recorded by the snapshot 4 at level 2 may also be recorded by one of the snapshots (4), 5, or 6 at level 1.
  • a page P recorded by snapshots at multiple levels may be archived in the volume of the highest- ranked snapshot that records P.
  • the highest recorder “captures" P. Segregating archived pages this way guarantees that a volume of the shorter-lived snapshots contains no longer-lived pages and therefore temporal reclamation within a volume creates no fragmentation.
  • the mappings in a snapshot page table VPTJ in area / may point to the pages recorded by snapshot v in whatever area these pages are archived. Snapshot reclamation may need to insure that the snapshot page table mappings are safe, that is, they do not point to reclaimed pages.
  • the segregation protocol may guarantee the safety of the snapshot page table mappings by enforcing the following invariant I that constrains the intra-level and inter-level reclamation order for snapshot pages and page tables:
  • VPTJ VPTJ and the pages recorded by snapshot v that are captured in volume' are reclaimed together, in temporal snapshot order.
  • Invariant Il may insure that in each given rank-tree level, the snapshot page table mappings are safe when they point to pages captured in volumes within the same level.
  • Invariant 12 may insure that the snapshot page table mappings are safe when they point to pages captured in volumes above their level. Note that the rank-tree policy property RT2 only requires that "bumping up" a lower-ranked snapshot v to level k may extend its lifetime but it does not constrain the lifetimes of the lower-level snapshots declared in the span of v at level k. Invariant 12 may insure the safety of the snapshot table mappings for these later lower-level snapshots.
  • Figure 5 depicts an embodiment of the eager segregation protocol for a two-level rank-tree policy.
  • Snapshot v 4 specified at level 2, has a snapshot page table at both level 1 and level 2.
  • the archived page P modified within the span of snapshot V 5 is recorded by snapshot V 5 , and also by the level 2 snapshot V 4 .
  • This version of P is archived in the volume of the highest recording snapshot (denoted volume v t).
  • the snapshot page tables of both recording snapshots VPT ⁇ 5 and VPT ⁇ 4 contain this mapping for P.
  • the pre-state of page Q modified within the span of v 6 is also captured in volume ⁇ .
  • P is modified again within the span of snapshot v 6 .
  • This later version of P may not be recorded by snapshot v 4 at level 2 since V 4 has already recorded its version of P.
  • This later version of P may be archived in volume ⁇ and its mapping may be inserted into VTTj 6 .
  • Invariant Il may guarantee that in VPT] 6 mapping for page P in volume ⁇ is safe.
  • Invariant 12 may guarantee that in VPT] 6 the mapping for page Q in volume v 4 is safe.
  • Some applications may need to defer snapshot ranking to after the snapshot has already been declared (use a lazy rank-tree policy).
  • snapshot discrimination may be costly because it requires copying.
  • the lazy segregation protocol may provide efficient lazy discrimination by combining two techniques to reduce the cost of copying. First, it may use a more compact diff-based representation for snapshot pages so that there is less to copy. Second, the diff-based representation (as explained below) may include a component that has a page-based snapshot representation. This page-based component may be segregated without copying using the eager segregation protocol.
  • the compact diff-based representation may implement the same abstraction of snapshot pages and snapshot page tables, as the page-based snapshot representation. It may be similar to database redo recovery log consisting of sequential repetitions of two types of components, checkpoints and diffs.
  • the checkpoints may be incremental page-based snapshots declared periodically by the storage management system.
  • the diffs may be versioned page diffs, consisting of versioned object modifications clustered by page. Since typically only a small fraction of objects in a page may be modified by a transaction, and moreover, many attributes do not change, it is expected that the diffs will be compact.
  • the log repetitions containing the diffs and the checkpoints may be archived sequentially, with diffs and checkpoints written into different archive data structures.
  • the incremental snapshot page tables may collect the archived page mappings for the checkpoint snapshots.
  • a simple page index structure may keep track of page-diffs in each log repetition (the diffs in one log repetition are referred to as "diff extent").
  • the cleaner may sort the diffs in an in-memory buffer, assembling the page-based diffs for the diff extents.
  • the available sorting buffer size may determine the length of diff extents. Since frequent checkpoints decrease the compactness of the diff-based representation, to get better compactness, the cleaner may create several diff extents in a single log repetition. Increasing the number of diff extents may slow down BITE. This trade-off is similar to the recovery log. For brevity, the details of how the diff-based representation is constructed are omitted. The details can be found in "Timebox: A High Performance Archive for Split Snapshots" (Hao Xu, PhD Thesis, Brandeis University, 2005) which is incorporated by reference in its entirety. Below some of the issues related to the diff-based representation compactness that are relevant to the snapshot storage management performance are discussed.
  • the snapshots declared between checkpoints may be reconstructed by first mounting the snapshot page table for the closest (referred to as "base") checkpoint and the corresponding diff-page index. This may allow BITE to access the checkpoint pages, and the corresponding page-diffs.
  • base the closest
  • the server may read page P from the checkpoint, and then read in order, the diff-pages for P from all the needed diff extents and apply them to the checkpoint P in order.
  • Figure 6 shows an embodiment of reconstructing a page P in a diff-based snapshot v from a checkpoint page ⁇ o and diff-pages contained in several diff extents.
  • the system may simply read back the archived diff extents, assemble the diff extents for the longer-lived snapshots, create the corresponding long-lived base checkpoints, and archive the retained snapshots sequentially into a longer-lived area. If diffs are compact, the cost of copying may be low.
  • the long-lived base checkpoints may be created without copying by separating the long-lived and short-lived checkpoint pages using eager segregation. Since checkpoints may simply be page-based snapshots declared periodically by the system, the system can derive the ranks for the base checkpoints once the application specifies the snapshot ranks. Knowing ranks at checkpoint declaration time may enable eager segregation.
  • Figure 7 shows an embodiment of eager rank-tree policy for checkpoints in lazy segregation.
  • a representation for level-1 snapshots has the diff extents E ⁇ , £ 2 and £ 3 (in the archive region G d x ⁇ ) associated with the base checkpoints B ⁇ , ⁇ 2 and Bj.
  • E ⁇ , £ 2 and £ may be merged into extent E (in region G d 2 ⁇ ).
  • This extent E has a base checkpoint B ⁇ .
  • extents E ⁇ , £ 2 Eventually, extents E ⁇ , £ 2 ,
  • the lazy segregation protocol may be optimized for the case where the application specifies snapshot rank within a limited time period after snapshot declaration, which is expected to be the common case. If the limit is exceeded, the system may reclaim shorter-lived base checkpoints by copying out longer-lived pages at a much higher cost. The same approach can also be used if the application needs to change the discrimination policy.
  • the diff-based representation may be more compact but has a slower BITE than the page-based representation.
  • Some applications require lazy discrimination but also need low-latency BITE on a recent window of snapshots. For example, to examine the recent snapshots and identify the ones to be retained.
  • the eager segregation protocol may allow efficient composition of diff-based and page-based representations to provide fast BITE on recent snapshots, and lazy snapshot discrimination.
  • the composed representation called "Hybrid”, works as follows. When an application declares a snapshot, Hybrid may create two snapshot representations. A page-based representation may be created in a separate archive region that maintains a sliding window of W recent snapshots, reclaimed temporally. BITE on snapshots within W may run on the fast page-based representation.
  • Hybrid may create for the snapshots a diff-based representation.
  • BITE on snapshots outside W may run on the slower diff -based representation. Snapshots within W therefore may have two representations (page-based and diff-based).
  • the eager segregation protocol can be used to efficiently compose the two representations and provide efficient reclamation.
  • the system may specify an eager rank-tree policy that ranks the page-based snapshots as lowest-rank (level-0) rank-tree snapshots, but specifies the ones that correspond to the system-declared checkpoints in the diff-based representation, as level- 1.
  • the checkpoints may be further discriminated by bumping up the rank of the longer-lived checkpoints.
  • the eager segregation protocol may retain the snapshots declared by the system as checkpoints without copying, and may reclaim the aged snapshots in the page-based window W without fragmentation.
  • the cost of checkpoint creation and segregation may be completely absorbed into the cost of creating the page-based snapshots, resulting in lower archiving cost than the simple sum of the two representations.
  • Figure 8 shows an embodiment of reclamation in the Hybrid system that adds faster BITE to the snapshots in Figure 7.
  • the system may create the page-based snapshots V, and use them to run fast BITE on recent snapshots. Snapshots V 1 and V 4 may be used as base checkpoints B ⁇ and B 2 for the diff-based representation, and checkpoint B ⁇ may be retained as a longer-lived checkpoint.
  • the system may specify an eager rank-tree policy, ranking snapshots V, at level-0, and bumping up Vi to level-2 and V 4 to level- 1. This may allow the eager segregation protocol to create the checkpoints B ⁇ , and B 2 , and eventually reclaim B 2 , V 5 and V 6 without copying.
  • the metric ⁇ ⁇ j p captures the non-disruptiveness of an I/O-bound storage system. This metric is used to gauge the impact of snapshot discrimination.
  • r pour be the "pouring rate" - average object cache (MOB/VMOB) free-space consumption speed due to incoming transaction commits, which insert modified objects.
  • / ⁇ ⁇ / admirhold • be the "draining rate” - the average growth rate of object cache free space produced by MOB/VMOB cleaning.
  • ⁇ d P is defined as: -1 dra
  • Adp ⁇ pour ⁇ j p indicates how well the draining keeps up with the pouring. If ⁇ d P > 1 , the system operates within its capacity and the foreground transaction performance is not be affected by background cleaning activities. If ⁇ dP ⁇ 1, the system is overloaded, transaction commits eventually block on free object cache space, and clients experience commit delay.
  • t c ⁇ ean be the average cleaning time per dirty database page. Hence, t C [ ean determines /> bark, origin. In Thresher, t c ⁇ ean reflects, in addition to the database ireads and writes, the cost of snapshot creation and snapshot discrimination. Since snapshots are created on a separate disk in parallel with the database cleaning, the cost of snapshot-related activity can
  • is an update workload parameter, defined as the percentage of repeated modifications to the same object or page, ⁇ affects both r pour and r ⁇ /ra ⁇ .
  • updates cause less cleaning in the storage system because the object cache (MOB/VMOB) absorbs repeated modifications, but high frequency snapshots may need to archive most of the repeated modifications. With less cleaning, it may be harder to hide archiving behind cleaning, so snapshots may become more disruptive.
  • workloads with repeated modifications reduce the amount of copying when lazy discrimination copies diffs.
  • r pour and r, /ram are measured experimentally in a system with and without discrimination for a range of workloads with low, medium and high degree of overwriting, and analyze the resulting ⁇ dP .
  • ⁇ dP may determine the maximum throughput of an I/O bound storage system. Measuring the maximum throughput in a system with and without discrimination could provide an end-to-end metric for gauging the impact of discrimination. ⁇ dP is focused on because it allows better explanations of the complex dependency between workload parameters and cost of discrimination.
  • the effectiveness of diff-based representation in reducing copying cost may depend on the "compactness" of the representation.
  • Compactness is characterized by a relative snapshot retention metric R, defined as the size of snapshot state written into the archive for a given snapshot history length H, relative to the size of the snapshot state for H captured in full snapshot pages.
  • R 1 for the page-based representation.
  • R of the diff -based representation has two contributing components, R C k P for the checkpoints, and Rj 1 Jf for the diffs.
  • "Density" ( ⁇ ) a workload parameter defined as the fraction of the page that gets modified by an update, may determine R d ,ff.
  • R C k P may depend on the frequency of checkpoints, determined by L — the number of snapshots declared in the history interval corresponding to one log repetition.
  • L the number of snapshots declared in the history interval corresponding to one log repetition.
  • increasing L may decrease R c i ⁇ since checkpoints are page-based snapshots that record the first pre-state for each page modified in the log repetition.
  • Increasing L by increasing d the number of diff extents in a log repetition, may raise the snapshot page reconstruction cost for BITE.
  • Increasing L without increasing d may require additional server memory for the cleaner to sort diffs when assembling diff pages.
  • Diff-based representation may not be compact if transactions modify all the objects in a page. Common update workloads have sparse modifications because most applications modify far fewer objects than they read. The compactness of the diff-based representation is determined by measuring R ⁇ and R CkP for workloads with expected medium and low update density.
  • Thresher implements in SNAP the techniques described, and also support for recovery during normal operation without the failure recovery procedure. This allows evaluation of system performance in the absence of failures. Comparing the performance of Thresher and SNAP reveals a meaningful snapshot discrimination cost because SNAP is very efficient: even at high snapshot frequencies it has low impact on the storage system.
  • An 007 transaction includes a readonly traversal (Tl ), or a read-write traversal (T2a or T2b).
  • Tl readonly traversal
  • T2a or T2b read-write traversal
  • the traversals T2a and T2b generate workloads with fixed amount of object overwriting and density.
  • Extended traversal has been implemented which is summarized below which allowed control of these parameters.
  • a variant traversal T2a' is used, that extends T2a to update a randomly selected AtomicPart object of a CompositePart instead of always modifying the same (root) object in T2a.
  • each T2a' traversal modifies 500 objects.
  • the desired amount of overwriting is achieved by adjusting the object update history in a sequence of T2a' traversals.
  • Workload parameter ⁇ controls the amount of overwriting.
  • the experiments use three settings for ⁇ , corresponding to low (0.08), medium (0.30) and very high (0.50) degree of overwriting.
  • T2f modifies 500 objects
  • the average number of modified Atomic-Part objects on a dirty page when the dirty page is written back to database (on average, a page in 007 has 27 such objects).
  • T2f modifies a group of Atomic-Part objects around the chosen one. Denote by T2f-g the workload with group of size g.
  • T2f-1 is essentially T2a'.
  • the workload density ⁇ is controlled by specifying the size of the group.
  • T2f-g since repeated T2f-g traversals update multiple objects on each data page due to write-absorption provided by MOB, T2f-g, like T2a ⁇ also controls the overwriting between traversals.
  • the single-client experiments run with snapshot frequency 1, declaring a snapshot after each transaction, in a 3-user OO7 database (185MB in size).
  • the multi-client scalability experiments run with snapshot frequency 10 in a large database (140GB in size).
  • the size of a single private client module is the same in both configurations. All the reported results show the mean of at least three trials with maximum standard deviation at 3%.
  • the storage system server runs on a Linux (kernel 2.4.20) workstation with dual 64-bit Xeon 3Ghz CPU, IGB RAM.
  • Two Seagate Cheetah disks (model ST3146707LC, 10000 rpm, 4.7ms avg seek, Ultra320 SCSI) directly attach to the server via LSI Fusion MPT adapter.
  • the database and the archive reside on separate raw hard disks.
  • the implementation uses Linux raw devices and direct I/O to bypass file system cache.
  • the client(s) run on workstations with single P3 850Mhz CPU and 512MB of RAM.
  • the clients and server are inter-connected via a lOOMbps switched network.
  • the server In single-client experiments, the server is configured with 18 MB of page cache ( 10% of the database size), and a 2MB MOB in Thor. In multi-client experiments, the server is configured with 30MB of page cache and 8-1 1MB of MOB in Thor.
  • the snapshot systems are configured with slightly more memory for VMOB so that the same number of dirty database pages is generated in all snapshot systems, normalizing the r ⁇ / ra , n comparison to Thor.
  • the cost of eager discrimination in Thresher includes the cost of creating VPTs for higher-level snapshots.
  • Table 1 shows t c ⁇ ean in Thresher for a two-level eager rank-tree with inter-level retention fraction f set to one snapshot in 200, 400, 800, and 1600.
  • the t c ⁇ ean in SNAP is 5.07ms.
  • the small incremental page tables contribute a very small fraction (0.02% to 0.14% ) of the overall archiving cost even for the lowest-level snapshots, rendering eager segregation essentially free of cost. This result is important, because eager segregation is used to reduce the cost of lazy segregation and hybrid representation.
  • a key factor affecting the cleaning costs in the diff-based systems is the compactness of the diff-based representation.
  • a diff-based system configured with a 4MB sorting buffer, with medium overwriting, has a very low /? c * p (0.5% - 2%) for the low density workload (R ⁇ j f is 0.3%).
  • R ⁇ j f is 0.3%).
  • Z? ⁇ is 3.7%
  • Table 2 shows the cleaning costs and ⁇ dP for all four systems for medium density workload with low, medium, and high overwriting.
  • the t c ⁇ ean measured in the Lazy and Diff systems includes the database iread and write cost, the CPU cost for processing VMOB, the page archiving and checkpointing cost via parallel I/O, snapshot page table archiving, and the cost for sorting diffs and creating diff-extents but does not directly include the cost of reading and archiving diffs, since this activity is performed asynchronously with cleaning.
  • the measured t ⁇ g reflects these diff related costs (including I/O on diff extents, and diff page index maintenance) per dirty database page.
  • the t c ⁇ ean measured for SNAP and Thor includes the (obvious) relevant cost components.
  • Lazy Compared to Diff, Lazy has a higher t ⁇ g reflecting the diff copying overhead. This overhead decreases as overwriting rate increases, t ⁇ g does not drop proportionally to the overwriting increase because the dominant cost component of creating higher level extents, reading back the extents in the lowest level, is insensitive to the overwriting rate. Lazy pays no checkpoint segregation cost because it uses the eager protocol.
  • the Hybrid system incurs the archiving costs of a page-based snapshot system, plus the costs of diff extent creation and segregation, deeming it the costliest of Thresher configurations. Workload density impacts the diff- related costs.
  • Figure 9 shows how the non-disruptiveness ⁇ of Hybrid decreases relative to Thor for workloads with low, medium and high density and a fixed medium overwriting. The denser workload implies more diff I/O. Under the densest possible workload, included here for comparison, the drop of ⁇ dP of hybrid over Thor is 13.6%, where for the common expected medium and low workloads the drop is 3.2% and 1.8% respectively.
  • BITE in Diff and SNAP to Thor are compared.
  • the experiment creates Diff and SNAP archives by running 16000 medium density, medium overwriting traversals declaring a snapshot after each traversal.
  • the incremental VPT protocol checkpoints VPTs at 4MB intervals to bound reconstruction scan cost.
  • the APT mounting time, depending on the distance from the VPT checkpoint is of a seesaw pattern, between 21.05ms and 48.77ms.
  • the latency of BITE is determined by the average fetch cost via APT (4.10ms per page).
  • Diff mounts snapshot by mounting the closest checkpoint, i.e. reconstructing the checkpoint page table (same cost as VPT in SNAP), and mounting the involved page index structures at average mounting time of page index at 7.61ms.
  • Diff reads the checkpoint page and the d diff-pages of P.
  • the average cost to fetch a checkpoint page is 5.80ms
  • to fetch a diff-page from one extent is 5.42ms.
  • the cost of constructing the requested page version by applying the diff-pages back to the checkpoint page is negligible.
  • Table 3 shows the average end-to-end BITE cost measured at the client side by running one standard 007 Tl traversal against Thor, SNAP and Diff respectively. Hybrid has the latency of SNAP for recent snapshots, and latency of Diff otherwise. The end-to-end BITE latency (page fetch cost) increases over time as pages are archived. Table 3 lists the numbers corresponding to a particular point in system execution history with the intention of providing general indication of BITE performance on different representations compared to the performance of accessing the current database. The performance gap between page-based and diff-based BITE motivates the hybrid representation.
  • the database size is 140GB, which virtually contains over 3000 007 modules. Each client accesses its private module (45MB in size) in the database.
  • the private modules of the testing clients are evenly allocated within the space of 140GB.
  • the ⁇ d P of Thor is 2.85, 1.64 and 1.30 respectively.
  • Figure 10 shows the decrease of ⁇ dP in Hybrid relative to Thor when load increases. Note, that adding more concurrent clients doesn't cause Hybrid to perform worse. In fact, with 8-client concurrent workload, Hybrid performs better than single-client workload.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Thresher est un nouveau système de gestion de mémorisation d'instantanés, basé sur une technique d'instantanés innovante qui est la première à fournir à des applications la capacité d'effectuer une discrimination efficace entre des instantanés. Une application peut fournir une politique de discrimination qui classe des instantanés. La politique peut être spécifiée lorsque des instantanés sont pris, ou ultérieurement après la création d'instantanés. Tresher peut efficacement démêler des instantanés classés différemment, permettant que des instantanés utiles soient mémorisés avec un accès plus rapide ou restent accessibles plus longtemps, tout en permettant également que des instantanés moins utiles soient écartés. Ces opérations peuvent être exécutées dans Tresher sans créer de fragmentation de disque.
PCT/US2008/006366 2007-05-18 2008-05-19 Dispositif et procédé pour un gestionnaire de mémorisation efficace pour des instantanés copie sur écriture WO2008144018A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US92453107P 2007-05-18 2007-05-18
US60/924,531 2007-05-18

Publications (1)

Publication Number Publication Date
WO2008144018A1 true WO2008144018A1 (fr) 2008-11-27

Family

ID=40122068

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2008/006366 WO2008144018A1 (fr) 2007-05-18 2008-05-19 Dispositif et procédé pour un gestionnaire de mémorisation efficace pour des instantanés copie sur écriture

Country Status (1)

Country Link
WO (1) WO2008144018A1 (fr)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8527722B1 (en) 2012-04-24 2013-09-03 Hitachi, Ltd. Selecting a snapshot method based on cache memory consumption
US20220156225A1 (en) * 2017-08-29 2022-05-19 Cohesity, Inc. Snapshot archive management
CN115982096A (zh) * 2022-12-09 2023-04-18 北京水脉科技有限公司 一种基于热点文件的实时数据库快照存储方法及系统
US11841824B2 (en) 2020-09-24 2023-12-12 Cohesity, Inc. Incremental access requests for portions of files from a cloud archival storage tier
US11874805B2 (en) 2017-09-07 2024-01-16 Cohesity, Inc. Remotely mounted file system with stubs
US11914485B2 (en) 2017-09-07 2024-02-27 Cohesity, Inc. Restoration of specified content from an archive
US11989436B2 (en) 2022-07-18 2024-05-21 Hewlett Packard Enterprise Development Lp Variable extent size

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5956713A (en) * 1997-04-23 1999-09-21 Oracle Corporation Techniques for reducing the number of snapshots of a database
US20030069880A1 (en) * 2001-09-24 2003-04-10 Ask Jeeves, Inc. Natural language query processing
US20030078918A1 (en) * 2001-10-23 2003-04-24 Souvignier Todd J. Method, apparatus and system for file sharing between computers
US20030159007A1 (en) * 2002-02-15 2003-08-21 International Business Machines Corporation Deferred copy-on-write of a snapshot
US6981114B1 (en) * 2002-10-16 2005-12-27 Veritas Operating Corporation Snapshot reconstruction from an existing snapshot and one or more modification logs

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5956713A (en) * 1997-04-23 1999-09-21 Oracle Corporation Techniques for reducing the number of snapshots of a database
US20030069880A1 (en) * 2001-09-24 2003-04-10 Ask Jeeves, Inc. Natural language query processing
US20030078918A1 (en) * 2001-10-23 2003-04-24 Souvignier Todd J. Method, apparatus and system for file sharing between computers
US20030159007A1 (en) * 2002-02-15 2003-08-21 International Business Machines Corporation Deferred copy-on-write of a snapshot
US6981114B1 (en) * 2002-10-16 2005-12-27 Veritas Operating Corporation Snapshot reconstruction from an existing snapshot and one or more modification logs

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8527722B1 (en) 2012-04-24 2013-09-03 Hitachi, Ltd. Selecting a snapshot method based on cache memory consumption
WO2013160934A1 (fr) * 2012-04-24 2013-10-31 Hitachi, Ltd. Appareil de stockage, système informatique et procédé de création de capture d'écran
US20220156225A1 (en) * 2017-08-29 2022-05-19 Cohesity, Inc. Snapshot archive management
US11880334B2 (en) * 2017-08-29 2024-01-23 Cohesity, Inc. Snapshot archive management
US11874805B2 (en) 2017-09-07 2024-01-16 Cohesity, Inc. Remotely mounted file system with stubs
US11914485B2 (en) 2017-09-07 2024-02-27 Cohesity, Inc. Restoration of specified content from an archive
US11841824B2 (en) 2020-09-24 2023-12-12 Cohesity, Inc. Incremental access requests for portions of files from a cloud archival storage tier
US11989436B2 (en) 2022-07-18 2024-05-21 Hewlett Packard Enterprise Development Lp Variable extent size
CN115982096A (zh) * 2022-12-09 2023-04-18 北京水脉科技有限公司 一种基于热点文件的实时数据库快照存储方法及系统
CN115982096B (zh) * 2022-12-09 2023-09-08 北京水脉科技有限公司 一种基于热点文件的实时数据库快照存储方法及系统

Similar Documents

Publication Publication Date Title
Kaiyrakhmet et al. {SLM-DB}:{Single-Level}{Key-Value} store with persistent memory
Bonwick et al. The zettabyte file system
Ganger et al. Soft updates: a solution to the metadata update problem in file systems
Graefe The five-minute rule twenty years later, and how flash memory changes the rules
O’Neil et al. The log-structured merge-tree (LSM-tree)
Rosenblum et al. The design and implementation of a log-structured file system
Saxena et al. Flashtier: a lightweight, consistent and durable storage cache
Zhou et al. Second-level buffer cache management
Rosenblum et al. The design and implementation of a log-structured file system
US6785771B2 (en) Method, system, and program for destaging data in cache
Do et al. Turbocharging DBMS buffer pool using SSDs
Rodeh B-trees, shadowing, and clones
Ousterhout et al. Beating the I/O bottleneck: A case for log-structured file systems
Zhou et al. Spitfire: A three-tier buffer manager for volatile and non-volatile memory
WO2008144018A1 (fr) Dispositif et procédé pour un gestionnaire de mémorisation efficace pour des instantanés copie sur écriture
Shirriff Sawmill: A Logging File System for a High-Performance RAID Disk Array
Spillane et al. An efficient multi-tier tablet server storage architecture
Graefe The five-minute rule 20 years later (and how flash memory changes the rules)
Shapiro et al. Design Evolution of the EROS Single-Level Store.
Carniel et al. A generic and efficient framework for flash-aware spatial indexing
Ghemawat The modified object buffer: a storage management technique for object-oriented databases
Shrira et al. Snap: Efficient snapshots for back-in-time execution
Shrira et al. Thresher: An Efficient Storage Manager for Copy-on-write Snapshots.
Hulse et al. A log-structured persistent store
Lomet The case for log structuring in database systems

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 08767793

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 08767793

Country of ref document: EP

Kind code of ref document: A1