EP2534570A1 - Verfahren und system für den effizienten zugang zu einem bandpspeichersystem - Google Patents

Verfahren und system für den effizienten zugang zu einem bandpspeichersystem

Info

Publication number
EP2534570A1
EP2534570A1 EP20110704385 EP11704385A EP2534570A1 EP 2534570 A1 EP2534570 A1 EP 2534570A1 EP 20110704385 EP20110704385 EP 20110704385 EP 11704385 A EP11704385 A EP 11704385A EP 2534570 A1 EP2534570 A1 EP 2534570A1
Authority
EP
European Patent Office
Prior art keywords
storage system
tape
sub
staging
tape storage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
EP20110704385
Other languages
English (en)
French (fr)
Inventor
Rebekah C. Vickrey
Frank C. Dachille
Stefan V. Gheorghita
Yonatan Zunger
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Google LLC
Original Assignee
Google LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US13/022,579 external-priority patent/US8341118B2/en
Priority claimed from US13/022,236 external-priority patent/US8560292B2/en
Priority claimed from US13/023,498 external-priority patent/US8874523B2/en
Application filed by Google LLC filed Critical Google LLC
Publication of EP2534570A1 publication Critical patent/EP2534570A1/de
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0646Horizontal data movement in storage systems, i.e. moving data in between storage devices or systems
    • G06F3/065Replication mechanisms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • G06F3/0611Improving I/O performance in relation to response time
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • G06F3/0613Improving I/O performance in relation to throughput
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0683Plurality of storage devices
    • G06F3/0685Hybrid storage combining heterogeneous device types, e.g. hierarchical storage, hybrid arrays

Definitions

  • the disclosed embodiments relate generally to database replication, and more specifically to replication of data between a distributed storage system and a tape storage system.
  • Tape-based storage systems have been proved to be reliable and cost-effective for managing large volumes of data. But as a medium that only supports serial access, it is always challenging for tape to be seamlessly integrated into a data storage system that requires the support of random access. Moreover, compared with the other types of storage media like disk and flash, tape's relative low throughput is another important factor that limits its wide adoption by many large-scale data-intensive applications.
  • changes to an individual piece of data are tracked as deltas, and the deltas are transmitted to other instances of the database rather than transmitting the piece of data itself.
  • reading the data includes reading both an underlying value and any subsequent deltas, and thus a client reading the data sees the updated value even if the deltas has not been incorporated into the underlying data value.
  • distribution of the data to other instances takes advantage of the network tree structure to reduce the amount of data transmitted across the long-haul links in the network. For example, data that needs to be transmitted from Los Angeles to both Paris and Frankfurt could be transmitted to Paris, with a subsequent transmission from Paris to Frankfurt.
  • a computer-implemented method for asynchronously replicating data onto a tape medium is implemented at one or more server computers, each having one or more processors and memory.
  • the memory stores one or more programs for execution by the one or more processors on each server computer, which is associated with a distributed storage system and connected to a tape storage system.
  • the server computer Upon receiving a first request from a client for storing an object within the tape storage system, the server computer stores the object within a staging sub-system of the distributed storage system.
  • the staging sub-system includes a plurality of objects scheduled to be transferred to the tape storage system.
  • the server computer then provides a first response to the requesting client, the first response indicating that the first request has been performed synchronously. If a predefined condition is met, the server computer transfers one or more objects from the staging sub-system to the tape storage system.
  • the server computer For each transferred object, the server computer adds a reference to the object to a tape management sub-system of the tape storage system, identifies a corresponding parent object associated with the object and its metadata within a parent object management subsystem of the distributed storage system, and updates the parent object's metadata to include the object's location within the tape storage system. [0009] In some embodiments, upon receipt of the first request, the server computer submits a second request for the object to the source storage sub-system and receives a second response that includes the requested object from the source storage sub-system.
  • the server computer before storing the object within the staging sub-system, queries the tape management sub-system to determine whether there is a replica of the object within the tape storage system. If there is a replica of the object within the tape storage system, the server computer adds a reference to the replica of the object to the tape management sub-system.
  • the tape management sub-system of the distributed storage system includes a staging object index table and an external object index table.
  • An entry in the staging object index table identifies an object that has been scheduled to be transferred to the tape storage system and an entry in the external object index table identifies an object that has been transferred to the tape storage system.
  • the server computer adds to the staging object index table an entry that corresponds to the object.
  • the server computer For each newly-transferred object, the server computer removes from the staging object index table an entry that corresponds to the newly- transferred object and adds to the external object index table an entry that corresponds to the newly-transferred object.
  • the staging sub-system includes one or more batches of object transfer entries and an object data staging region.
  • the server computer stores the object to be transferred within the object data staging region, identifies a respective batch in accordance with a locality hint provided with the first request, the locality hint identifying a group of objects that are likely to be collectively restored from the tape storage system or expire from the tape storage system, inserts an object transfer entry into the identified batch, the object transfer entry identifying a location of the object within the object data staging region, and updates a total size of the identified batch in accordance with the newly-inserted object transfer entry and the object to be transferred.
  • each object to be transferred includes content and metadata.
  • the server computer writes the object's content into a first file in the object data staging region and the object's metadata into a second file or a bigtable in the object data staging region.
  • the first response is provided to the requesting client before the object is transferred to the tape storage system.
  • the staging sub-system of the distributed storage system includes one or more batches of object transfer entries and an object data staging region.
  • the server computer periodically scans the one or more batches to determine their respective states and identifies a respective batch of object transfer entries if a total size of the batch reaches a predefined threshold or the batch has been opened for at least a predefined time period.
  • the server computer then closes the identified batch from accepting any more object transfer entry and submits an object transfer request to the tape storage system for the identified batch.
  • the server computer retrieves the corresponding object from the object data staging region and transfers the object to the tape storage system.
  • using separate batches for tape backup or restore makes it possible to prioritize the jobs submitted by different users or for different purposes.
  • the server computer deletes the identified batch from the staging sub-system after the last object transfer entry within the identified batch is processed.
  • the closure of the identified batch triggers a creation of a new batch for incoming object transfer entries in the staging sub-system.
  • the server computer deletes the object transfer entry and the corresponding object from the identified batch and the object data staging region, respectively.
  • the server computer sets the parent object's state as "finalized” if the object is the last object of the parent object to be transferred to the tape storage system and sets the parent object's state as "finalizing” if the object is not the last object of the parent object to be transferred to the tape storage system.
  • a computer-implemented method for asynchronously replicating data from a tape medium is implemented at one or more server computers, each having one or more processors and memory.
  • the memory stores one or more programs for execution by the one or more processors on each server computer, which is associated with a distributed storage system and connected to a tape storage system.
  • the server computer Upon receiving a first request from a client for restoring an object from the tape storage system to a destination storage sub-system of the distributed storage system, the server computer generates an object restore entry that identifies the object to be restored and the destination storage sub-system and stores the object restore entry within a staging subsystem of the distributed storage system.
  • the staging sub-system includes a plurality of object restore entries scheduled to be applied to the tape storage system.
  • the server computer then provides a first response to the requesting client, indicating that the first request will be performed asynchronously, and applies one or more object restore entries within the staging sub-system to the tape storage system if a predefined condition is met.
  • the server computer For each restored object, transfers the object to the destination storage sub-system, identifies a corresponding parent object associated with the object and its metadata within a parent object management sub-system of the distributed storage system, and updates the parent object's metadata to identify the object's location within destination storage sub-system.
  • the staging sub-system includes one or more batches of object restore entries and an object data staging region.
  • the server computer identifies a respective batch in accordance with a restore priority provided with the first request, inserts the object restore entry into the identified batch, and updates a total size of the identified batch in accordance with the newly-inserted object restore entry.
  • the first response is provided to the requesting client before the object is restored from the tape storage system.
  • the server computer periodically scans the one or more batches to determine their respective states and identifies a respective batch of object restore entries if a total size of the batch reaches a predefined threshold or the batch has been opened for at least a predefined time period.
  • the server computer closes the identified batch from accepting any more object restore entry and submits an object restore request to the tape storage system for the closed batch.
  • the server computer retrieves the object's content and metadata from the tape storage system and then writes the object's content into a first file in the object data staging region and the object's metadata into a second file in the object data staging region.
  • the closure of the identified batch triggers a creation of a new batch for incoming object restore entries in the staging sub-system.
  • the server computer associates the destination storage sub-system in the object restore entry with the object in the object data staging region, sends a request to an object management sub-system, the request identifying the destination storage sub-system and including a copy of the object in the object data staging region, and deletes the object restore entry and the corresponding object from the identified batch and the object data staging region, respectively.
  • the server computer sets the parent object's state as "finalized” if the object is the last object of the parent object to be restored from the tape storage system and sets the parent object's state as "finalizing” if the object is not the last object of the parent object to be restored from the tape storage system.
  • Figure 1A is a conceptual illustration for placing multiple instances of a database at physical sites all over the globe according to some embodiments.
  • Figure IB illustrates basic functionality at each instance according to some embodiments.
  • Figure 2 is a block diagram illustrating multiple instances of a replicated database, with an exemplary set of programs and/or processes shown for the first instance according to some embodiments.
  • Figure 3 is a block diagram that illustrates an exemplary instance for the system, and illustrates what blocks within the instance a user interacts with according to some embodiments.
  • Figure 4 is a block diagram of an instance server that may be used for the various programs and processes illustrated in Figures IB, 2, and 3, according to some embodiments.
  • Figure 5 illustrates a typical allocation of instance servers to various programs or processes illustrated in Figures IB, 2, and 3, according to some embodiments.
  • Figure 6 illustrates how metadata is stored according to some embodiments.
  • Figure 7 illustrates an data structure that is used to store deltas according to some embodiments..
  • Figures 8A - 8E illustrate data structures used to store metadata according to some embodiments.
  • Figures 9A - 9F illustrate block diagrams and data structures used for replicating data between a planetary-scale distributed storage system and a tape storage system according to some embodiments.
  • Figures 10A - 10D illustrate flow charts of computer-implemented methods used for replicating data between a planetary-scale distributed storage system and a tape storage system according to some embodiments.
  • the present specification describes a distributed storage system.
  • the distributed storage system is implemented on a global or planet-scale.
  • an instance (such as instance 102-1) corresponds to a data center.
  • multiple instances are physically located at the same data center.
  • the conceptual diagram of Figure 1 shows a limited number of network communication links 104-1, etc., typical embodiments would have many more network communication links.
  • each network communication link has a specified bandwidth and/or a specified cost for the use of that bandwidth.
  • statistics are maintained about the transfer of data across one or more of the network communication links, including throughput rate, times of availability, reliability of the links, etc.
  • Each instance typically has data stores and associated databases (as shown in Figures 2 and 3), and utilizes a farm of server computers ("instance servers," see Figure 4) to perform all of the tasks.
  • instance servers see Figure 4
  • Limited functionality instances may or may not have any of the data stores depicted in Figures 3 and 4.
  • Figure IB illustrates data and programs at an instance 102-i that store and replicate data between instances.
  • the underlying data items 122-1, 122-2, etc. are stored and managed by one or more database units 120.
  • Each instance 102-i has a replication unit 124 that replicates data to and from other instances.
  • the replication unit 124 also manages one or more egress maps 134 that track data sent to and acknowledged by other instances.
  • the replication unit 124 manages one or more ingress maps, which track data received at the instance from other instances.
  • Each instance 102-i has one or more clock servers 126 that provide accurate time.
  • the clock servers 126 provide time as the number of microseconds past a well-defined point in the past.
  • the clock servers provide time readings that are guaranteed to be monotonically increasing.
  • each instance server 102-i stores an instance identifier 128 that uniquely identifies itself within the distributed storage system.
  • the instance identifier may be saved in any convenient format, such as a 32-bit integer, a 64-bit integer, or a fixed length character string.
  • the instance identifier is incorporated (directly or indirectly) into other unique identifiers generated at the instance.
  • an instance 102-i stores a row identifier seed 130, which is used when new data items 122 are inserted into the database.
  • a row identifier is used to uniquely identify each data item 122.
  • the row identifier seed is used to create a row identifier, and simultaneously incremented, so that the next row identifier will be greater.
  • unique row identifiers are created from a timestamp provided by the clock servers 126, without the use of a row identifier seed.
  • a tie breaker value 132 is used when generating row identifiers or unique identifiers for data changes (described below with respect to Figures 6 - 7).
  • a tie breaker 132 is stored permanently in non- volatile memory (such as a magnetic or optical disk).
  • FIG. IB The elements described in Figure IB are incorporated in embodiments of the distributed storage system 200 illustrated in Figures 2 and 3.
  • the functionality described in Figure IB is included in a blobmaster 204 and metadata store 206.
  • the primary data storage i.e., blobs
  • the metadata for the blobs is in the metadata store 206, and managed by the blobmaster 204.
  • the metadata corresponds to the functionality identified in Figure IB.
  • the metadata for storage of blobs provides an exemplary embodiment of the present invention, one of ordinary skill in the art would recognize that the present invention is not limited to this embodiment.
  • the distributed storage system 200 shown in Figures 2 and 3 includes certain global applications and configuration information 202, as well as a plurality of instances 102- l, ... 102-N.
  • the global configuration information includes a list of instances and information about each instance.
  • the information for each instance includes: the set of storage nodes (data stores) at the instance; the state information, which in some embodiments includes whether the metadata at the instance is global or local; and network addresses to reach the blobmaster 204 and bitpusher 210 at the instance.
  • the global configuration information 202 resides at a single physical location, and that information is retrieved as needed. In other embodiments, copies of the global configuration information 202 are stored at multiple locations.
  • copies of the global configuration information 202 are stored at some or all of the instances.
  • the global configuration information can only be modified at a single location, and changes are transferred to other locations by one-way replication.
  • there are certain global applications such as the location assignment daemon 346 (see Figure 3) that can only run at one location at any given time.
  • the global applications run at a selected instance, but in other
  • one or more of the global applications runs on a set of servers distinct from the instances.
  • the location where a global application is running is specified as part of the global configuration information 202, and is subject to change over time.
  • Figures 2 and 3 illustrate an exemplary set of programs, processes, and data that run or exist at each instance, as well as a user system that may access the distributed storage system 200 and some global applications and configuration.
  • a user 302 interacts with a user system 304, which may be a computer or other device that can run a web browser 306.
  • a user application 308 runs in the web browser, and uses functionality provided by database client 310 to access data stored in the distributed storage system 200 using network 328.
  • Network 328 may be the Internet, a local area network (LAN), a wide area network (WAN), a wireless network (WiFi), a local intranet, or any combination of these.
  • a load balancer 314 distributes the workload among the instances, so multiple requests issued by a single client 310 need not all go to the same instance.
  • database client 310 uses information in a global configuration store 312 to identify an appropriate instance for a request. The client uses information from the global configuration store 312 to find the set of blobmasters 204 and bitpushers 210 that are available, and where to contact them.
  • a blobmaster 204 uses a global configuration store 312 to identify the set of peers for all of the replication processes.
  • a bitpusher 210 uses information in a global configuration store 312 to track which stores it is responsible for.
  • user application 308 runs on the user system 304 without a web browser 306. Exemplary user applications are an email application and an online video application.
  • each instance has a blobmaster 204, which is a program that acts as an external interface to the metadata table 206.
  • an external user application 308 can request metadata corresponding to a specified blob using client 310.
  • a "blob” i.e., a binary large object
  • binary data e.g., images, videos, binary files, executable code, etc.
  • every instance 102 has metadata in its metadata table 206 corresponding to every blob stored anywhere in the distributed storage system 200.
  • the instances come in two varieties: those with global metadata (for every blob in the distributed storage system 200) and those with only local metadata (only for blobs that are stored at the instance).
  • blobs typically reside at only a small subset of the instances.
  • the metadata table 206 includes information relevant to each of the blobs, such as which instances have copies of a blob, who has access to a blob, and what type of data store is used at each instance to store a blob.
  • the exemplary data structures in Figures 8A - 8E illustrate other metadata that is stored in metadata table 206 in some embodiments.
  • the blobmaster 204 provides one or more read tokens to the client 310, which the client 310 provides to a bitpusher 210 in order to gain access to the relevant blob.
  • the client 310 writes data
  • the client 310 writes to a bitpusher 210.
  • the bitpusher 210 returns write tokens indicating that data has been stored, which the client 310 then provides to the blobmaster 204, in order to attach that data to a blob.
  • a client 310 communicates with a bitpusher 210 over network 328, which may be the same network used to communicate with the blobmaster 204.
  • communication between the client 310 and bitpushers 210 is routed according to a load balancer 314. Because of load balancing or other factors, communication with a blobmaster 204 at one instance may be followed by communication with a bitpusher 210 at a different instance.
  • the first instance may be a global instance with metadata for all of the blobs, but may not have a copy of the desired blob.
  • the metadata for the blob identifies which instances have copies of the desired blob, so in this example the subsequent communication with a bitpusher 210 to read or write is at a different instance.
  • a bitpusher 210 copies data to and from data stores.
  • the read and write operations comprise entire blobs.
  • each blob comprises one or more chunks, and the read and write operations performed by a bitpusher are on solely on chunks.
  • a bitpusher deals only with chunks, and has no knowledge of blobs.
  • a bitpusher has no knowledge of the contents of the data that is read or written, and does not attempt to interpret the contents.
  • Embodiments of a bitpusher 210 support one or more types of data store.
  • a bitpusher supports a plurality of data store types, including inline data stores 212, BigTable stores 214, file server stores 216, and tape stores 218. Some embodiments support additional other stores 220, or are designed to accommodate other types of data stores as they become available or technologically feasible.
  • Inline stores 212 actually use storage space 208 in the metadata store 206.
  • Inline stores provide faster access to the data, but have limited capacity, so inline stores are generally for relatively "small” blobs.
  • inline stores are limited to blobs that are stored as a single chunk.
  • "small” means blobs that are less than 32 kilobytes. In some embodiments, "small” means blobs that are less than 1 megabyte. As storage technology facilitates greater storage capacity, even blobs that are currently considered large may be "relatively small” compared to other blobs.
  • BigTable stores 214 store data in BigTables located on one or more BigTable database servers 316. BigTables are described in several publicly available publications, including “Bigtable: A Distributed Storage System for Structured Data,” Fay Chang et al, OSDI 2006, which is incorporated herein by reference in its entirety. In some embodiments, the BigTable stores save data on a large array of servers 316.
  • File stores 216 store data on one or more file servers 318. In some embodiments,
  • the file servers use file systems provided by computer operating systems, such as UNIX.
  • the file servers 318 implement a proprietary file system, such as the Google File System (GFS).
  • GFS is described in multiple publicly available publications, including "The Google File System,” Sanjay Ghemawat et al, SOSP'03, October 19-22, 2003, which is incorporated herein by reference in its entirety.
  • the file servers 318 implement NFS (Network File System) or other publicly available file systems not implemented by a computer operating system.
  • NFS Network File System
  • the file system is distributed across many individual servers 318 to reduce risk of loss or unavailability of any individual computer.
  • Tape stores 218 store data on physical tapes 320. Unlike a tape backup, the tapes here are another form of storage. This is described in greater detail in co-pending U.S. Provisional Patent Application Serial No. 61/302,909, "Method and System for Providing Efficient Access to a Tape Storage System," filed February 9, 2010, which is incorporated herein by reference in its entirety.
  • a Tape Master application 222 assists in reading and writing from tape.
  • there are two types of tape those that are physically loaded in a tape device, so that the tapes can be robotically loaded; and those tapes that physically located in a vault or other offline location, and require human action to mount the tapes on a tape device.
  • the tapes in the latter category are referred to as deep storage or archived.
  • a large read/write buffer is used to manage reading and writing data to tape. In some embodiments, this buffer is managed by the tape master application 222. In some embodiments there are separate read buffers and write buffers.
  • a client 310 cannot directly read or write to a copy of data that is stored on tape. In these embodiments, a client must read a copy of the data from an alternative data source, even if the data must be transmitted over a greater distance.
  • bitpushers 210 are designed to accommodate additional storage technologies as they become available.
  • Each of the data store types has specific characteristics that make them useful for certain purposes. For example, inline stores provide fast access, but use up more expensive limited space. As another example, tape storage is very inexpensive, and provides secure long-term storage, but a client cannot directly read or write to tape.
  • data is automatically stored in specific data store types based on matching the characteristics of the data to the characteristics of the data stores.
  • users 302 who create files may specify the type of data store to use.
  • the type of data store to use is determined by the user application 308 that creates the blobs of data. In some embodiments, a combination of the above selection criteria is used.
  • each blob is assigned to a storage policy 326, and the storage policy specifies storage properties.
  • a blob policy 326 may specify the number of copies of the blob to save, in what types of data stores the blob should be saved, locations where the copies should be saved, etc. For example, a policy may specify that there should be two copies on disk (Big Table stores or File Stores), one copy on tape, and all three copies at distinct metro locations.
  • blob policies 326 are stored as part of the global configuration and applications 202.
  • each instance 102 has a quorum clock server 228, which comprises one or more servers with internal clocks.
  • the order of events, including metadata deltas 608, is important, so maintenance of a consistent time clock is important.
  • a quorum clock server regularly polls a plurality of independent clocks, and determines if they are reasonably consistent. If the clocks become inconsistent and it is unclear how to resolve the inconsistency, human intervention may be required.
  • the resolution of an inconsistency may depend on the number of clocks used for the quorum and the nature of the inconsistency. For example, if there are five clocks, and only one is inconsistent with the other four, then the consensus of the four is almost certainly right. However, if each of the five clocks has a time that differs significantly from the others, there would be no clear resolution.
  • each instance has a replication module 224, which identifies blobs or chunks that will be replicated to other instances.
  • the replication module 224 may use one or more queues 226-1, 226-2, ... Items to be replicated are placed in a queue 226, and the items are replicated when resources are available.
  • items in a replication queue 226 have assigned priorities, and the highest priority items are replicated as bandwidth becomes available. There are multiple ways that items can be added to a replication queue 226. In some embodiments, items are added to replication queues 226 when blob or chunk data is created or modified.
  • replication items based on blob content changes have a relatively high priority.
  • items are added to the replication queues 226 based on a current user request for a blob that is located at a distant instance. For example, if a user in California requests a blob that exists only at an instance in India, an item may be inserted into a replication queue 226 to copy the blob from the instance in India to a local instance in California.
  • a background replication process that creates and deletes copies of blobs based on blob policies 326 and blob access data provided by a statistics server 324.
  • the blob policies specify how many copies of a blob are desired, where the copies should reside, and in what types of data stores the data should be saved.
  • a policy may specify additional properties, such as the number of generations of a blob to save, or time frames for saving different numbers of copies. E.g., save three copies for the first 30 days after creation, then two copies thereafter.
  • a location assignment daemon 322 determines where to create new copies of a blob and what copies may be deleted. When new copies are to be created, records are inserted into a replication queue 226, with the lowest priority.
  • blob policies 326 and the operation of a location assignment daemon 322 are described in more detail in co-pending U.S. Provisional Patent Application Serial No. 61/302,936, "System and Method for managing Replicas of Objects in a Distributed Storage System," filed February 9, 2010, which is incorporated herein by reference in its entirety.
  • FIG. 4 is a block diagram illustrating an Instance Server 400 used for operations identified in Figures 2 and 3 in accordance with some embodiments of the present invention.
  • An Instance Server 400 typically includes one or more processing units (CPU's) 402 for executing modules, programs and/or instructions stored in memory 414 and thereby performing processing operations; one or more network or other communications interfaces 404; memory 414; and one or more communication buses 412 for interconnecting these components.
  • an Instance Server 400 includes a user interface 406 comprising a display device 408 and one or more input devices 410.
  • memory 414 includes high-speed random access memory, such as DRAM, SRAM, DDR RAM or other random access solid state memory devices.
  • memory 414 includes non- volatile memory, such as one or more magnetic disk storage devices, optical disk storage devices, flash memory devices, or other non-volatile solid state storage devices. In some embodiments, memory 414 includes one or more storage devices remotely located from the CPU(s) 402. Memory 414, or alternately the non-volatile memory device(s) within memory 414, comprises a computer readable storage medium. In some embodiments, memory 414 or the computer readable storage medium of memory 414 stores the following programs, modules and data structures, or a subset thereof:
  • an operating system 416 that includes procedures for handling various basic system services and for performing hardware dependent tasks
  • a communications module 418 that is used for connecting an Instance Server 400 to other Instance Servers or computers via the one or more communication network interfaces 404 (wired or wireless) and one or more communication networks 328, such as the Internet, other wide area networks, local area networks, metropolitan area networks, and so on;
  • server applications 420 such as a blobmaster 204 that provides an
  • a bitpusher 210 that provides access to read and write data from data stores; a replication module 224 that copies data from one instance to another; a quorum clock server 228 that provides a stable clock; a location assignment daemon 322 that determines where copies of a blob should be located; and other server functionality as illustrated in Figures 2 and 3.
  • two or more server applications 422 and 424 may execute on the same physical computer;
  • the databases 428 may provide storage for metadata 206, replication queues 226, blob policies 326, global configuration 312, the statistics used by statistics server 324, as well as ancillary databases used by any of the other functionality.
  • Each database 428 has one or more tables with data records 430.
  • some databases include aggregate tables 432, such as the statistics used by statistics server 324; and • one or more file servers 434 that provide access to read and write files, such as file #1 (436) and file #2 (438).
  • File server functionality may be provided directly by an operating system (e.g., UNIX or Linux), or by a software application, such as the Google File System (GFS).
  • GFS Google File System
  • Each of the above identified elements may be stored in one or more of the previously mentioned memory devices, and corresponds to a set of instructions for performing a function described above.
  • the above identified modules or programs i.e., sets of instructions
  • memory 414 may store a subset of the modules and data structures identified above.
  • memory 414 may store additional modules or data structures not described above.
  • Figure 4 shows an instance server used for performing various operations or storing data as illustrated in Figures 2 and 3
  • Figure 4 is intended more as functional description of the various features which may be present in a set of one or more computers rather than as a structural schematic of the embodiments described herein.
  • items shown separately could be combined and some items could be separated.
  • some items shown separately in Figure 4 could be implemented on individual computer systems and single items could be implemented by one or more computer systems.
  • the actual number of computers used to implement each of the operations, databases, or file storage systems, and how features are allocated among them will vary from one implementation to another, and may depend in part on the amount of data at each instance, the amount of data traffic that an instance must handle during peak usage periods, as well as the amount of data traffic that an instance must handle during average usage periods.
  • each program or process that runs at an instance is generally distributed among multiple computers.
  • the number of instance servers 400 assigned to each of the programs or processes can vary, and depends on the workload.
  • Figure 5 provides exemplary information about a typical number of instance servers 400 that are assigned to each of the functions.
  • each instance has about 10 instance servers performing (502) as blobmasters.
  • each instance has about 100 instance servers performing (504) as bitpushers.
  • each instance has about 50 instance servers performing (506) as BigTable servers.
  • each instance has about 1000 instance servers performing (508) as file system servers.
  • File system servers store data for file system stores 216 as well as the underlying storage medium for BigTable stores 214.
  • each instance has about 10 instance servers performing (510) as tape servers. In some embodiments, each instance has about 5 instance servers performing (512) as tape masters. In some embodiments, each instance has about 10 instance servers performing (514) replication management, which includes both dynamic and background replication. In some embodiments, each instance has about 5 instance servers performing (516) as quorum clock servers.
  • FIG. 6 illustrates the storage of metadata data items 600 according to some embodiments.
  • Each data item 600 has a unique row identifier 602.
  • Each data item 600 is a row 604 that has a base value 606 and zero or more deltas 608-1, 608-2, ..., 608-L.
  • the value of the data item 600 is the base value 606.
  • the "value" of the data item 600 is computed by starting with the base value 606 and applying the deltas 608-1, etc. in order to the base value.
  • a row thus has a single value, representing a single data item or entry.
  • the deltas store the entire new value, in some embodiments the deltas store as little data as possible to identify the change.
  • Metadata for a blob includes specifying what instances have the blob as well as who has access to the blob. If the blob is copied to an additional instance, the metadata delta only needs to specify that the blob is available at the additional instance. The delta need not specify where the blob is already located. As the number of deltas increases, the time to read data increases. The compaction process merges the deltas 608-1, etc. into the base value 606 to create a new base value that incorporates the changes in the deltas.
  • an access control list may be implemented as a multi-byte integer in which each bit position represents an item, location, or person.
  • deltas may be encoded as instructions for how to make changes to a stream of binary data.
  • Figure 7 illustrates an exemplary data structure to hold a delta.
  • Each delta applies to a unique row, so the delta includes the row identifier 702 of the row to which it applies.
  • the sequence identifier 704 is globally unique, and specifies the order in which the deltas are applied.
  • the sequence identifier comprises a timestamp 706 and a tie breaker value 708 that is uniquely assigned to each instance where deltas are created.
  • the timestamp is the number of microseconds past a well-defined point in time.
  • the tie breaker is computed as a function of the physical machine running the blobmaster as well as a process id.
  • the tie breaker includes an instance identifier, either alone, or in conjunction with other characteristics at the instance.
  • the tie breaker 708 is stored as a tie breaker value 132.
  • a change to metadata at one instance is replicated to other instances.
  • the actual change to the base value 712 may be stored in various formats.
  • data structures similar to those in Figures 8A - 8E are used to store the changes, but the structures are modified so that most of the fields are optional. Only the actual changes are filled in, so the space required to store or transmit the delta is small.
  • the changes are stored as key/value pairs, where the key uniquely identifies the data element changed, and the value is the new value for the data element.
  • deltas may include information about forwarding. Because blobs may be dynamically replicated between instances at any time, and the metadata may be modified at any time as well, there are times that a new copy of a blob does not initially have all of the associated metadata. In these cases, the source of the new copy maintains a "forwarding address," and transmits deltas to the instance that has the new copy of the blob for a certain period of time (e.g., for a certain range of sequence identifiers).
  • Figures 8A - 8E illustrate data structures that are used to store metadata in some embodiments.
  • these data structures exist within the memory space of an executing program or process.
  • these data structures exist in non-volatile memory, such as magnetic or optical disk drives.
  • these data structures form a protocol buffer, facilitating transfer of the structured data between physical devices or processes. See, for example, the Protocol Buffer Language Guide, available at http ://code. google. com/ apis/protocolbuffers/ docs/proto .html.
  • the overall metadata structure 802 includes three major parts: the data about blob generations 804, the data about blob references 808, and inline data 812.
  • read tokens 816 are also saved with the metadata, but the read tokens are used as a means to access data instead of representing characteristics of the stored blobs.
  • the blob generations 804 can comprise one or more "generations" of each blob.
  • the stored blobs are immutable, and thus are not directly editable. Instead, a "change" of a blob is implemented as a deletion of the prior version and the creation of a new version.
  • Each of these blob versions 806-1, 806-2, etc. is a generation, and has its own entry.
  • a fixed number of generations are stored before the oldest generations are physically removed from storage. In other embodiments, the number of generations saved is set by a blob policy 326.
  • a policy can set the number of saved generations as 1 , meaning that the old one is removed when a new generation is created.) In some embodiments, removal of old generations is intentionally “slow,” providing an opportunity to recover an old "deleted” generation for some period of time.
  • the specific metadata associated with each generation 806 is described below with respect to Figure 8B.
  • Blob references 808 can comprise one or more individual references 810-1,
  • Inline data 812 comprises one or more inline data items 814-1, 814-2, etc.
  • Inline data is not "metadata" - it is the actual content of the saved blob to which the metadata applies.
  • access to the blobs can be optimized by storing the blob contents with the metadata.
  • the blobmaster returns the actual blob contents rather than read tokens 816 and information about where to find the blob contents.
  • blobs are stored in the metadata table only when they are small, there is generally at most one inline data item 814-1 for each blob. The information stored for each inline data item 814 is described below in Figure 8D.
  • each generation 806 includes several pieces of information.
  • a generation number 822 (or generation ID) uniquely identifies the generation.
  • the generation number can be used by clients to specify a certain generation to access.
  • the blobmaster 204 will return information about the most current generation.
  • each generation tracks several points in time.
  • some embodiments track the time the generation was created (824). Some embodiments track the time the blob was last accessed by a user (826). In some
  • last access refers to end user access, and in other embodiments, last access includes administrative access as well.
  • Some embodiments track the time the blob was last changed (828). In some embodiments that track when the blob was last changed, changes apply only to metadata because the blob contents are immutable.
  • Some embodiments provide a block flag 830 that blocks access to the generation. In these embodiments, a blobmaster 204 would still allow access to certain users or clients who have the privilege or seeing blocked blob generations.
  • Some embodiments provide a preserve flag 832 that will guarantee that the data in the generation is not removed. This may be used, for example, for data that is subject to a litigation hold or other order by a court.
  • a generation has one or more representations 818. The individual representations 820-1, 820-2, etc. are described below with respect to Figure 8E.
  • FIG. 8C illustrates a data structure to hold an individual reference according to some embodiments.
  • Each reference 810 includes a reference ID 834 that uniquely identifies the reference.
  • the user application 308 When a user 302 accesses a blob, the user application 308 must specify a reference ID in order to access the blob.
  • each reference has an owner 836, which may be the user or process that created the reference.
  • Each reference has its own access control list (“ACL"), which may specify who has access to the blob, and what those access rights are. For example, a group that has access to read the blob may be larger than the group that may edit or delete the blob. In some embodiments, removal of a reference is intentionally slow, in order to provide for recovery from mistakes.
  • ACL access control list
  • this slow deletion of references is provided by tombstones.
  • Tombstones may be implemented in several ways, including the specification of a tombstone time 840, at which point the reference will be truly removed.
  • the tombstone time is 30 days after the reference is marked for removal.
  • certain users or accounts with special privileges can view or modify references that are already marked with a tombstone, and have the rights to remove a tombstone (i.e., revive a blob).
  • each reference has its own blob policy, which may be specified by a policy ID 842.
  • the blob policy specifies the number of copies of the blob, where the copies are located, what types of data stores to use for the blobs, etc.
  • the applicable "policy" is the union of the relevant policies. For example, if one policy requests 2 copies, at least one of which is in Europe, and another requests 3 copies, at least one of which is in North America, then the minimal union policy is 3 copies, with at least one in Europe and at least one in North America. In some
  • individual references also have a block flag 844 and preserve flag 846, which function the same way as block and preserve flags 830 and 832 defined for each generation.
  • a user or owner of a blob reference may specify additional information about a blob, which may include on disk information 850 or in memory information 848. A user may save any information about a blob in these fields.
  • Figure 8D illustrates inline data items 814 according to some embodiments.
  • Each inline data item 814 is assigned to a specific generation, and thus includes a generation number 822.
  • the inline data item also specifies the representation type 852, which, in combination with the generation number 822, uniquely identifies a representation item 820. (See Figure 8E and associated description below.)
  • the inline data item 814 also specifies the chunk ID 856.
  • the inline data item 814 specifies the chunk offset 854, which specifies the offset of the current chunk from the beginning of the blob.
  • the chunk offset is specified in bytes.
  • there is a Preload Flag 858 that specifies whether the data on disk is preloaded into memory for faster access.
  • Figure 8E illustrates a data structure to store blob representations according to some embodiments. Representations are distinct views of the same physical data. For example, one representation of a digital image could be a high resolution photograph. A second representation of the same blob of data could be a small thumbnail image
  • Each representation data item 820 specifies a representation type 852, which would correspond to "high resolution photo” and "thumbnail image” in the above example.
  • the Replica Information 862 identifies where the blob has been replicated, the list of storage references (i.e., which chunk stores have the chunks for the blob). In some embodiments, the Replica Information 862 includes other auxiliary data needed to track the blobs and their chunks.
  • Each representation data item also includes a collection of blob extents 864, which specify the offset to each chunk within the blob, to allow reconstruction of the blob.
  • a blob When a blob is initially created, it goes through several phases, and some embodiments track these phases in each representation data item 820.
  • a fmalization status field 866 indicates when the blob is UPLOADING, when the blob is FINALIZING, and when the blob is FINALIZED. Most representation data items 820 will have the FINALIZED status.
  • certain fmalization data 868 is stored during the fmalization process.
  • a distributed storage system 200 may includes multiple instances 102 and a particular instance 102 may include multiple data stores based on different types of storage media, one of which being a tape store 218 that is configured to store data on a physical tape 320 as a backup for the other data stores.
  • a tape store 218 that is configured to store data on a physical tape 320 as a backup for the other data stores.
  • special approaches are developed for the tape store 218 to make these tape- related features less visible such that the tape store 218 can be treated in effectively the same manner as the other types of data stores like the bigtable store 214 and the file store 216.
  • Figure 9A depicts a block diagrams illustrative of how a chunk is transferred from a distributed storage system to a tape storage system with Figures 10A and 10B showing the corresponding flowcharts of the chunk backup process.
  • Figure 9B depicts how a chunk is restored from the tape storage system back to the distributed storage system with Figures IOC and 10D showing the corresponding flowcharts of the chunk restore process.
  • Figures 9C - 9F depict block diagrams of data structures used by different components of the distributed storage system to support the back and forth chunk replication between the distributed storage system to the tape storage system.
  • Figures 9A and 9B only depict a subset of components of the distributed storage system 200 as shown in Figures 1 and 3, including the LAD 902 and two blobstores 904, 906.
  • the blobstore 904 is coupled to a tape storage system 908 that may be external to the distributed storage system 200 in some embodiments or part of the distributed storage system 200 in some other embodiments.
  • the term "blobstore" in this application corresponds to an instance 102 of the system 200 because it stores a plurality of blobs, each blob being a data object (e.g., an image, a text document, or an audio/video stream) that is comprised of one or more chunks.
  • the LAD 902 determines that a chunk associated with a blob should be backed up onto a tape and then issues a chunk backup request including an identifier of the chunk to the repqueue 904-1 of the blobstore 904 (1001 of Figure 10A).
  • the term "repqueue" is a collective representation of a replication module 224 and its associated queues 226 as shown in Figure 3.
  • the LAD 902 may make this decision in accordance with the blob's replication policy that requires a replica of the blob being stored in a tape storage system as a backup.
  • the chunk backup request may be initiated by a client residing with an application (e.g., the client 310 inside the user application 308 as shown in Figure 3). Since the distributed storage system 200 includes multiple blobstores, the LAD 902 typically issues the request to a load- balanced blobstore that has access to tape storage. As part of the chunk backup request, the LAD 902 also identifies a source storage reference that has a replica of the chunk to be backed up. Depending on where the replica is located, the source storage reference may be a chunk store within the same blobstore that receives the backup request or a different blobstore. In this example, it is assumed that the replica is within the blobstore 906.
  • the repqueue 904-1 then issues a chunk write request to a load-balanced bitpusher 904-3 (1003 of Figure 10A).
  • this step involves the process of adding the request to a particular queue of tasks to be performed by the blobstore 's bitpushers in accordance with the priority of the backup request that the LAD 902 has chosen.
  • the chunk write request may reach the bitpusher 904-3 at a later time if the backup request has a relatively low priority.
  • the bitpusher 904-1 may check whether the chunk is already backed up upon receipt of the chunk backup or write request (1005 of Figure 10A).
  • the bitpusher 904-1 does so by attempting to add a reference to the chunk using the chunk identifier and its source storage reference to the chunk index table 904-6 of the tape store 904-4.
  • the chunk index table 904-6 includes a plurality of chunk index records, each record identifying a chunk that has been backed up or scheduled to be backed up by the tape storage system. In some other embodiments, the chunk index table 904-6 only includes chunk index records for those chunks that have been backed up by the tape storage system.
  • Figure 9C illustrates the data structure of an exemplary chunk index record
  • Each chunk index record 920 has a globally unique chunk ID 922.
  • the chunk ID 922 is a function of a hash of the chunk's content 922-1 and a sequence ID 922-3 assigned to a particular incarnation of the chunk, e.g., the creation timestamp of the incarnation.
  • the chunk index record 920 includes a storage reference 924, which is a function of a blobstore ID 924-1 and a chunk store ID 924-3.
  • the chunk metadata 926 of the chunk index record 920 includes: another hash of the chunk's content 926-1; a chunk creation time 926-3; a reference count 926-5; and a chunk size 926-7.
  • the blob references list 928 of the chunk index record 920 identifies a set of blobs each of which considers the chunk as part of the blob.
  • Each blob reference is identified by a combination of a blob base ID 928-1 and a blob generation ID 928-3.
  • the blob reference also keeps a chunk offset 928-5 indicating the position of the chunk within the blob and an optional
  • the bitpusher 904-3 queries the chunk index table 904-6 for a chunk index record corresponding to the given chunk ID. If the chunk index record is indeed found (yes, 1007 of Figure 10A), the reference count 926-5 of the chunk is increased by one and a new entry may be added to the blob references list 928 identifying another blob that considers the chunk as part of the blob.
  • the tape store 904-4 then sends a response to the bitpusher 904-3, indicating that the chunk backup/write operation is complete (1009 of Figure 10A).
  • the bitpusher 904-3 then forwards the response back to the repqueue 904-1, which then sends a blob metadata update to the blobmaster 904-5.
  • the bitpusher 904-3 requests the chunk from a bitpusher 906-1 at the blobstore 906.
  • the bitpusher 906-1 retrieves the requested chunk from a corresponding chunk store 906-3 and returns it to the bitpusher 904-3 at the blobstore 904.
  • the bitpusher 904-3 places the chunk in a staging area of the tape store 904-4 with other chunks scheduled to be backed up.
  • the tape master 904-12 is triggered to upload the chunks into the tape storage system 908 in a batch mode.
  • the staging area of the tape store 904-4 is composed of three components: a batch table 904-9, a chunk metadata staging region 904-8 (e.g., a file directory or a bigtable), and a chunk data staging region 904-10 (e.g., a file directory).
  • the bitpusher 904-3 given a chunk to be backed up, the bitpusher 904-3 generates a chunk transfer entry and inserts the chunk transfer entry into the batch table.
  • the chunk transfer entry includes a reference to a file in the chunk data staging region that contains a list of chunks to be backed up.
  • the bitpusher 904-3 writes the chunk's backup metadata into the chunk metadata staging region 904-8 and the chunk's content into a file in the chunk data staging region 904-10.
  • FIGS 9D - 9F illustrate the data structures of an exemplary batch table record 930, a chunk backup record 950, and a chunk restore record 960, respectively.
  • a batch table record within the batch table 904-9 has the following attributes: a unique batch ID 932, a batch type 934 (backup or restore), a locality range 936 (start, limit), a current batch state 938, a batch creation time 940, a tape storage system job status 942, a chunk files list 944, and a batch size 946.
  • a unique batch ID 932 a batch type 934 (backup or restore), a locality range 936 (start, limit), a current batch state 938, a batch creation time 940, a tape storage system job status 942, a chunk files list 944, and a batch size 946.
  • a chunk backup metadata record 950 includes the following attributes: a chunk ID 954 and a blob back reference 956 that further includes: a blob base ID 956-1, a chunk offset within the blob 956-3, a chunk size 956-5, a representation type 956-7, and a blob generation ID 956-9.
  • a chunk restore metadata record 960 includes the following attributes: a chunk ID 964 and a blob back reference 966 that further includes: a blob base ID 956-1, a chunk offset within the blob 956-3, a chunk size 956-5, a representation type 956-7, and a blob generation ID 956-9. Note that a combination of a blob based ID and a blob generation ID can uniquely identify a particular generation of a blob that includes the chunk.
  • the batch table 904-9 includes multiple batches and each batch includes a list of chunk files to be backed up or restored.
  • the bitpusher 904-3 needs to identify one of the multiple batches for the chunk to be backed up on the tape storage system (1011 of Figure 10A).
  • the bitpusher 904-3 chooses the batch by comparing a locality hint provided with the chunk back up request with the batch's locality range.
  • the locality hint provided by the LAD 902 or another client indicates a group of chunks that should be restored together with the chunk in one batch or a group of chunks that should expire together with the chunk.
  • the bitpusher 904-3 identifies a batch whose locality range matches most the chunk's locality hint and inserts a new chunk transfer entry into the identified batch as well as writing the chunk's backup metadata and content into respective files in the corresponding chunk data staging region (1013 of Figure 10A).
  • the metadata includes a hash of the chunk used by the tape storage system 908. In some embodiments, if the chosen batch's size reaches a threshold or the chosen batch has been open for at least a predefined time period, the bitpusher 904-3 may close the batch and create a new batch for the chunk to be backed up.
  • the latency between the bitpusher 904-3 sending the response and the chunk being backed up on the tape storage system 908 may range from a few minutes to multiple hours.
  • the repqueue 904-1 may send a blob metadata update to the blobmaster 904-5 to update the blob's extents table within the metadata table 904-7 (1017 of Figure 10A).
  • a similar set of operations may be performed at the blobstore 906. For example, a previous request to delete a blob including the chunk that was suspended due to the blob's "in transfer" state may resume after the blob's state returns to be "finalized.”
  • the tape master 904-12 is responsible for periodically uploading the chunks from the staging area of the distributed storage system to the tape storage system 908 in a batch mode.
  • the tape master 904-12 scans the batch table 904-9 for batches closed by the bitpusher 904-3 or open batches that meet one or more predefined batch closure conditions (1021 of Figure 10B). For example, a batch may be closed if the batch's size or its open time period since creation exceeds a predefined threshold. If the tape master 904-12 identifies no batch for further process (no, 1023 of Figure 10B), it waits for a predefined time period before the next scan of the batch table 904-
  • the tape master 904-12 identifies a closed batch or a batch that is ready to be closed (yes, 1023 of Figure 10B), it will update the current batch state of the batch record in the batch table 904-9 and initiate the process of uploading the files associated with the closed batch into the tape storage system 908 (1027 of Figure 10B). In some embodiments, the tape master also opens a new batch in the batch table after closing an old one.
  • the tape master 904-12 extracts the list of chunk files from the batch table 904-9 and sends the list to the tape storage system 908 for chunk backup (1029 of Figure 10B).
  • each chunk is divided into two parts, the metadata being stored in one file and the content being stored in another file. Therefore, for each chunk, the tape storage system 908 uses the two file names provided by the tape master 904- 12 to retrieve the metadata from the chunk metadata region 904-8 and the content from the chunk data region 904-10.
  • the tape storage system 908 keeps one copy for each chunk on the tape. For security, a private key is used for encrypting the chunk and the private key is kept on a separate tape such that a chunk deletion request is honored by deleting the private key.
  • the tape storage system 908 may repeat the retrieval process until either it receives the chunk or it has tried for at least a predefined number of times. In either case, the tape storage system 908 sends a backup status for each file back to the tape master 904-12 (1031 of Figure 10B).
  • the tape master 904-12 then uses the status information to update the batch table (e.g., the tape system job status attribute 942 of the batch table record 920).
  • the tape master 904-12 also updates the chunk index table 904-6 for each chunk processed by the tape storage system regardless of whether the backup succeeds or not (1033 of Figure 10B).
  • the bitpusher 904-3 may insert an entry into the chunk index table 904-6 for each chunk it plays into the staging area.
  • the entry is marked as "Staging," indicating that the chunk has not yet been transferred to the tape storage system 908.
  • the tape master 904-12 receives the backup status from the tape storage system 908, the tape master 904-12 either changes the state of the entry in the chunk index table 904-6 from “Staging" to "External” (if the backup succeeds) or deletes the entry from the chunk index table 904-6 (if the backup fails).
  • the entries in the chunk index table 904-6 are used by the bitpusher 904-3 to determine whether a chunk backup request is a new request or a repeated request.
  • the tape master for each successfully backed up chunk, the tape master
  • the tape master 904-12 also updates the extents table of the corresponding blob through the blobmaster 904-5 (1035 of Figure 10B). Through subsequent metadata replication, the existence of the chunk in the tape storage system (in the form of a replica of blob) will be spread out to the instances or blobstores of the distributed storage system. In some embodiments, the tape master 904- 12 also deletes the chunk metadata file and the chunk content file from the staging area for each successfully backed up chunk to leave the space for subsequent chunk backup or restore tasks. After the last chunk within a batch is processed, the tape master 904-12 may remove the batch from the batch table 904-9 to conclude a batch of chunk backup requests (1037 of Figure IOC).
  • the tape storage system 908 divides a chunk into multiple (e.g., four) segments and generates a redundancy segment from the multiple segments such that a lost segment can be reconstructed from the other segments using approaches such as error correction codes (ECC).
  • ECC error correction codes
  • Each segment (including the redundancy segment) is kept at a separate tape for security and safety reasons.
  • tapes corresponding to different chunk segments are shipped by different vehicles such that any tape loss due to an accident can be recovered from the other tapes that are shipped separately.
  • the bitpusher 904-3 deletes a reference to the chunk from the chunk index table 904-6 and reduces the chunk's reference count by one. Once the chunk's reference count reaches zero, the corresponding chunk index record may be deleted from the chunk index table 904-6.
  • the tape master 904-12 may issue a delete request to the tape storage system 908 to delete the chunk from the tape storage system.
  • the tape storage system 908 manages a private key for each chunk backed up on tape. Upon receipt of the delete request, the tape storage system 908 can simply eliminate the private key, indicating the expiration of the chunk.
  • the tape storage system 908 reads back the chunk segments using the private keys for the valid chunks from the different tapes and rewrites the chunk segments back to the tapes using the same approach as described above. In doing so, the space occupied by the expired chunks is reclaimed by valid chunks and the chunk redundancy is kept intact.
  • One reason for replicating chunks on the tape storage system is to restore the chunks. This may happen when one or more instances of the distributed storage system become unavailable due to a catastrophic accident and the LAD determines that restoring chunks from the tape storage system is necessary. Note that although tape is a serial access medium, restoring chunks from the tape storage system can happen at a chunk level, not at a tape level, partly because the chunks that are likely to be restored together have been grouped into one batch at the time of chunk backup operation.
  • Figure 9B is a block diagram that is similar to the one in Figure 9A except that
  • Figure 9B illustrates the process of restoring a chunk from the tape storage system 908.
  • the LAD 902 instructs the repqueue 904-1 of the blobstore 904 to restore a chunk from the tape storage system 908 (1041 of Figure IOC).
  • the instruction may include the chunk ID and a destination storage reference for hosting the restored chunk.
  • the repqueue 904-1 then issues a request (e.g., a remote procedure call) to a load-balanced bitpusher 904-3 to restore the chunk (1043 of Figure IOC).
  • the bitpusher 904-3 requests quota for the chunk in the batch table 904-9 (1045 of Figure IOC).
  • the bitpusher 904-3 may send an error notification to the repqueue 904-1 (1049 of Figure IOC), which can submit a quota request to a quota server (not shown in Figure IOC) to get the necessary quota for the chunk replication to continue.
  • the bitpusher 904-3 selects a batch in the batch table 904-9 for the chunk to be restored (1051 of Figure IOC). In some embodiments, this step is followed by inserting information such as the chunk ID, sequence ID, and the destination storage reference into a corresponding batch table record in the batch table or a bigtable in the chunk metadata staging region 904-8 (1053 of Figure IOC). In some embodiments, the bitpusher 904-3 also closes a batch that is ready for closure and opens a new batch for subsequent tape-related operations in the staging area. At this point, the responsibility for restoring the chunks identified in the batch table 904-9 is transferred from the bitpusher 904-3 to the tape mater 904-12.
  • the bitpusher 904-3 sends a response to the repqueue 904-1 indicating that the chunk restore request has been scheduled and will be performed asynchronously at a later time (1055 of Figure IOC).
  • the repqueue 904-1 may send metadata updates to the blobmaster 904-5 at the blobstore 904 as well as the blobmaster at the blobstore 906 (which is assumed to be destination storage reference in this example).
  • the latency between the bitpusher 904-3 providing the response acknowledging the receipt of the chunk restore request and the bitpusher 904-3 providing the requested chunk may range from a few minutes to a few days partly depending on when the chunk was backed up. The most recent the backup the short the latency between the two steps.
  • the tape master 904-12 Similar to the chunk backup process described above, the tape master 904-12 periodically scans the batch table for closed batches or batches that are ready for closure (1061 of Figure 10D). If no batch is identified (no, 1063 of Figure 10D), the tape master 904-12 then waits for the next scan (1065 of Figure 10D). For each closed batch or a batch that is ready for closure (yes, 1063 of Figure 10D), the tape master 904-12 retrieves the chunk IDs and a list of chunk file names from the corresponding batch table record in the batch table 904-9 and updates the batch table if necessary (1067 of Figure 10D).
  • each chunk scheduled for backup or restoration has two components, one component stored in a bigtable within the chunk metadata region 904-8 for storing the chunk's restore metadata (a chunk restore metadata record 960 is depicted in Figure 9F) and the other component (including the chunk's key, content data, and hash) stored in a file within the chunk content data region 904-10 for storing the chunk's content.
  • the tape storage system 908 Based on the chunk IDs (in addition to the chunk sequence IDs), the tape storage system 908 identifies one or more tapes that store those chunks and then writes the chunks back to the staging area of the distribute storage system using the list of chunk file names.
  • the tape storage system 908 sends a job status report to the tape master 904-12 regarding the restore state of each file identified in the batch (1071 of Figure 10D). For each successfully restored chunk of each file, the tape master 908-12 retrieves the chunk's content and metadata from the chunk data region 904-10 and the chunk metadata region 904-8 and sends them to the bitpusher 904-3 (1073 of Figure 10D). In some embodiments, the restored chunks are stored within a local chunk store before restoring to a remote chunk store for efficiency concern. In some embodiments, the tape master 904-12 updates the chunk index table 904-6 to reflect the new reference to the chunk restored from the tape storage system.
  • the tape master 904-12 sends a metadata update to the blobmaster 904-5 to update the extents table of the corresponding blob that includes the restored chunk. For each unsuccessful restore of a chunk, the tape master 904- 12 removes the original chunk restore request from the batch table (1075 of Figure 10D). At the end of the process, the tape master 904-12 removes the restored chunks from the staging area and the processed batch from the batch table (1077 of Figure 10D).

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
EP20110704385 2010-02-09 2011-02-09 Verfahren und system für den effizienten zugang zu einem bandpspeichersystem Ceased EP2534570A1 (de)

Applications Claiming Priority (7)

Application Number Priority Date Filing Date Title
US30289610P 2010-02-09 2010-02-09
US30293610P 2010-02-09 2010-02-09
US30290910P 2010-02-09 2010-02-09
US13/022,579 US8341118B2 (en) 2010-02-09 2011-02-07 Method and system for dynamically replicating data within a distributed storage system
US13/022,236 US8560292B2 (en) 2010-02-09 2011-02-07 Location assignment daemon (LAD) simulation system and method
US13/023,498 US8874523B2 (en) 2010-02-09 2011-02-08 Method and system for providing efficient access to a tape storage system
PCT/US2011/024249 WO2011100368A1 (en) 2010-02-09 2011-02-09 Method and system for providing efficient access to a tape storage system

Publications (1)

Publication Number Publication Date
EP2534570A1 true EP2534570A1 (de) 2012-12-19

Family

ID=43797888

Family Applications (2)

Application Number Title Priority Date Filing Date
EP20110704385 Ceased EP2534570A1 (de) 2010-02-09 2011-02-09 Verfahren und system für den effizienten zugang zu einem bandpspeichersystem
EP11705357.9A Active EP2534571B1 (de) 2010-02-09 2011-02-09 Verfahren und system für dynamische datenreplikation in einem verteilten speichersystem

Family Applications After (1)

Application Number Title Priority Date Filing Date
EP11705357.9A Active EP2534571B1 (de) 2010-02-09 2011-02-09 Verfahren und system für dynamische datenreplikation in einem verteilten speichersystem

Country Status (3)

Country Link
EP (2) EP2534570A1 (de)
CN (1) CN103038742B (de)
WO (2) WO2011100368A1 (de)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2829976A4 (de) * 2012-03-22 2015-11-11 Nec Corp Verteiltes speichersystem, speichersteuerungsverfahren und programm
US9158472B2 (en) * 2013-06-25 2015-10-13 Google Inc. Hierarchical chunking of objects in a distributed storage system
CN103838830B (zh) * 2014-02-18 2017-03-29 广东亿迅科技有限公司 一种HBase数据库的数据管理方法及系统
CN106527961B (zh) * 2015-09-15 2019-06-21 伊姆西公司 用于保证数据一致性的方法和装置
CN105677805B (zh) * 2015-12-31 2019-05-10 北京奇艺世纪科技有限公司 一种利用protobuf的数据存储、读取方法及装置
WO2017141249A1 (en) * 2016-02-16 2017-08-24 Technion Research & Development Foundation Limited Optimized data distribution system
CN107544999B (zh) * 2016-06-28 2022-10-21 百度在线网络技术(北京)有限公司 用于检索系统的同步装置及同步方法、检索系统及方法
CN108804693A (zh) * 2018-06-15 2018-11-13 郑州云海信息技术有限公司 一种分布式存储方法和装置
CN108959513A (zh) * 2018-06-28 2018-12-07 郑州云海信息技术有限公司 一种分布式存储系统下读取数据的方法及其数据处理装置
US10977217B2 (en) * 2018-10-31 2021-04-13 EMC IP Holding Company LLC Method and system to efficiently recovering a consistent view of a file system image from an asynchronously remote system
US11468011B2 (en) 2019-04-11 2022-10-11 Singlestore, Inc. Database management system

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7464124B2 (en) * 2004-11-19 2008-12-09 International Business Machines Corporation Method for autonomic data caching and copying on a storage area network aware file system using copy services
JP4420351B2 (ja) * 2005-09-30 2010-02-24 富士通株式会社 階層ストレージシステム、制御方法及びプログラム
US7653668B1 (en) * 2005-11-23 2010-01-26 Symantec Operating Corporation Fault tolerant multi-stage data replication with relaxed coherency guarantees
JP4756545B2 (ja) * 2006-05-15 2011-08-24 株式会社日立製作所 複数のテープ装置を備えるストレージシステム
US8019727B2 (en) * 2007-09-26 2011-09-13 Symantec Corporation Pull model for file replication at multiple data centers

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
IBM: "IBM System Storage Solutions Handbook passage", 1 October 2006, IBM SYSTEM STORAGE SOLUTIONS HANDBOOK,PAGE(S) 38PP, ISBN: 0-7384-9677-4, XP007923080 *
See also references of WO2011100368A1 *

Also Published As

Publication number Publication date
WO2011100365A1 (en) 2011-08-18
CN103038742A (zh) 2013-04-10
EP2534571B1 (de) 2016-12-07
EP2534571A1 (de) 2012-12-19
WO2011100368A1 (en) 2011-08-18
CN103038742B (zh) 2015-09-30

Similar Documents

Publication Publication Date Title
US8874523B2 (en) Method and system for providing efficient access to a tape storage system
US8341118B2 (en) Method and system for dynamically replicating data within a distributed storage system
US8615485B2 (en) Method and system for managing weakly mutable data in a distributed storage system
US9305069B2 (en) Method and system for uploading data into a distributed storage system
WO2011100368A1 (en) Method and system for providing efficient access to a tape storage system
US9317524B2 (en) Location assignment daemon (LAD) for a distributed storage system
JP6009097B2 (ja) 分散オブジェクトストレージエコシステムにおけるコンテンツとメタデータの分離
US8380659B2 (en) Method and system for efficiently replicating data in non-relational databases
US8285686B2 (en) Executing prioritized replication requests for objects in a distributed storage system
US8793531B2 (en) Recovery and replication of a flash memory-based object store
US12019524B2 (en) Data connector component for implementing data requests
US10009250B2 (en) System and method for managing load in a distributed storage system
EP2534569B1 (de) System und verfahren zur verwaltung von objektrepliken in einem verteilten speichersystem
US8423517B2 (en) System and method for determining the age of objects in the presence of unreliable clocks

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20120907

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

DAX Request for extension of the european patent (deleted)
17Q First examination report despatched

Effective date: 20131108

REG Reference to a national code

Ref country code: DE

Ref legal event code: R003

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION HAS BEEN REFUSED

18R Application refused

Effective date: 20150326

P01 Opt-out of the competence of the unified patent court (upc) registered

Effective date: 20230522