EP3532939A1 - Garbage collection system and process - Google Patents
Garbage collection system and processInfo
- Publication number
- EP3532939A1 EP3532939A1 EP17876888.3A EP17876888A EP3532939A1 EP 3532939 A1 EP3532939 A1 EP 3532939A1 EP 17876888 A EP17876888 A EP 17876888A EP 3532939 A1 EP3532939 A1 EP 3532939A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- block
- data
- locations
- shard
- key
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000000034 method Methods 0.000 title claims abstract description 87
- 230000008569 process Effects 0.000 title claims abstract description 73
- 230000004044 response Effects 0.000 claims description 8
- 238000010586 diagram Methods 0.000 description 26
- 238000005056 compaction Methods 0.000 description 14
- 238000012545 processing Methods 0.000 description 7
- 230000006870 function Effects 0.000 description 6
- 238000004891 communication Methods 0.000 description 4
- 230000008859 change Effects 0.000 description 3
- 101100217298 Mus musculus Aspm gene Proteins 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000005055 memory storage Effects 0.000 description 2
- 101100203322 Saccharomyces cerevisiae (strain ATCC 204508 / S288c) SKS1 gene Proteins 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 125000004122 cyclic group Chemical group 0.000 description 1
- 238000013144 data compression Methods 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 230000007480 spreading Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0602—Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
- G06F3/0608—Saving storage space on storage systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0638—Organizing or formatting or addressing of data
- G06F3/064—Management of blocks
- G06F3/0641—De-duplication techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0638—Organizing or formatting or addressing of data
- G06F3/0643—Management of files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0646—Horizontal data movement in storage systems, i.e. moving data in between storage devices or systems
- G06F3/0652—Erasing, e.g. deleting, data cleaning, moving of data to a wastebasket
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0668—Interfaces specially adapted for storage systems adopting a particular infrastructure
- G06F3/067—Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
Definitions
- a garbage collection system using an intermediary networked device to store data objects on a remotely located object storage device(s) is disclosed.
- Deduplication is a specialized data compression technique for eliminating duplicate copies of data.
- Deduplication of data is typically done to decrease the cost of storage of the data using a specially configured storage device having a deduplication engine internally connected directly to a storage drive.
- the deduplication engine within the storage device receives data from an external device.
- the deduplication engine creates a hash from the received data which is stored in a table.
- the table is scanned to determine if an identical hash was previously stored in the table. If it was not, the received data is stored in the Cloud Object Store, and a location pointer for the received data is stored in an entry within the table along with hash of the received data.
- a duplicate of the received data is detected, an entry is stored in the table containing the hash and an index pointing to the location within the Cloud Object Store where the duplicated data is stored.
- This system has the deduplication engine directly coupled to an internal storage drive to maintain low latency and fast storage of the hash table.
- the data is stored in a Cloud Object Store.
- a method is disclosed to perform garbage collection that works effectively across a system spread over multiple servers (a scale-out cluster) and across very large amounts of data by compacting data in data blocks in an object store.
- Compacting data in the object store includes storing backend objects in the object store and examining data in a reference map of the object store to determine which of the locations within a back-end object in the object store are referenced in the map, and which locations are no longer referenced.
- the back-end object in the object store are altered to remove block data from locations which are no longer referenced, and a hash- to-location table is updated to remove the entries for block data that have been removed.
- the method describes a series of messages, data structures and data stores that can be used to perform garbage collection for a deduplication system spread across multiple servers in a scale-out cluster.
- the method may be a two-phase process - a trace process followed by a compaction process.
- the trace process determines which locations contain data that is still active or referenced.
- the compaction process removes data from locations that are no longer referenced.
- a system to perform garbage collection to compact data.
- the system includes an object store storing a backend object and one or more multiple network capable servers including an object store.
- the system includes circuitry to create a reference map in the object store to indicate which locations within a back-end object stored in the object store are currently referenced, and which locations within the back-end object stored in the object store are no longer referenced.
- the system includes circuitry to alter the back-end object stored in the object store to remove block data from the locations within the back-end object stored in the object store which are no longer referenced, and circuitry to remove entries within a hash- to-location table identifying locations of block data within the back-end object that have been removed.
- FIG. 1 is a simplified schematic diagram of a deduplication storage system using an intermediary networked device to perform deduplication;
- FIG. 2 is a simplified schematic and flow diagram of a storage system in which a client application on a client device communicates through an application program interface (API) directly connected to a cloud object store;
- API application program interface
- FIG. 3 is a simplified schematic diagram and flow diagram of a deduplication storage system in which a client application communicates via a network to an application program interface (API) at an intermediary computing device which performs deduplication, and then stores data via a network to a cloud object store.
- API application program interface
- Fig. 3A is a simplified schematic diagram and flow diagram of an alternate deduplication storage system in which a client application communicates via a network to a scale out cluster that include an application program interface (API) at multiple intermediary computing devices which perform deduplication, and then transmit data via a network to be stored in a cloud object store.
- API application program interface
- Fig. 3A also shows how the intermediary computing devices can initiate a garbage collection by exchanging messages.
- Fig. 4 is a simplified schematic diagram of an intermediary computing device shown in Fig. 3.
- FIG. 5 is a flow chart of a process for storing and deduplicating data executed by the intermediary computing device shown in Figure 3;
- Fig. 6 is a flow diagram illustrating the process for storing and deduplicating data
- Fig. 7 is a flow diagram illustrating the process for storing and deduplicating data executed on the client device of Fig. 3.
- Fig. 8 is a data diagram illustrating how data is partitioned into blocks for storage.
- Fig. 9 is a data diagram illustrating how the partitioned data blocks are stored in memory.
- Fig. 10 is a data diagram illustrating a relation between a hash and the data blocks that are stored in memory.
- Fig. 1 1 is a data diagram illustrating the file or object table which maps file or object names to the location addresses where the files are stored.
- Fig. 12 is a data diagram illustrating a garbage collection coordination process for coordinating garbage collection by an arbitrarily selected StorReduce server in Fig. 3A.
- Fig. 13 is a data diagram illustrating a trace process for tracing references in each key shard on StorReduce Servers in Fig. 3A.
- Fig. 14 is a data diagram illustrating a compaction process for compacting data stored in each block shard on StorReduce Servers in Fig. 3A.
- Fig. 15 is a data diagram illustrating a compact data process for compacting data in the cloud object store that provides a more detailed view of the process shown in step 1414 on Fig. 14.
- Storage system 100 includes a client system 102, coupled via network 104 to Intermediate Computing system 106.
- Intermediate computing system 106 is coupled via network 108 to remotely located File Storage system 1 10.
- Storage system 100 transmits data objects to intermediate computing system 106 via network 104.
- Intermediate computing system 106 includes a process for storing the received data objects on file storage system 100 to reduce duplication of the data objects when stored on file system 100.
- Storage system 100 transmits requests via network 104 to intermediate computing system 106 for data store on file storage system 1 10.
- Intermediate computing system 106 responds to the requests by obtaining the deduplicated data on file system 1 10, and transmits the obtained data to client system 100.
- a storage system 200 that includes a client application 202 on a client device 204 that communicates via a network 206 through an application program interface (API) 203 directly connected to a cloud object store 204.
- API application program interface
- the cloud object store may be a non-transitory memory storage device coupled with a server.
- a deduplication storage system 300 including a client application 302 communicates data via a network 304 to an application program interface (API) 31 1 at an intermediary computing device
- API application program interface
- the data is deduplicated on intermediary computing device 308 and then the unique data is stored via a network 310 and API 31 1 (API 203 in Fig. 2) on a remotely disposed computing device 312 such as a cloud object store system that may typically be administered by an object store service.
- a network 310 and API 31 1 API 203 in Fig. 2
- a remotely disposed computing device 312 such as a cloud object store system that may typically be administered by an object store service.
- Exemplary Networks 304 and 310 include, but are not limited to, an
- GSM Global System for Mobile communications
- WiMAX Wireless Fidelity
- LTE Long Term Evolution
- Examples of the intermediary computing device 308, includes, but is not limited to, a Physical Server, a personal computing device, a Virtual Server, a
- Virtual Private Server a Network Appliance, and a Router/Firewall.
- Exemplary remotely disposed computing device 312 may include, but is not limited to, a Network Fileserver, an Object Store, an Object Store Service, a Network Attached device, a Web server with or without WebDAV.
- Examples of the cloud object store include, but are not limited to,
- OpenStack Swift IBM Cloud Object Storage and Cloudian HyperStore.
- object store service examples include, but are not limited to, Amazon® S3, Microsoft® Azure Blob Service and Google® Cloud Storage.
- Client application 302 transmits a file via network 304 for storage by providing an API endpoint (such as hltp //rny-storereduce com) 306 corresponding to a network address of the intermediary device 308.
- the intermediary device 308 then deduplicates the file as described herein.
- the intermediary device 308 then stores the deduplicated data on the remotely disposed computing device 312 via API endpoint 31 1 .
- the API endpoint 306 on the intermediary device is virtually identical to the API endpoint 31 1 on the remotely disposed computing device 312.
- client application 302 transmits a request for the file to the API endpoint 306.
- the intermediary device 308 responds to the request by requesting the deduplicated data from remotely disposed computing device 312 via API endpoint 31 1 .
- the cloud object store 312 and API endpoint 31 1 accommodate the request by returning the deduplicated data to the intermediate device 308, that is then un- deduplicated by the intermediate device 308.
- the intermediate device 308 via API 306 returns the file to client application 302.
- device 308 and a cloud object store is present on device 312 that present the same API to the network.
- the client application 302 uses the same set of operations for storing and retrieving objects.
- the intermediate device 307 is almost transparent to the client application.
- the client application 302 does not need to know that the intermediate API 31 1 and intermediate device 306 are present.
- the only change for the client application 302 is that location of the endpoint of where it stores data has changed in its configuration (e.g., from http://objectstore to http://mystorreduce).
- the location of the intermediate processing device can be physically close to the client application to reduce the amount of data crossing Network 310 which can be a low bandwidth Wide Area Network.
- FIG. 3A there is shown an alternate deduplication storage system 300a including a client application 302a that communicates data via a network 304a to a store reduce scale out cluster 305.
- Cluster 305 includes an application program interface (API) 306a and a load balancer 308 coupled to server 1 309a through server n 309n.
- Server 1 309a through server n 309n are coupled to cloud object store 312a via network 310a and API 31 1 a.
- Computing device 308 may be a load balancer at exemplary network address hft ;//roy- storreduce.
- Servers 309a-309n may be located at exemplary network address h ⁇ tp://storreduce-1 through h ⁇ tp://storreduce-n.
- the data is deduplicated using server 1 309a through server n 309n to determine unique data.
- the unique data determined from the deduplicating process is stored via a network 310a and API 31 1 a (API 21 1 in Fig. 2) on a remotely disposed computing device 312a such as a public cloud object store system providing an object store service, or a private object store system.
- Exemplary Networks 304a and 310a include, but are not limited to, an Ethernet Local Area Network, a Wide Area Network, an Internet Wireless Local Area Network, an 802.1 1 g standard network, a WiFi network, a Wireless Wide Area Network running protocols such as GSM, WiMAX, or LTE.
- Examples of the load balancer 308a and servers 309a-309n include, but are not limited to, a Physical Server, a personal computing device, a Virtual Server, a Virtual Private Server, a Network Appliance, and a Router/Firewall.
- Exemplary remotely disposed computing device 312a may include, but is not limited to, a Network Fileserver, an Object Store, an Object Store Service, a Network Attached device, a Web server with or without WebDAV.
- Examples of the cloud object store include, but are not limited to, OpenStack Swift, IBM Cloud Object Storage and Cloudian HyperStore.
- Examples of the object store service include, but are not limited to, Amazon® S3, Microsoft® Azure Blob Storage and Google® Cloud Storage.
- the Client application 302a transmits a file (request 1 A) via network 304a for storage by using an API endpoint (such as hftg;//my- sto [[educe, com) 306a corresponding to a network address of the load balancer 308.
- the load balancer 308 chooses a server to send the request to and forwards the request (1 A), in this case to Server 309a.
- This Coordinating Server (309a) will split the file into blocks of data and calculate the hash of each block. Each block will be assigned to a shard based on its hash, and each shard is assigned to one of servers 309a-309n.
- the Coordinating Server will send each block of data to the server (309a to 309n) responsible for that shard, shown as "Key Shard and Block Shard Requests" in the diagram.
- Servers 309a - 309n each perform deduplication for the blocks of data sent to them as described herein (step 1 b), and store the deduplicated data on the remotely disposed computing device 312a via API endpoint 31 1 a (requests "1 C (shard 1 )" through to "1 C (shard n)" on Fig. 3A).
- the API endpoint 306a on the intermediary device is virtually identical to the API endpoint 31 1 a on the remotely disposed computing device 312.
- Servers 309a-309n each send location information for their Block data back to the Coordinating Server.
- the Coordinating Server then arranges for this location information to be stored.
- client application 302a transmits a request (2A) for the file to the API endpoint 306a.
- the load balancer 308 chooses a server to send the request to and forwards the request (2A), in this case to Server 309b.
- This Coordinating Server (309b) will fetch location information for each block in the file, including the shard to which each block of data was assigned.
- the Coordinating server will send a request to fetch each block of data to the server (309a to 309n) responsible for that shard, shown as "Key Shard and Block Shard Requests" in the diagram.
- Servers 309a-309n respond to the Block shard requests by requesting the deduplicated data from remotely disposed computing device 312a via API endpoint 31 1 a (requests "2B (Shard 1 )" through to "2B (Shard n)” on Fig. 3A).
- the cloud object store 312a and API endpoint 31 1 a accommodate the requests by returning the deduplicated data to servers 309a-309n (responses “2C (shard 1 )” through to "2C (shard n)” on Fig. 3A).
- Servers 309a-309n return the block data to the Coordinating Server (in this case Server 309b).
- the Coordinating server will directly fetch each block of data from remotely disposed computing device 312a via API endpoint 31 1 a.
- the cloud object store 312a and API endpoint 31 1 a accommodate the requests by returning the deduplicated data to the Coordinating server.
- the data is then un-deduplicated by the Coordinating Server.
- the resulting file (2E) is returned to the load balancer (308) which then returns the file via API 306a to client application 302a.
- device 309a and the cloud object store on device 312a present the same API to the network.
- the client application 302a uses the same set of operations for storing and retrieving objects.
- the intermediate scale-out cluster 300a is almost transparent to the client application.
- the client application 302a does not need to know that the intermediate API 306a and intermediate scale-out cluster 300a are present.
- the only change for the client application 302a is that location of the endpoint of where it stores data has changed in its configuration (e.g., from http://objectstore to http://mystorreduce).
- the location of the intermediate scale-out cluster 300a can be physically close to the client application to reduce the amount of data crossing Network 310 which can be a low bandwidth Wide Area Network.
- the objects being managed by the system 300a each have an object key, and these keys are used to divide the of objects into sets known as key shards.
- Each key shard is assigned to a server within the cluster, which is then responsible for managing information for each object in that key shard. In particular, information about the set of blocks which make up the data for the object is managed by the key shard server for that object.
- the unique blocks of data being managed by the system 300 are each identified by their hash, using a cryptographic hash algorithm.
- the hash namespace is divided into subsets known as block shards.
- Each block shard is assigned to a server within the cluster, which is then responsible for operations on blocks whose hashes fall within that subset of the hash namespace.
- the block shard server can answer the question "is this block with this hash new/unique, or do we already have it stored?".
- the block shard server is also responsible for storing and retrieving blocks whose hashes fall within its subset of the hash namespace. During garbage collection the block shard server collects and merges the reference maps from every key shard (as described in Figure 14) and then runs the compaction process (as described in Figure 15) to remove blocks that are no longer referenced.
- Each block shard is responsible for storing blocks into the underlying object store (also known as the 'cloud object store'). Multiple blocks may be grouped together into an aggregate block in which case all blocks in the aggregate block are stored in a single 'file' (object) in the underlying object store.
- each block is hashed and sent to the appropriate block shard, which will look up the block hash, store the block data if it is unique, and return a reference to the block. After all blocks are stored, the references are collated from the various block shards. A key is assigned to the object and the corresponding key shard stores the list of references for the blocks making up the object.
- the key When reading an object back from the system, the key is provided by the client and the corresponding key shard supplies the list of references for the blocks making up the object. For each reference the block data is retrieved from the cloud object store. The data for all blocks is then assembled and returned to the client.
- the key When deleting an object, the key is provided by the client, and the corresponding key shard deletes the information held about this object, including the list of references for the blocks making up the object. No changes are made within the block shards for those blocks.
- each block may or may not still be referenced by other objects, so no blocks are deleted at this stage and no storage space is reclaimed - this is the purpose of the garbage collection process. Deleting an object simply removes that object's references to its data blocks from the key shard for the object.
- FIG. 4 are illustrated selected modules in computing device 400 using processes 500 and 600 shown in Figs. 5 - 6 respectively to store and retrieve deduplicated data objects.
- Computing device 400 (such as intermediary computing device 308 shown in Fig. 3 and the intermediary computing devices 309a-n shown in Fig. 3A) includes a processing device 404 and memory 412.
- Computing device 400 may include one or more microprocessors, microcontrollers or any such devices for accessing memory 412 (also referred to as a non-transitory media) and hardware 422.
- Computing device 400 has processing capabilities and memory suitable to store and execute computer- executable instructions.
- Computing device 400 executes instruction stored in memory 412, and in response thereto, processes signals from hardware 422.
- Hardware 422 may include an optional display 424, an optional input device 426 and an I/O communications device 428.
- I/O communications device 428 may include a network and communication circuitry for communicating with network 304, 310 or an external memory storage device.
- Optional Input device 426 receives inputs from a user of the computing device 400 and may include a keyboard, mouse, track pad, microphone, audio input device, video input device, or touch screen display.
- Optional display device 424 may include an LED, LCD, CRT or any type of display device to enable the user to preview information being stored or processed by computing device 404.
- Memory 412 may include volatile and nonvolatile memory, removable and non-removable media implemented in any method or technology for storage of information, such as computer-readable instructions, data structures, program modules or other data.
- Such memory includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, RAID storage systems, or any other medium which can be used to store the desired information, and which can be accessed by a computer system.
- Operating system 414 may be used by application 420 to control hardware and various software components within computing device 400.
- the operating system 414 may include drivers for device 400 to communicate with I/O communications device 428.
- a database or library 418 may include preconfigured parameters (or set by the user before or after initial operation) such server operating parameters, server libraries, HTML libraries, API's and configurations.
- An optional graphic user interface or command line interface 423 may be provided to enable application 420 to communicate with display 424.
- Application 420 includes a receiver module 430, a partitioner module 432, a hash value creator module 434, determiner/comparer module 438 and a storing module 436.
- the receiver module 430 includes instructions to receive one or more files via the network 304 from the remotely disposed computing device 302.
- the partitioner module 432 includes instructions to partition the one or more received files into one or more data objects.
- the hash value creator module 434 includes instructions to create one or more hash values for the one or more data objects. Exemplary algorithms to create hash values include, but is not limited to, MD2, MD4, MD5, SHA1 , SHA2, SHA3, RIPEMD, WHIRLPOOL, SKEIN, Buzhash, Cyclic Redundancy Checks (CRCs), CRC32, CRC64, and Adler-32.
- the determiner/comparer module 438 includes instructions to determine, in response to a receipt from a networked computing device (e.g. device hosting application 302) of one of the one or more additional files that include one or more data objects, if the one or more data objects are identical to one or more data objects previously stored on the one or more remotely disposed storage systems (e.g. device 312) by comparing one or more hash values for the one or more data objects against one or more hash values stored in one or more records of the storage table.
- a networked computing device e.g. device hosting application 302
- the one or more data objects are identical to one or more data objects previously stored on the one or more remotely disposed storage systems (e.g. device 312) by comparing one or more hash values for the one or more data objects against one or more hash values stored in one or more records of the storage table.
- the storing module 436 includes instructions to store the one or more data objects on one or more remotely disposed storage systems (such as remotely disposed computing device 312 using API 31 1 ) at one or more location addresses, and instructions to store in one or more records of a storage table, for each of the one or more data objects, the one or more hash values and a corresponding one or more location addresses.
- the storing module also includes instructions to store in one or more records of the storage table for each of the received one or more data objects if the one or more data objects are identical to one or more data objects previously stored on the one or more remotely disposed storage systems (e.g.
- the one or more hash values and a corresponding one or more location addresses of the received one or more data objects without storing on the one or more remotely disposed storage systems (device 312) the received one or more data objects identical to the previously stored one or more data objects.
- FIG. 5 and 6 Illustrated in Figures 5 and 6, are exemplary processes 500 and 600 for deduplicating storage across a network.
- Such exemplary processes 500 and 600 may be a collection of blocks in a logical flow diagram, which represents a sequence of operations that can be implemented in hardware, software, and a combination thereof.
- the blocks represent computer- executable instructions that, when executed by one or more processors, perform the recited operations.
- computer-executable instructions include routines, programs, objects, components, data structures, and the like that perform particular functions or implement particular abstract data types.
- the order in which the operations are described is not intended to be construed as a limitation, and any number of the described blocks can be combined in any order and/or in parallel to implement the process.
- the processes are described with reference to Fig. 4, although it may be implemented in other system architectures.
- process 500 executed by a deduplication application 420 (See Fig. 4) (hereafter also referred to as "application 420") is shown.
- application 420 when executed by the processing devices, uses the processor 404 and modules 416-438 shown in Fig. 4.
- application 420 in computing device 308 receives one or more files via network 304 from a remotely disposed computing device (e.g. device hosting application 302).
- a remotely disposed computing device e.g. device hosting application 302
- application 420 divides the received files into data objects, creates hash values for the data objects or portions thereof, and stores the hash values into a storage table in memory on intermediate computing device (e.g. an external computing device, or system 312).
- intermediate computing device e.g. an external computing device, or system 3112.
- application 420 stores the one or more files via the network 310 onto a remotely disposed storage system 312 via API 31 1 .
- an API within system 312 stores within records of the storage table disposed on system 312 the hash values and
- application 420 stores in one or more records of a storage table disposed on the intermediate device 308 or a secondary remote storage system (not shown) for each of the one or more data objects the one or more hash values and a corresponding one or more network location addresses.
- Application 420 also stores in a file table (Fig. 1 1 ) the names of the files received at in block 502 and the location addresses created at block 505.
- the one or more records of a storage table are stored for each of the one or more data objects the one or more hash values and a corresponding one or more location addresses of the data object without storage of an identical data object on the one or more remotely disposed storage systems.
- the one or more hash values are transmitted to the remotely disposed storage systems for storage with the one or more data objects.
- the hash value and a corresponding one or more new location addresses may be stored in the one or more records of the storage table.
- the one or more data objects may be stored on one or more remotely disposed storage systems at one or more location addresses with the one or more hash values.
- application 420 receive from the networked computing device another of the one or more files.
- application 420 determine if the one or more data objects were previously stored on one or more remotely disposed storage systems 312 by comparing one or more hash values for the data object against one or more hash values stored in one or more records of the storage table.
- the application 420 may deduplicate data objects previously stored on any storage system by including instructions that read one or more files a stored on the remotely disposed storage system, divide the one or more files into one or more data objects, and create one or more hash values for the one or more file data objects.
- application 420 may store the one or more data objects on one or more remotely disposed storage systems at one or more location addresses, store in one or more records of the storage table, for each of the one or more data objects, the one or more hash values and a corresponding one or more location addresses, and in response to the receipt from the networked computing device of the another of the one or more files including the one or more data objects, determine if the one or more data objects were previously stored on one or more remotely disposed storage systems by comparing one or more hash values for the data object against one or more hash values stored in one or more records of the storage table.
- the filenames of the files are stored in the file table (Fig. 1 1 ) along with the location addresses of the duplicate data objects (from the first files) and the location addresses of the unique data objects from the files.
- FIG. 6 there is shown an alternate embodiment of system architecture diagram illustrating a process 600 for storing data objects with deduplication.
- Process 600 may be implemented using an application 420 in intermediate computing device 308 shown in Fig. 3.
- the process includes an application (such as application 420) that receives a request to store an object (e.g., a file) from a client (e.g., the "Client System" in Fig. 1 ).
- the request typically consists of an object key (e.g. , like a filename), the object data (a stream of bytes) and some metadata.
- the application splits that the stream of data into blocks, using a block splitting algorithm.
- the block splitting algorithm could generate variable length blocks like the algorithm described in U.S. patent number 5,990,810 (which is hereby incorporated by reference) or, could generate fixed length blocks of a predetermined size, or could use some other algorithm that produces blocks that have a high probability of matching already stored blocks.
- a block boundary is found in the data stream, a block is emitted to the next stage. The block could be almost any size.
- each block is hashed using a cryptographic hash algorithm like MD5, SHA1 or SHA2 (or one of the other algorithms previously mentioned).
- a cryptographic hash algorithm like MD5, SHA1 or SHA2 (or one of the other algorithms previously mentioned).
- the constraint is that there must be a very low probability that the hashes of different blocks are the same.
- each data block hash is looked up in a table mapping block hashes that have already been encountered to data block locations in the cloud object store (e.g. a hash-to-location table). If the hash is found, then that block location is recorded, the data block is discarded and block 616 is run. If the hash is not found in the table, then the data block is compressed in block 610 using a lossless text compression algorithm (e.g., algorithms described in Deflate U.S. Patent 5,051 ,745, or LZW U.S. Patent 4,558,302, the contents of which are hereby incorporated by reference).
- a lossless text compression algorithm e.g., algorithms described in Deflate U.S. Patent 5,051 ,745, or LZW U.S. Patent 4,558,302, the contents of which are hereby incorporated by reference.
- the data blocks are optionally aggregated into a sequence of larger aggregated data blocks to enable efficient storage.
- the blocks (or aggregate blocks) are then stored into the underlying object store 618 (the
- the hash-to-location table is updated, adding the hash of each block and its location in the cloud object store 618.
- the hash-to-location table (referenced here and in block 608) is stored in a database (e.g. database 620) that is in turn stored in fast, unreliable, storage directly attached to the computer receiving the request.
- the block location takes the form of either the number of the aggregate block stored in block 614, the offset of the block in the aggregate, and the length of the block; or, the number of the block stored in block 614.
- the list of network locations from blocks 608 - 614 may be stored in the object-key-to-location table (Fig 1 1 ), in fast, unreliable, storage directly attached to the computer receiving the request.
- the object key and block locations are stored into the cloud object store 618 using the same monotonically increasing naming scheme as the block records.
- Each file sent to the system is identified by an Object Key.
- the Object-Key- to-Location table contains a list of locations for the blocks making up the file. Each of these Locations is known as a 'reference' to the corresponding block.
- the hash-to-location table is independent of the object-key-to-location table.
- the process may then revert to block 602, in which a response is transmitted to the client device (mentioned in block 602) indicating that the data object has been stored.
- exemplary process 700 implemented by the client application 302 (See Fig. 3) for deduplicating storage across a network.
- Such exemplary process 700 may be a collection of blocks in a logical flow diagram, which represents a sequence of operations that can be implemented in hardware, software, and a combination thereof.
- the blocks represent computer-executable instructions that, when executed by one or more processors, perform the recited operations.
- computer- executable instructions include routines, programs, objects, components, data structures, and the like that perform particular functions or implement particular abstract data types.
- client application 302 prepares a request for transmission to intermediate computing device 308 to store a data object.
- client application 302 transmits the data object to intermediate computing device 308 to store a data object.
- process 500 or 600 is executed by device 308 to store the data object.
- the client application receives a response notification from the intermediate computing system indicating the data object has been stored.
- a response notification from the intermediate computing system indicating the data object has been stored.
- FIG. 8 an exemplary aggregate data object 800 as produced by block 612 is shown.
- the data object includes a header 802n - 802nm, with a block number 804n - 804nm and an offset indication 806n - 806nm, and includes a data block.
- the data objects 902a - 902n each include the header (e.g. 904a) (as described in connection with Fig. 8) and a data block (e.g. 906a).
- FIG. 10 an exemplary relation between the hashes (e.g. H1 - H8) (which are stored in a separate deduplication table) and two separate data objects D1 and D2 are shown. Portions within blocks B1 - B4 of file D1 are shown with hashes H1 - H4, and portions within blocks B1 , B2, B4, B7, and B8 of file D2 are shown with hashes H1 , H2, H4, H6, H7, and H8 respectively. It is noted that portions of data objects having the same hash value are only stored in memory once with its location of storage within memory recorded in the deduplication table along with the hash value.
- a table 1 100 is shown with filenames ("Filename 1 " - "Filename N") of the files stored in the file table along with their data objects for the files' network location addresses.
- Exemplary data objects of Filename 1 are stored at network location address 1 -5.
- Exemplary data objects of Filename 2 are stored at location address 6, 7, 3, 4, 8 and 9.
- the data objects of "Filename 2" are stored at location address 3 and 4 are shared with "Filename 1 ".
- "Filename 3" is a clone of "Filename 1 " sharing the data objects at location addresses 1 , 2, 3, 4 & 5.
- "Filename N" shares data objects with "Filename 1 " and "Filename 2" at location addresses 7, 3 and 9.
- exemplary process 1200 implemented by servers 309a-309n (See Fig. 3a) and garbage collection coordinator module 438 (Fig. 4) for deduplicating storage and garbage collection across a network.
- Garbage collection coordinator module 438 in one of servers 309a-309n is nominated to orchestrate the garbage collection process by whichever server the load balancer happened to forward the 'start garbage collection' request.
- GC Coordinator This will be abbreviated to "GC Coordinator” in the following text and in Figures 12 to 15.
- Such exemplary process 1200 may be a collection of blocks in a logical flow diagram, which represents a sequence of operations that can be implemented in hardware, software, and a combination thereof.
- the blocks represent computer-executable instructions that, when executed by one or more processors, perform the recited operations.
- computer-executable instructions include routines, programs, objects, components, data structures, and the like that perform particular functions or implement particular abstract data types.
- the order in which the operations are described is not intended to be construed as a limitation, and any number of the described blocks can be combined in any order and/or in parallel to implement the process. For discussion purposes, the process is described with reference to Fig. 3a, although it may be implemented in other system architectures.
- Each key shard is allocated to a specific server from 309a to 309n, known as the key shard server for that shard.
- Each block shard is allocated to a specific server from 309a to 309n, known as the block shard server for that shard.
- the message is actually sent to the key shard server or block shard server (309a-309n) for that shard, and then the message is internally routed to the key shard component or block shard component for the shard within that server.
- a reference map is a data structure used to record a set of references to specific block locations, to determine which blocks are 'in-use', versus those able to be deleted. A variety of data structures can be used to implement the reference map.
- the GC coordinator sends a message to each key shard to begin a trace operation for that key shard.
- Each request will include the block range information for every block shard.
- the trace operation will find all references to blocks that should prevent those blocks from being deleted, across all block shards.
- an incoming request to Start Garbage Collection arrives into the scale-out cluster, via the Load Balancer.
- each block shard in servers 309a-309n is messaged to prepare for garbage collection (see 1402).
- the GC coordinator waits for an 'acknowledge ready for garbage collection' message to be received from each block shard (see
- This message will include a block range for the shard.
- each key shard (in servers 309a-309n) is sent a message to begin a trace (see 1302) and in block 1208, the coordinator waits for an acknowledgement from each key shard that the trace is complete (see
- the coordinator sends a message to each block shard to perform compaction (see 1414).
- the coordinator waits for an acknowledgement from each block shard that compaction has been complete (see 1416).
- exemplary process 1300 implemented by key shard modules in servers 309a-309n (Fig. 3a) for performing a trace operation during a garbage collection process across a network.
- Such exemplary process 1300 may be a collection of blocks in a logical flow diagram, which represents a sequence of operations that can be implemented in hardware, software, and a combination thereof.
- the blocks represent computer-executable instructions that, when executed by one or more processors, perform the recited operations.
- computer- executable instructions include routines, programs, objects, components, data structures, and the like that perform particular functions or implement particular abstract data types.
- the key shard server performs the following trace process:
- a partial reference map is created for each block shard, to record the references found.
- the location of each block that is referenced (i.e. still used) as part of a file is recorded in the reference map.
- the aim is to find blocks that are no longer referenced so they can be deleted.
- the key shard server traces through every entry in the object-key-to- location table for every shard, and collect up all the references.
- the references can be compared with the list of blocks being managed to find blocks that are no longer needed (because the files that used to reference them have been removed).
- each partial reference map is sent to its corresponding block shard server.
- the key shard server After all reference maps have been sent, the key shard server responds to the GC coordinator, acknowledging that the trace operation is complete for that key shard.
- the key shard reads the partial reference map for each block shard and sends each partial reference map to the corresponding block shard (see 1410).
- an acknowledgement that the trace is complete is sent to the garbage collection coordinator (see 1208). Once all trace operations have been completed, the Garbage Collection coordinator can begin compaction operations.
- exemplary process 1400 implemented by block shard modules in servers 309a-309n (Fig. 3a) for performing a compaction operation during a garbage collection process across a network.
- Such exemplary process 1400 may be a collection of blocks in a logical flow diagram, which represents a sequence of operations that can be implemented in hardware, software, and a combination thereof.
- the blocks represent computer-executable instructions that, when executed by one or more processors, perform the recited operations.
- computer- executable instructions include routines, programs, objects, components, data structures, and the like that perform particular functions or implement particular abstract data types.
- the corresponding block shard server performs the following process:
- the current maximum block location for the shard is recorded. This defines the block location range for this shard, which is the set of block locations that will be covered by this GC operation.
- the block shard server responds to the GC coordinator, acknowledging that it is now ready for GC and providing information about the block range covered by this GC operation.
- the block shard server will receive partial reference maps from each key server containing the results of that key server's trace operation. Each incoming partial reference map is merged with the existing reference map for the block shard, contributing more references to blocks. Once the partial reference maps from all key shard servers have been received and merged, the resulting map will contain an exhaustive list of references to blocks in this block shard (within the block location range).
- the block shard module waits for an incoming message from the GC Coordinator and defines a block location range for this garbage collection run, referencing the hash-to-location table.
- the block shard module creates an empty reference map in the reference map table, and in block 1406 the block shard module sends an acknowledgement to the GC Coordinator.
- the block shard module waits for incoming partial reference maps from each key shard (see 1304), and then, in block 1410, merges each incoming partial reference map into the existing reference map for the shard.
- the merge operation is implemented by performing a bitwise OR operation on each corresponding bit in the two bitmaps to merge the two sets of references.
- exemplary process 1500 implemented by block shard modules in servers 309a-309n (Fig. 3a) for compacting data in the Cloud Object Store during a compaction operation, specifically for block 1414 (Fig. 14) of the garbage collection process.
- Such exemplary process 1500 may be a collection of blocks in a logical flow diagram, which represents a sequence of operations that can be implemented in hardware, software, and a combination thereof.
- the blocks represent computer- executable instructions that, when executed by one or more processors, perform the recited operations.
- computer-executable instructions include routines, programs, objects, components, data structures, and the like that perform particular functions or implement particular abstract data types.
- routines programs, objects, components, data structures, and the like that perform particular functions or implement particular abstract data types.
- the order in which the operations are described is not intended to be construed as a limitation, and any number of the described blocks can be combined in any order and/or in parallel to implement the process. For discussion purposes, the process is described with reference to Fig. 3a and Fig. 14, although it may be implemented in other system architectures.
- the block shard server For each block shard, the block shard server performs the following compaction process: The block shard server iterates through each back-end object in the Cloud Object Store managed by the shard. Each back- end object can contain one or more blocks of data, and therefore can span multiple locations within the block shard.
- Each back-end object may be compacted using the following process: a.
- the reference map is examined to determine which of the locations within the back-end object are referenced, and which locations are no longer referenced.
- the back-end object is altered in the Cloud Object Store to remove the block data from locations which are no longer referenced. Only block data which is still referenced will remain.
- the hash-to-location table is updated to remove the entries for blocks that have been removed during the compaction process.
- the reference map for the block shard can be deleted.
- the block shard server responds to the GC coordinator acknowledging that the compaction operation is completed for this block shard.
- block 1504 a determination is made as to which blocks in the back-end object are still referenced using information from the hash-to-location table and the reference map.
- the back-end objects are modified or re-written into the cloud object store to remove unused blocks.
- Back end objects may be modified, or may be re-written by writing a new version of the object that replaces the old version.
- the new version of the object omits the data blocks which are no longer required.
- a back-end object contains exemplary blocks 1 , 2, 3, 4, 5 and 6, and the system determines that blocks 3 and 4 are no longer referenced and can be deleted, then the system will re-write the back-end object so that it contains only blocks 1 , 2, 5 and 6. This changes the offset within the back-end object at which blocks 5 and 6 are stored; they are now closer to the start of the back-end object. The offset of blocks 1 and 2 does not change.
- the amount of storage required for the back-end object is reduced because it no longer contains blocks 3 and 4.
- Each location is an offset within a particular back-end object. (For example, shard 5, object number 1 ,234,567, offset 20,000 bytes from the start of the object). In one implementation this is the location where the bytes making up the data block are stored within the object store.
- the hash-to-location table is updated to remove entries for blocks which have been removed from the Cloud Object Store.
- block 1512 a determination is made as to whether more backend objects exist within the block location range for this compact data process. If there are more backend objects, block 1504 - block 1508 are repeated. If there are no more objects, then this process completes.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201662427353P | 2016-11-29 | 2016-11-29 | |
US15/825,073 US20180107404A1 (en) | 2015-11-02 | 2017-11-28 | Garbage collection system and process |
PCT/US2017/063673 WO2018102392A1 (en) | 2016-11-29 | 2017-11-29 | Garbage collection system and process |
Publications (2)
Publication Number | Publication Date |
---|---|
EP3532939A1 true EP3532939A1 (en) | 2019-09-04 |
EP3532939A4 EP3532939A4 (en) | 2020-06-17 |
Family
ID=62242710
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP17876888.3A Withdrawn EP3532939A4 (en) | 2016-11-29 | 2017-11-29 | Garbage collection system and process |
Country Status (3)
Country | Link |
---|---|
EP (1) | EP3532939A4 (en) |
CN (1) | CN110226153A (en) |
WO (1) | WO2018102392A1 (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112597070B (en) * | 2020-11-16 | 2022-10-21 | 新华三大数据技术有限公司 | Object recovery method and device |
Family Cites Families (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7085789B1 (en) * | 2001-07-30 | 2006-08-01 | Microsoft Corporation | Compact garbage collection tables |
US20060173939A1 (en) * | 2005-01-31 | 2006-08-03 | Baolin Yin | Garbage collection and compaction |
US20080270436A1 (en) * | 2007-04-27 | 2008-10-30 | Fineberg Samuel A | Storing chunks within a file system |
US8880775B2 (en) * | 2008-06-20 | 2014-11-04 | Seagate Technology Llc | System and method of garbage collection in a memory device |
US8224875B1 (en) * | 2010-01-05 | 2012-07-17 | Symantec Corporation | Systems and methods for removing unreferenced data segments from deduplicated data systems |
US20120159098A1 (en) * | 2010-12-17 | 2012-06-21 | Microsoft Corporation | Garbage collection and hotspots relief for a data deduplication chunk store |
US9489133B2 (en) * | 2011-11-30 | 2016-11-08 | International Business Machines Corporation | Optimizing migration/copy of de-duplicated data |
US8930648B1 (en) * | 2012-05-23 | 2015-01-06 | Netapp, Inc. | Distributed deduplication using global chunk data structure and epochs |
US9208080B2 (en) * | 2013-05-30 | 2015-12-08 | Hewlett Packard Enterprise Development Lp | Persistent memory garbage collection |
US9268806B1 (en) * | 2013-07-26 | 2016-02-23 | Google Inc. | Efficient reference counting in content addressable storage |
-
2017
- 2017-11-29 WO PCT/US2017/063673 patent/WO2018102392A1/en unknown
- 2017-11-29 EP EP17876888.3A patent/EP3532939A4/en not_active Withdrawn
- 2017-11-29 CN CN201780073649.8A patent/CN110226153A/en active Pending
Also Published As
Publication number | Publication date |
---|---|
EP3532939A4 (en) | 2020-06-17 |
WO2018102392A1 (en) | 2018-06-07 |
CN110226153A (en) | 2019-09-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11797510B2 (en) | Key-value store and file system integration | |
US11868312B2 (en) | Snapshot storage and management within an object store | |
US9792306B1 (en) | Data transfer between dissimilar deduplication systems | |
US10852976B2 (en) | Transferring snapshot copy to object store with deduplication preservation and additional compression | |
US20170300550A1 (en) | Data Cloning System and Process | |
US9558073B2 (en) | Incremental block level backup | |
US20180107404A1 (en) | Garbage collection system and process | |
US11797477B2 (en) | Defragmentation for objects within object store | |
US11016943B2 (en) | Garbage collection for objects within object store | |
US9501365B2 (en) | Cloud-based disaster recovery of backup data and metadata | |
US20190012091A1 (en) | Deduplicating data based on boundary identification | |
US9928210B1 (en) | Constrained backup image defragmentation optimization within deduplication system | |
TWI534614B (en) | Data deduplication | |
JP5485866B2 (en) | Information management method and information providing computer | |
US11287994B2 (en) | Native key-value storage enabled distributed storage system | |
US20180060348A1 (en) | Method for Replication of Objects in a Cloud Object Store | |
US20140201168A1 (en) | Deduplication in an extent-based architecture | |
US20140172928A1 (en) | Extent-based storage architecture | |
US11036394B2 (en) | Data deduplication cache comprising solid state drive storage and the like | |
US10437682B1 (en) | Efficient resource utilization for cross-site deduplication | |
US20220138048A1 (en) | Data connector component for implementing management requests | |
US8918378B1 (en) | Cloning using an extent-based architecture | |
US11567913B2 (en) | Method and system for improving efficiency in the management of data references | |
US20170124107A1 (en) | Data deduplication storage system and process | |
EP3532939A1 (en) | Garbage collection system and process |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE |
|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE |
|
17P | Request for examination filed |
Effective date: 20190528 |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
AX | Request for extension of the european patent |
Extension state: BA ME |
|
DAV | Request for validation of the european patent (deleted) | ||
DAX | Request for extension of the european patent (deleted) | ||
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION HAS BEEN WITHDRAWN |
|
18W | Application withdrawn |
Effective date: 20200512 |
|
A4 | Supplementary search report drawn up and despatched |
Effective date: 20200519 |
|
RIC1 | Information provided on ipc code assigned before grant |
Ipc: G06F 12/02 20060101AFI20200513BHEP Ipc: G06F 3/06 20060101ALI20200513BHEP |