US9244623B1 - Parallel de-duplication of data chunks of a shared data object using a log-structured file system - Google Patents

Parallel de-duplication of data chunks of a shared data object using a log-structured file system Download PDF

Info

Publication number
US9244623B1
US9244623B1 US13/799,325 US201313799325A US9244623B1 US 9244623 B1 US9244623 B1 US 9244623B1 US 201313799325 A US201313799325 A US 201313799325A US 9244623 B1 US9244623 B1 US 9244623B1
Authority
US
United States
Prior art keywords
data chunk
duplication
log
node
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active, expires
Application number
US13/799,325
Inventor
John M. Bent
Sorin Faibish
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
EMC Corp
Original Assignee
EMC Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by EMC Corp filed Critical EMC Corp
Priority to US13/799,325 priority Critical patent/US9244623B1/en
Assigned to EMC CORPORATION reassignment EMC CORPORATION ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: FAIBISH, SORIN, BENT, JOHN M.
Application granted granted Critical
Publication of US9244623B1 publication Critical patent/US9244623B1/en
Assigned to CREDIT SUISSE AG, CAYMAN ISLANDS BRANCH, AS COLLATERAL AGENT reassignment CREDIT SUISSE AG, CAYMAN ISLANDS BRANCH, AS COLLATERAL AGENT SECURITY AGREEMENT Assignors: ASAP SOFTWARE EXPRESS, INC., AVENTAIL LLC, CREDANT TECHNOLOGIES, INC., DELL INTERNATIONAL L.L.C., DELL MARKETING L.P., DELL PRODUCTS L.P., DELL SOFTWARE INC., DELL SYSTEMS CORPORATION, DELL USA L.P., EMC CORPORATION, EMC IP Holding Company LLC, FORCE10 NETWORKS, INC., MAGINATICS LLC, MOZY, INC., SCALEIO LLC, SPANNING CLOUD APPS LLC, WYSE TECHNOLOGY L.L.C.
Assigned to THE BANK OF NEW YORK MELLON TRUST COMPANY, N.A., AS NOTES COLLATERAL AGENT reassignment THE BANK OF NEW YORK MELLON TRUST COMPANY, N.A., AS NOTES COLLATERAL AGENT SECURITY AGREEMENT Assignors: ASAP SOFTWARE EXPRESS, INC., AVENTAIL LLC, CREDANT TECHNOLOGIES, INC., DELL INTERNATIONAL L.L.C., DELL MARKETING L.P., DELL PRODUCTS L.P., DELL SOFTWARE INC., DELL SYSTEMS CORPORATION, DELL USA L.P., EMC CORPORATION, EMC IP Holding Company LLC, FORCE10 NETWORKS, INC., MAGINATICS LLC, MOZY, INC., SCALEIO LLC, SPANNING CLOUD APPS LLC, WYSE TECHNOLOGY L.L.C.
Assigned to EMC IP Holding Company LLC reassignment EMC IP Holding Company LLC ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: EMC CORPORATION
Assigned to THE BANK OF NEW YORK MELLON TRUST COMPANY, N.A. reassignment THE BANK OF NEW YORK MELLON TRUST COMPANY, N.A. SECURITY AGREEMENT Assignors: CREDANT TECHNOLOGIES, INC., DELL INTERNATIONAL L.L.C., DELL MARKETING L.P., DELL PRODUCTS L.P., DELL USA L.P., EMC CORPORATION, EMC IP Holding Company LLC, FORCE10 NETWORKS, INC., WYSE TECHNOLOGY L.L.C.
Assigned to THE BANK OF NEW YORK MELLON TRUST COMPANY, N.A. reassignment THE BANK OF NEW YORK MELLON TRUST COMPANY, N.A. SECURITY AGREEMENT Assignors: CREDANT TECHNOLOGIES INC., DELL INTERNATIONAL L.L.C., DELL MARKETING L.P., DELL PRODUCTS L.P., DELL USA L.P., EMC CORPORATION, EMC IP Holding Company LLC, FORCE10 NETWORKS, INC., WYSE TECHNOLOGY L.L.C.
Assigned to DELL SYSTEMS CORPORATION, DELL MARKETING L.P., MAGINATICS LLC, CREDANT TECHNOLOGIES, INC., WYSE TECHNOLOGY L.L.C., DELL PRODUCTS L.P., DELL INTERNATIONAL, L.L.C., FORCE10 NETWORKS, INC., DELL SOFTWARE INC., SCALEIO LLC, DELL USA L.P., EMC CORPORATION, ASAP SOFTWARE EXPRESS, INC., EMC IP Holding Company LLC, AVENTAIL LLC, MOZY, INC. reassignment DELL SYSTEMS CORPORATION RELEASE BY SECURED PARTY (SEE DOCUMENT FOR DETAILS). Assignors: CREDIT SUISSE AG, CAYMAN ISLANDS BRANCH
Assigned to DELL MARKETING CORPORATION (SUCCESSOR-IN-INTEREST TO FORCE10 NETWORKS, INC. AND WYSE TECHNOLOGY L.L.C.), EMC IP HOLDING COMPANY LLC (ON BEHALF OF ITSELF AND AS SUCCESSOR-IN-INTEREST TO MOZY, INC.), DELL PRODUCTS L.P., DELL MARKETING L.P. (ON BEHALF OF ITSELF AND AS SUCCESSOR-IN-INTEREST TO CREDANT TECHNOLOGIES, INC.), SCALEIO LLC, DELL INTERNATIONAL L.L.C., EMC CORPORATION (ON BEHALF OF ITSELF AND AS SUCCESSOR-IN-INTEREST TO MAGINATICS LLC), DELL USA L.P., DELL MARKETING CORPORATION (SUCCESSOR-IN-INTEREST TO ASAP SOFTWARE EXPRESS, INC.) reassignment DELL MARKETING CORPORATION (SUCCESSOR-IN-INTEREST TO FORCE10 NETWORKS, INC. AND WYSE TECHNOLOGY L.L.C.) RELEASE OF SECURITY INTEREST IN PATENTS PREVIOUSLY RECORDED AT REEL/FRAME (040136/0001) Assignors: THE BANK OF NEW YORK MELLON TRUST COMPANY, N.A., AS NOTES COLLATERAL AGENT
Assigned to DELL INTERNATIONAL L.L.C., DELL MARKETING CORPORATION (SUCCESSOR-IN-INTEREST TO FORCE10 NETWORKS, INC. AND WYSE TECHNOLOGY L.L.C.), DELL USA L.P., EMC IP HOLDING COMPANY LLC (ON BEHALF OF ITSELF AND AS SUCCESSOR-IN-INTEREST TO MOZY, INC.), DELL PRODUCTS L.P., DELL MARKETING CORPORATION (SUCCESSOR-IN-INTEREST TO ASAP SOFTWARE EXPRESS, INC.), EMC CORPORATION (ON BEHALF OF ITSELF AND AS SUCCESSOR-IN-INTEREST TO MAGINATICS LLC), DELL MARKETING L.P. (ON BEHALF OF ITSELF AND AS SUCCESSOR-IN-INTEREST TO CREDANT TECHNOLOGIES, INC.), SCALEIO LLC reassignment DELL INTERNATIONAL L.L.C. RELEASE OF SECURITY INTEREST IN PATENTS PREVIOUSLY RECORDED AT REEL/FRAME (045455/0001) Assignors: THE BANK OF NEW YORK MELLON TRUST COMPANY, N.A., AS NOTES COLLATERAL AGENT
Active legal-status Critical Current
Adjusted expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • G06F3/0641De-duplication techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1748De-duplication implemented within the file system, e.g. based on file segments
    • G06F16/1752De-duplication implemented within the file system, e.g. based on file segments based on file chunks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • G06F16/1824Distributed file systems implemented using Network-attached Storage [NAS] architecture
    • G06F17/30197
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0608Saving storage space on storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0614Improving the reliability of storage systems
    • G06F3/0619Improving the reliability of storage systems in relation to data integrity, e.g. data losses, bit errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0673Single storage device
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0683Plurality of storage devices
    • G06F3/0685Hybrid storage combining heterogeneous device types, e.g. hierarchical storage, hybrid arrays

Definitions

  • the present invention relates to parallel storage in high performance computing environments.
  • Parallel storage systems are widely used in many computing environments. Parallel storage systems provide high degrees of concurrency in which many distributed processes within a parallel application simultaneously access a shared file namespace.
  • Parallel computing techniques are used in many industries and applications for implementing computationally intensive models or simulations.
  • the Department of Energy uses a large number of distributed compute nodes tightly coupled into a supercomputer to model physics experiments.
  • parallel computing techniques are often used for computing geological models that help predict the location of natural resources.
  • each parallel process generates a portion, referred to as a data chunk, of a shared data object.
  • De-duplication is a common technique to reduce redundant data by eliminating duplicate copies of repeating data. De-duplication is to improve storage utilization and also to reduce the number of bytes that must be sent for network data transfers.
  • unique chunks of data are identified and stored as “fingerprints” during an analysis process. As the analysis progresses, other chunks are compared to the stored copy and when a match is detected, the redundant chunk is replaced with a reference that points to the stored chunk.
  • Embodiments of the present invention provide improved techniques for parallel de-duplication of data chunks being written to a shared object.
  • a method is provided for a client executing on one or more of a compute node and a burst buffer node in a parallel computing system to store a data chunk generated by the parallel computing system to a shared data object on a storage node in the parallel computing system by processing the data chunk to obtain a de-duplication fingerprint; comparing the de-duplication fingerprint to de-duplication fingerprints of other data chunks; and providing original data chunks to the storage node that stores the shared object.
  • a reference to an original data chunk can be stored when the de-duplication fingerprint matches a de-duplication fingerprint of another data chunk.
  • the client may be embodied, for example, as a Log-Structured File System (LSFS) client, and the storage node may be embodied, for example, as a Log-Structured File server.
  • LSFS Log-Structured File System
  • a storage node in a parallel computing system stores a data chunk as part of a shared object by receiving only an original version of the data chunk from a compute node in the parallel computing system; and storing the original version of the data chunk to the shared data object on the storage node as a shared object.
  • the storage node can provide the original version of the data chunk to a compute node when the data chunk is read from the storage node.
  • illustrative embodiments of the invention provide techniques for parallel de-duplication of data chunks being written to a shared object.
  • FIG. 1 illustrates an exemplary conventional technique for de-duplicating data being stored to a shared object by a plurality of processes in a storage system
  • FIG. 2 illustrates an exemplary distributed technique for de-duplication of data being stored to a shared object by a plurality of processes in a storage system in accordance with aspects of the present invention
  • FIG. 3 illustrates an exemplary alternate distributed technique for de-duplication of data being stored to a shared object by a plurality of processes in a storage system in accordance with an alternate embodiment of the present invention
  • FIG. 4 is a flow chart describing an exemplary LSFS de-duplication process incorporating aspects of the present invention.
  • the present invention provides improved techniques for cooperative parallel writing of data to a shared object.
  • one aspect of the present invention leverages the parallelism of concurrent writes to a shared object and the high interconnect speed of parallel supercomputer networks to de-duplicate the data in parallel as it is written.
  • Embodiments of the present invention will be described herein with reference to exemplary computing systems and data storage systems and associated servers, computers, storage units and devices and other processing devices. It is to be appreciated, however, that embodiments of the invention are not restricted to use with the particular illustrative system and device configurations shown. Moreover, the phrases “computing system” and “data storage system” as used herein are intended to be broadly construed, so as to encompass, for example, private or public cloud computing or storage systems, as well as other types of systems comprising distributed virtual infrastructure. However, a given embodiment may more generally comprise any arrangement of one or more processing devices. As used herein, the term “files” shall include complete files and portions of files, such as sub-files or shards.
  • FIG. 1 illustrates an exemplary conventional storage system 100 that employs a conventional technique for de-duplication of data being stored to a shared object 150 by a plurality of processes.
  • the exemplary storage system 100 may be implemented, for example, as a Parallel Log-Structured File System (PLFS) to make placement decisions automatically, as described in U.S. patent application Ser. No. 13/536,331, filed Jun. 28, 2012, entitled “Storing Files in a Parallel Computing System Using List-Based Index to Identify Replica Files,” (now U.S. Pat. No. 9,087,075), incorporated by reference herein, or it can be explicitly controlled by the application and administered by a storage daemon.
  • PLFS Parallel Log-Structured File System
  • the exemplary storage system 100 comprises a plurality of compute nodes 110 - 1 through 110 -N (collectively, compute nodes 110 ) where a distributed application process generates a corresponding portion 120 - 1 through 120 -N of a distributed shared data structure 150 or other information to store.
  • the compute nodes 110 optionally store the portions 120 of the distributed data structure 150 in one or more nodes of the exemplary storage system 100 , such as an exemplary flash based storage node 140 .
  • the exemplary hierarchical storage tiering system 100 optionally comprises one or more hard disk drives (not shown).
  • the compute nodes 110 send their distributed data chunks 120 into a single file 150 .
  • the single file 150 is striped into file system defined blocks, and then a de-duplication fingerprint 160 - 1 through 160 - i is generated for each block.
  • existing de-duplication approaches process the shared data structure 150 only after it has been sent to the storage node 140 of the storage system 100 .
  • the de-duplication is applied to offset ranges on the data in sizes that are pre-defined by the file system 100 .
  • the offset size of the de-duplication does not typically align with the size of the data portions 120 (i.e., the file system defined blocks will typically not match the original memory layout).
  • FIG. 2 illustrates an exemplary storage system 200 that de-duplicates data chunks 220 being stored to a shared object 250 by a plurality of processes in accordance with aspects of the present invention.
  • the exemplary storage system 200 may be implemented, for example, as a Parallel Log-Structured File System.
  • the exemplary storage system 200 comprises a plurality of compute nodes 210 - 1 through 210 -N (collectively, compute nodes 210 ) where a distributed application process generates a corresponding data chunk portion 220 - 1 through 220 -N (collectively, data chunks 220 ) of a distributed shared data object 250 to store.
  • the distributed application executing on given compute node 210 in the parallel computing system 200 writes and reads the data chunks 220 that are part of the shared data object 250 using a log-structured file system (LSFS) client 205 - 1 through 205 -N executing on the given compute node 210 .
  • LSFS log-structured file system
  • each LSFS client 205 applies a corresponding de-duplication function 260 - 1 through 260 -N to each data chunk 220 - 1 through 220 -N to generate a corresponding fingerprint 265 - 1 through 265 -N that is compared to other fingerprints.
  • the redundant chunk 220 is replaced with a reference that points to the stored chunk.
  • chunk 220 - 3 is a duplicate of chunk 220 - 2 so only chunk 220 - 2 is stored and a reference pointing to the stored chunk 220 - 2 is stored for chunk 220 - 3 , in a known manner.
  • Each original data chunk 220 is then stored by the corresponding LSFS client 205 on the compute nodes 210 on one or more storage nodes of the exemplary storage system 200 , such as an exemplary LSFS server 240 .
  • the LSFS server 240 may be implemented, for example, as a flash based storage node.
  • the exemplary hierarchical storage tiering system 200 optionally comprises one or more hard disk drives (not shown).
  • the parallelism of the compute nodes 210 can also be also leveraged to build a parallel key server to help find the de-duplicated fingerprints 265 .
  • the keys can be cached across the compute server network 200 .
  • FIG. 3 illustrates an exemplary storage system 300 that de-duplicates data chunks 220 being stored to a shared object 250 by a plurality of processes in accordance with an alternate embodiment of the present invention.
  • the exemplary storage system 300 may be implemented; for example, as a Parallel Log-Structured File System.
  • the exemplary storage system 300 comprises a plurality of compute nodes 210 - 1 through 210 -N (collectively, compute nodes 210 ) where a distributed application process generates a corresponding data chunk portion 220 - 1 through 220 -N (collectively, data chunks 220 ) of a distributed shared data object 250 to store, in a similar manner to FIG. 2 .
  • the distributed application executing on given compute node 210 in the parallel computing system 200 writes and reads the data chunks 220 that are part of the shared data object 250 using a log-structured file system (LSFS) client 205 - 1 through 205 -N executing on the given compute node 210 , in a similar manner to FIG. 2 .
  • LSFS log-structured file system
  • each original data chunk 220 from the distributed data structure 250 is stored in one or more storage nodes of the exemplary storage system 200 , such as an exemplary LSFS server 240 .
  • the LSFS server 240 may be implemented, for example, as a flash based storage node.
  • the exemplary hierarchical storage tiering system 200 optionally comprises one or more hard disk drives (not shown).
  • the exemplary storage system 300 also comprises one or more flash-based burst buffer nodes 310 - 1 through 310 - k that process the data chunks 220 that are written by the LSFS clients 205 to the LSFS server 240 , and are read by the LSFS clients 205 from the LSFS server 240 .
  • the exemplary flash-based burst buffer nodes 310 comprise LSFS clients 305 in a similar manner to the LSFS clients 205 of FIG. 2 .
  • each burst buffer node 310 applies a de-duplication function 360 - 1 through 360 - k to each data chunk 220 - 1 through 220 -N to generate a corresponding fingerprint 365 - 1 through 365 -N.
  • Each original data chunk 220 is then stored on the LSFS server 240 , in a similar manner to FIG. 2 .
  • FIGS. 2 and 3 can be combined such that a first level de-duplication is performed by the LSFS clients 205 executing on the compute nodes 210 and additional more computationally intensive de-duplication is performed by the burst buffer nodes 310 .
  • FIG. 4 is a flow chart describing an exemplary LSFS de-duplication process 400 incorporating aspects of the present invention.
  • the exemplary LSFS de-duplication process 400 is implemented by the LSFS clients 205 executing on the compute nodes 210 in the embodiment of FIG. 2 and by the flash-based burst buffer nodes 310 in the embodiment of FIG. 3 .
  • the exemplary LSFS de-duplication process 400 initially obtains the data chunk from the application during step 420 .
  • the exemplary LSFS de-duplication process 400 then de-duplicates the data chunk during step 430 on the compute nodes 210 or the burst buffer nodes 310 .
  • the original data chunks are stored on the LSFS server 240 as part of the shared object 250 during step 440 .
  • the number of compute servers 210 as shown in FIG. 2 is at least an order of magnitude greater than the number of storage servers 240 in HPC systems, thus it is much faster to perform the de-duplication on the compute servers 210 .
  • the de-duplication is performed on the data chunks 220 as they are being written by the LSFS client 205 as opposed to when they have been placed into the file 250 by the server 240 .
  • the advantage is that in a conventional approach, the data chunks 120 on the compute node 110 may be completely reorganized when the server 140 puts them into the shared file 150 , as shown in FIG. 1 .
  • the data chunks 120 may be split into many smaller sub-chunks and interspersed with small sub-chunks from other compute nodes 110 .
  • the original chunking of the data is the most likely to have commonality with other chunks.
  • this reorganization with the conventional approach may reduce the de-duplicability of the data.
  • the chunks 220 in a log-structured file system retain their original data organization whereas in existing approaches, the data in the chunks will almost always be reorganized into file system defined blocks. This can introduce additional latency as the file system will either wait for the blocks to be filled or do the de-duplication multiple times each time the block is partially filled.
  • aspects of the present invention leverage the parallelism of concurrent writes to a shared object and the high interconnect speed of parallel supercomputer networks to improve data de-duplication during a write operation.
  • aspects of the present invention thus recognize that the log-structured file system eliminates the need for artificial file system boundaries because all block sizes perform equally well in a log-structured file system.
  • PLFS files can be shared across many locations, data processing required to implement these functions can be performed more efficiently when there are multiple nodes cooperating on the data processing operations. Therefore, when this is run on a parallel system with a parallel language, such as Message Passing Interface (MPI), PLFS can provide MPI versions of these functions which will allow it to exploit parallelism for more efficient data processing.
  • MPI Message Passing Interface
  • Such components can communicate with other elements over any type of network, such as a wide area network (WAN), a local area network (LAN), a satellite network, a telephone or cable network, or various portions or combinations of these and other types of networks.
  • WAN wide area network
  • LAN local area network
  • satellite network a satellite network
  • telephone or cable network a telephone or cable network
  • a tangible machine-readable recordable storage medium stores one or more software programs, which when executed by one or more processing devices, implement the data deduplication techniques described herein.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Parallel de-duplication of data chunks being written to a shared object is provided. A client executing on a compute node or a burst buffer node in a parallel computing system stores a data chunk to a shared data object on a storage node by processing the data chunk to obtain a de-duplication fingerprint; comparing the de-duplication fingerprint to de-duplication fingerprints of other data chunks; and providing original data chunks to the storage node that stores the shared object. A reference to an original data chunk can be stored when the de-duplication fingerprint matches another data chunk. The client and storage node may employ Log-Structured File techniques. A storage node stores a data chunk in the shared object by receiving only an original version of the data chunk from a compute node; and storing the original version of the data chunk to the shared data object on the storage node as a shared object.

Description

FIELD
The present invention relates to parallel storage in high performance computing environments.
BACKGROUND
Parallel storage systems are widely used in many computing environments. Parallel storage systems provide high degrees of concurrency in which many distributed processes within a parallel application simultaneously access a shared file namespace.
Parallel computing techniques are used in many industries and applications for implementing computationally intensive models or simulations. For example, the Department of Energy uses a large number of distributed compute nodes tightly coupled into a supercomputer to model physics experiments. In the oil and gas industry, parallel computing techniques are often used for computing geological models that help predict the location of natural resources. Generally, each parallel process generates a portion, referred to as a data chunk, of a shared data object.
De-duplication is a common technique to reduce redundant data by eliminating duplicate copies of repeating data. De-duplication is to improve storage utilization and also to reduce the number of bytes that must be sent for network data transfers. Typically, unique chunks of data are identified and stored as “fingerprints” during an analysis process. As the analysis progresses, other chunks are compared to the stored copy and when a match is detected, the redundant chunk is replaced with a reference that points to the stored chunk.
Existing approaches de-duplicate the shared data object after it has been sent to the storage system. The de-duplication is applied to offset ranges on the shared data object in sizes that are pre-defined by the file system.
In parallel computing systems, such as High Performance Computing (HPC) applications, the inherently complex and large datasets increase the resources required for data storage and transmission. A need therefore exists for parallel techniques for de-duplicating data chunks being written to a shared object.
SUMMARY
Embodiments of the present invention provide improved techniques for parallel de-duplication of data chunks being written to a shared object. In one embodiment, a method is provided for a client executing on one or more of a compute node and a burst buffer node in a parallel computing system to store a data chunk generated by the parallel computing system to a shared data object on a storage node in the parallel computing system by processing the data chunk to obtain a de-duplication fingerprint; comparing the de-duplication fingerprint to de-duplication fingerprints of other data chunks; and providing original data chunks to the storage node that stores the shared object. In addition, a reference to an original data chunk can be stored when the de-duplication fingerprint matches a de-duplication fingerprint of another data chunk.
The client may be embodied, for example, as a Log-Structured File System (LSFS) client, and the storage node may be embodied, for example, as a Log-Structured File server.
According to another aspect of the invention, a storage node in a parallel computing system stores a data chunk as part of a shared object by receiving only an original version of the data chunk from a compute node in the parallel computing system; and storing the original version of the data chunk to the shared data object on the storage node as a shared object. The storage node can provide the original version of the data chunk to a compute node when the data chunk is read from the storage node.
Advantageously, illustrative embodiments of the invention provide techniques for parallel de-duplication of data chunks being written to a shared object. These and other features and advantages of the present invention will become more readily apparent from the accompanying drawings and the following detailed description.
BRIEF DESCRIPTION OF THE DRAWINGS
FIG. 1 illustrates an exemplary conventional technique for de-duplicating data being stored to a shared object by a plurality of processes in a storage system;
FIG. 2 illustrates an exemplary distributed technique for de-duplication of data being stored to a shared object by a plurality of processes in a storage system in accordance with aspects of the present invention;
FIG. 3 illustrates an exemplary alternate distributed technique for de-duplication of data being stored to a shared object by a plurality of processes in a storage system in accordance with an alternate embodiment of the present invention; and
FIG. 4 is a flow chart describing an exemplary LSFS de-duplication process incorporating aspects of the present invention.
DETAILED DESCRIPTION
The present invention provides improved techniques for cooperative parallel writing of data to a shared object. Generally, one aspect of the present invention leverages the parallelism of concurrent writes to a shared object and the high interconnect speed of parallel supercomputer networks to de-duplicate the data in parallel as it is written.
Embodiments of the present invention will be described herein with reference to exemplary computing systems and data storage systems and associated servers, computers, storage units and devices and other processing devices. It is to be appreciated, however, that embodiments of the invention are not restricted to use with the particular illustrative system and device configurations shown. Moreover, the phrases “computing system” and “data storage system” as used herein are intended to be broadly construed, so as to encompass, for example, private or public cloud computing or storage systems, as well as other types of systems comprising distributed virtual infrastructure. However, a given embodiment may more generally comprise any arrangement of one or more processing devices. As used herein, the term “files” shall include complete files and portions of files, such as sub-files or shards.
FIG. 1 illustrates an exemplary conventional storage system 100 that employs a conventional technique for de-duplication of data being stored to a shared object 150 by a plurality of processes. The exemplary storage system 100 may be implemented, for example, as a Parallel Log-Structured File System (PLFS) to make placement decisions automatically, as described in U.S. patent application Ser. No. 13/536,331, filed Jun. 28, 2012, entitled “Storing Files in a Parallel Computing System Using List-Based Index to Identify Replica Files,” (now U.S. Pat. No. 9,087,075), incorporated by reference herein, or it can be explicitly controlled by the application and administered by a storage daemon.
As shown in FIG. 1, the exemplary storage system 100 comprises a plurality of compute nodes 110-1 through 110-N (collectively, compute nodes 110) where a distributed application process generates a corresponding portion 120-1 through 120-N of a distributed shared data structure 150 or other information to store. The compute nodes 110 optionally store the portions 120 of the distributed data structure 150 in one or more nodes of the exemplary storage system 100, such as an exemplary flash based storage node 140. In addition, the exemplary hierarchical storage tiering system 100 optionally comprises one or more hard disk drives (not shown).
As shown in FIG. 1, the compute nodes 110 send their distributed data chunks 120 into a single file 150. The single file 150 is striped into file system defined blocks, and then a de-duplication fingerprint 160-1 through 160-i is generated for each block. As indicated above, existing de-duplication approaches process the shared data structure 150 only after it has been sent to the storage node 140 of the storage system 100. Thus, as shown in FIG. 1, the de-duplication is applied to offset ranges on the data in sizes that are pre-defined by the file system 100. The offset size of the de-duplication does not typically align with the size of the data portions 120 (i.e., the file system defined blocks will typically not match the original memory layout).
FIG. 2 illustrates an exemplary storage system 200 that de-duplicates data chunks 220 being stored to a shared object 250 by a plurality of processes in accordance with aspects of the present invention. The exemplary storage system 200 may be implemented, for example, as a Parallel Log-Structured File System.
As shown in FIG. 2, the exemplary storage system 200 comprises a plurality of compute nodes 210-1 through 210-N (collectively, compute nodes 210) where a distributed application process generates a corresponding data chunk portion 220-1 through 220-N (collectively, data chunks 220) of a distributed shared data object 250 to store. The distributed application executing on given compute node 210 in the parallel computing system 200 writes and reads the data chunks 220 that are part of the shared data object 250 using a log-structured file system (LSFS) client 205-1 through 205-N executing on the given compute node 210.
In accordance with one aspect of the present invention, on a write operation, each LSFS client 205 applies a corresponding de-duplication function 260-1 through 260-N to each data chunk 220-1 through 220-N to generate a corresponding fingerprint 265-1 through 265-N that is compared to other fingerprints. When a match is detected, the redundant chunk 220 is replaced with a reference that points to the stored chunk. In the example of FIG. 2, chunk 220-3 is a duplicate of chunk 220-2 so only chunk 220-2 is stored and a reference pointing to the stored chunk 220-2 is stored for chunk 220-3, in a known manner.
Each original data chunk 220 is then stored by the corresponding LSFS client 205 on the compute nodes 210 on one or more storage nodes of the exemplary storage system 200, such as an exemplary LSFS server 240. The LSFS server 240 may be implemented, for example, as a flash based storage node. In addition, the exemplary hierarchical storage tiering system 200 optionally comprises one or more hard disk drives (not shown).
The parallelism of the compute nodes 210 can also be also leveraged to build a parallel key server to help find the de-duplicated fingerprints 265. The keys can be cached across the compute server network 200.
FIG. 3 illustrates an exemplary storage system 300 that de-duplicates data chunks 220 being stored to a shared object 250 by a plurality of processes in accordance with an alternate embodiment of the present invention. The exemplary storage system 300 may be implemented; for example, as a Parallel Log-Structured File System. As shown in FIG. 3, the exemplary storage system 300 comprises a plurality of compute nodes 210-1 through 210-N (collectively, compute nodes 210) where a distributed application process generates a corresponding data chunk portion 220-1 through 220-N (collectively, data chunks 220) of a distributed shared data object 250 to store, in a similar manner to FIG. 2. The distributed application executing on given compute node 210 in the parallel computing system 200 writes and reads the data chunks 220 that are part of the shared data object 250 using a log-structured file system (LSFS) client 205-1 through 205-N executing on the given compute node 210, in a similar manner to FIG. 2.
As discussed hereinafter, following de-duplication, each original data chunk 220 from the distributed data structure 250 is stored in one or more storage nodes of the exemplary storage system 200, such as an exemplary LSFS server 240. The LSFS server 240 may be implemented, for example, as a flash based storage node. In addition, the exemplary hierarchical storage tiering system 200 optionally comprises one or more hard disk drives (not shown).
The exemplary storage system 300 also comprises one or more flash-based burst buffer nodes 310-1 through 310-k that process the data chunks 220 that are written by the LSFS clients 205 to the LSFS server 240, and are read by the LSFS clients 205 from the LSFS server 240. The exemplary flash-based burst buffer nodes 310 comprise LSFS clients 305 in a similar manner to the LSFS clients 205 of FIG. 2.
In accordance with one aspect of the present invention, on a write operation, each burst buffer node 310 applies a de-duplication function 360-1 through 360-k to each data chunk 220-1 through 220-N to generate a corresponding fingerprint 365-1 through 365-N. Each original data chunk 220 is then stored on the LSFS server 240, in a similar manner to FIG. 2.
On a burst buffer node 310, due to the bursty nature of the workloads, there is additional time to run computationally intensive de-duplication.
It is noted that the embodiments of FIGS. 2 and 3 can be combined such that a first level de-duplication is performed by the LSFS clients 205 executing on the compute nodes 210 and additional more computationally intensive de-duplication is performed by the burst buffer nodes 310.
While such distributed de-duplication may reduce performance due to latency, this is outweighed by the improved storage and transmission efficiency. Additionally, on the burst buffer nodes 310, this additional latency will not be incurred by the application since the latency will be added not between the application on the compute nodes 210 and the burst buffer nodes 310 but between the asynchronous transfer from the burst buffer nodes 310 to the lower storage servers 240.
FIG. 4 is a flow chart describing an exemplary LSFS de-duplication process 400 incorporating aspects of the present invention. The exemplary LSFS de-duplication process 400 is implemented by the LSFS clients 205 executing on the compute nodes 210 in the embodiment of FIG. 2 and by the flash-based burst buffer nodes 310 in the embodiment of FIG. 3.
As shown in FIG. 4, the exemplary LSFS de-duplication process 400 initially obtains the data chunk from the application during step 420. The exemplary LSFS de-duplication process 400 then de-duplicates the data chunk during step 430 on the compute nodes 210 or the burst buffer nodes 310. Finally, the original data chunks are stored on the LSFS server 240 as part of the shared object 250 during step 440.
Among other benefits, the number of compute servers 210 as shown in FIG. 2, is at least an order of magnitude greater than the number of storage servers 240 in HPC systems, thus it is much faster to perform the de-duplication on the compute servers 210. In addition, the de-duplication is performed on the data chunks 220 as they are being written by the LSFS client 205 as opposed to when they have been placed into the file 250 by the server 240. The advantage is that in a conventional approach, the data chunks 120 on the compute node 110 may be completely reorganized when the server 140 puts them into the shared file 150, as shown in FIG. 1. In fact, the data chunks 120 may be split into many smaller sub-chunks and interspersed with small sub-chunks from other compute nodes 110. The original chunking of the data is the most likely to have commonality with other chunks. Thus, this reorganization with the conventional approach may reduce the de-duplicability of the data.
The chunks 220 in a log-structured file system retain their original data organization whereas in existing approaches, the data in the chunks will almost always be reorganized into file system defined blocks. This can introduce additional latency as the file system will either wait for the blocks to be filled or do the de-duplication multiple times each time the block is partially filled.
In this manner, aspects of the present invention leverage the parallelism of concurrent writes to a shared object and the high interconnect speed of parallel supercomputer networks to improve data de-duplication during a write operation. Aspects of the present invention thus recognize that the log-structured file system eliminates the need for artificial file system boundaries because all block sizes perform equally well in a log-structured file system.
Because PLFS files can be shared across many locations, data processing required to implement these functions can be performed more efficiently when there are multiple nodes cooperating on the data processing operations. Therefore, when this is run on a parallel system with a parallel language, such as Message Passing Interface (MPI), PLFS can provide MPI versions of these functions which will allow it to exploit parallelism for more efficient data processing.
CONCLUSION
Numerous other arrangements of servers, computers, storage devices or other components are possible. Such components can communicate with other elements over any type of network, such as a wide area network (WAN), a local area network (LAN), a satellite network, a telephone or cable network, or various portions or combinations of these and other types of networks.
In one or more exemplary embodiments, a tangible machine-readable recordable storage medium stores one or more software programs, which when executed by one or more processing devices, implement the data deduplication techniques described herein.
It should again be emphasized that the above-described embodiments of the invention are presented for purposes of illustration only. Many variations may be made in the particular arrangements shown. For example, although described in the context of particular system and device configurations, the techniques are applicable to a wide variety of other types of information processing systems, data storage systems, processing devices and distributed virtual infrastructure arrangements. In addition, any simplifying assumptions made above in the course of describing the illustrative embodiments should also be viewed as exemplary rather than as requirements or limitations of the invention. Numerous other alternative embodiments within the scope of the appended claims will be readily apparent to those skilled in the art.

Claims (20)

What is claimed is:
1. A method, comprising:
processing a data chunk generated by a parallel computing system using a Log-Structured File System client to obtain a de-duplication fingerprint, wherein said Log-Structured File System client executes on one or more of a compute node and a burst buffer node in said parallel computing system;
storing said de-duplication fingerprint using said Log-Structured File System client in a parallel key server;
comparing said de-duplication fingerprint using said Log-Structured File System client to de-duplication fingerprints of other data chunks obtained from said parallel key server; and
providing original data chunks to said storage node in said parallel computing system for storage as part of a shared object.
2. The method of claim 1, wherein said storage node comprises a Log-Structured File server.
3. The method of claim 1, further comprising the step of storing a reference to an original data chunk when said de-duplication fingerprint matches a de-duplication fingerprint of another data chunk.
4. The method of claim 1, wherein said storage node receives only said original version of said data chunk from said compute node.
5. The method of claim 4, further comprising the step of said storage node providing said original version of said data chunk to one or more compute nodes when said data chunk is read from said storage node.
6. The method of claim 1, wherein said data chunk has a variable block size.
7. The method of claim 1, wherein said method is performed during a write operation.
8. A non-transitory machine-readable recordable storage medium, wherein one or more software programs when executed by one or more processing devices implement the following steps:
processing a data chunk generated by a parallel computing system using a Log-Structured File System client to obtain a de-duplication fingerprint, wherein said Log-Structured File System client executes on one or more of a compute node and a burst buffer node in said parallel computing system;
storing said de-duplication fingerprint using said Log-Structured File System client in a parallel key server;
comparing said de-duplication fingerprint using said Log-Structured File System client to de-duplication fingerprints of other data chunks obtained from said parallel key server; and
providing original data chunks to said storage node in said parallel computing system for storage as part of a shared object.
9. The storage medium of claim 8, further comprising the step of storing a reference to an original data chunk when said de-duplication fingerprint matches a de-duplication fingerprint of another data chunk.
10. The storage medium of claim 9, further comprising the step of said storage node providing said original version of said data chunk to one or more compute nodes when said data chunk is read from said storage node.
11. The apparatus of claim 10, wherein said storage node is configured to provide said original version of said data chunk to one or more compute nodes when said data chunk is read from said storage node.
12. The storage medium of claim 8, wherein said storage node receives only said original version of said data chunk from said compute node.
13. The storage medium of claim 8, wherein said data chunk has a variable block size.
14. The storage medium of claim 8, wherein said processing, storing, comparing and providing steps are performed during a write operation.
15. An apparatus, said apparatus comprising:
a memory; and
at least one hardware device operatively coupled to the memory and configured to:
process a data chunk generated by a parallel computing system using a Log-Structured File System client to obtain a de-duplication fingerprint, wherein said Log-Structured File System client executes on one or more of a compute node and a burst buffer node in said parallel computing system;
store said de-duplication fingerprint using said Log-Structured File System client in a parallel key server;
compare said de-duplication fingerprint using said Log-Structured File System client to de-duplication fingerprints of other data chunks obtained from said parallel key server; and
provide original data chunks to said storage node in said parallel computing system for storage as part of a shared object.
16. The apparatus of claim 15, wherein said storage node comprises a Log-Structured File server.
17. The apparatus of claim 15, wherein said apparatus comprises one or more of a compute node and a burst buffer node.
18. The apparatus of claim 15, wherein said at least one hardware device is further configured to store a reference to an original data chunk when said de-duplication fingerprint matches a de-duplication fingerprint of another data chunk.
19. The apparatus of claim 15, wherein said storage node receives only said original version of said data chunk from said compute node.
20. The apparatus of claim 15, wherein said data chunk has a variable block size.
US13/799,325 2013-03-13 2013-03-13 Parallel de-duplication of data chunks of a shared data object using a log-structured file system Active 2034-01-28 US9244623B1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US13/799,325 US9244623B1 (en) 2013-03-13 2013-03-13 Parallel de-duplication of data chunks of a shared data object using a log-structured file system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US13/799,325 US9244623B1 (en) 2013-03-13 2013-03-13 Parallel de-duplication of data chunks of a shared data object using a log-structured file system

Publications (1)

Publication Number Publication Date
US9244623B1 true US9244623B1 (en) 2016-01-26

Family

ID=55086120

Family Applications (1)

Application Number Title Priority Date Filing Date
US13/799,325 Active 2034-01-28 US9244623B1 (en) 2013-03-13 2013-03-13 Parallel de-duplication of data chunks of a shared data object using a log-structured file system

Country Status (1)

Country Link
US (1) US9244623B1 (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160224595A1 (en) * 2015-01-29 2016-08-04 HGST Netherlands B.V. Hardware Efficient Fingerprinting
US10078643B1 (en) 2017-03-23 2018-09-18 International Business Machines Corporation Parallel deduplication using automatic chunk sizing
US10108659B2 (en) 2015-01-29 2018-10-23 Western Digital Technologies, Inc. Hardware efficient rabin fingerprints
US10459633B1 (en) 2017-07-21 2019-10-29 EMC IP Holding Company LLC Method for efficient load balancing in virtual storage systems
US10481813B1 (en) 2017-07-28 2019-11-19 EMC IP Holding Company LLC Device and method for extending cache operational lifetime
US10795859B1 (en) 2017-04-13 2020-10-06 EMC IP Holding Company LLC Micro-service based deduplication
US10795860B1 (en) 2017-04-13 2020-10-06 EMC IP Holding Company LLC WAN optimized micro-service based deduplication
US10860212B1 (en) 2017-07-21 2020-12-08 EMC IP Holding Company LLC Method or an apparatus to move perfect de-duplicated unique data from a source to destination storage tier
US10929382B1 (en) 2017-07-31 2021-02-23 EMC IP Holding Company LLC Method and system to verify integrity of a portion of replicated data
US10936543B1 (en) 2017-07-21 2021-03-02 EMC IP Holding Company LLC Metadata protected sparse block set for SSD cache space management
US10949088B1 (en) * 2017-07-21 2021-03-16 EMC IP Holding Company LLC Method or an apparatus for having perfect deduplication, adapted for saving space in a deduplication file system
CN112527186A (en) * 2019-09-18 2021-03-19 华为技术有限公司 Storage system, storage node and data storage method
US11093176B2 (en) * 2019-04-26 2021-08-17 EMC IP Holding Company LLC FaaS-based global object compression
US11093453B1 (en) 2017-08-31 2021-08-17 EMC IP Holding Company LLC System and method for asynchronous cleaning of data objects on cloud partition in a file system with deduplication
US11100051B1 (en) * 2013-03-15 2021-08-24 Comcast Cable Communications, Llc Management of content
US11113153B2 (en) 2017-07-27 2021-09-07 EMC IP Holding Company LLC Method and system for sharing pre-calculated fingerprints and data chunks amongst storage systems on a cloud local area network
US11461269B2 (en) 2017-07-21 2022-10-04 EMC IP Holding Company Metadata separated container format

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110258245A1 (en) * 2010-04-14 2011-10-20 International Business Machines Corporation Performing A Local Reduction Operation On A Parallel Computer
US20140136789A1 (en) * 2011-09-20 2014-05-15 Netapp Inc. Host side deduplication

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110258245A1 (en) * 2010-04-14 2011-10-20 International Business Machines Corporation Performing A Local Reduction Operation On A Parallel Computer
US20140136789A1 (en) * 2011-09-20 2014-05-15 Netapp Inc. Host side deduplication

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Rosenblum, Mendel, "The Design and Implementation of a Log-Structured File System", Feb. 1992, pp. 26-52. *

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11100051B1 (en) * 2013-03-15 2021-08-24 Comcast Cable Communications, Llc Management of content
US20160224595A1 (en) * 2015-01-29 2016-08-04 HGST Netherlands B.V. Hardware Efficient Fingerprinting
US10078646B2 (en) * 2015-01-29 2018-09-18 HGST Netherlands B.V. Hardware efficient fingerprinting
US10108659B2 (en) 2015-01-29 2018-10-23 Western Digital Technologies, Inc. Hardware efficient rabin fingerprints
US10078643B1 (en) 2017-03-23 2018-09-18 International Business Machines Corporation Parallel deduplication using automatic chunk sizing
US11157453B2 (en) 2017-03-23 2021-10-26 International Business Machines Corporation Parallel deduplication using automatic chunk sizing
US10621144B2 (en) 2017-03-23 2020-04-14 International Business Machines Corporation Parallel deduplication using automatic chunk sizing
US10795859B1 (en) 2017-04-13 2020-10-06 EMC IP Holding Company LLC Micro-service based deduplication
US10795860B1 (en) 2017-04-13 2020-10-06 EMC IP Holding Company LLC WAN optimized micro-service based deduplication
US10459633B1 (en) 2017-07-21 2019-10-29 EMC IP Holding Company LLC Method for efficient load balancing in virtual storage systems
US10860212B1 (en) 2017-07-21 2020-12-08 EMC IP Holding Company LLC Method or an apparatus to move perfect de-duplicated unique data from a source to destination storage tier
US10936543B1 (en) 2017-07-21 2021-03-02 EMC IP Holding Company LLC Metadata protected sparse block set for SSD cache space management
US10949088B1 (en) * 2017-07-21 2021-03-16 EMC IP Holding Company LLC Method or an apparatus for having perfect deduplication, adapted for saving space in a deduplication file system
US11461269B2 (en) 2017-07-21 2022-10-04 EMC IP Holding Company Metadata separated container format
US11113153B2 (en) 2017-07-27 2021-09-07 EMC IP Holding Company LLC Method and system for sharing pre-calculated fingerprints and data chunks amongst storage systems on a cloud local area network
US10481813B1 (en) 2017-07-28 2019-11-19 EMC IP Holding Company LLC Device and method for extending cache operational lifetime
US10929382B1 (en) 2017-07-31 2021-02-23 EMC IP Holding Company LLC Method and system to verify integrity of a portion of replicated data
US11093453B1 (en) 2017-08-31 2021-08-17 EMC IP Holding Company LLC System and method for asynchronous cleaning of data objects on cloud partition in a file system with deduplication
US11093176B2 (en) * 2019-04-26 2021-08-17 EMC IP Holding Company LLC FaaS-based global object compression
CN112527186A (en) * 2019-09-18 2021-03-19 华为技术有限公司 Storage system, storage node and data storage method
CN112527186B (en) * 2019-09-18 2023-09-08 华为技术有限公司 Storage system, storage node and data storage method

Similar Documents

Publication Publication Date Title
US9244623B1 (en) Parallel de-duplication of data chunks of a shared data object using a log-structured file system
US9477682B1 (en) Parallel compression of data chunks of a shared data object using a log-structured file system
US11416452B2 (en) Determining chunk boundaries for deduplication of storage objects
CN104871155B (en) Optimizing data block size for deduplication
US9251160B1 (en) Data transfer between dissimilar deduplication systems
CN116431072A (en) Accessible fast durable storage integrated into mass storage device
EP3376393B1 (en) Data storage method and apparatus
US9501488B1 (en) Data migration using parallel log-structured file system middleware to overcome archive file system limitations
Manogar et al. A study on data deduplication techniques for optimized storage
US10261946B2 (en) Rebalancing distributed metadata
US9471582B2 (en) Optimized pre-fetch ordering using de-duplication information to enhance network performance
US10242021B2 (en) Storing data deduplication metadata in a grid of processors
CN116601596A (en) Selecting segments for garbage collection using data similarity
US9965487B2 (en) Conversion of forms of user data segment IDs in a deduplication system
US10255288B2 (en) Distributed data deduplication in a grid of processors
US20160371295A1 (en) Removal of reference information for storage blocks in a deduplication system
US10963177B2 (en) Deduplication using fingerprint tries
US11314432B2 (en) Managing data reduction in storage systems using machine learning
Nicolae Leveraging naturally distributed data redundancy to reduce collective I/O replication overhead
Kumar et al. Differential Evolution based bucket indexed data deduplication for big data storage
Shieh et al. De-duplication approaches in cloud computing environment: a survey
US10521400B1 (en) Data reduction reporting in storage systems
US20200249862A1 (en) System and method for optimal order migration into a cache based deduplicated storage array
Karthika et al. Perlustration on techno level classification of deduplication techniques in cloud for big data storage
US9836475B2 (en) Streamlined padding of deduplication repository file systems

Legal Events

Date Code Title Description
AS Assignment

Owner name: EMC CORPORATION, MASSACHUSETTS

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:BENT, JOHN M.;FAIBISH, SORIN;SIGNING DATES FROM 20130422 TO 20130521;REEL/FRAME:030575/0860

STCF Information on status: patent grant

Free format text: PATENTED CASE

AS Assignment

Owner name: CREDIT SUISSE AG, CAYMAN ISLANDS BRANCH, AS COLLATERAL AGENT, NORTH CAROLINA

Free format text: SECURITY AGREEMENT;ASSIGNORS:ASAP SOFTWARE EXPRESS, INC.;AVENTAIL LLC;CREDANT TECHNOLOGIES, INC.;AND OTHERS;REEL/FRAME:040134/0001

Effective date: 20160907

Owner name: THE BANK OF NEW YORK MELLON TRUST COMPANY, N.A., AS NOTES COLLATERAL AGENT, TEXAS

Free format text: SECURITY AGREEMENT;ASSIGNORS:ASAP SOFTWARE EXPRESS, INC.;AVENTAIL LLC;CREDANT TECHNOLOGIES, INC.;AND OTHERS;REEL/FRAME:040136/0001

Effective date: 20160907

Owner name: CREDIT SUISSE AG, CAYMAN ISLANDS BRANCH, AS COLLAT

Free format text: SECURITY AGREEMENT;ASSIGNORS:ASAP SOFTWARE EXPRESS, INC.;AVENTAIL LLC;CREDANT TECHNOLOGIES, INC.;AND OTHERS;REEL/FRAME:040134/0001

Effective date: 20160907

Owner name: THE BANK OF NEW YORK MELLON TRUST COMPANY, N.A., A

Free format text: SECURITY AGREEMENT;ASSIGNORS:ASAP SOFTWARE EXPRESS, INC.;AVENTAIL LLC;CREDANT TECHNOLOGIES, INC.;AND OTHERS;REEL/FRAME:040136/0001

Effective date: 20160907

AS Assignment

Owner name: EMC IP HOLDING COMPANY LLC, MASSACHUSETTS

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:EMC CORPORATION;REEL/FRAME:040203/0001

Effective date: 20160906

CC Certificate of correction
AS Assignment

Owner name: THE BANK OF NEW YORK MELLON TRUST COMPANY, N.A., T

Free format text: SECURITY AGREEMENT;ASSIGNORS:CREDANT TECHNOLOGIES, INC.;DELL INTERNATIONAL L.L.C.;DELL MARKETING L.P.;AND OTHERS;REEL/FRAME:049452/0223

Effective date: 20190320

Owner name: THE BANK OF NEW YORK MELLON TRUST COMPANY, N.A., TEXAS

Free format text: SECURITY AGREEMENT;ASSIGNORS:CREDANT TECHNOLOGIES, INC.;DELL INTERNATIONAL L.L.C.;DELL MARKETING L.P.;AND OTHERS;REEL/FRAME:049452/0223

Effective date: 20190320

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Year of fee payment: 4

AS Assignment

Owner name: THE BANK OF NEW YORK MELLON TRUST COMPANY, N.A., TEXAS

Free format text: SECURITY AGREEMENT;ASSIGNORS:CREDANT TECHNOLOGIES INC.;DELL INTERNATIONAL L.L.C.;DELL MARKETING L.P.;AND OTHERS;REEL/FRAME:053546/0001

Effective date: 20200409

AS Assignment

Owner name: WYSE TECHNOLOGY L.L.C., CALIFORNIA

Free format text: RELEASE BY SECURED PARTY;ASSIGNOR:CREDIT SUISSE AG, CAYMAN ISLANDS BRANCH;REEL/FRAME:058216/0001

Effective date: 20211101

Owner name: SCALEIO LLC, MASSACHUSETTS

Free format text: RELEASE BY SECURED PARTY;ASSIGNOR:CREDIT SUISSE AG, CAYMAN ISLANDS BRANCH;REEL/FRAME:058216/0001

Effective date: 20211101

Owner name: MOZY, INC., WASHINGTON

Free format text: RELEASE BY SECURED PARTY;ASSIGNOR:CREDIT SUISSE AG, CAYMAN ISLANDS BRANCH;REEL/FRAME:058216/0001

Effective date: 20211101

Owner name: MAGINATICS LLC, CALIFORNIA

Free format text: RELEASE BY SECURED PARTY;ASSIGNOR:CREDIT SUISSE AG, CAYMAN ISLANDS BRANCH;REEL/FRAME:058216/0001

Effective date: 20211101

Owner name: FORCE10 NETWORKS, INC., CALIFORNIA

Free format text: RELEASE BY SECURED PARTY;ASSIGNOR:CREDIT SUISSE AG, CAYMAN ISLANDS BRANCH;REEL/FRAME:058216/0001

Effective date: 20211101

Owner name: EMC IP HOLDING COMPANY LLC, TEXAS

Free format text: RELEASE BY SECURED PARTY;ASSIGNOR:CREDIT SUISSE AG, CAYMAN ISLANDS BRANCH;REEL/FRAME:058216/0001

Effective date: 20211101

Owner name: EMC CORPORATION, MASSACHUSETTS

Free format text: RELEASE BY SECURED PARTY;ASSIGNOR:CREDIT SUISSE AG, CAYMAN ISLANDS BRANCH;REEL/FRAME:058216/0001

Effective date: 20211101

Owner name: DELL SYSTEMS CORPORATION, TEXAS

Free format text: RELEASE BY SECURED PARTY;ASSIGNOR:CREDIT SUISSE AG, CAYMAN ISLANDS BRANCH;REEL/FRAME:058216/0001

Effective date: 20211101

Owner name: DELL SOFTWARE INC., CALIFORNIA

Free format text: RELEASE BY SECURED PARTY;ASSIGNOR:CREDIT SUISSE AG, CAYMAN ISLANDS BRANCH;REEL/FRAME:058216/0001

Effective date: 20211101

Owner name: DELL PRODUCTS L.P., TEXAS

Free format text: RELEASE BY SECURED PARTY;ASSIGNOR:CREDIT SUISSE AG, CAYMAN ISLANDS BRANCH;REEL/FRAME:058216/0001

Effective date: 20211101

Owner name: DELL MARKETING L.P., TEXAS

Free format text: RELEASE BY SECURED PARTY;ASSIGNOR:CREDIT SUISSE AG, CAYMAN ISLANDS BRANCH;REEL/FRAME:058216/0001

Effective date: 20211101

Owner name: DELL INTERNATIONAL, L.L.C., TEXAS

Free format text: RELEASE BY SECURED PARTY;ASSIGNOR:CREDIT SUISSE AG, CAYMAN ISLANDS BRANCH;REEL/FRAME:058216/0001

Effective date: 20211101

Owner name: DELL USA L.P., TEXAS

Free format text: RELEASE BY SECURED PARTY;ASSIGNOR:CREDIT SUISSE AG, CAYMAN ISLANDS BRANCH;REEL/FRAME:058216/0001

Effective date: 20211101

Owner name: CREDANT TECHNOLOGIES, INC., TEXAS

Free format text: RELEASE BY SECURED PARTY;ASSIGNOR:CREDIT SUISSE AG, CAYMAN ISLANDS BRANCH;REEL/FRAME:058216/0001

Effective date: 20211101

Owner name: AVENTAIL LLC, CALIFORNIA

Free format text: RELEASE BY SECURED PARTY;ASSIGNOR:CREDIT SUISSE AG, CAYMAN ISLANDS BRANCH;REEL/FRAME:058216/0001

Effective date: 20211101

Owner name: ASAP SOFTWARE EXPRESS, INC., ILLINOIS

Free format text: RELEASE BY SECURED PARTY;ASSIGNOR:CREDIT SUISSE AG, CAYMAN ISLANDS BRANCH;REEL/FRAME:058216/0001

Effective date: 20211101

AS Assignment

Owner name: SCALEIO LLC, MASSACHUSETTS

Free format text: RELEASE OF SECURITY INTEREST IN PATENTS PREVIOUSLY RECORDED AT REEL/FRAME (040136/0001);ASSIGNOR:THE BANK OF NEW YORK MELLON TRUST COMPANY, N.A., AS NOTES COLLATERAL AGENT;REEL/FRAME:061324/0001

Effective date: 20220329

Owner name: EMC IP HOLDING COMPANY LLC (ON BEHALF OF ITSELF AND AS SUCCESSOR-IN-INTEREST TO MOZY, INC.), TEXAS

Free format text: RELEASE OF SECURITY INTEREST IN PATENTS PREVIOUSLY RECORDED AT REEL/FRAME (040136/0001);ASSIGNOR:THE BANK OF NEW YORK MELLON TRUST COMPANY, N.A., AS NOTES COLLATERAL AGENT;REEL/FRAME:061324/0001

Effective date: 20220329

Owner name: EMC CORPORATION (ON BEHALF OF ITSELF AND AS SUCCESSOR-IN-INTEREST TO MAGINATICS LLC), MASSACHUSETTS

Free format text: RELEASE OF SECURITY INTEREST IN PATENTS PREVIOUSLY RECORDED AT REEL/FRAME (040136/0001);ASSIGNOR:THE BANK OF NEW YORK MELLON TRUST COMPANY, N.A., AS NOTES COLLATERAL AGENT;REEL/FRAME:061324/0001

Effective date: 20220329

Owner name: DELL MARKETING CORPORATION (SUCCESSOR-IN-INTEREST TO FORCE10 NETWORKS, INC. AND WYSE TECHNOLOGY L.L.C.), TEXAS

Free format text: RELEASE OF SECURITY INTEREST IN PATENTS PREVIOUSLY RECORDED AT REEL/FRAME (040136/0001);ASSIGNOR:THE BANK OF NEW YORK MELLON TRUST COMPANY, N.A., AS NOTES COLLATERAL AGENT;REEL/FRAME:061324/0001

Effective date: 20220329

Owner name: DELL PRODUCTS L.P., TEXAS

Free format text: RELEASE OF SECURITY INTEREST IN PATENTS PREVIOUSLY RECORDED AT REEL/FRAME (040136/0001);ASSIGNOR:THE BANK OF NEW YORK MELLON TRUST COMPANY, N.A., AS NOTES COLLATERAL AGENT;REEL/FRAME:061324/0001

Effective date: 20220329

Owner name: DELL INTERNATIONAL L.L.C., TEXAS

Free format text: RELEASE OF SECURITY INTEREST IN PATENTS PREVIOUSLY RECORDED AT REEL/FRAME (040136/0001);ASSIGNOR:THE BANK OF NEW YORK MELLON TRUST COMPANY, N.A., AS NOTES COLLATERAL AGENT;REEL/FRAME:061324/0001

Effective date: 20220329

Owner name: DELL USA L.P., TEXAS

Free format text: RELEASE OF SECURITY INTEREST IN PATENTS PREVIOUSLY RECORDED AT REEL/FRAME (040136/0001);ASSIGNOR:THE BANK OF NEW YORK MELLON TRUST COMPANY, N.A., AS NOTES COLLATERAL AGENT;REEL/FRAME:061324/0001

Effective date: 20220329

Owner name: DELL MARKETING L.P. (ON BEHALF OF ITSELF AND AS SUCCESSOR-IN-INTEREST TO CREDANT TECHNOLOGIES, INC.), TEXAS

Free format text: RELEASE OF SECURITY INTEREST IN PATENTS PREVIOUSLY RECORDED AT REEL/FRAME (040136/0001);ASSIGNOR:THE BANK OF NEW YORK MELLON TRUST COMPANY, N.A., AS NOTES COLLATERAL AGENT;REEL/FRAME:061324/0001

Effective date: 20220329

Owner name: DELL MARKETING CORPORATION (SUCCESSOR-IN-INTEREST TO ASAP SOFTWARE EXPRESS, INC.), TEXAS

Free format text: RELEASE OF SECURITY INTEREST IN PATENTS PREVIOUSLY RECORDED AT REEL/FRAME (040136/0001);ASSIGNOR:THE BANK OF NEW YORK MELLON TRUST COMPANY, N.A., AS NOTES COLLATERAL AGENT;REEL/FRAME:061324/0001

Effective date: 20220329

AS Assignment

Owner name: SCALEIO LLC, MASSACHUSETTS

Free format text: RELEASE OF SECURITY INTEREST IN PATENTS PREVIOUSLY RECORDED AT REEL/FRAME (045455/0001);ASSIGNOR:THE BANK OF NEW YORK MELLON TRUST COMPANY, N.A., AS NOTES COLLATERAL AGENT;REEL/FRAME:061753/0001

Effective date: 20220329

Owner name: EMC IP HOLDING COMPANY LLC (ON BEHALF OF ITSELF AND AS SUCCESSOR-IN-INTEREST TO MOZY, INC.), TEXAS

Free format text: RELEASE OF SECURITY INTEREST IN PATENTS PREVIOUSLY RECORDED AT REEL/FRAME (045455/0001);ASSIGNOR:THE BANK OF NEW YORK MELLON TRUST COMPANY, N.A., AS NOTES COLLATERAL AGENT;REEL/FRAME:061753/0001

Effective date: 20220329

Owner name: EMC CORPORATION (ON BEHALF OF ITSELF AND AS SUCCESSOR-IN-INTEREST TO MAGINATICS LLC), MASSACHUSETTS

Free format text: RELEASE OF SECURITY INTEREST IN PATENTS PREVIOUSLY RECORDED AT REEL/FRAME (045455/0001);ASSIGNOR:THE BANK OF NEW YORK MELLON TRUST COMPANY, N.A., AS NOTES COLLATERAL AGENT;REEL/FRAME:061753/0001

Effective date: 20220329

Owner name: DELL MARKETING CORPORATION (SUCCESSOR-IN-INTEREST TO FORCE10 NETWORKS, INC. AND WYSE TECHNOLOGY L.L.C.), TEXAS

Free format text: RELEASE OF SECURITY INTEREST IN PATENTS PREVIOUSLY RECORDED AT REEL/FRAME (045455/0001);ASSIGNOR:THE BANK OF NEW YORK MELLON TRUST COMPANY, N.A., AS NOTES COLLATERAL AGENT;REEL/FRAME:061753/0001

Effective date: 20220329

Owner name: DELL PRODUCTS L.P., TEXAS

Free format text: RELEASE OF SECURITY INTEREST IN PATENTS PREVIOUSLY RECORDED AT REEL/FRAME (045455/0001);ASSIGNOR:THE BANK OF NEW YORK MELLON TRUST COMPANY, N.A., AS NOTES COLLATERAL AGENT;REEL/FRAME:061753/0001

Effective date: 20220329

Owner name: DELL INTERNATIONAL L.L.C., TEXAS

Free format text: RELEASE OF SECURITY INTEREST IN PATENTS PREVIOUSLY RECORDED AT REEL/FRAME (045455/0001);ASSIGNOR:THE BANK OF NEW YORK MELLON TRUST COMPANY, N.A., AS NOTES COLLATERAL AGENT;REEL/FRAME:061753/0001

Effective date: 20220329

Owner name: DELL USA L.P., TEXAS

Free format text: RELEASE OF SECURITY INTEREST IN PATENTS PREVIOUSLY RECORDED AT REEL/FRAME (045455/0001);ASSIGNOR:THE BANK OF NEW YORK MELLON TRUST COMPANY, N.A., AS NOTES COLLATERAL AGENT;REEL/FRAME:061753/0001

Effective date: 20220329

Owner name: DELL MARKETING L.P. (ON BEHALF OF ITSELF AND AS SUCCESSOR-IN-INTEREST TO CREDANT TECHNOLOGIES, INC.), TEXAS

Free format text: RELEASE OF SECURITY INTEREST IN PATENTS PREVIOUSLY RECORDED AT REEL/FRAME (045455/0001);ASSIGNOR:THE BANK OF NEW YORK MELLON TRUST COMPANY, N.A., AS NOTES COLLATERAL AGENT;REEL/FRAME:061753/0001

Effective date: 20220329

Owner name: DELL MARKETING CORPORATION (SUCCESSOR-IN-INTEREST TO ASAP SOFTWARE EXPRESS, INC.), TEXAS

Free format text: RELEASE OF SECURITY INTEREST IN PATENTS PREVIOUSLY RECORDED AT REEL/FRAME (045455/0001);ASSIGNOR:THE BANK OF NEW YORK MELLON TRUST COMPANY, N.A., AS NOTES COLLATERAL AGENT;REEL/FRAME:061753/0001

Effective date: 20220329

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 8TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1552); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Year of fee payment: 8