WO2018013541A1 - Meilleure déduplication de données pour un système et un procédé de cohérence finale - Google Patents

Meilleure déduplication de données pour un système et un procédé de cohérence finale Download PDF

Info

Publication number
WO2018013541A1
WO2018013541A1 PCT/US2017/041499 US2017041499W WO2018013541A1 WO 2018013541 A1 WO2018013541 A1 WO 2018013541A1 US 2017041499 W US2017041499 W US 2017041499W WO 2018013541 A1 WO2018013541 A1 WO 2018013541A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
vessel
client
manifest
name
Prior art date
Application number
PCT/US2017/041499
Other languages
English (en)
Inventor
Kurt J. MILLER
Anthony MULIERI
Shaun R.J. MCDOWELL
Original Assignee
Neverfail Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Neverfail Limited filed Critical Neverfail Limited
Priority to EP17828287.7A priority Critical patent/EP3485386A4/fr
Publication of WO2018013541A1 publication Critical patent/WO2018013541A1/fr

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1095Replication or mirroring of data, e.g. scheduling or transport for data synchronisation between network nodes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1748De-duplication implemented within the file system, e.g. based on file segments
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]

Definitions

  • Data deduplication also known as data optimization is to reduce the amount of required physical bytes are stored on the disk or the need for data transmitted over the network operation without compromising the fidelity and integrity of the original data.
  • Data deduplication reduces the required storage capacity to store data, and may thus lead to data storage hardware costs and management costs savings.
  • Data deduplication provides solutions handle fast-growing digital storage of data.
  • the method comprises receiving a request to synchronize client data on the client computer with data in the data storage connected to the server computer; breaking the client data to be stored in the data storage connected to the server computer into pieces; using the server computer, running an algorithm on the client data and comparing the client data to existing data in the data storage to determine if the client data is already exists in data storage; if the client data is not present in data storage, combining the client data into a grouping called a first vessel (V1), creating a first vessel manifest (M1) having a first name/identifier (N1) that identifies the first vessel (V1) and storing the first vessel and the first vessel manifest (M1) in the data storage and storing a pointer to the first vessel manifest (M1 ) in a first metadata file.
  • V1 first vessel
  • N1 first name/identifier
  • the present invention further comprises a computer-implemented method for synchronizing data between a server computer, a client computer and data storage accessible to the server computer to provide a strongly consistent data repository by retrieving data that has been modified and stored in a vessel selected from the group consisting of the first or second vessel (V1 , V2) and accessing the first name/identifier (N1 that may be M2 or M1). If the same first name/identifier (N1) identifies the second manifest (M2), attempting to retrieve the data from the second vessel (V2) and if this retrieval succeeds, the data in the second vessel (V2) is valid.
  • FIG. 2 is an exemplary embodiment of a data deduplication architecture diagram overview.
  • FIG. 3 is an overview of an exemplary embodiment of a data deduplication architecture diagram.
  • Fig. 4 is an overview of an exemplary embodiment of a data deduplication architecture diagram.
  • FIG. 6 is an overview of an exemplary embodiment of a data deduplication architecture diagram.
  • FIG. 7 is an overview of an exemplary embodiment of a data deduplication architecture diagram.
  • FIG. 1 depicts a computer system and network 100 suitable for implementing the system and method of the present system.
  • a server computer 105 includes an operating system 107 for controlling the overall operation of the server 105 and the deduplication software 106 of the present solution.
  • the server 105 may connect through a wide area or local area and communications network (wired or wireless) 102 to one or more client computers 101.
  • the sever 105 may also connect via the same or another wide area or local area and communications network (wired or wireless) 110 to a standard eventually consistent object storage 111.
  • the communications networks 102 and 110 may be a mixture of local or remote networks so some client computers 101 are local while others are remotely located.
  • File systems volumes are configured on and shared from the server 105.
  • the file system shares can be either networked attached storage (NAS) or common internet file system (CIFS) type shares.
  • NAS networked attached storage
  • CIFS common internet file system
  • Any data, new data 103 or deleted data 104 sent to the server 105 are processed through the deduplication software 106.
  • the processing includes breaking the data into pieces, in this example approximately 20K each and running algorithms to determine if the same data is already present, rf so pointers are used to prevent the need for storing the same data again.
  • the result of the deduplication process is that data that needs to be added or deleted from the back-end object storage system, is combined into new or updated grouping called vessels 108, 109 with new or updated pointer indices called manifests 108, 109.
  • the vessels and manifests 108, 109 are then written via a communications network 110 to local or remote standard eventually consistent object storage 111.
  • FIG. 2 is an overview of an exemplary embodiment of a data deduplication architecture diagram that depicts a design without the use of manifests.
  • each source file 201 , 202 and 203 is broken into segments of an average of twenty (20) KB in length.
  • a hashing scheme is used to generate a unique fingerprint for the data in each segment 204, 205 and 206.
  • Those segments are grouped together into a unit of approximately ten (10) MB and stored as a file (called a vessel 207, 208, 209, 210) in the storage system 211.
  • a metadata file 204, 205 and 206 containing the original location of each segment in the source file, the fingerprint and where the data for that fingerprint (i.e., segment) can be found in the storage.
  • Figs. 4, 5 and 6 are overviews of exemplary embodiments of a data deduplication architecture diagram.
  • Fig.4 depicts an exemplary embodiment of metadata and a manifest before deletion.
  • Fig. 5 depicts an exemplary embodiment of a data retrieval where a new manifest has been retrieved.
  • Fig. 6 depicts an exemplary embodiment of a failed retrieval where an old manifest has been returned.
  • metadata 401 , Fig. 5, 501 metadata include pointers to a manifest 502 and that manifest 502 points to a vessel 503.
  • Fig. 6, 601 metadata include pointers to an old manifest 602 and that manifest points to a vessel 603.
  • the vessel in question is read back and any data no longer needed is removed as shown in Fig. 5, 503. This vessel 503 is then given a new vessel name with a different identification.
  • the manifest file 502 still has the same identification and name as the old one Fig. 6, 602. In this case, the old manifest points to a vessel that has been deleted and no longer exists. The new manifest now points to the new vessel 603. The old vessel 503 is then deleted. If the new manifest file 502 is properly retrieved, it will now point to the new vessel 603. However if the old manifest 602 is retrieved because of stale access, the old vessel 503 will likely be unavailable because it was deleted and an error condition will occur.
  • Fig. 7 is an overview of an exemplary embodiment of a data deduplication architecture diagram that depicts a successful retrieval where both manifests have been retrieved.
  • the process is as follows.
  • the vessel in question 702 is accessed and data within that vessel that is no longer needed is deleted.
  • a new vessel with a different name is saved with a different identification and name 701.
  • a stub vessel file 702 is saved with a variation of the name of the name of the old vessel and may have an expiration date.
  • This stub vessel file 702 indicates that the vessel no longer exists and contains redirection information to the new vessel 701.
  • Stream 3 metadata 705 is incorporated into the new manifest 3 file 703.
  • a new manifest file 703 is saved with the same identification and name as the old manifest file 704.
  • the new manifest 703 now points to the new vessel 701.
  • the old vessel 702 may be deleted. If the new manifest file 703 is properly retrieved, it will point to the new vessel 701. If the old manifest 704 is retrieved because of lack of consistency, the retrieval code will attempt to retrieve the old vessel. If the old vessel 702 is retrieved (because of inconsistency), the data is taken from there. In this example, the old vessel 702 is now a stub vessel 792 that points to vessel 701 which contains the most current data. That redirection stub 702 will point to the new vessel 701 which contains the most current data and it will be retrieved. In either case, where the old manifest 404 or the new manifest 703 is retrieved, it will ultimately point to the most current data in vessel 701.
  • FIGs. 8a and 8b are flowcharts of an exemplary embodiment of a data deduplication system for eventual consistency object storage.
  • a computer-implemented method for synchronizing data between a server computer, a client computer and data storage accessible to the server computer to provide a strongly consistent data repository is shown 800.
  • client data to be stored in the data storage connected to the server computer is broken into pieces 805.
  • an algorithm is run on the client data and comparing the client data to existing data in the data storage to determine if the client data is already exists in data storage 810.
  • the client data is combined into a grouping called a first vessel (V1), a first vessel manifest (M1) is created having a first name/identifier (N1 ) that identifies the first vessel (V1 ) and the first vessel is stored and the first vessel manifest (M1) in the data storage a pointer is stored to the first vessel manifest (M1) in a first metadata file 820 and processing ends 825.
  • V1 first vessel
  • N1 first name/identifier
  • a second vessel is created and the client data is stored in the second vessel (V2) and a second vessel manifest (M2) is created having the same first name/identifier (N1a) but its contents identify the second vessel (V2) 830.
  • a stub redirect vessel (V3) is created having a redirect vessel name/identifier that describes the data in the second vessel (V2) 835.
  • the second vessel (V2) is deleted 840.
  • first name/identifier (N1) identifies the first manifest (M1) 860, an attempt is made to retrieve the data from the first vessel (V1 ) and if this retrieval succeeds, the data in the first vessel (V1) is valid and retrieving data from the first vessel (V1) 870. If the same first name/identifier (N1) identifies the first manifest (M1), an attempt is made to retrieve the data from the first vessel (V1 ) and if this retrieval fails, the data in the first vessel (V1 ) was deleted and the stub redirect vessel (V3) is accessed which describes the data in the second vessel (V2) and the data is retrieved from the second vessel (V2) and processing end 875.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

La présente invention se rapporte à un système et à un procédé permettant d'améliorer la déduplication de données pour une cohérence éventuelle de systèmes de stockage distribués. Une cohérence éventuelle pose des défis importants pour les systèmes de déduplication. Ceci est dû au fait qu'un stockage de données est souvent réparti sur de nombreuses notes différentes et que les données les plus récentes ne peuvent pas toujours être disponibles. Un composant important de déduplication crée des pointeurs vers des copies plus anciennes de données identiques et supprime la dernière copie. Selon un mode de réalisation de base, ceci crée une vulnérabilité éventuellement cohérente. Si une copie plus ancienne des pointeurs vers les données est récupérée (sous la forme d'un fichier de métadonnées), elle peut être dirigée vers une copie des données qui n'est plus disponible. Le système et le procédé de la présente invention résolvent le problème lié à des vulnérabilités de cohérence éventuelle en introduisant un niveau d'indirection et en créant des fichiers de manifeste pour chaque fichier (récipient).
PCT/US2017/041499 2016-07-12 2017-07-11 Meilleure déduplication de données pour un système et un procédé de cohérence finale WO2018013541A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP17828287.7A EP3485386A4 (fr) 2016-07-12 2017-07-11 Meilleure déduplication de données pour un système et un procédé de cohérence finale

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201662361321P 2016-07-12 2016-07-12
US62/361,321 2016-07-12

Publications (1)

Publication Number Publication Date
WO2018013541A1 true WO2018013541A1 (fr) 2018-01-18

Family

ID=60953326

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2017/041499 WO2018013541A1 (fr) 2016-07-12 2017-07-11 Meilleure déduplication de données pour un système et un procédé de cohérence finale

Country Status (2)

Country Link
EP (1) EP3485386A4 (fr)
WO (1) WO2018013541A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10691666B1 (en) 2017-08-23 2020-06-23 CloudBD, LLC Providing strong consistency for object storage

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090171888A1 (en) * 2007-12-28 2009-07-02 International Business Machines Corporation Data deduplication by separating data from meta data
US20110184921A1 (en) * 2010-01-25 2011-07-28 Sepaton, Inc. System and Method for Data Driven De-Duplication
US20130018855A1 (en) * 2010-06-18 2013-01-17 Kave Eshghi Data deduplication
US20140189040A1 (en) * 2012-12-27 2014-07-03 Akamai Technologies, Inc. Stream-based data deduplication with cache synchronization
US20140250066A1 (en) * 2013-03-04 2014-09-04 Vmware, Inc. Cross-file differential content synchronization
US20140344229A1 (en) * 2012-02-02 2014-11-20 Mark D. Lillibridge Systems and methods for data chunk deduplication
US20150106345A1 (en) * 2013-10-15 2015-04-16 Sepaton, Inc. Multi-node hybrid deduplication
US20150154220A1 (en) * 2009-07-08 2015-06-04 Commvault Systems, Inc. Synchronized data duplication
US20150205663A1 (en) * 2014-01-17 2015-07-23 Netapp, Inc. Clustered raid data organization
US20160092312A1 (en) * 2014-09-30 2016-03-31 Code 42 Software, Inc. Deduplicated data distribution techniques

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9633033B2 (en) * 2013-01-11 2017-04-25 Commvault Systems, Inc. High availability distributed deduplicated storage system

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090171888A1 (en) * 2007-12-28 2009-07-02 International Business Machines Corporation Data deduplication by separating data from meta data
US20150154220A1 (en) * 2009-07-08 2015-06-04 Commvault Systems, Inc. Synchronized data duplication
US20110184921A1 (en) * 2010-01-25 2011-07-28 Sepaton, Inc. System and Method for Data Driven De-Duplication
US20130018855A1 (en) * 2010-06-18 2013-01-17 Kave Eshghi Data deduplication
US20140344229A1 (en) * 2012-02-02 2014-11-20 Mark D. Lillibridge Systems and methods for data chunk deduplication
US20140189040A1 (en) * 2012-12-27 2014-07-03 Akamai Technologies, Inc. Stream-based data deduplication with cache synchronization
US20140250066A1 (en) * 2013-03-04 2014-09-04 Vmware, Inc. Cross-file differential content synchronization
US20150106345A1 (en) * 2013-10-15 2015-04-16 Sepaton, Inc. Multi-node hybrid deduplication
US20150205663A1 (en) * 2014-01-17 2015-07-23 Netapp, Inc. Clustered raid data organization
US20160092312A1 (en) * 2014-09-30 2016-03-31 Code 42 Software, Inc. Deduplicated data distribution techniques

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
BARRETO ET AL.: "Efficient Locally Trackable Deduplication in Replicated Systems", EFFICIENT LOCALLY TRACKABLE DEDUPLICATION IN REPLICATED SYSTEMS, 2009, XP019134822, Retrieved from the Internet <URL:https://link.springer.com/content/pdf/10.1007/978-3-642-10445-9_6.pdf> [retrieved on 20170908] *
See also references of EP3485386A4 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10691666B1 (en) 2017-08-23 2020-06-23 CloudBD, LLC Providing strong consistency for object storage

Also Published As

Publication number Publication date
EP3485386A4 (fr) 2020-03-11
EP3485386A1 (fr) 2019-05-22

Similar Documents

Publication Publication Date Title
US7366859B2 (en) Fast incremental backup method and system
US20190114288A1 (en) Transferring differences between chunks during replication
US9934237B1 (en) Metadata optimization for network replication using representative of metadata batch
US7472254B2 (en) Systems and methods for modifying a set of data objects
US9798486B1 (en) Method and system for file system based replication of a deduplicated storage system
US8312006B2 (en) Cluster storage using delta compression
US11182256B2 (en) Backup item metadata including range information
CN110096891B (zh) 对象库中的对象签名
US8166012B2 (en) Cluster storage using subsegmenting
JP4473694B2 (ja) 長期データ保護システム及び方法
US9367448B1 (en) Method and system for determining data integrity for garbage collection of data storage systems
US9262280B1 (en) Age-out selection in hash caches
WO2017049764A1 (fr) Procédé de lecture et d&#39;écriture de données et système de mémorisation distribué
US8825626B1 (en) Method and system for detecting unwanted content of files
US20150339314A1 (en) Compaction mechanism for file system
US10387271B2 (en) File system storage in cloud using data and metadata merkle trees
US10191915B2 (en) Information processing system and data synchronization control scheme thereof
US9183218B1 (en) Method and system to improve deduplication of structured datasets using hybrid chunking and block header removal
US9917894B2 (en) Accelerating transfer protocols
US10459886B2 (en) Client-side deduplication with local chunk caching
US8756249B1 (en) Method and apparatus for efficiently searching data in a storage system
US10684920B2 (en) Optimized and consistent replication of file overwrites
US10339124B2 (en) Data fingerprint strengthening
US10380141B1 (en) Fast incremental backup method and system
US9594643B2 (en) Handling restores in an incremental backup storage system

Legal Events

Date Code Title Description
DPE1 Request for preliminary examination filed after expiration of 19th month from priority date (pct application filed from 20040101)
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17828287

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2017828287

Country of ref document: EP

Effective date: 20190212