EP3485386A1 - Meilleure déduplication de données pour un système et un procédé de cohérence finale - Google Patents

Meilleure déduplication de données pour un système et un procédé de cohérence finale

Info

Publication number
EP3485386A1
EP3485386A1 EP17828287.7A EP17828287A EP3485386A1 EP 3485386 A1 EP3485386 A1 EP 3485386A1 EP 17828287 A EP17828287 A EP 17828287A EP 3485386 A1 EP3485386 A1 EP 3485386A1
Authority
EP
European Patent Office
Prior art keywords
data
vessel
client
manifest
name
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP17828287.7A
Other languages
German (de)
English (en)
Other versions
EP3485386A4 (fr
Inventor
Kurt J. MILLER
Anthony MULIERI
Shaun R.J. MCDOWELL
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Neverfail Ltd
Original Assignee
Neverfail Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Neverfail Ltd filed Critical Neverfail Ltd
Publication of EP3485386A1 publication Critical patent/EP3485386A1/fr
Publication of EP3485386A4 publication Critical patent/EP3485386A4/fr
Withdrawn legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1095Replication or mirroring of data, e.g. scheduling or transport for data synchronisation between network nodes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1748De-duplication implemented within the file system, e.g. based on file segments
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]

Definitions

  • Data deduplication is a specialized data compression technique for eliminating duplicate copies of repeating data. This technique is used to improve storage utilization and can also be applied to network data transfers to reduce the number of bytes that must be sent.
  • unique chunks of data, or byte patterns are identified and stored during a process of analysis. As the analysis continues, other chunks are compared to the stored copy and whenever a match occurs, the redundant chunk is replaced with a small reference that points to the stored chunk. Given that the same byte pattern may occur dozens, hundreds, or even thousands of times (the match frequency often varies with the chunk size), the amount of data that must be stored or transferred can be greatly reduced.
  • Data deduplication also known as data optimization is to reduce the amount of required physical bytes are stored on the disk or the need for data transmitted over the network operation without compromising the fidelity and integrity of the original data.
  • Data deduplication reduces the required storage capacity to store data, and may thus lead to data storage hardware costs and management costs savings.
  • Data deduplication provides solutions handle fast-growing digital storage of data.
  • Eventual consistency poses significant challenges for deduplication systems. This is because data storage is often spread over many different nodes and up-to-date data may not always be available.
  • a significant component of deduplication is creating pointers to older copies of identical data and deleting the later copy. In a basic implementation, this creates an eventually consistent vulnerability. If an older copy of the pointers to the data is retrieved (usually in the form of a metadata file), it may be pointing to a copy of the data that is no longer available.
  • the present system and method solves the problem with eventual consistency vulnerabilities by introducing a level of indirection and creating manifest files for each file (vessel).
  • the present solution addresses data deduplication limitations and deals with the problems of eventual consistency by adding a layer of indirection above the deduplicated data. When this layer is combined with a precisely ordered multi-step update process and the use of temporary redirection stub files, the result is a strongly consistent data repository out of the eventually consistent object storage.
  • the present invention relates to a system and method for improving data deduplication for eventually consistent distributed storage systems.
  • the present system and method solves the problem with eventual consistency vulnerabilities by introducing a level of indirection and creating manifest files for each file (vessel).
  • the present invention comprises a computer- implemented method for synchronizing data between a server computer, a client computer and data storage accessible to the server computer to provide a strongly consistent data repository.
  • the method comprises receiving a request to synchronize client data on the client computer with data in the data storage connected to the server computer; breaking the client data to be stored in the data storage connected to the server computer into pieces; using the server computer, running an algorithm on the client data and comparing the client data to existing data in the data storage to determine if the client data is already exists in data storage; if the client data is not present in data storage, combining the client data into a grouping called a first vessel (V1), creating a first vessel manifest (M1) having a first name/identifier (N1) that identifies the first vessel (V1) and storing the first vessel and the first vessel manifest (M1) in the data storage and storing a pointer to the first vessel manifest (M1 ) in a first metadata file.
  • V1 first vessel
  • N1 first name/identifier
  • the client data is present in data storage, comparing the contents of the client data to existing data to determine if the client data needs to be revised and if so, creating a second vessel (V2) and storing the client data in the second vessel (V2) and creating a second vessel manifest (M2) having the same first name/identifier (N1 ) but its contents identify the second vessel (V2); creating a stub redirect vessel (V3) having a redirect vessel name/identifier that describes the data in the second vessel (V2); and deleting the first vessel (V1 ).
  • V2 the contents of the client data to existing data to determine if the client data needs to be revised and if so, creating a second vessel (V2) and storing the client data in the second vessel (V2) and creating a second vessel manifest (M2) having the same first name/identifier (N1 ) but its contents identify the second vessel (V2); creating a stub redirect vessel (V3) having a redirect vessel name/identifier that describes the data in the second vessel (V2); and deleting the
  • the present invention further comprises a computer-implemented method for synchronizing data between a server computer, a client computer and data storage accessible to the server computer to provide a strongly consistent data repository by retrieving data that has been modified and stored in a vessel selected from the group consisting of the first or second vessel (V1 , V2) and accessing the first name/identifier (N1 that may be M2 or M1). If the same first name/identifier (N1) identifies the second manifest (M2), attempting to retrieve the data from the second vessel (V2) and if this retrieval succeeds, the data in the second vessel (V2) is valid.
  • first name/identifier (N1 ) identifies the first manifest (M1 ), attempting to retrieve the data from the first vessel (V1 ) and if this retrieval succeeds, the data in the first vessel (V1 ) is valid and retrieving data from the first vessel (V1). If the same first name/identifier (N1) identifies the first manifest (M1 ), attempting to retrieve the data from the first vessel (V1 ) and if this retrieval fails, the data in the first vessel (V1) was deleted and accessing the stub redirect vessel (V3) which describes the data in the second vessel (V2) and retrieving the data from the second vessel (V2).
  • FIG. 1 illustrates an exemplary embodiment block diagram of the present system.
  • FIG. 2 is an exemplary embodiment of a data deduplication architecture diagram overview.
  • FIG. 3 is an overview of an exemplary embodiment of a data deduplication architecture diagram.
  • Fig. 4 is an overview of an exemplary embodiment of a data deduplication architecture diagram.
  • FIG. 5 is an overview of an exemplary embodiment of a data deduplication architecture diagram.
  • FIG. 6 is an overview of an exemplary embodiment of a data deduplication architecture diagram.
  • FIG. 7 is an overview of an exemplary embodiment of a data deduplication architecture diagram.
  • FIGs. 8a and 8b are flowcharts of an exemplary embodiment of a data deduplication system for eventual consistency object storage.
  • FIG. 1 depicts a computer system and network 100 suitable for implementing the system and method of the present system.
  • a server computer 105 includes an operating system 107 for controlling the overall operation of the server 105 and the deduplication software 106 of the present solution.
  • the server 105 may connect through a wide area or local area and communications network (wired or wireless) 102 to one or more client computers 101.
  • the sever 105 may also connect via the same or another wide area or local area and communications network (wired or wireless) 110 to a standard eventually consistent object storage 111.
  • the communications networks 102 and 110 may be a mixture of local or remote networks so some client computers 101 are local while others are remotely located.
  • File systems volumes are configured on and shared from the server 105.
  • the file system shares can be either networked attached storage (NAS) or common internet file system (CIFS) type shares.
  • NAS networked attached storage
  • CIFS common internet file system
  • Any data, new data 103 or deleted data 104 sent to the server 105 are processed through the deduplication software 106.
  • the processing includes breaking the data into pieces, in this example approximately 20K each and running algorithms to determine if the same data is already present, rf so pointers are used to prevent the need for storing the same data again.
  • the result of the deduplication process is that data that needs to be added or deleted from the back-end object storage system, is combined into new or updated grouping called vessels 108, 109 with new or updated pointer indices called manifests 108, 109.
  • the vessels and manifests 108, 109 are then written via a communications network 110 to local or remote standard eventually consistent object storage 111.
  • FIG. 2 is an overview of an exemplary embodiment of a data deduplication architecture diagram that depicts a design without the use of manifests.
  • each source file 201 , 202 and 203 is broken into segments of an average of twenty (20) KB in length.
  • a hashing scheme is used to generate a unique fingerprint for the data in each segment 204, 205 and 206.
  • Those segments are grouped together into a unit of approximately ten (10) MB and stored as a file (called a vessel 207, 208, 209, 210) in the storage system 211.
  • a metadata file 204, 205 and 206 containing the original location of each segment in the source file, the fingerprint and where the data for that fingerprint (i.e., segment) can be found in the storage.
  • FIG. 3 is an overview of an exemplary embodiment of a data deduplication architecture diagram that depicts a design alternative which uses manifests.
  • This deduplication design 300 is a modification of the design shown in Fig. 2.
  • each source file 301 , 302 and 303 is broken into segments of an average of twenty
  • Figs. 4, 5 and 6 are overviews of exemplary embodiments of a data deduplication architecture diagram.
  • Fig.4 depicts an exemplary embodiment of metadata and a manifest before deletion.
  • Fig. 5 depicts an exemplary embodiment of a data retrieval where a new manifest has been retrieved.
  • Fig. 6 depicts an exemplary embodiment of a failed retrieval where an old manifest has been returned.
  • metadata 401 , Fig. 5, 501 metadata include pointers to a manifest 502 and that manifest 502 points to a vessel 503.
  • Fig. 6, 601 metadata include pointers to an old manifest 602 and that manifest points to a vessel 603.
  • the vessel in question is read back and any data no longer needed is removed as shown in Fig. 5, 503. This vessel 503 is then given a new vessel name with a different identification.
  • the manifest file 502 still has the same identification and name as the old one Fig. 6, 602. In this case, the old manifest points to a vessel that has been deleted and no longer exists. The new manifest now points to the new vessel 603. The old vessel 503 is then deleted. If the new manifest file 502 is properly retrieved, it will now point to the new vessel 603. However if the old manifest 602 is retrieved because of stale access, the old vessel 503 will likely be unavailable because it was deleted and an error condition will occur.
  • Fig. 7 is an overview of an exemplary embodiment of a data deduplication architecture diagram that depicts a successful retrieval where both manifests have been retrieved.
  • the process is as follows.
  • the vessel in question 702 is accessed and data within that vessel that is no longer needed is deleted.
  • a new vessel with a different name is saved with a different identification and name 701.
  • a stub vessel file 702 is saved with a variation of the name of the name of the old vessel and may have an expiration date.
  • This stub vessel file 702 indicates that the vessel no longer exists and contains redirection information to the new vessel 701.
  • Stream 3 metadata 705 is incorporated into the new manifest 3 file 703.
  • a new manifest file 703 is saved with the same identification and name as the old manifest file 704.
  • the new manifest 703 now points to the new vessel 701.
  • the old vessel 702 may be deleted. If the new manifest file 703 is properly retrieved, it will point to the new vessel 701. If the old manifest 704 is retrieved because of lack of consistency, the retrieval code will attempt to retrieve the old vessel. If the old vessel 702 is retrieved (because of inconsistency), the data is taken from there. In this example, the old vessel 702 is now a stub vessel 792 that points to vessel 701 which contains the most current data. That redirection stub 702 will point to the new vessel 701 which contains the most current data and it will be retrieved. In either case, where the old manifest 404 or the new manifest 703 is retrieved, it will ultimately point to the most current data in vessel 701.
  • FIGs. 8a and 8b are flowcharts of an exemplary embodiment of a data deduplication system for eventual consistency object storage.
  • a computer-implemented method for synchronizing data between a server computer, a client computer and data storage accessible to the server computer to provide a strongly consistent data repository is shown 800.
  • client data to be stored in the data storage connected to the server computer is broken into pieces 805.
  • an algorithm is run on the client data and comparing the client data to existing data in the data storage to determine if the client data is already exists in data storage 810.
  • the client data is combined into a grouping called a first vessel (V1), a first vessel manifest (M1) is created having a first name/identifier (N1 ) that identifies the first vessel (V1 ) and the first vessel is stored and the first vessel manifest (M1) in the data storage a pointer is stored to the first vessel manifest (M1) in a first metadata file 820 and processing ends 825.
  • V1 first vessel
  • N1 first name/identifier
  • a second vessel is created and the client data is stored in the second vessel (V2) and a second vessel manifest (M2) is created having the same first name/identifier (N1a) but its contents identify the second vessel (V2) 830.
  • a stub redirect vessel (V3) is created having a redirect vessel name/identifier that describes the data in the second vessel (V2) 835.
  • the second vessel (V2) is deleted 840.
  • first name/identifier (N1) identifies the first manifest (M1) 860, an attempt is made to retrieve the data from the first vessel (V1 ) and if this retrieval succeeds, the data in the first vessel (V1) is valid and retrieving data from the first vessel (V1) 870. If the same first name/identifier (N1) identifies the first manifest (M1), an attempt is made to retrieve the data from the first vessel (V1 ) and if this retrieval fails, the data in the first vessel (V1 ) was deleted and the stub redirect vessel (V3) is accessed which describes the data in the second vessel (V2) and the data is retrieved from the second vessel (V2) and processing end 875.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

La présente invention se rapporte à un système et à un procédé permettant d'améliorer la déduplication de données pour une cohérence éventuelle de systèmes de stockage distribués. Une cohérence éventuelle pose des défis importants pour les systèmes de déduplication. Ceci est dû au fait qu'un stockage de données est souvent réparti sur de nombreuses notes différentes et que les données les plus récentes ne peuvent pas toujours être disponibles. Un composant important de déduplication crée des pointeurs vers des copies plus anciennes de données identiques et supprime la dernière copie. Selon un mode de réalisation de base, ceci crée une vulnérabilité éventuellement cohérente. Si une copie plus ancienne des pointeurs vers les données est récupérée (sous la forme d'un fichier de métadonnées), elle peut être dirigée vers une copie des données qui n'est plus disponible. Le système et le procédé de la présente invention résolvent le problème lié à des vulnérabilités de cohérence éventuelle en introduisant un niveau d'indirection et en créant des fichiers de manifeste pour chaque fichier (récipient).
EP17828287.7A 2016-07-12 2017-07-11 Meilleure déduplication de données pour un système et un procédé de cohérence finale Withdrawn EP3485386A4 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201662361321P 2016-07-12 2016-07-12
PCT/US2017/041499 WO2018013541A1 (fr) 2016-07-12 2017-07-11 Meilleure déduplication de données pour un système et un procédé de cohérence finale

Publications (2)

Publication Number Publication Date
EP3485386A1 true EP3485386A1 (fr) 2019-05-22
EP3485386A4 EP3485386A4 (fr) 2020-03-11

Family

ID=60953326

Family Applications (1)

Application Number Title Priority Date Filing Date
EP17828287.7A Withdrawn EP3485386A4 (fr) 2016-07-12 2017-07-11 Meilleure déduplication de données pour un système et un procédé de cohérence finale

Country Status (2)

Country Link
EP (1) EP3485386A4 (fr)
WO (1) WO2018013541A1 (fr)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10691666B1 (en) 2017-08-23 2020-06-23 CloudBD, LLC Providing strong consistency for object storage

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7962452B2 (en) * 2007-12-28 2011-06-14 International Business Machines Corporation Data deduplication by separating data from meta data
US8930306B1 (en) * 2009-07-08 2015-01-06 Commvault Systems, Inc. Synchronized data deduplication
US8495028B2 (en) * 2010-01-25 2013-07-23 Sepaton, Inc. System and method for data driven de-duplication
US8799238B2 (en) * 2010-06-18 2014-08-05 Hewlett-Packard Development Company, L.P. Data deduplication
EP2810171B1 (fr) * 2012-02-02 2019-07-03 Hewlett-Packard Enterprise Development LP Systèmes et procédés de déduplication de blocs de données
US9451000B2 (en) * 2012-12-27 2016-09-20 Akamai Technologies, Inc. Stream-based data deduplication with cache synchronization
US9633033B2 (en) * 2013-01-11 2017-04-25 Commvault Systems, Inc. High availability distributed deduplicated storage system
US9418072B2 (en) * 2013-03-04 2016-08-16 Vmware, Inc. Cross-file differential content synchronization
US9678973B2 (en) * 2013-10-15 2017-06-13 Hitachi Data Systems Corporation Multi-node hybrid deduplication
US9483349B2 (en) * 2014-01-17 2016-11-01 Netapp, Inc. Clustered raid data organization
US9053124B1 (en) * 2014-09-30 2015-06-09 Code 42 Software, Inc. System for a distributed file system element collection

Also Published As

Publication number Publication date
EP3485386A4 (fr) 2020-03-11
WO2018013541A1 (fr) 2018-01-18

Similar Documents

Publication Publication Date Title
US7366859B2 (en) Fast incremental backup method and system
US20190114288A1 (en) Transferring differences between chunks during replication
US9934237B1 (en) Metadata optimization for network replication using representative of metadata batch
US7472254B2 (en) Systems and methods for modifying a set of data objects
US9798486B1 (en) Method and system for file system based replication of a deduplicated storage system
US8312006B2 (en) Cluster storage using delta compression
US11182256B2 (en) Backup item metadata including range information
CN110096891B (zh) 对象库中的对象签名
US9110964B1 (en) Metadata optimization for network replication using differential encoding
JP4473694B2 (ja) 長期データ保護システム及び方法
US8166012B2 (en) Cluster storage using subsegmenting
US9367448B1 (en) Method and system for determining data integrity for garbage collection of data storage systems
US9262280B1 (en) Age-out selection in hash caches
WO2017049764A1 (fr) Procédé de lecture et d'écriture de données et système de mémorisation distribué
US8825626B1 (en) Method and system for detecting unwanted content of files
US20150339314A1 (en) Compaction mechanism for file system
US9785646B2 (en) Data file handling in a network environment and independent file server
US10387271B2 (en) File system storage in cloud using data and metadata merkle trees
US10191915B2 (en) Information processing system and data synchronization control scheme thereof
US9183218B1 (en) Method and system to improve deduplication of structured datasets using hybrid chunking and block header removal
US9917894B2 (en) Accelerating transfer protocols
US8756249B1 (en) Method and apparatus for efficiently searching data in a storage system
US10684920B2 (en) Optimized and consistent replication of file overwrites
US10339124B2 (en) Data fingerprint strengthening
US10380141B1 (en) Fast incremental backup method and system

Legal Events

Date Code Title Description
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE

PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE

17P Request for examination filed

Effective date: 20190111

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

AX Request for extension of the european patent

Extension state: BA ME

DAV Request for validation of the european patent (deleted)
DAX Request for extension of the european patent (deleted)
A4 Supplementary search report drawn up and despatched

Effective date: 20200212

RIC1 Information provided on ipc code assigned before grant

Ipc: G06F 21/60 20130101ALI20200206BHEP

Ipc: H04L 29/08 20060101ALI20200206BHEP

Ipc: G06F 16/174 20190101ALI20200206BHEP

Ipc: G06F 15/16 20060101AFI20200206BHEP

Ipc: G06F 16/00 20190101ALI20200206BHEP

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: EXAMINATION IS IN PROGRESS

17Q First examination report despatched

Effective date: 20210325

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 20210805