CN106326035A - File-metadata-based incremental backup method - Google Patents

File-metadata-based incremental backup method Download PDF

Info

Publication number
CN106326035A
CN106326035A CN201610671739.4A CN201610671739A CN106326035A CN 106326035 A CN106326035 A CN 106326035A CN 201610671739 A CN201610671739 A CN 201610671739A CN 106326035 A CN106326035 A CN 106326035A
Authority
CN
China
Prior art keywords
file
metadata
characteristic table
code value
incremental backup
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610671739.4A
Other languages
Chinese (zh)
Inventor
闫旋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Chicha Information Technology Co Ltd
Original Assignee
Nanjing Chicha Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Chicha Information Technology Co Ltd filed Critical Nanjing Chicha Information Technology Co Ltd
Priority to CN201610671739.4A priority Critical patent/CN106326035A/en
Publication of CN106326035A publication Critical patent/CN106326035A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1448Management of the data involved in backup or backup restore

Abstract

The invention discloses a file-metadata-based incremental backup method, which comprises the following steps: establishing a file feature table B for recording MD5 (message-digest algorithm 5) code value of a copied source disk file in a memory, simultaneously establishing a file feature table A in a destination disk, an attribute of the file feature table A being the same as that of the file feature table B, performing technical chunking on a file to be stored by adopting a CDC (content-defined chunking) technology, calculating and recording an MD5 code value of chunked metadata into the file feature table A, and comparing the MD5 code values in the file feature table A and the file feature table B. Compared with the conventional art, the incremental backup method has the advantages that replicated data between files can be helped to be eliminated, the space occupied by data can be reduced to a greater extent, the problem of space enlargement of a storage system can be alleviated, existing resources can be maximally utilized, and the storage cost can be reduced.

Description

A kind of incremental backup method based on file metadata
Technical field
The present invention relates to file storage technology field, particularly relate to a kind of incremental backup method based on file metadata.
Background technology
In recent years, digital information is the main trend place of International Development, and numeral Informatization Development is the most extremely paid attention to by various countries, Along with the continuous propelling of China's digital information process, digital information presents the situation of explosive growth, and data take up room more Coming the biggest, and in the centralized stores systems such as filing, backup, there is substantial amounts of redundant data information, research finds, in storage system In system, having up to 60% in the data of preservation is redundancy, and As time goes on gets more and more, in this case, Elimination of duplicate data, saving memory space just becomes the key issue that storage system needs to solve.
Summary of the invention
The technical problem existed based on background technology, the present invention proposes a kind of incremental backup side based on file metadata Method.
The technical scheme is that and be achieved in that:
A kind of incremental backup method based on file metadata, it is characterised in that include step:
S1, sets up a file characteristic Table A being currently needed for backing up file at purpose dish;
S2, sets up a file characteristic table B, the MD5 code value of the source tray file that record once copied in memory;
S3, carries out technology piecemeal by file to be stored, calculates the MD5 code value of the metadata after piecemeal and records at purpose dish File characteristic Table A in;
MD5 code value in S4, comparison file characteristic Table A and file characteristic table B, if retrieved not in file characteristic table B MD5 code value in file characteristic Table A, carries out the copy backup of metadata in memorizer, if in file characteristic Table A MD5 code value is consistent with the MD5 code value in file characteristic table B, then do not carry out the backup of metadata in memorizer;
File characteristic table B in S5, more new memory.
Preferably, the attribute of the metadata of the record of the file characteristic table B in S2 includes file name, document size, file Several or whole in establishment time, filemodetime, file user-defined metadata, file store path.
Preferably, the attribute of the metadata of the file characteristic Table A record in S3 includes file name, document size, file Several or whole in establishment time, filemodetime, file user-defined metadata, file store path.
Preferably, in S4, all MD5 code values in file mark sheet A all compare with the MD5 code value in file characteristic table B Right, after comparison completes, in file characteristic Table A, all metadata identical with file characteristic table B are deleted.
Preferably, S3 use CDC partition file to be stored is carried out technology piecemeal.
The MD5 mentioned in the present invention, is the letter of Message-Digest Algorithm5 (Message Digest Algorithm 5) Claim, be Rivest in the modified version to MD4 in 1991, be used for guaranteeing that information transmission is complete consistent as current computer field Widely used hash algorithm, main flow programming language generally has the realization of MD5.MD5 comes complicated than MD4, and speed is relatively Want slow, but safer, perform better than in terms of analysis resistant and resisting differential.
It with 512 packets, through calculation process, generates four 32 bit data, finally by these four value associatings to input Get up to become a 128-bits hashed value.Basic mode is: complementation, remainder, adjustment length and link variable are circulated fortune Calculate, obtain a result.
In the present invention, the piecemeal flow process of the CDC partition that S3 uses includes:
Because the 1/D of the probability of the r such as the Rabin fingerprint value function calculating sliding window content is the most discrete, its value mould D, Then from probability analysis, often the value of slip D length the most once h mould D is r, and the expected value of the most elongated piece is D, and certainly, this is simply Expected value, the metadata of division is still likely to occur excessive or too small situation, and two file division have been by CDC partition Exactly the same metadata, simultaneously as Rabin function has preferable character string identification ability, when file carries out inserting, deleting Or during amendment operation, except the minority breakpoint after change point needs to repartition, the border of other metadata is the most constant, File is carried out a little change so not havinging and is divided into diverse metadata, thus can not find duplicate contents Situation.
The present invention compared with prior art, the Advantageous Effects having:
The present invention uses a kind of incremental backup mode based on file metadata, sets up one the most in memory Individual file characteristic table B, for the MD5 code value of the source tray file that record once copied, sets up one at purpose dish simultaneously File characteristic Table A, the attribute of file characteristic Table A is identical with file characteristic table B, uses CDC partition to enter file to be stored Row technology piecemeal, calculates the MD5 code value of the metadata after piecemeal and recorded in file characteristic Table A, comparison file characteristic Table A and MD5 code value in file characteristic table B, if retrieved less than the MD5 code value in file characteristic Table A in file characteristic table B, is depositing The copy backup of metadata is carried out in reservoir, if the MD5 code value in file characteristic Table A and the MD5 code in file characteristic table B Value is consistent, then do not carry out the backup of metadata in memorizer, and finally, the file characteristic table B in memorizer is updated, for next Secondary backup is prepared, and compared with conventional art, incremental backup mode can help to eliminate the repetition data between file, more Reduce data in big degree to take up room, alleviate the space growing concern of storage system, farthest utilize existing resource, fall Low carrying cost.
Accompanying drawing explanation
Fig. 1 is shown as the specific embodiment of a kind of based on file metadata the incremental backup mode that the present invention proposes Process blocks schematic diagram.
Detailed description of the invention
Below in conjunction with specific embodiment, the present invention is explained orally further.
A kind of incremental backup method based on file metadata, S1, sets up one at purpose dish and is currently needed for backing up file File characteristic Table A;S2, sets up a file characteristic table B, the MD5 of the source tray file that record once copied in memory Code value;S3, carries out technology piecemeal by file to be stored, calculates the MD5 code value of the metadata after piecemeal and records at purpose dish In file characteristic Table A;MD5 code value in S4, comparison file characteristic Table A and file characteristic table B, if in file characteristic table B Retrieval, less than the MD5 code value in file characteristic Table A, carries out the copy backup of metadata, if at file characteristic in memorizer MD5 code value in Table A is consistent with the MD5 code value in file characteristic table B, then do not carry out the backup of metadata in memorizer;S5, more File characteristic table B in new memory, prepares for backup next time.
The MD5 mentioned in the present invention, is the letter of Message-Digest Algorithm5 (Message Digest Algorithm 5) Claim, be Rivest in the modified version to MD4 in 1991, be used for guaranteeing that information transmission is complete consistent as current computer field Widely used hash algorithm, main flow programming language generally has the realization of MD5.MD5 comes complicated than MD4, and speed is relatively Want slow, but safer, perform better than in terms of analysis resistant and resisting differential.
It with 512 packets, through calculation process, generates four 32 bit data, finally by these four value associatings to input Get up to become a 128-bits hashed value.Basic mode is: complementation, remainder, adjustment length and link variable are circulated fortune Calculate, obtain a result.
In the present invention, the piecemeal flow process of the CDC partition that S3 uses includes:
Because the 1/D of the probability of the r such as the Rabin fingerprint value function calculating sliding window content is the most discrete, its value mould D, Then from probability analysis, often the value of slip D length the most once h mould D is r, and the expected value of the most elongated piece is D, and certainly, this is simply Expected value, the metadata of division is still likely to occur excessive or too small situation, and two file division have been by CDC partition Exactly the same metadata, simultaneously as Rabin function has preferable character string identification ability, when file carries out inserting, deleting Or during amendment operation, except the minority breakpoint after change point needs to repartition, the border of other metadata is the most constant, File is carried out a little change so not havinging and is divided into diverse metadata, thus can not find duplicate contents Situation.
The above, the only present invention preferably detailed description of the invention, but protection scope of the present invention is not limited thereto, Any those familiar with the art in the technical scope that the invention discloses, according to technical scheme and Inventive concept equivalent or change in addition, all should contain within protection scope of the present invention.

Claims (5)

1. an incremental backup method based on file metadata, it is characterised in that include step:
S1, sets up a file characteristic Table A being currently needed for backing up file at purpose dish;
S2, sets up a file characteristic table B, the MD5 code value of the source tray file that record once copied in memory;
S3, carries out technology piecemeal by file to be stored, calculates the MD5 code value of the metadata after piecemeal and records the literary composition at purpose dish In part mark sheet A;
MD5 code value in S4, comparison file characteristic Table A and file characteristic table B, if retrieved less than literary composition in file characteristic table B MD5 code value in part mark sheet A, carries out the copy backup of metadata in memorizer, if the MD5 in file characteristic Table A Code value is consistent with the MD5 code value in file characteristic table B, then do not carry out the backup of metadata in memorizer;
File characteristic table B in S5, more new memory.
A kind of incremental backup method based on file metadata the most according to claim 1, it is characterised in that the literary composition in S2 The attribute of the metadata of part mark sheet B record includes file name, document size, file creation time, filemodetime, literary composition Several or whole in part self-defining metadata, file store path.
A kind of incremental backup method based on file metadata the most according to claim 1, it is characterised in that the literary composition in S3 The attribute of the metadata of part mark sheet A record includes file name, document size, file creation time, filemodetime, literary composition Several or whole in part self-defining metadata, file store path.
A kind of incremental backup method based on file metadata the most according to claim 1, it is characterised in that file in S4 All MD5 code values in mark sheet A are all compared with the MD5 code value in file characteristic table B, after comparison completes, and file characteristic In Table A, all metadata identical with file characteristic table B are deleted.
A kind of incremental backup method based on file metadata the most according to claim 1, it is characterised in that use in S3 CDC partition carries out technology piecemeal to file to be stored.
CN201610671739.4A 2016-08-13 2016-08-13 File-metadata-based incremental backup method Pending CN106326035A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610671739.4A CN106326035A (en) 2016-08-13 2016-08-13 File-metadata-based incremental backup method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610671739.4A CN106326035A (en) 2016-08-13 2016-08-13 File-metadata-based incremental backup method

Publications (1)

Publication Number Publication Date
CN106326035A true CN106326035A (en) 2017-01-11

Family

ID=57739356

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610671739.4A Pending CN106326035A (en) 2016-08-13 2016-08-13 File-metadata-based incremental backup method

Country Status (1)

Country Link
CN (1) CN106326035A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106850342A (en) * 2017-01-20 2017-06-13 郑州云海信息技术有限公司 The method and device of test interchanger compatibility and stability
CN106873908A (en) * 2017-01-17 2017-06-20 北京联想核芯科技有限公司 Date storage method and device
CN107704342A (en) * 2017-09-26 2018-02-16 郑州云海信息技术有限公司 A kind of snap copy method, system, device and readable storage medium storing program for executing
CN111367871A (en) * 2020-02-29 2020-07-03 华南理工大学 Method for increment synchronization among files based on SAPCI (software application programming interface) variable-length blocks
CN112507100A (en) * 2020-12-18 2021-03-16 北京百度网讯科技有限公司 Method and device for updating question-answering system
CN112882866A (en) * 2021-02-24 2021-06-01 上海泰宇信息技术股份有限公司 Backup method suitable for massive files
CN115145943A (en) * 2022-09-06 2022-10-04 北京麦聪软件有限公司 Multi-data-source metadata rapid comparison method, system, device and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101706825A (en) * 2009-12-10 2010-05-12 华中科技大学 Replicated data deleting method based on file content types
CN101989929A (en) * 2010-11-17 2011-03-23 中兴通讯股份有限公司 Disaster recovery data backup method and system
CN102810108A (en) * 2011-06-02 2012-12-05 英业达股份有限公司 Method for processing repeated data
CN104375905A (en) * 2014-11-07 2015-02-25 北京云巢动脉科技有限公司 Incremental backing up method and system based on data block
EP2905709A2 (en) * 2014-02-11 2015-08-12 Atlantis Computing, Inc. Method and apparatus for replication of files and file systems using a deduplication key space
CN104932841A (en) * 2015-06-17 2015-09-23 南京邮电大学 Saving type duplicated data deleting method in cloud storage system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101706825A (en) * 2009-12-10 2010-05-12 华中科技大学 Replicated data deleting method based on file content types
CN101989929A (en) * 2010-11-17 2011-03-23 中兴通讯股份有限公司 Disaster recovery data backup method and system
CN102810108A (en) * 2011-06-02 2012-12-05 英业达股份有限公司 Method for processing repeated data
EP2905709A2 (en) * 2014-02-11 2015-08-12 Atlantis Computing, Inc. Method and apparatus for replication of files and file systems using a deduplication key space
CN104375905A (en) * 2014-11-07 2015-02-25 北京云巢动脉科技有限公司 Incremental backing up method and system based on data block
CN104932841A (en) * 2015-06-17 2015-09-23 南京邮电大学 Saving type duplicated data deleting method in cloud storage system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
段梦博等: "基于内容的重复数据删除技术的研究", 《电脑知识与技术》 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106873908A (en) * 2017-01-17 2017-06-20 北京联想核芯科技有限公司 Date storage method and device
CN106873908B (en) * 2017-01-17 2019-11-12 深圳忆联信息系统有限公司 Date storage method and device
CN106850342B (en) * 2017-01-20 2020-11-24 苏州浪潮智能科技有限公司 Method and device for testing compatibility and stability of switch
CN106850342A (en) * 2017-01-20 2017-06-13 郑州云海信息技术有限公司 The method and device of test interchanger compatibility and stability
CN107704342A (en) * 2017-09-26 2018-02-16 郑州云海信息技术有限公司 A kind of snap copy method, system, device and readable storage medium storing program for executing
CN111367871B (en) * 2020-02-29 2022-06-10 华南理工大学 Method for increment synchronization among files based on SAPCI (software application programming interface) variable-length blocks
CN111367871A (en) * 2020-02-29 2020-07-03 华南理工大学 Method for increment synchronization among files based on SAPCI (software application programming interface) variable-length blocks
CN112507100A (en) * 2020-12-18 2021-03-16 北京百度网讯科技有限公司 Method and device for updating question-answering system
CN112507100B (en) * 2020-12-18 2023-12-22 北京百度网讯科技有限公司 Update processing method and device of question-answering system
CN112882866A (en) * 2021-02-24 2021-06-01 上海泰宇信息技术股份有限公司 Backup method suitable for massive files
CN112882866B (en) * 2021-02-24 2023-12-15 上海泰宇信息技术股份有限公司 Backup method suitable for mass files
CN115145943A (en) * 2022-09-06 2022-10-04 北京麦聪软件有限公司 Multi-data-source metadata rapid comparison method, system, device and storage medium
CN115145943B (en) * 2022-09-06 2023-02-28 北京麦聪软件有限公司 Method, system, equipment and storage medium for rapidly comparing metadata of multiple data sources

Similar Documents

Publication Publication Date Title
CN106326035A (en) File-metadata-based incremental backup method
US10621142B2 (en) Deduplicating input backup data with data of a synthetic backup previously constructed by a deduplication storage system
US10380073B2 (en) Use of solid state storage devices and the like in data deduplication
US7925683B2 (en) Methods and apparatus for content-aware data de-duplication
US8452739B2 (en) Highly scalable and distributed data de-duplication
US9251160B1 (en) Data transfer between dissimilar deduplication systems
CN106201771B (en) Data-storage system and data read-write method
US9785646B2 (en) Data file handling in a network environment and independent file server
CN101963982B (en) Method for managing metadata of redundancy deletion and storage system based on location sensitive Hash
US9251235B1 (en) Log-based synchronization
EP3779715A1 (en) Method and apparatus for deleting duplicate data
CN104932841A (en) Saving type duplicated data deleting method in cloud storage system
US20110218973A1 (en) System and method for creating a de-duplicated data set and preserving metadata for processing the de-duplicated data set
US20150205674A1 (en) Cataloging backup data
CN109522283B (en) Method and system for deleting repeated data
US10776345B2 (en) Efficiently updating a secondary index associated with a log-structured merge-tree database
CN105868286A (en) Parallel adding method and system for merging small files on basis of distributed file system
US20220035786A1 (en) Distributed database management system with dynamically split b-tree indexes
US10503605B2 (en) Method of detecting source change for file level incremental backup
US11775482B2 (en) File system metadata deduplication
US9678972B2 (en) Packing deduplicated data in a self-contained deduplicated repository
US10956446B1 (en) Log-based synchronization with inferred context
US7685186B2 (en) Optimized and robust in-place data transformation
US11593304B2 (en) Browsability of backup files using data storage partitioning
CN110399340A (en) A kind of document handling method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20170111

WD01 Invention patent application deemed withdrawn after publication