CN103970852A - Data deduplication method of backup server - Google Patents

Data deduplication method of backup server Download PDF

Info

Publication number
CN103970852A
CN103970852A CN201410186755.5A CN201410186755A CN103970852A CN 103970852 A CN103970852 A CN 103970852A CN 201410186755 A CN201410186755 A CN 201410186755A CN 103970852 A CN103970852 A CN 103970852A
Authority
CN
China
Prior art keywords
data
hash
value
hash value
backup
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410186755.5A
Other languages
Chinese (zh)
Inventor
付丽莉
于建彬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Electronic Information Industry Co Ltd
Original Assignee
Inspur Electronic Information Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Electronic Information Industry Co Ltd filed Critical Inspur Electronic Information Industry Co Ltd
Priority to CN201410186755.5A priority Critical patent/CN103970852A/en
Publication of CN103970852A publication Critical patent/CN103970852A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1448Management of the data involved in backup or backup restore
    • G06F11/1453Management of the data involved in backup or backup restore using de-duplication of the data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • G06F3/0641De-duplication techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Databases & Information Systems (AREA)
  • Human Computer Interaction (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a data deduplication method of a backup server, which comprises the following specific operation processes: dividing a partition on a logical volume formed by a hard disk of a backup server for temporarily storing data to be backed up on the server, wherein the partition plays a role of data caching; dividing the cached data in the partitions into blocks, taking the blocks as a calculation list, calculating the hash value of the blocks and storing the hash value in a database; comparing the hash values, namely comparing the hash value of the new data backed up each time with the hash value stored in the database, and if the hash values are the same, determining that the data are repeated; if not, the data block is non-repeated data, and the hash value of the data block is added into the hash library. Compared with the prior art, the data deduplication method of the backup server can effectively deduplication the repeated data, greatly save storage space, save storage space to a certain extent, and therefore reduce the operation cost of enterprises.

Description

A kind of data of backup server are heavily deleted method
Technical field
The present invention relates to field of computer technology, the data of the backup server of a kind of effective raising system running speed and quality are heavily deleted method specifically.
Background technology
The core concept of data de-duplication technology is: in the time of storage data, check and more already present data, if they are identical, so just filter out the backup of this part data, then quote already present data by pointer.Data de-duplication is more popular research topic of field of storage at present, because it brings a lot of significantly benefits can to whole storage system or even whole enterprise.Obviously, data de-duplication can fundamentally reduce the space that takies of storage and user's disc driver quantity, alleviates the expense of the aspects such as manpower, the energy, electric power resource, thereby significantly saves carrying cost.In addition, data de-duplication can reduce the data volume of transmitting in network, and then falls low-energy-consumption and network cost, and saves in a large number the network bandwidth for data Replica.
When stand-by program repeatedly backs up identical file in network from same catalogue, or while backing up identical file from multiple addresses, the data that repeat back up in temporary realm, repeating data amount on most of network is amazing, such as a backup server, backup 100 employees' of our company mail, what these 100 employees had that quite a few people receives an every day is identical mail, if same envelope mail is preserved by 80 people, these data are kept at their backup server and just have 80 parts from 80 users so, along with the growth of user mail amount every day, the quantity that repeats mail is also increasing.For same backup server, tend to produce the backup request of repeating data, if by data de-duplication, nature can greatly reduce taking storage space.
Sometimes for can be by data de-duplication, enterprise need to send the special responsible deletion work of special messenger, delete procedure is loaded down with trivial details and easily make mistakes, easily cause the loss of Backup Data, thereby, how can ensure in the situation that data are not lost, automatically the data de-duplication action that completes backup server becomes following development trend, especially for small business, the saving of storage space, can greatly reduce the operation cost of enterprise, based on this, now provide a kind of method that can effectively repeating data heavily be deleted on backup server.
Summary of the invention
Technical assignment of the present invention is to solve the deficiencies in the prior art, provides a kind of data of the backup server that reduces operation cost of enterprises, saving storage space heavily to delete method.
Technical scheme of the present invention realizes in the following manner, and the data of this kind of backup server are heavily deleted method, and its specific operation process is as follows:
On the logical volume forming at the hard disk of backup server, divide a subregion for deposit the data that will back up on this server temporarily, this subregion is born data buffer storage effect;
The data of buffer memory in above-mentioned subregion are carried out to piece and cut apart, and single as a calculating taking piece, calculate its hash value and be kept in database;
Hash value being contrasted, be about to have deposited hash value in the hash value of new data of each backup and storehouse and compare, is repeating data if identical; Be non-repeating data if not identical, and the hash value of this data block is added in hash storehouse.
Logical volume in described step 1 refers to the logical volume of a RAID5 who is built up by three above hard disk groups, this logical volume needs all data of backup under first preserving in use, again according to the judgement in step 3, non-repeatability data in this buffer memory are kept on server, repeatability data replace with pointer, after the data action that deletion repeats completes, this logical volume is carried out quick formatting, waits for user's Backup Data of next time.
The detailed process of the Data Segmentation of described step 2 is: first definition block size, then according to the block size defining, file is carried out to cutting, and carry out the calculating of hash functional value, this hash functional value refers to weak proof test value and the strong proof test value of md5, first calculate weak proof test value and carry out hash and search, if found, calculate the strong proof test value of md5 and make further hash and search.
The detailed process of described step 3 is: step 2 is calculated to hash functional value composition hash functional value storehouse, be kept at separately a fixed position of subregion, the hash value of each backup new data all with this hash functional value storehouse in hash value compare, if the data block of hash value is identical, preserve a pointer, the memory location of this pointed repeating data; If the data block difference of hash value, is non-repeating data, this data block is kept in a unique data district of this backup server, and the hash value of this data block is added in hash storehouse.
The beneficial effect that the present invention compared with prior art produced is:
The data of a kind of backup server of the present invention are heavily deleted method and are used pointer to replace repeated data, and remain in logic complete data, can effectively heavily delete repeating data, greatly save storage space, save to a certain extent storage space, thereby reduce the operation cost of enterprise; Applied widely, be particularly useful for the back-up application in medium-sized and small enterprises, as mail data backup system, the backup tasks that comprises some repeated data in multi-user's backup procedures such as conventional data backup system, effectively deleting duplicated data, only retains unique data, effectively save memory disk space, thereby reduce costs, practical, be easy to promote.
Brief description of the drawings
Accompanying drawing 1 is backup server module diagram of the present invention;
Accompanying drawing 2 is Data Segmentation step schematic diagram of the present invention;
Accompanying drawing 3 is hash contrast step schematic diagram of the present invention;
Accompanying drawing 4 is for adopting the embodiment schematic diagram of prior art backup;
Accompanying drawing 5 is for adopting the embodiment schematic diagram of the present invention's backup.
Embodiment
Below in conjunction with accompanying drawing to the data of a kind of backup server of the present invention heavily the method for deleting be described in detail below.
Now provide a kind of data of backup server heavily to delete method, first build the basis of its enforcement: backup server, at this backup server as shown in Figure 1, on this server, be provided with four module data cache modules, data segmentation module, hash contrast module and unique data memory module.
Wherein building of data cache module refers on the logical volume of Raid5 composition, divide a subregion for deposit the data that will back up on this server temporarily, the size of this subregion, can determine according to user's backup custom (size of a backup data quantity).
Data segmentation module is cut apart for the data of data cache module are carried out to piece, and single as a calculating taking piece, the data in the hash collection in the contrast module by its hash value and below compare.
Hash contrast module is cut apart for the data in data cache partitions are carried out to piece, taking piece as a calculating list, calculate its hash functional value (conventionally with MD5 or SHA-1), and these hash functional values are organized into hash functional value storehouse, be kept at separately a position of subregion, hash in hash and the storehouse of each backup new data compares, if be worth the identical repeating data that is; Otherwise be non-repeating data, this data block need to be kept in the unique data district of this server, and the hash value of this data block also will be added in hash storehouse.
Unique data module is after backup requirements proposes, and first stores data in cache module, then enters data segmentation module, calculates to enter hash after hash value again and contrast module, if the identical repeating data that is of hash value; Otherwise be non-repeating data, be unique data, need to be kept on the specified partition of this server.
Further, these data heavily method of deleting are to complete between four modules of above-mentioned backup server, and as shown in accompanying drawing 2, Fig. 3, its specific implementation process is:
One, on backup server, set up the logical volume of a Raid5 with more than 3 hard disks, this logical volume needs all data of backup under first preserving in use, again according to the judgement in step 3, non-repeatability data in this buffer memory are kept on server, repeatability data replace with pointer, after the data action that deletion repeats completes, this logical volume is carried out quick formatting, waits for user's Backup Data of next time;
Two, carry out piece and cut apart being kept at data in data buffer storage, adopt the block size pre-defining to carry out cutting to file, and carry out weak proof test value and the strong proof test value of md5.Weak proof test value is mainly the performance in order to promote difference coding, first calculates weak proof test value and carries out hash and search, if found, calculates the strong proof test value of md5 and makes further hash and search.Because weak proof test value calculated amount is little more a lot of than md5, therefore can effectively improve coding efficiency;
Three, step 2 is calculated to hash functional value composition hash functional value storehouse, be kept at separately a fixed position of subregion, the hash value of each backup new data all with this hash functional value storehouse in hash value compare, if the data block of hash value is identical, preserve a pointer, the memory location of this pointed repeating data; If the data block difference of hash value, is non-repeating data, prove that this data block is new data content, do not store before, this data block need to be kept in the data partition of this backup server, and the hash value of this data block is added in hash storehouse.
Embodiment as shown in accompanying drawing 4, Fig. 5, this embodiment is with an enterprise, the growth schematic diagram of the backup data quantity in sky is example on every Fridays, and all can produce a large amount of repeating datas every day, taken storage resources, increase production cost, at this moment adopt method of the present invention that the data that repeat are deleted, delete by adding black data in accompanying drawing, repeat every day, can find out, data backup total amount obviously reduces.
The foregoing is only embodiments of the invention, within the spirit and principles in the present invention all, any amendment of doing, be equal to replacement, improvement etc., within all should being included in protection scope of the present invention.

Claims (4)

1. the data of backup server are heavily deleted a method, it is characterized in that its specific operation process is as follows:
One, on the logical volume forming at the hard disk of backup server, divide a subregion for deposit the data that will back up on this server temporarily, this subregion is born data buffer storage effect;
Two, the data of buffer memory in above-mentioned subregion are carried out to piece and cut apart, and single as a calculating taking piece, calculate its hash value and be kept in database;
Three, hash value being contrasted, be about to have deposited hash value in the hash value of new data of each backup and storehouse and compare, is repeating data if identical; Be non-repeating data if not identical, and the hash value of this data block is added in hash storehouse.
2. the data of a kind of backup server according to claim 1 are heavily deleted method, it is characterized in that: the logical volume in described step 1 refers to the logical volume of a RAID5 who is built up by three above hard disk groups, this logical volume needs all data of backup under first preserving in use, again according to the judgement in step 3, non-repeatability data in this buffer memory are kept on server, repeatability data replace with pointer, after the data action that deletion repeats completes, this logical volume is carried out quick formatting, waits for user's Backup Data of next time.
3. the data of a kind of backup server according to claim 1 and 2 are heavily deleted method, it is characterized in that: the detailed process of the Data Segmentation of described step 2 is: first definition block size, then according to the block size defining, file is carried out to cutting, and carry out the calculating of hash functional value, this hash functional value refers to weak proof test value and the strong proof test value of md5, first calculate weak proof test value and carry out hash and search, if found, calculate the strong proof test value of md5 and make further hash and search.
4. the data of a kind of backup server according to claim 3 are heavily deleted method, it is characterized in that: the detailed process of described step 3 is: step 2 is calculated to hash functional value composition hash functional value storehouse, be kept at separately a fixed position of subregion, the hash value of each backup new data all with this hash functional value storehouse in hash value compare, if the data block of hash value is identical, preserve a pointer, the memory location of this pointed repeating data; If the data block difference of hash value, is non-repeating data, this data block is kept in a unique data district of this backup server, and the hash value of this data block is added in hash storehouse.
CN201410186755.5A 2014-05-06 2014-05-06 Data deduplication method of backup server Pending CN103970852A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410186755.5A CN103970852A (en) 2014-05-06 2014-05-06 Data deduplication method of backup server

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410186755.5A CN103970852A (en) 2014-05-06 2014-05-06 Data deduplication method of backup server

Publications (1)

Publication Number Publication Date
CN103970852A true CN103970852A (en) 2014-08-06

Family

ID=51240349

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410186755.5A Pending CN103970852A (en) 2014-05-06 2014-05-06 Data deduplication method of backup server

Country Status (1)

Country Link
CN (1) CN103970852A (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105095027A (en) * 2015-09-11 2015-11-25 浪潮(北京)电子信息产业有限公司 Data backup method and apparatus
CN105871705A (en) * 2016-06-07 2016-08-17 北京赛思信安技术股份有限公司 Method for judging E-mail repeated contents during massive E-mail analysis processing process
CN106227901A (en) * 2016-09-19 2016-12-14 郑州云海信息技术有限公司 A kind of based on heavily deleting and compressing parallel space method for saving
CN106598765A (en) * 2015-10-15 2017-04-26 北京国双科技有限公司 Data check method and device
CN107066352A (en) * 2017-03-02 2017-08-18 陈辉 With delete again and remote functionality portable intelligent device backup devices and methods therefor
CN107113164A (en) * 2014-12-18 2017-08-29 诺基亚技术有限公司 The deduplication of encryption data
CN107172112A (en) * 2016-03-07 2017-09-15 阿里巴巴集团控股有限公司 A kind of computer documents transmission method and device
CN107315653A (en) * 2017-03-02 2017-11-03 陈辉 A kind of band deletes the portable storage device and implementation method of calculating and processing function again
CN107346271A (en) * 2016-05-05 2017-11-14 华为技术有限公司 The method and calamity of Backup Data block are for end equipment
CN108255422A (en) * 2017-12-28 2018-07-06 浪潮通用软件有限公司 A kind of storage method and storage device
CN108427539A (en) * 2018-03-15 2018-08-21 深信服科技股份有限公司 Offline duplicate removal compression method, device and the readable storage medium storing program for executing of buffer memory device data
CN109070345A (en) * 2016-02-23 2018-12-21 Abb瑞士股份有限公司 Robot controller system and method
CN110830361A (en) * 2019-10-22 2020-02-21 新华三信息安全技术有限公司 Mail data storage method and device
CN111210352A (en) * 2020-01-10 2020-05-29 李�荣 Economic data statistical device and method based on block chain
WO2021077313A1 (en) * 2019-10-23 2021-04-29 Beijing Voyager Technology Co., Ltd. Systems and methods for autonomous driving
CN115580594A (en) * 2022-12-12 2023-01-06 四川大学 E-mail processing and transmitting method, system and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101158954A (en) * 2007-11-07 2008-04-09 上海爱数软件有限公司 Method for recognizing repeat data in computer storage
CN101989929A (en) * 2010-11-17 2011-03-23 中兴通讯股份有限公司 Disaster recovery data backup method and system
CN102281321A (en) * 2011-04-25 2011-12-14 程旭 Data cloud storage partitioning and backup method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101158954A (en) * 2007-11-07 2008-04-09 上海爱数软件有限公司 Method for recognizing repeat data in computer storage
CN101989929A (en) * 2010-11-17 2011-03-23 中兴通讯股份有限公司 Disaster recovery data backup method and system
CN102281321A (en) * 2011-04-25 2011-12-14 程旭 Data cloud storage partitioning and backup method and device

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107113164A (en) * 2014-12-18 2017-08-29 诺基亚技术有限公司 The deduplication of encryption data
CN105095027A (en) * 2015-09-11 2015-11-25 浪潮(北京)电子信息产业有限公司 Data backup method and apparatus
CN106598765A (en) * 2015-10-15 2017-04-26 北京国双科技有限公司 Data check method and device
CN109070345A (en) * 2016-02-23 2018-12-21 Abb瑞士股份有限公司 Robot controller system and method
CN107172112A (en) * 2016-03-07 2017-09-15 阿里巴巴集团控股有限公司 A kind of computer documents transmission method and device
CN107172112B (en) * 2016-03-07 2020-10-02 阿里巴巴集团控股有限公司 Computer file transmission method and device
CN107346271A (en) * 2016-05-05 2017-11-14 华为技术有限公司 The method and calamity of Backup Data block are for end equipment
CN105871705A (en) * 2016-06-07 2016-08-17 北京赛思信安技术股份有限公司 Method for judging E-mail repeated contents during massive E-mail analysis processing process
CN106227901A (en) * 2016-09-19 2016-12-14 郑州云海信息技术有限公司 A kind of based on heavily deleting and compressing parallel space method for saving
CN107066352A (en) * 2017-03-02 2017-08-18 陈辉 With delete again and remote functionality portable intelligent device backup devices and methods therefor
CN107315653A (en) * 2017-03-02 2017-11-03 陈辉 A kind of band deletes the portable storage device and implementation method of calculating and processing function again
CN108255422A (en) * 2017-12-28 2018-07-06 浪潮通用软件有限公司 A kind of storage method and storage device
CN108427539A (en) * 2018-03-15 2018-08-21 深信服科技股份有限公司 Offline duplicate removal compression method, device and the readable storage medium storing program for executing of buffer memory device data
CN110830361A (en) * 2019-10-22 2020-02-21 新华三信息安全技术有限公司 Mail data storage method and device
CN110830361B (en) * 2019-10-22 2021-12-07 新华三信息安全技术有限公司 Mail data storage method and device
WO2021077313A1 (en) * 2019-10-23 2021-04-29 Beijing Voyager Technology Co., Ltd. Systems and methods for autonomous driving
CN111210352A (en) * 2020-01-10 2020-05-29 李�荣 Economic data statistical device and method based on block chain
CN115580594A (en) * 2022-12-12 2023-01-06 四川大学 E-mail processing and transmitting method, system and storage medium

Similar Documents

Publication Publication Date Title
CN103970852A (en) Data deduplication method of backup server
CN101989929B (en) Disaster recovery data backup method and system
CN102222085B (en) Data de-duplication method based on combination of similarity and locality
CN103473239B (en) A kind of data of non relational database update method and device
CN102467572B (en) Data block inquiring method for supporting data de-duplication program
CN104932841A (en) Saving type duplicated data deleting method in cloud storage system
CN104301360A (en) Method, log server and system for recording log data
CN102200936A (en) Intelligent configuration storage backup method suitable for cloud storage
CN102156727A (en) Method for deleting repeated data by using double-fingerprint hash check
CN103488687A (en) Searching system and searching method of big data
CN105630810B (en) A method of mass small documents are uploaded in distributed memory system
CN104462389A (en) Method for implementing distributed file systems on basis of hierarchical storage
CN103530388A (en) Performance improving data processing method in cloud storage system
CN103279502B (en) A kind of framework and method with the data de-duplication file system be combined with parallel file system
CN102591864B (en) Data updating method and device in comparison system
CN109800185A (en) A kind of data cache method in data-storage system
CN106990914B (en) Data deleting method and device
CN105487942A (en) Backup and remote copy method based on data deduplication
CN109299115A (en) A kind of date storage method, device, server and storage medium
CN103051671A (en) Repeating data deletion method for cluster file system
CN102880671A (en) Method for actively deleting repeated data of distributed file system
US11397706B2 (en) System and method for reducing read amplification of archival storage using proactive consolidation
CN103916459A (en) Big data filing and storing system
CN103631933A (en) Distributed duplication elimination system-oriented data routing method
CN105824881A (en) Repeating data and deleted data placement method and device based on load balancing

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20140806

WD01 Invention patent application deemed withdrawn after publication