CN103970852A - Data deduplication method of backup server - Google Patents
Data deduplication method of backup server Download PDFInfo
- Publication number
- CN103970852A CN103970852A CN201410186755.5A CN201410186755A CN103970852A CN 103970852 A CN103970852 A CN 103970852A CN 201410186755 A CN201410186755 A CN 201410186755A CN 103970852 A CN103970852 A CN 103970852A
- Authority
- CN
- China
- Prior art keywords
- data
- hash
- value
- hash value
- backup
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 27
- 238000003860 storage Methods 0.000 claims abstract description 15
- 238000012360 testing method Methods 0.000 claims description 14
- 230000011218 segmentation Effects 0.000 claims description 6
- 238000012217 deletion Methods 0.000 claims description 4
- 230000037430 deletion Effects 0.000 claims description 4
- 238000005520 cutting process Methods 0.000 claims description 3
- 230000000694 effects Effects 0.000 claims description 2
- 238000005192 partition Methods 0.000 abstract description 6
- 238000010586 diagram Methods 0.000 description 6
- 230000003203 everyday effect Effects 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 3
- 241001269238 Data Species 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000005265 energy consumption Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/14—Error detection or correction of the data by redundancy in operation
- G06F11/1402—Saving, restoring, recovering or retrying
- G06F11/1446—Point-in-time backing up or restoration of persistent data
- G06F11/1448—Management of the data involved in backup or backup restore
- G06F11/1453—Management of the data involved in backup or backup restore using de-duplication of the data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0638—Organizing or formatting or addressing of data
- G06F3/064—Management of blocks
- G06F3/0641—De-duplication techniques
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- Databases & Information Systems (AREA)
- Human Computer Interaction (AREA)
- Data Mining & Analysis (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a data deduplication method of a backup server, which comprises the following specific operation processes: dividing a partition on a logical volume formed by a hard disk of a backup server for temporarily storing data to be backed up on the server, wherein the partition plays a role of data caching; dividing the cached data in the partitions into blocks, taking the blocks as a calculation list, calculating the hash value of the blocks and storing the hash value in a database; comparing the hash values, namely comparing the hash value of the new data backed up each time with the hash value stored in the database, and if the hash values are the same, determining that the data are repeated; if not, the data block is non-repeated data, and the hash value of the data block is added into the hash library. Compared with the prior art, the data deduplication method of the backup server can effectively deduplication the repeated data, greatly save storage space, save storage space to a certain extent, and therefore reduce the operation cost of enterprises.
Description
Technical field
The present invention relates to field of computer technology, the data of the backup server of a kind of effective raising system running speed and quality are heavily deleted method specifically.
Background technology
The core concept of data de-duplication technology is: in the time of storage data, check and more already present data, if they are identical, so just filter out the backup of this part data, then quote already present data by pointer.Data de-duplication is more popular research topic of field of storage at present, because it brings a lot of significantly benefits can to whole storage system or even whole enterprise.Obviously, data de-duplication can fundamentally reduce the space that takies of storage and user's disc driver quantity, alleviates the expense of the aspects such as manpower, the energy, electric power resource, thereby significantly saves carrying cost.In addition, data de-duplication can reduce the data volume of transmitting in network, and then falls low-energy-consumption and network cost, and saves in a large number the network bandwidth for data Replica.
When stand-by program repeatedly backs up identical file in network from same catalogue, or while backing up identical file from multiple addresses, the data that repeat back up in temporary realm, repeating data amount on most of network is amazing, such as a backup server, backup 100 employees' of our company mail, what these 100 employees had that quite a few people receives an every day is identical mail, if same envelope mail is preserved by 80 people, these data are kept at their backup server and just have 80 parts from 80 users so, along with the growth of user mail amount every day, the quantity that repeats mail is also increasing.For same backup server, tend to produce the backup request of repeating data, if by data de-duplication, nature can greatly reduce taking storage space.
Sometimes for can be by data de-duplication, enterprise need to send the special responsible deletion work of special messenger, delete procedure is loaded down with trivial details and easily make mistakes, easily cause the loss of Backup Data, thereby, how can ensure in the situation that data are not lost, automatically the data de-duplication action that completes backup server becomes following development trend, especially for small business, the saving of storage space, can greatly reduce the operation cost of enterprise, based on this, now provide a kind of method that can effectively repeating data heavily be deleted on backup server.
Summary of the invention
Technical assignment of the present invention is to solve the deficiencies in the prior art, provides a kind of data of the backup server that reduces operation cost of enterprises, saving storage space heavily to delete method.
Technical scheme of the present invention realizes in the following manner, and the data of this kind of backup server are heavily deleted method, and its specific operation process is as follows:
On the logical volume forming at the hard disk of backup server, divide a subregion for deposit the data that will back up on this server temporarily, this subregion is born data buffer storage effect;
The data of buffer memory in above-mentioned subregion are carried out to piece and cut apart, and single as a calculating taking piece, calculate its hash value and be kept in database;
Hash value being contrasted, be about to have deposited hash value in the hash value of new data of each backup and storehouse and compare, is repeating data if identical; Be non-repeating data if not identical, and the hash value of this data block is added in hash storehouse.
Logical volume in described step 1 refers to the logical volume of a RAID5 who is built up by three above hard disk groups, this logical volume needs all data of backup under first preserving in use, again according to the judgement in step 3, non-repeatability data in this buffer memory are kept on server, repeatability data replace with pointer, after the data action that deletion repeats completes, this logical volume is carried out quick formatting, waits for user's Backup Data of next time.
The detailed process of the Data Segmentation of described step 2 is: first definition block size, then according to the block size defining, file is carried out to cutting, and carry out the calculating of hash functional value, this hash functional value refers to weak proof test value and the strong proof test value of md5, first calculate weak proof test value and carry out hash and search, if found, calculate the strong proof test value of md5 and make further hash and search.
The detailed process of described step 3 is: step 2 is calculated to hash functional value composition hash functional value storehouse, be kept at separately a fixed position of subregion, the hash value of each backup new data all with this hash functional value storehouse in hash value compare, if the data block of hash value is identical, preserve a pointer, the memory location of this pointed repeating data; If the data block difference of hash value, is non-repeating data, this data block is kept in a unique data district of this backup server, and the hash value of this data block is added in hash storehouse.
The beneficial effect that the present invention compared with prior art produced is:
The data of a kind of backup server of the present invention are heavily deleted method and are used pointer to replace repeated data, and remain in logic complete data, can effectively heavily delete repeating data, greatly save storage space, save to a certain extent storage space, thereby reduce the operation cost of enterprise; Applied widely, be particularly useful for the back-up application in medium-sized and small enterprises, as mail data backup system, the backup tasks that comprises some repeated data in multi-user's backup procedures such as conventional data backup system, effectively deleting duplicated data, only retains unique data, effectively save memory disk space, thereby reduce costs, practical, be easy to promote.
Brief description of the drawings
Accompanying drawing 1 is backup server module diagram of the present invention;
Accompanying drawing 2 is Data Segmentation step schematic diagram of the present invention;
Accompanying drawing 3 is hash contrast step schematic diagram of the present invention;
Accompanying drawing 4 is for adopting the embodiment schematic diagram of prior art backup;
Accompanying drawing 5 is for adopting the embodiment schematic diagram of the present invention's backup.
Embodiment
Below in conjunction with accompanying drawing to the data of a kind of backup server of the present invention heavily the method for deleting be described in detail below.
Now provide a kind of data of backup server heavily to delete method, first build the basis of its enforcement: backup server, at this backup server as shown in Figure 1, on this server, be provided with four module data cache modules, data segmentation module, hash contrast module and unique data memory module.
Wherein building of data cache module refers on the logical volume of Raid5 composition, divide a subregion for deposit the data that will back up on this server temporarily, the size of this subregion, can determine according to user's backup custom (size of a backup data quantity).
Data segmentation module is cut apart for the data of data cache module are carried out to piece, and single as a calculating taking piece, the data in the hash collection in the contrast module by its hash value and below compare.
Hash contrast module is cut apart for the data in data cache partitions are carried out to piece, taking piece as a calculating list, calculate its hash functional value (conventionally with MD5 or SHA-1), and these hash functional values are organized into hash functional value storehouse, be kept at separately a position of subregion, hash in hash and the storehouse of each backup new data compares, if be worth the identical repeating data that is; Otherwise be non-repeating data, this data block need to be kept in the unique data district of this server, and the hash value of this data block also will be added in hash storehouse.
Unique data module is after backup requirements proposes, and first stores data in cache module, then enters data segmentation module, calculates to enter hash after hash value again and contrast module, if the identical repeating data that is of hash value; Otherwise be non-repeating data, be unique data, need to be kept on the specified partition of this server.
Further, these data heavily method of deleting are to complete between four modules of above-mentioned backup server, and as shown in accompanying drawing 2, Fig. 3, its specific implementation process is:
One, on backup server, set up the logical volume of a Raid5 with more than 3 hard disks, this logical volume needs all data of backup under first preserving in use, again according to the judgement in step 3, non-repeatability data in this buffer memory are kept on server, repeatability data replace with pointer, after the data action that deletion repeats completes, this logical volume is carried out quick formatting, waits for user's Backup Data of next time;
Two, carry out piece and cut apart being kept at data in data buffer storage, adopt the block size pre-defining to carry out cutting to file, and carry out weak proof test value and the strong proof test value of md5.Weak proof test value is mainly the performance in order to promote difference coding, first calculates weak proof test value and carries out hash and search, if found, calculates the strong proof test value of md5 and makes further hash and search.Because weak proof test value calculated amount is little more a lot of than md5, therefore can effectively improve coding efficiency;
Three, step 2 is calculated to hash functional value composition hash functional value storehouse, be kept at separately a fixed position of subregion, the hash value of each backup new data all with this hash functional value storehouse in hash value compare, if the data block of hash value is identical, preserve a pointer, the memory location of this pointed repeating data; If the data block difference of hash value, is non-repeating data, prove that this data block is new data content, do not store before, this data block need to be kept in the data partition of this backup server, and the hash value of this data block is added in hash storehouse.
Embodiment as shown in accompanying drawing 4, Fig. 5, this embodiment is with an enterprise, the growth schematic diagram of the backup data quantity in sky is example on every Fridays, and all can produce a large amount of repeating datas every day, taken storage resources, increase production cost, at this moment adopt method of the present invention that the data that repeat are deleted, delete by adding black data in accompanying drawing, repeat every day, can find out, data backup total amount obviously reduces.
The foregoing is only embodiments of the invention, within the spirit and principles in the present invention all, any amendment of doing, be equal to replacement, improvement etc., within all should being included in protection scope of the present invention.
Claims (4)
1. the data of backup server are heavily deleted a method, it is characterized in that its specific operation process is as follows:
One, on the logical volume forming at the hard disk of backup server, divide a subregion for deposit the data that will back up on this server temporarily, this subregion is born data buffer storage effect;
Two, the data of buffer memory in above-mentioned subregion are carried out to piece and cut apart, and single as a calculating taking piece, calculate its hash value and be kept in database;
Three, hash value being contrasted, be about to have deposited hash value in the hash value of new data of each backup and storehouse and compare, is repeating data if identical; Be non-repeating data if not identical, and the hash value of this data block is added in hash storehouse.
2. the data of a kind of backup server according to claim 1 are heavily deleted method, it is characterized in that: the logical volume in described step 1 refers to the logical volume of a RAID5 who is built up by three above hard disk groups, this logical volume needs all data of backup under first preserving in use, again according to the judgement in step 3, non-repeatability data in this buffer memory are kept on server, repeatability data replace with pointer, after the data action that deletion repeats completes, this logical volume is carried out quick formatting, waits for user's Backup Data of next time.
3. the data of a kind of backup server according to claim 1 and 2 are heavily deleted method, it is characterized in that: the detailed process of the Data Segmentation of described step 2 is: first definition block size, then according to the block size defining, file is carried out to cutting, and carry out the calculating of hash functional value, this hash functional value refers to weak proof test value and the strong proof test value of md5, first calculate weak proof test value and carry out hash and search, if found, calculate the strong proof test value of md5 and make further hash and search.
4. the data of a kind of backup server according to claim 3 are heavily deleted method, it is characterized in that: the detailed process of described step 3 is: step 2 is calculated to hash functional value composition hash functional value storehouse, be kept at separately a fixed position of subregion, the hash value of each backup new data all with this hash functional value storehouse in hash value compare, if the data block of hash value is identical, preserve a pointer, the memory location of this pointed repeating data; If the data block difference of hash value, is non-repeating data, this data block is kept in a unique data district of this backup server, and the hash value of this data block is added in hash storehouse.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410186755.5A CN103970852A (en) | 2014-05-06 | 2014-05-06 | Data deduplication method of backup server |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410186755.5A CN103970852A (en) | 2014-05-06 | 2014-05-06 | Data deduplication method of backup server |
Publications (1)
Publication Number | Publication Date |
---|---|
CN103970852A true CN103970852A (en) | 2014-08-06 |
Family
ID=51240349
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410186755.5A Pending CN103970852A (en) | 2014-05-06 | 2014-05-06 | Data deduplication method of backup server |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103970852A (en) |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105095027A (en) * | 2015-09-11 | 2015-11-25 | 浪潮(北京)电子信息产业有限公司 | Data backup method and apparatus |
CN105871705A (en) * | 2016-06-07 | 2016-08-17 | 北京赛思信安技术股份有限公司 | Method for judging E-mail repeated contents during massive E-mail analysis processing process |
CN106227901A (en) * | 2016-09-19 | 2016-12-14 | 郑州云海信息技术有限公司 | A kind of based on heavily deleting and compressing parallel space method for saving |
CN106598765A (en) * | 2015-10-15 | 2017-04-26 | 北京国双科技有限公司 | Data check method and device |
CN107066352A (en) * | 2017-03-02 | 2017-08-18 | 陈辉 | With delete again and remote functionality portable intelligent device backup devices and methods therefor |
CN107113164A (en) * | 2014-12-18 | 2017-08-29 | 诺基亚技术有限公司 | The deduplication of encryption data |
CN107172112A (en) * | 2016-03-07 | 2017-09-15 | 阿里巴巴集团控股有限公司 | A kind of computer documents transmission method and device |
CN107315653A (en) * | 2017-03-02 | 2017-11-03 | 陈辉 | A kind of band deletes the portable storage device and implementation method of calculating and processing function again |
CN107346271A (en) * | 2016-05-05 | 2017-11-14 | 华为技术有限公司 | The method and calamity of Backup Data block are for end equipment |
CN108255422A (en) * | 2017-12-28 | 2018-07-06 | 浪潮通用软件有限公司 | A kind of storage method and storage device |
CN108427539A (en) * | 2018-03-15 | 2018-08-21 | 深信服科技股份有限公司 | Offline duplicate removal compression method, device and the readable storage medium storing program for executing of buffer memory device data |
CN109070345A (en) * | 2016-02-23 | 2018-12-21 | Abb瑞士股份有限公司 | Robot controller system and method |
CN110830361A (en) * | 2019-10-22 | 2020-02-21 | 新华三信息安全技术有限公司 | Mail data storage method and device |
CN111210352A (en) * | 2020-01-10 | 2020-05-29 | 李�荣 | Economic data statistical device and method based on block chain |
WO2021077313A1 (en) * | 2019-10-23 | 2021-04-29 | Beijing Voyager Technology Co., Ltd. | Systems and methods for autonomous driving |
CN115580594A (en) * | 2022-12-12 | 2023-01-06 | 四川大学 | E-mail processing and transmitting method, system and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101158954A (en) * | 2007-11-07 | 2008-04-09 | 上海爱数软件有限公司 | Method for recognizing repeat data in computer storage |
CN101989929A (en) * | 2010-11-17 | 2011-03-23 | 中兴通讯股份有限公司 | Disaster recovery data backup method and system |
CN102281321A (en) * | 2011-04-25 | 2011-12-14 | 程旭 | Data cloud storage partitioning and backup method and device |
-
2014
- 2014-05-06 CN CN201410186755.5A patent/CN103970852A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101158954A (en) * | 2007-11-07 | 2008-04-09 | 上海爱数软件有限公司 | Method for recognizing repeat data in computer storage |
CN101989929A (en) * | 2010-11-17 | 2011-03-23 | 中兴通讯股份有限公司 | Disaster recovery data backup method and system |
CN102281321A (en) * | 2011-04-25 | 2011-12-14 | 程旭 | Data cloud storage partitioning and backup method and device |
Cited By (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107113164A (en) * | 2014-12-18 | 2017-08-29 | 诺基亚技术有限公司 | The deduplication of encryption data |
CN105095027A (en) * | 2015-09-11 | 2015-11-25 | 浪潮(北京)电子信息产业有限公司 | Data backup method and apparatus |
CN106598765A (en) * | 2015-10-15 | 2017-04-26 | 北京国双科技有限公司 | Data check method and device |
CN109070345A (en) * | 2016-02-23 | 2018-12-21 | Abb瑞士股份有限公司 | Robot controller system and method |
CN107172112A (en) * | 2016-03-07 | 2017-09-15 | 阿里巴巴集团控股有限公司 | A kind of computer documents transmission method and device |
CN107172112B (en) * | 2016-03-07 | 2020-10-02 | 阿里巴巴集团控股有限公司 | Computer file transmission method and device |
CN107346271A (en) * | 2016-05-05 | 2017-11-14 | 华为技术有限公司 | The method and calamity of Backup Data block are for end equipment |
CN105871705A (en) * | 2016-06-07 | 2016-08-17 | 北京赛思信安技术股份有限公司 | Method for judging E-mail repeated contents during massive E-mail analysis processing process |
CN106227901A (en) * | 2016-09-19 | 2016-12-14 | 郑州云海信息技术有限公司 | A kind of based on heavily deleting and compressing parallel space method for saving |
CN107066352A (en) * | 2017-03-02 | 2017-08-18 | 陈辉 | With delete again and remote functionality portable intelligent device backup devices and methods therefor |
CN107315653A (en) * | 2017-03-02 | 2017-11-03 | 陈辉 | A kind of band deletes the portable storage device and implementation method of calculating and processing function again |
CN108255422A (en) * | 2017-12-28 | 2018-07-06 | 浪潮通用软件有限公司 | A kind of storage method and storage device |
CN108427539A (en) * | 2018-03-15 | 2018-08-21 | 深信服科技股份有限公司 | Offline duplicate removal compression method, device and the readable storage medium storing program for executing of buffer memory device data |
CN110830361A (en) * | 2019-10-22 | 2020-02-21 | 新华三信息安全技术有限公司 | Mail data storage method and device |
CN110830361B (en) * | 2019-10-22 | 2021-12-07 | 新华三信息安全技术有限公司 | Mail data storage method and device |
WO2021077313A1 (en) * | 2019-10-23 | 2021-04-29 | Beijing Voyager Technology Co., Ltd. | Systems and methods for autonomous driving |
CN111210352A (en) * | 2020-01-10 | 2020-05-29 | 李�荣 | Economic data statistical device and method based on block chain |
CN115580594A (en) * | 2022-12-12 | 2023-01-06 | 四川大学 | E-mail processing and transmitting method, system and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103970852A (en) | Data deduplication method of backup server | |
CN101989929B (en) | Disaster recovery data backup method and system | |
CN102222085B (en) | Data de-duplication method based on combination of similarity and locality | |
CN103473239B (en) | A kind of data of non relational database update method and device | |
CN102467572B (en) | Data block inquiring method for supporting data de-duplication program | |
CN104932841A (en) | Saving type duplicated data deleting method in cloud storage system | |
CN104301360A (en) | Method, log server and system for recording log data | |
CN102200936A (en) | Intelligent configuration storage backup method suitable for cloud storage | |
CN102156727A (en) | Method for deleting repeated data by using double-fingerprint hash check | |
CN103488687A (en) | Searching system and searching method of big data | |
CN105630810B (en) | A method of mass small documents are uploaded in distributed memory system | |
CN104462389A (en) | Method for implementing distributed file systems on basis of hierarchical storage | |
CN103530388A (en) | Performance improving data processing method in cloud storage system | |
CN103279502B (en) | A kind of framework and method with the data de-duplication file system be combined with parallel file system | |
CN102591864B (en) | Data updating method and device in comparison system | |
CN109800185A (en) | A kind of data cache method in data-storage system | |
CN106990914B (en) | Data deleting method and device | |
CN105487942A (en) | Backup and remote copy method based on data deduplication | |
CN109299115A (en) | A kind of date storage method, device, server and storage medium | |
CN103051671A (en) | Repeating data deletion method for cluster file system | |
CN102880671A (en) | Method for actively deleting repeated data of distributed file system | |
US11397706B2 (en) | System and method for reducing read amplification of archival storage using proactive consolidation | |
CN103916459A (en) | Big data filing and storing system | |
CN103631933A (en) | Distributed duplication elimination system-oriented data routing method | |
CN105824881A (en) | Repeating data and deleted data placement method and device based on load balancing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20140806 |
|
WD01 | Invention patent application deemed withdrawn after publication |