CN102982122A - Repeating data deleting method suitable for mass storage system - Google Patents
Repeating data deleting method suitable for mass storage system Download PDFInfo
- Publication number
- CN102982122A CN102982122A CN2012104528309A CN201210452830A CN102982122A CN 102982122 A CN102982122 A CN 102982122A CN 2012104528309 A CN2012104528309 A CN 2012104528309A CN 201210452830 A CN201210452830 A CN 201210452830A CN 102982122 A CN102982122 A CN 102982122A
- Authority
- CN
- China
- Prior art keywords
- data
- ssd
- storage
- fingerprint
- hash
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Abstract
The invention provides a repeating data deleting method suitable for a mass storage system. Minor blocks of 4KB-8KB are adopted to intercept and shunt data flows. The two-factor authentication of Hash value and fingerprint data is adopted to guarantee the deleting rate of repeating data, and solid state disk (SSD) storage and Hash indexes are adopted for improving retrieval efficiency. The deleting of the repeating data is suitable for high repeating data. Highest efficiency, namely the highest deleting rate of the repeating data, can be achieved under the application environments of low data variation rate, complete data backup, long-term data storage, and non-active data. The deleting rate of the repeating data can also be perceived as disk space release ratio. According to the prior delete technology of fixed repeating data, blocks with the storage sizes of 64KB-128KB are generally adopted for the fragmented data flows, and due to the difference between the speed rate of random access memory (RAM) and the speed rate of hard disk drive (HDD) disk, performance is not affected by the fact that the blocks are too small, but the deleting rate of the repeating data can be affected because the blocks are too big.
Description
Technical field
The present invention relates to computer system and mass storage system (MSS), specifically a kind of data de-duplication method that is applicable to mass storage system (MSS).
Background technology
Because magnetic disc fast development in recent years, some enterprises and user also tend to back up or file with jumbo disk.Traditional backup policy tends to produce a large amount of redundant datas in the disk space the inside of memory device, has consumed a large amount of unnecessary disk spaces, causes the waste of device resource; And because too many redundant data causes the disk search prolongation of positioning time, also reduced the overall performance of system.
Data de-duplication solves these problems, the data that it will back up are compared, and substitute if identical data are arranged then connect or point to one, have so not only saved data and have stored needed storage space, and since the minimizing of data volume, the positioning time of also having reduced disk.It has not only effectively controlled the sharp increase of data, also increase effective storage space, improved storage efficiency, saved total cost and the handling cost of storage, simultaneously also save the network bandwidth of data transmission, saved the O﹠M costs such as space, electric power supply.
But use data de-duplication technology that shortcoming is also arranged, it needs comparison, so increased the time of calculating and verification; It need to store the Hash index, and the serious unbalance of internal memory and disk speed will increase the consuming time of check; Also there is impact in minute block size of deblocking technology to the deletion accuracy of repeating data in the system.
This method is improved for the defective of existing method, has faster verification speed, less data block granularity and more accurate repeating data and removes.On two criterion data de-duplication rates and performance of data de-duplication, raising is in various degree arranged.
Summary of the invention
The purpose of this invention is to provide a kind of data de-duplication method that is applicable to mass storage system (MSS).
The objective of the invention is to realize in the following manner, use the block size of 4KB-8KB that data stream is carried out piecemeal, minimum like this piece just can be guaranteed high data de-duplication rate, can carry out more accurately the location deletion of repeating data; And for the difference of data transmission rate between HDD disk and the internal memory RAM and the performance impact that causes, by using SSD to rebuild the framework of storage, as the transfer between RAM and the RAID array, not only can improve like this speed of retrieval, and improved data storage speed, when new data block hash value arrival need to be verified, RAM reads the Hash index that is stored among the SSD, exceed 3 times actual data transfer rate than HDD with SSD, just the performance that gets system is guaranteed, the relatively large caused performance issue of Hash concordance list that the fritter of use 4KB-8KB produces also can be resolved, for data de-duplication ratio and the low problem of data de-duplication degree of accuracy, use the duplication check of hash value and data fingerprint, to guarantee to the full extent data de-duplication degree of accuracy, because use SSD as intermediate point, ensured I/O speed, therefore in the situation of double verification, also can not affect the performance of system, concrete steps are as follows:
1) receives the data stream that to store;
2) be sent to Dedupe System module;
3) data stream is blocked decomposition according to the piece of the 4KB-8KB size of appointment;
4) hash value of computational data piece;
5) data fingerprint of computational data piece;
6) read the Hash concordance list of storing among the SSD;
7) whether repeat in differentiation hash value and the Hash concordance list;
A. repeat, (1) reading out data fingerprint index table continues the decision data fingerprint; (2) data fingerprint repeats, and sets up data and connects, and data block is not stored; (3) data fingerprint does not repeat, and the storage data block is to SSD, the Hash concordance list of storage data block hash value to the SSD, the data fingerprint of the storage data block data fingerprint concordance list to the SSD;
B. do not repeat, (1) storage data block is to SSD, the Hash concordance list of storage data block hash value to the SSD, the data fingerprint of the storage data block data fingerprint concordance list to the SSD; (2) will be stored in data block among the SSD is sent in the RAID Array Disk and stores.
The invention has the beneficial effects as follows: adopt the fritter of 4KB-8KB to cut divided data stream, the double verification of employing hash value and data fingerprint is guaranteed the deletion rate of repeating data, adopt SSD storage Hash index to improve recall precision, data de-duplication is applicable to high repeating data, lower data variation rate for example, completely data backup, data long preservation and inactive data, under these applied environments, can reach the highest efficient, i.e. the highest data de-duplication rate, also can be understood as the disk space release ratio, traditional fixedly data de-duplication technology, block data stream generally is the block size that adopts 64KB-128KB, because the otherness of speed between internal memory RAM and HDD disk, just doing so can not affect because piece is too little performance, but can affect because piece is excessive the deletion ratio of repeating data.
Description of drawings
Fig. 1 is the system architecture synoptic diagram;
Fig. 2 is the data de-duplication schematic diagram;
Fig. 3 is the data de-duplication process flow diagram.
Embodiment
Explain below with reference to Figure of description method of the present invention being done.
This method is improved for the data de-duplication rate problem low and that performance is lower of existing data de-duplication method, adopt the fritter of 4KB-8KB to cut divided data stream, adopt the double verification of hash value and data fingerprint, guarantee the deletion rate of repeating data, adopt SSD storage Hash index to improve recall precision, whole framework as shown in Figure 1.
Data de-duplication is applicable to high repeating data, lower data variation rate for example, completely data backup, data long preservation and inactive data, under these applied environments, can reach the highest efficient (i.e. the highest data de-duplication rate, also can be understood as the disk space release ratio), the principle of data de-duplication is as shown in Figure 2.
Fixing data de-duplication technology, block data stream generally is the block size that adopts 64KB-128KB, because the otherness of speed between internal memory RAM and HDD disk, just doing so can not affect because piece is too little performance, but can affect because piece is excessive the deletion ratio of repeating data.We use the block size of 4KB-8KB that data stream is carried out piecemeal, and minimum like this piece just can be guaranteed high data de-duplication rate, can carry out more accurately the location deletion of repeating data; And for the difference of data transmission rate between HDD disk and the internal memory RAM and the performance impact that causes, we are by using SSD to rebuild the framework of storage, as the transfer between RAM and the RAID array, not only can improve like this speed of retrieval, and improved data storage speed, when new data block hash value arrival need to be verified, RAM reads the Hash index that is stored among the SSD, exceed 3 times actual data transfer rate than HDD with SSD, just the performance that gets system is guaranteed, and the relatively large caused performance issue of Hash concordance list that the fritter of use 4KB-8KB produces also can be resolved.Simultaneously, in existing method, data de-duplication ratio and data de-duplication degree of accuracy are difficult to satisfactory really.For the problem of this respect, we use the duplication check of hash value and data fingerprint, to guarantee to the full extent data de-duplication degree of accuracy, because use SSD as intermediate point, ensure I/O speed, therefore in the situation of double verification, also can not affect the performance of system.
Whether data storages flow process in the method is roughly as follows: receive the data stream that will store-〉 be sent to Dedupe System module-〉 data stream and block in the Hash concordance list that the data fingerprint of the hash value of decomposition-〉 computational data piece-〉 computational data piece-〉 read stores among the SSD-〉 differentiation hash value and the Hash concordance list according to the block size (4KB or 8KB) of appointment and repeat-〉 the A. repetition, reading out data fingerprint index table, continuing decision data fingerprint-〉 data fingerprint repeats, setting up data connects, data block do not store-〉 and data fingerprint do not repeat, the storage data block is to SSD, the Hash concordance list of storage data block hash value to the SSD, the data fingerprint of storage data block does not repeat to the data fingerprint concordance list-〉 B. among the SSD, the storage data block is to SSD, the Hash concordance list of storage data block hash value to the SSD, the data fingerprint concordance list of the data fingerprint of storage data block to the SSD-〉 will the be stored in data block among the SSD is sent in the RAID Array Disk stores, as shown in Figure 3.
Except the described technical characterictic of instructions, be the known technology of those skilled in the art.
Claims (1)
1. data de-duplication method that is applicable to mass storage system (MSS), it is characterized in that using the block size of 4KB-8KB that data stream is carried out piecemeal, minimum like this piece just can be guaranteed high data de-duplication rate, can carry out more accurately the location deletion of repeating data; And for the difference of data transmission rate between HDD disk and the internal memory RAM and the performance impact that causes, by using SSD to rebuild the framework of storage, as the transfer between RAM and the RAID array, not only can improve like this speed of retrieval, and improved data storage speed, when new data block hash value arrival need to be verified, RAM reads the Hash index that is stored among the SSD, exceed 3 times actual data transfer rate than HDD with SSD, just the performance that gets system is guaranteed, the relatively large caused performance issue of Hash concordance list that the fritter of use 4KB-8KB produces also can be resolved, for data de-duplication ratio and the low problem of data de-duplication degree of accuracy, use the duplication check of hash value and data fingerprint, to guarantee to the full extent data de-duplication degree of accuracy, because use SSD as intermediate point, ensured I/O speed, therefore in the situation of double verification, also can not affect the performance of system, concrete steps are as follows:
1) receives the data stream that to store;
2) be sent to Dedupe System module;
3) data stream is blocked decomposition according to the piece of the 4KB-8KB size of appointment;
4) hash value of computational data piece;
5) data fingerprint of computational data piece;
6) read the Hash concordance list of storing among the SSD;
7) whether repeat in differentiation hash value and the Hash concordance list;
A. repeat, (1) reading out data fingerprint index table continues the decision data fingerprint; (2) data fingerprint repeats, and sets up data and connects, and data block is not stored; (3) data fingerprint does not repeat, and the storage data block is to SSD, the Hash concordance list of storage data block hash value to the SSD, the data fingerprint of the storage data block data fingerprint concordance list to the SSD;
B. do not repeat, (1) storage data block is to SSD, the Hash concordance list of storage data block hash value to the SSD, the data fingerprint of the storage data block data fingerprint concordance list to the SSD; (2) will be stored in data block among the SSD is sent in the RAID Array Disk and stores.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2012104528309A CN102982122A (en) | 2012-11-13 | 2012-11-13 | Repeating data deleting method suitable for mass storage system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2012104528309A CN102982122A (en) | 2012-11-13 | 2012-11-13 | Repeating data deleting method suitable for mass storage system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN102982122A true CN102982122A (en) | 2013-03-20 |
Family
ID=47856140
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2012104528309A Pending CN102982122A (en) | 2012-11-13 | 2012-11-13 | Repeating data deleting method suitable for mass storage system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102982122A (en) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104462388A (en) * | 2014-12-10 | 2015-03-25 | 上海爱数软件有限公司 | Redundant data cleaning method based on cascade storage media |
WO2016041127A1 (en) * | 2014-09-15 | 2016-03-24 | 华为技术有限公司 | Data duplication method and storage array |
WO2016090541A1 (en) * | 2014-12-09 | 2016-06-16 | 清华大学 | Data storage system and data storage method |
CN105824881A (en) * | 2016-03-10 | 2016-08-03 | 中国人民解放军国防科学技术大学 | Repeating data and deleted data placement method and device based on load balancing |
CN106610792A (en) * | 2016-07-28 | 2017-05-03 | 四川用联信息技术有限公司 | Repeating data deleting algorithm in cloud storage |
CN106713489A (en) * | 2017-01-17 | 2017-05-24 | 郑州云海信息技术有限公司 | Deduplication based synchronous remote copying system and method |
CN106843760A (en) * | 2017-01-17 | 2017-06-13 | 郑州云海信息技术有限公司 | It is a kind of based on the asynchronous remote copy system deleted and method again |
CN106951192A (en) * | 2017-03-25 | 2017-07-14 | 广州硕点电子科技有限公司 | A kind of date storage method, apparatus and system |
WO2017187334A1 (en) * | 2016-04-29 | 2017-11-02 | International Business Machines Corporation | Data deduplication with reduced hash computations |
CN108427538A (en) * | 2018-03-15 | 2018-08-21 | 深信服科技股份有限公司 | Storage data compression method, device and the readable storage medium storing program for executing of full flash array |
TWI709857B (en) * | 2017-12-08 | 2020-11-11 | 日商東芝記憶體股份有限公司 | Memory system and control method |
CN112783446A (en) * | 2021-01-22 | 2021-05-11 | 苏州浪潮智能科技有限公司 | Data writing method and system of storage system |
CN113010104A (en) * | 2020-01-27 | 2021-06-22 | 慧与发展有限责任合伙企业 | Deduplication system threshold based on amount of wear of storage device |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102156727A (en) * | 2011-04-01 | 2011-08-17 | 华中科技大学 | Method for deleting repeated data by using double-fingerprint hash check |
CN102591947A (en) * | 2010-12-28 | 2012-07-18 | 微软公司 | Fast and low-RAM-footprint indexing for data deduplication |
CN102591946A (en) * | 2010-12-28 | 2012-07-18 | 微软公司 | Using index partitioning and reconciliation for data deduplication |
CN102609442A (en) * | 2010-12-28 | 2012-07-25 | 微软公司 | Adaptive Index for Data Deduplication |
CN102663086A (en) * | 2012-04-09 | 2012-09-12 | 华中科技大学 | Method for retrieving data block indexes |
-
2012
- 2012-11-13 CN CN2012104528309A patent/CN102982122A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102591947A (en) * | 2010-12-28 | 2012-07-18 | 微软公司 | Fast and low-RAM-footprint indexing for data deduplication |
CN102591946A (en) * | 2010-12-28 | 2012-07-18 | 微软公司 | Using index partitioning and reconciliation for data deduplication |
CN102609442A (en) * | 2010-12-28 | 2012-07-25 | 微软公司 | Adaptive Index for Data Deduplication |
CN102156727A (en) * | 2011-04-01 | 2011-08-17 | 华中科技大学 | Method for deleting repeated data by using double-fingerprint hash check |
CN102663086A (en) * | 2012-04-09 | 2012-09-12 | 华中科技大学 | Method for retrieving data block indexes |
Cited By (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105612489B (en) * | 2014-09-15 | 2017-08-29 | 华为技术有限公司 | Data de-duplication method and storage array |
WO2016041127A1 (en) * | 2014-09-15 | 2016-03-24 | 华为技术有限公司 | Data duplication method and storage array |
CN105612489A (en) * | 2014-09-15 | 2016-05-25 | 华为技术有限公司 | Data duplication method and storage array |
US10346245B2 (en) | 2014-12-09 | 2019-07-09 | Tsinghua University | Data storage system and data storage method |
WO2016090541A1 (en) * | 2014-12-09 | 2016-06-16 | 清华大学 | Data storage system and data storage method |
CN107250975B (en) * | 2014-12-09 | 2020-07-10 | 清华大学 | Data storage system and data storage method |
CN107250975A (en) * | 2014-12-09 | 2017-10-13 | 清华大学 | Data-storage system and date storage method |
CN104462388B (en) * | 2014-12-10 | 2017-12-29 | 上海爱数信息技术股份有限公司 | A kind of redundant data method for cleaning based on tandem type storage medium |
CN104462388A (en) * | 2014-12-10 | 2015-03-25 | 上海爱数软件有限公司 | Redundant data cleaning method based on cascade storage media |
CN105824881A (en) * | 2016-03-10 | 2016-08-03 | 中国人民解放军国防科学技术大学 | Repeating data and deleted data placement method and device based on load balancing |
CN105824881B (en) * | 2016-03-10 | 2019-03-29 | 中国人民解放军国防科学技术大学 | A kind of data de-duplication data placement method based on load balancing |
WO2017187334A1 (en) * | 2016-04-29 | 2017-11-02 | International Business Machines Corporation | Data deduplication with reduced hash computations |
CN106610792A (en) * | 2016-07-28 | 2017-05-03 | 四川用联信息技术有限公司 | Repeating data deleting algorithm in cloud storage |
CN106713489A (en) * | 2017-01-17 | 2017-05-24 | 郑州云海信息技术有限公司 | Deduplication based synchronous remote copying system and method |
CN106843760A (en) * | 2017-01-17 | 2017-06-13 | 郑州云海信息技术有限公司 | It is a kind of based on the asynchronous remote copy system deleted and method again |
CN106951192A (en) * | 2017-03-25 | 2017-07-14 | 广州硕点电子科技有限公司 | A kind of date storage method, apparatus and system |
TWI709857B (en) * | 2017-12-08 | 2020-11-11 | 日商東芝記憶體股份有限公司 | Memory system and control method |
CN108427538A (en) * | 2018-03-15 | 2018-08-21 | 深信服科技股份有限公司 | Storage data compression method, device and the readable storage medium storing program for executing of full flash array |
CN108427538B (en) * | 2018-03-15 | 2021-06-04 | 深信服科技股份有限公司 | Storage data compression method and device of full flash memory array and readable storage medium |
CN113010104A (en) * | 2020-01-27 | 2021-06-22 | 慧与发展有限责任合伙企业 | Deduplication system threshold based on amount of wear of storage device |
US11609849B2 (en) | 2020-01-27 | 2023-03-21 | Hewlett Packard Enterprise Development Lp | Deduplication system threshold based on a type of storage device |
CN112783446A (en) * | 2021-01-22 | 2021-05-11 | 苏州浪潮智能科技有限公司 | Data writing method and system of storage system |
CN112783446B (en) * | 2021-01-22 | 2023-01-10 | 苏州浪潮智能科技有限公司 | Data writing method and system of storage system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN102982122A (en) | Repeating data deleting method suitable for mass storage system | |
US9021189B2 (en) | System and method for performing efficient processing of data stored in a storage node | |
US9092321B2 (en) | System and method for performing efficient searches and queries in a storage node | |
US10001944B2 (en) | Systems and methods for data organization in storage systems using large erasure codes | |
US9489148B2 (en) | Selecting between non-volatile memory units having different minimum addressable data unit sizes | |
Cao et al. | Sliding {Look-Back} Window Assisted Data Chunk Rewriting for Improving Deduplication Restore Performance | |
US9223509B2 (en) | File processing method and storage device | |
KR20170037562A (en) | Compression sampling in tiered storage | |
CN102521260B (en) | Data preheating method and device | |
CN104765575A (en) | Information storage processing method | |
CN103916459A (en) | Big data filing and storing system | |
US9336135B1 (en) | Systems and methods for performing search and complex pattern matching in a solid state drive | |
CN103198027A (en) | Method and device for storing and providing files | |
WO2015081690A1 (en) | Method and apparatus for improving disk array performance | |
CN104765574A (en) | Data cloud storage method | |
CN104616680A (en) | Repeating data deleting system based on optical disc storage as well as data operating method and device | |
CN201698255U (en) | Server capable of accessing disc at high speed | |
CN103500147A (en) | Embedded and layered storage method of PB-class cluster storage system | |
Liu et al. | A delayed container organization approach to improve restore speed for deduplication systems | |
CN102063263B (en) | Method, device and system for responding read-write operation request of host computer by solid state disk | |
CN103279561A (en) | Method for increasing random database data read-write speed | |
CN103399783A (en) | Storage method and device of mirror image documents of virtual machines | |
CN110968271B (en) | High-performance data storage method, system and device | |
CN101526887B (en) | Method for writing data into hard disc, device and system thereof | |
CN105573668B (en) | A kind of date storage method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20130320 |