CN102982122A - Repeating data deleting method suitable for mass storage system - Google Patents

Repeating data deleting method suitable for mass storage system Download PDF

Info

Publication number
CN102982122A
CN102982122A CN2012104528309A CN201210452830A CN102982122A CN 102982122 A CN102982122 A CN 102982122A CN 2012104528309 A CN2012104528309 A CN 2012104528309A CN 201210452830 A CN201210452830 A CN 201210452830A CN 102982122 A CN102982122 A CN 102982122A
Authority
CN
China
Prior art keywords
data
ssd
storage
fingerprint
hash
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2012104528309A
Other languages
Chinese (zh)
Inventor
梁吉林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Electronic Information Industry Co Ltd
Original Assignee
Inspur Electronic Information Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Electronic Information Industry Co Ltd filed Critical Inspur Electronic Information Industry Co Ltd
Priority to CN2012104528309A priority Critical patent/CN102982122A/en
Publication of CN102982122A publication Critical patent/CN102982122A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention provides a repeating data deleting method suitable for a mass storage system. Minor blocks of 4KB-8KB are adopted to intercept and shunt data flows. The two-factor authentication of Hash value and fingerprint data is adopted to guarantee the deleting rate of repeating data, and solid state disk (SSD) storage and Hash indexes are adopted for improving retrieval efficiency. The deleting of the repeating data is suitable for high repeating data. Highest efficiency, namely the highest deleting rate of the repeating data, can be achieved under the application environments of low data variation rate, complete data backup, long-term data storage, and non-active data. The deleting rate of the repeating data can also be perceived as disk space release ratio. According to the prior delete technology of fixed repeating data, blocks with the storage sizes of 64KB-128KB are generally adopted for the fragmented data flows, and due to the difference between the speed rate of random access memory (RAM) and the speed rate of hard disk drive (HDD) disk, performance is not affected by the fact that the blocks are too small, but the deleting rate of the repeating data can be affected because the blocks are too big.

Description

A kind of data de-duplication method that is applicable to mass storage system (MSS)
Technical field
The present invention relates to computer system and mass storage system (MSS), specifically a kind of data de-duplication method that is applicable to mass storage system (MSS).
Background technology
Because magnetic disc fast development in recent years, some enterprises and user also tend to back up or file with jumbo disk.Traditional backup policy tends to produce a large amount of redundant datas in the disk space the inside of memory device, has consumed a large amount of unnecessary disk spaces, causes the waste of device resource; And because too many redundant data causes the disk search prolongation of positioning time, also reduced the overall performance of system.
Data de-duplication solves these problems, the data that it will back up are compared, and substitute if identical data are arranged then connect or point to one, have so not only saved data and have stored needed storage space, and since the minimizing of data volume, the positioning time of also having reduced disk.It has not only effectively controlled the sharp increase of data, also increase effective storage space, improved storage efficiency, saved total cost and the handling cost of storage, simultaneously also save the network bandwidth of data transmission, saved the O﹠M costs such as space, electric power supply.
But use data de-duplication technology that shortcoming is also arranged, it needs comparison, so increased the time of calculating and verification; It need to store the Hash index, and the serious unbalance of internal memory and disk speed will increase the consuming time of check; Also there is impact in minute block size of deblocking technology to the deletion accuracy of repeating data in the system.
This method is improved for the defective of existing method, has faster verification speed, less data block granularity and more accurate repeating data and removes.On two criterion data de-duplication rates and performance of data de-duplication, raising is in various degree arranged.
Summary of the invention
The purpose of this invention is to provide a kind of data de-duplication method that is applicable to mass storage system (MSS).
The objective of the invention is to realize in the following manner, use the block size of 4KB-8KB that data stream is carried out piecemeal, minimum like this piece just can be guaranteed high data de-duplication rate, can carry out more accurately the location deletion of repeating data; And for the difference of data transmission rate between HDD disk and the internal memory RAM and the performance impact that causes, by using SSD to rebuild the framework of storage, as the transfer between RAM and the RAID array, not only can improve like this speed of retrieval, and improved data storage speed, when new data block hash value arrival need to be verified, RAM reads the Hash index that is stored among the SSD, exceed 3 times actual data transfer rate than HDD with SSD, just the performance that gets system is guaranteed, the relatively large caused performance issue of Hash concordance list that the fritter of use 4KB-8KB produces also can be resolved, for data de-duplication ratio and the low problem of data de-duplication degree of accuracy, use the duplication check of hash value and data fingerprint, to guarantee to the full extent data de-duplication degree of accuracy, because use SSD as intermediate point, ensured I/O speed, therefore in the situation of double verification, also can not affect the performance of system, concrete steps are as follows:
1) receives the data stream that to store;
2) be sent to Dedupe System module;
3) data stream is blocked decomposition according to the piece of the 4KB-8KB size of appointment;
4) hash value of computational data piece;
5) data fingerprint of computational data piece;
6) read the Hash concordance list of storing among the SSD;
7) whether repeat in differentiation hash value and the Hash concordance list;
A. repeat, (1) reading out data fingerprint index table continues the decision data fingerprint; (2) data fingerprint repeats, and sets up data and connects, and data block is not stored; (3) data fingerprint does not repeat, and the storage data block is to SSD, the Hash concordance list of storage data block hash value to the SSD, the data fingerprint of the storage data block data fingerprint concordance list to the SSD;
B. do not repeat, (1) storage data block is to SSD, the Hash concordance list of storage data block hash value to the SSD, the data fingerprint of the storage data block data fingerprint concordance list to the SSD; (2) will be stored in data block among the SSD is sent in the RAID Array Disk and stores.
The invention has the beneficial effects as follows: adopt the fritter of 4KB-8KB to cut divided data stream, the double verification of employing hash value and data fingerprint is guaranteed the deletion rate of repeating data, adopt SSD storage Hash index to improve recall precision, data de-duplication is applicable to high repeating data, lower data variation rate for example, completely data backup, data long preservation and inactive data, under these applied environments, can reach the highest efficient, i.e. the highest data de-duplication rate, also can be understood as the disk space release ratio, traditional fixedly data de-duplication technology, block data stream generally is the block size that adopts 64KB-128KB, because the otherness of speed between internal memory RAM and HDD disk, just doing so can not affect because piece is too little performance, but can affect because piece is excessive the deletion ratio of repeating data.
Description of drawings
Fig. 1 is the system architecture synoptic diagram;
Fig. 2 is the data de-duplication schematic diagram;
Fig. 3 is the data de-duplication process flow diagram.
Embodiment
Explain below with reference to Figure of description method of the present invention being done.
This method is improved for the data de-duplication rate problem low and that performance is lower of existing data de-duplication method, adopt the fritter of 4KB-8KB to cut divided data stream, adopt the double verification of hash value and data fingerprint, guarantee the deletion rate of repeating data, adopt SSD storage Hash index to improve recall precision, whole framework as shown in Figure 1.
Data de-duplication is applicable to high repeating data, lower data variation rate for example, completely data backup, data long preservation and inactive data, under these applied environments, can reach the highest efficient (i.e. the highest data de-duplication rate, also can be understood as the disk space release ratio), the principle of data de-duplication is as shown in Figure 2.
Fixing data de-duplication technology, block data stream generally is the block size that adopts 64KB-128KB, because the otherness of speed between internal memory RAM and HDD disk, just doing so can not affect because piece is too little performance, but can affect because piece is excessive the deletion ratio of repeating data.We use the block size of 4KB-8KB that data stream is carried out piecemeal, and minimum like this piece just can be guaranteed high data de-duplication rate, can carry out more accurately the location deletion of repeating data; And for the difference of data transmission rate between HDD disk and the internal memory RAM and the performance impact that causes, we are by using SSD to rebuild the framework of storage, as the transfer between RAM and the RAID array, not only can improve like this speed of retrieval, and improved data storage speed, when new data block hash value arrival need to be verified, RAM reads the Hash index that is stored among the SSD, exceed 3 times actual data transfer rate than HDD with SSD, just the performance that gets system is guaranteed, and the relatively large caused performance issue of Hash concordance list that the fritter of use 4KB-8KB produces also can be resolved.Simultaneously, in existing method, data de-duplication ratio and data de-duplication degree of accuracy are difficult to satisfactory really.For the problem of this respect, we use the duplication check of hash value and data fingerprint, to guarantee to the full extent data de-duplication degree of accuracy, because use SSD as intermediate point, ensure I/O speed, therefore in the situation of double verification, also can not affect the performance of system.
Whether data storages flow process in the method is roughly as follows: receive the data stream that will store-〉 be sent to Dedupe System module-〉 data stream and block in the Hash concordance list that the data fingerprint of the hash value of decomposition-〉 computational data piece-〉 computational data piece-〉 read stores among the SSD-〉 differentiation hash value and the Hash concordance list according to the block size (4KB or 8KB) of appointment and repeat-〉 the A. repetition, reading out data fingerprint index table, continuing decision data fingerprint-〉 data fingerprint repeats, setting up data connects, data block do not store-〉 and data fingerprint do not repeat, the storage data block is to SSD, the Hash concordance list of storage data block hash value to the SSD, the data fingerprint of storage data block does not repeat to the data fingerprint concordance list-〉 B. among the SSD, the storage data block is to SSD, the Hash concordance list of storage data block hash value to the SSD, the data fingerprint concordance list of the data fingerprint of storage data block to the SSD-〉 will the be stored in data block among the SSD is sent in the RAID Array Disk stores, as shown in Figure 3.
Except the described technical characterictic of instructions, be the known technology of those skilled in the art.

Claims (1)

1. data de-duplication method that is applicable to mass storage system (MSS), it is characterized in that using the block size of 4KB-8KB that data stream is carried out piecemeal, minimum like this piece just can be guaranteed high data de-duplication rate, can carry out more accurately the location deletion of repeating data; And for the difference of data transmission rate between HDD disk and the internal memory RAM and the performance impact that causes, by using SSD to rebuild the framework of storage, as the transfer between RAM and the RAID array, not only can improve like this speed of retrieval, and improved data storage speed, when new data block hash value arrival need to be verified, RAM reads the Hash index that is stored among the SSD, exceed 3 times actual data transfer rate than HDD with SSD, just the performance that gets system is guaranteed, the relatively large caused performance issue of Hash concordance list that the fritter of use 4KB-8KB produces also can be resolved, for data de-duplication ratio and the low problem of data de-duplication degree of accuracy, use the duplication check of hash value and data fingerprint, to guarantee to the full extent data de-duplication degree of accuracy, because use SSD as intermediate point, ensured I/O speed, therefore in the situation of double verification, also can not affect the performance of system, concrete steps are as follows:
1) receives the data stream that to store;
2) be sent to Dedupe System module;
3) data stream is blocked decomposition according to the piece of the 4KB-8KB size of appointment;
4) hash value of computational data piece;
5) data fingerprint of computational data piece;
6) read the Hash concordance list of storing among the SSD;
7) whether repeat in differentiation hash value and the Hash concordance list;
A. repeat, (1) reading out data fingerprint index table continues the decision data fingerprint; (2) data fingerprint repeats, and sets up data and connects, and data block is not stored; (3) data fingerprint does not repeat, and the storage data block is to SSD, the Hash concordance list of storage data block hash value to the SSD, the data fingerprint of the storage data block data fingerprint concordance list to the SSD;
B. do not repeat, (1) storage data block is to SSD, the Hash concordance list of storage data block hash value to the SSD, the data fingerprint of the storage data block data fingerprint concordance list to the SSD; (2) will be stored in data block among the SSD is sent in the RAID Array Disk and stores.
CN2012104528309A 2012-11-13 2012-11-13 Repeating data deleting method suitable for mass storage system Pending CN102982122A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2012104528309A CN102982122A (en) 2012-11-13 2012-11-13 Repeating data deleting method suitable for mass storage system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2012104528309A CN102982122A (en) 2012-11-13 2012-11-13 Repeating data deleting method suitable for mass storage system

Publications (1)

Publication Number Publication Date
CN102982122A true CN102982122A (en) 2013-03-20

Family

ID=47856140

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2012104528309A Pending CN102982122A (en) 2012-11-13 2012-11-13 Repeating data deleting method suitable for mass storage system

Country Status (1)

Country Link
CN (1) CN102982122A (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104462388A (en) * 2014-12-10 2015-03-25 上海爱数软件有限公司 Redundant data cleaning method based on cascade storage media
WO2016041127A1 (en) * 2014-09-15 2016-03-24 华为技术有限公司 Data duplication method and storage array
WO2016090541A1 (en) * 2014-12-09 2016-06-16 清华大学 Data storage system and data storage method
CN105824881A (en) * 2016-03-10 2016-08-03 中国人民解放军国防科学技术大学 Repeating data and deleted data placement method and device based on load balancing
CN106610792A (en) * 2016-07-28 2017-05-03 四川用联信息技术有限公司 Repeating data deleting algorithm in cloud storage
CN106713489A (en) * 2017-01-17 2017-05-24 郑州云海信息技术有限公司 Deduplication based synchronous remote copying system and method
CN106843760A (en) * 2017-01-17 2017-06-13 郑州云海信息技术有限公司 It is a kind of based on the asynchronous remote copy system deleted and method again
CN106951192A (en) * 2017-03-25 2017-07-14 广州硕点电子科技有限公司 A kind of date storage method, apparatus and system
WO2017187334A1 (en) * 2016-04-29 2017-11-02 International Business Machines Corporation Data deduplication with reduced hash computations
CN108427538A (en) * 2018-03-15 2018-08-21 深信服科技股份有限公司 Storage data compression method, device and the readable storage medium storing program for executing of full flash array
TWI709857B (en) * 2017-12-08 2020-11-11 日商東芝記憶體股份有限公司 Memory system and control method
CN112783446A (en) * 2021-01-22 2021-05-11 苏州浪潮智能科技有限公司 Data writing method and system of storage system
CN113010104A (en) * 2020-01-27 2021-06-22 慧与发展有限责任合伙企业 Deduplication system threshold based on amount of wear of storage device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102156727A (en) * 2011-04-01 2011-08-17 华中科技大学 Method for deleting repeated data by using double-fingerprint hash check
CN102591947A (en) * 2010-12-28 2012-07-18 微软公司 Fast and low-RAM-footprint indexing for data deduplication
CN102591946A (en) * 2010-12-28 2012-07-18 微软公司 Using index partitioning and reconciliation for data deduplication
CN102609442A (en) * 2010-12-28 2012-07-25 微软公司 Adaptive Index for Data Deduplication
CN102663086A (en) * 2012-04-09 2012-09-12 华中科技大学 Method for retrieving data block indexes

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102591947A (en) * 2010-12-28 2012-07-18 微软公司 Fast and low-RAM-footprint indexing for data deduplication
CN102591946A (en) * 2010-12-28 2012-07-18 微软公司 Using index partitioning and reconciliation for data deduplication
CN102609442A (en) * 2010-12-28 2012-07-25 微软公司 Adaptive Index for Data Deduplication
CN102156727A (en) * 2011-04-01 2011-08-17 华中科技大学 Method for deleting repeated data by using double-fingerprint hash check
CN102663086A (en) * 2012-04-09 2012-09-12 华中科技大学 Method for retrieving data block indexes

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105612489B (en) * 2014-09-15 2017-08-29 华为技术有限公司 Data de-duplication method and storage array
WO2016041127A1 (en) * 2014-09-15 2016-03-24 华为技术有限公司 Data duplication method and storage array
CN105612489A (en) * 2014-09-15 2016-05-25 华为技术有限公司 Data duplication method and storage array
US10346245B2 (en) 2014-12-09 2019-07-09 Tsinghua University Data storage system and data storage method
WO2016090541A1 (en) * 2014-12-09 2016-06-16 清华大学 Data storage system and data storage method
CN107250975B (en) * 2014-12-09 2020-07-10 清华大学 Data storage system and data storage method
CN107250975A (en) * 2014-12-09 2017-10-13 清华大学 Data-storage system and date storage method
CN104462388B (en) * 2014-12-10 2017-12-29 上海爱数信息技术股份有限公司 A kind of redundant data method for cleaning based on tandem type storage medium
CN104462388A (en) * 2014-12-10 2015-03-25 上海爱数软件有限公司 Redundant data cleaning method based on cascade storage media
CN105824881A (en) * 2016-03-10 2016-08-03 中国人民解放军国防科学技术大学 Repeating data and deleted data placement method and device based on load balancing
CN105824881B (en) * 2016-03-10 2019-03-29 中国人民解放军国防科学技术大学 A kind of data de-duplication data placement method based on load balancing
WO2017187334A1 (en) * 2016-04-29 2017-11-02 International Business Machines Corporation Data deduplication with reduced hash computations
CN106610792A (en) * 2016-07-28 2017-05-03 四川用联信息技术有限公司 Repeating data deleting algorithm in cloud storage
CN106713489A (en) * 2017-01-17 2017-05-24 郑州云海信息技术有限公司 Deduplication based synchronous remote copying system and method
CN106843760A (en) * 2017-01-17 2017-06-13 郑州云海信息技术有限公司 It is a kind of based on the asynchronous remote copy system deleted and method again
CN106951192A (en) * 2017-03-25 2017-07-14 广州硕点电子科技有限公司 A kind of date storage method, apparatus and system
TWI709857B (en) * 2017-12-08 2020-11-11 日商東芝記憶體股份有限公司 Memory system and control method
CN108427538A (en) * 2018-03-15 2018-08-21 深信服科技股份有限公司 Storage data compression method, device and the readable storage medium storing program for executing of full flash array
CN108427538B (en) * 2018-03-15 2021-06-04 深信服科技股份有限公司 Storage data compression method and device of full flash memory array and readable storage medium
CN113010104A (en) * 2020-01-27 2021-06-22 慧与发展有限责任合伙企业 Deduplication system threshold based on amount of wear of storage device
US11609849B2 (en) 2020-01-27 2023-03-21 Hewlett Packard Enterprise Development Lp Deduplication system threshold based on a type of storage device
CN112783446A (en) * 2021-01-22 2021-05-11 苏州浪潮智能科技有限公司 Data writing method and system of storage system
CN112783446B (en) * 2021-01-22 2023-01-10 苏州浪潮智能科技有限公司 Data writing method and system of storage system

Similar Documents

Publication Publication Date Title
CN102982122A (en) Repeating data deleting method suitable for mass storage system
US9021189B2 (en) System and method for performing efficient processing of data stored in a storage node
US9092321B2 (en) System and method for performing efficient searches and queries in a storage node
US10001944B2 (en) Systems and methods for data organization in storage systems using large erasure codes
US9489148B2 (en) Selecting between non-volatile memory units having different minimum addressable data unit sizes
Cao et al. Sliding {Look-Back} Window Assisted Data Chunk Rewriting for Improving Deduplication Restore Performance
US9223509B2 (en) File processing method and storage device
KR20170037562A (en) Compression sampling in tiered storage
CN102521260B (en) Data preheating method and device
CN104765575A (en) Information storage processing method
CN103916459A (en) Big data filing and storing system
US9336135B1 (en) Systems and methods for performing search and complex pattern matching in a solid state drive
CN103198027A (en) Method and device for storing and providing files
WO2015081690A1 (en) Method and apparatus for improving disk array performance
CN104765574A (en) Data cloud storage method
CN104616680A (en) Repeating data deleting system based on optical disc storage as well as data operating method and device
CN201698255U (en) Server capable of accessing disc at high speed
CN103500147A (en) Embedded and layered storage method of PB-class cluster storage system
Liu et al. A delayed container organization approach to improve restore speed for deduplication systems
CN102063263B (en) Method, device and system for responding read-write operation request of host computer by solid state disk
CN103279561A (en) Method for increasing random database data read-write speed
CN103399783A (en) Storage method and device of mirror image documents of virtual machines
CN110968271B (en) High-performance data storage method, system and device
CN101526887B (en) Method for writing data into hard disc, device and system thereof
CN105573668B (en) A kind of date storage method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20130320