CN102982122A

CN102982122A - Repeating data deleting method suitable for mass storage system

Info

Publication number: CN102982122A
Application number: CN2012104528309A
Authority: CN
Inventors: 梁吉林
Original assignee: Inspur Electronic Information Industry Co Ltd
Current assignee: Inspur Electronic Information Industry Co Ltd
Priority date: 2012-11-13
Filing date: 2012-11-13
Publication date: 2013-03-20

Abstract

The invention provides a repeating data deleting method suitable for a mass storage system. Minor blocks of 4KB-8KB are adopted to intercept and shunt data flows. The two-factor authentication of Hash value and fingerprint data is adopted to guarantee the deleting rate of repeating data, and solid state disk (SSD) storage and Hash indexes are adopted for improving retrieval efficiency. The deleting of the repeating data is suitable for high repeating data. Highest efficiency, namely the highest deleting rate of the repeating data, can be achieved under the application environments of low data variation rate, complete data backup, long-term data storage, and non-active data. The deleting rate of the repeating data can also be perceived as disk space release ratio. According to the prior delete technology of fixed repeating data, blocks with the storage sizes of 64KB-128KB are generally adopted for the fragmented data flows, and due to the difference between the speed rate of random access memory (RAM) and the speed rate of hard disk drive (HDD) disk, performance is not affected by the fact that the blocks are too small, but the deleting rate of the repeating data can be affected because the blocks are too big.

Description

A kind of data de-duplication method that is applicable to mass storage system (MSS)

Technical field

The present invention relates to computer system and mass storage system (MSS), specifically a kind of data de-duplication method that is applicable to mass storage system (MSS).

Background technology

Because magnetic disc fast development in recent years, some enterprises and user also tend to back up or file with jumbo disk.Traditional backup policy tends to produce a large amount of redundant datas in the disk space the inside of memory device, has consumed a large amount of unnecessary disk spaces, causes the waste of device resource; And because too many redundant data causes the disk search prolongation of positioning time, also reduced the overall performance of system.

Data de-duplication solves these problems, the data that it will back up are compared, and substitute if identical data are arranged then connect or point to one, have so not only saved data and have stored needed storage space, and since the minimizing of data volume, the positioning time of also having reduced disk.It has not only effectively controlled the sharp increase of data, also increase effective storage space, improved storage efficiency, saved total cost and the handling cost of storage, simultaneously also save the network bandwidth of data transmission, saved the O﹠M costs such as space, electric power supply.

But use data de-duplication technology that shortcoming is also arranged, it needs comparison, so increased the time of calculating and verification; It need to store the Hash index, and the serious unbalance of internal memory and disk speed will increase the consuming time of check; Also there is impact in minute block size of deblocking technology to the deletion accuracy of repeating data in the system.

This method is improved for the defective of existing method, has faster verification speed, less data block granularity and more accurate repeating data and removes.On two criterion data de-duplication rates and performance of data de-duplication, raising is in various degree arranged.

Summary of the invention

The purpose of this invention is to provide a kind of data de-duplication method that is applicable to mass storage system (MSS).

The objective of the invention is to realize in the following manner, use the block size of 4KB-8KB that data stream is carried out piecemeal, minimum like this piece just can be guaranteed high data de-duplication rate, can carry out more accurately the location deletion of repeating data; And for the difference of data transmission rate between HDD disk and the internal memory RAM and the performance impact that causes, by using SSD to rebuild the framework of storage, as the transfer between RAM and the RAID array, not only can improve like this speed of retrieval, and improved data storage speed, when new data block hash value arrival need to be verified, RAM reads the Hash index that is stored among the SSD, exceed 3 times actual data transfer rate than HDD with SSD, just the performance that gets system is guaranteed, the relatively large caused performance issue of Hash concordance list that the fritter of use 4KB-8KB produces also can be resolved, for data de-duplication ratio and the low problem of data de-duplication degree of accuracy, use the duplication check of hash value and data fingerprint, to guarantee to the full extent data de-duplication degree of accuracy, because use SSD as intermediate point, ensured I/O speed, therefore in the situation of double verification, also can not affect the performance of system, concrete steps are as follows:

1) receives the data stream that to store;

2) be sent to Dedupe System module;

3) data stream is blocked decomposition according to the piece of the 4KB-8KB size of appointment;

4) hash value of computational data piece;

5) data fingerprint of computational data piece;

6) read the Hash concordance list of storing among the SSD;

7) whether repeat in differentiation hash value and the Hash concordance list;

A. repeat, (1) reading out data fingerprint index table continues the decision data fingerprint; (2) data fingerprint repeats, and sets up data and connects, and data block is not stored; (3) data fingerprint does not repeat, and the storage data block is to SSD, the Hash concordance list of storage data block hash value to the SSD, the data fingerprint of the storage data block data fingerprint concordance list to the SSD;

B. do not repeat, (1) storage data block is to SSD, the Hash concordance list of storage data block hash value to the SSD, the data fingerprint of the storage data block data fingerprint concordance list to the SSD; (2) will be stored in data block among the SSD is sent in the RAID Array Disk and stores.

The invention has the beneficial effects as follows: adopt the fritter of 4KB-8KB to cut divided data stream, the double verification of employing hash value and data fingerprint is guaranteed the deletion rate of repeating data, adopt SSD storage Hash index to improve recall precision, data de-duplication is applicable to high repeating data, lower data variation rate for example, completely data backup, data long preservation and inactive data, under these applied environments, can reach the highest efficient, i.e. the highest data de-duplication rate, also can be understood as the disk space release ratio, traditional fixedly data de-duplication technology, block data stream generally is the block size that adopts 64KB-128KB, because the otherness of speed between internal memory RAM and HDD disk, just doing so can not affect because piece is too little performance, but can affect because piece is excessive the deletion ratio of repeating data.

Description of drawings

Fig. 1 is the system architecture synoptic diagram;

Fig. 2 is the data de-duplication schematic diagram;

Fig. 3 is the data de-duplication process flow diagram.

Embodiment

Explain below with reference to Figure of description method of the present invention being done.

This method is improved for the data de-duplication rate problem low and that performance is lower of existing data de-duplication method, adopt the fritter of 4KB-8KB to cut divided data stream, adopt the double verification of hash value and data fingerprint, guarantee the deletion rate of repeating data, adopt SSD storage Hash index to improve recall precision, whole framework as shown in Figure 1.

Data de-duplication is applicable to high repeating data, lower data variation rate for example, completely data backup, data long preservation and inactive data, under these applied environments, can reach the highest efficient (i.e. the highest data de-duplication rate, also can be understood as the disk space release ratio), the principle of data de-duplication is as shown in Figure 2.

Fixing data de-duplication technology, block data stream generally is the block size that adopts 64KB-128KB, because the otherness of speed between internal memory RAM and HDD disk, just doing so can not affect because piece is too little performance, but can affect because piece is excessive the deletion ratio of repeating data.We use the block size of 4KB-8KB that data stream is carried out piecemeal, and minimum like this piece just can be guaranteed high data de-duplication rate, can carry out more accurately the location deletion of repeating data; And for the difference of data transmission rate between HDD disk and the internal memory RAM and the performance impact that causes, we are by using SSD to rebuild the framework of storage, as the transfer between RAM and the RAID array, not only can improve like this speed of retrieval, and improved data storage speed, when new data block hash value arrival need to be verified, RAM reads the Hash index that is stored among the SSD, exceed 3 times actual data transfer rate than HDD with SSD, just the performance that gets system is guaranteed, and the relatively large caused performance issue of Hash concordance list that the fritter of use 4KB-8KB produces also can be resolved.Simultaneously, in existing method, data de-duplication ratio and data de-duplication degree of accuracy are difficult to satisfactory really.For the problem of this respect, we use the duplication check of hash value and data fingerprint, to guarantee to the full extent data de-duplication degree of accuracy, because use SSD as intermediate point, ensure I/O speed, therefore in the situation of double verification, also can not affect the performance of system.

Whether data storages flow process in the method is roughly as follows: receive the data stream that will store-〉 be sent to Dedupe System module-〉 data stream and block in the Hash concordance list that the data fingerprint of the hash value of decomposition-〉 computational data piece-〉 computational data piece-〉 read stores among the SSD-〉 differentiation hash value and the Hash concordance list according to the block size (4KB or 8KB) of appointment and repeat-〉 the A. repetition, reading out data fingerprint index table, continuing decision data fingerprint-〉 data fingerprint repeats, setting up data connects, data block do not store-〉 and data fingerprint do not repeat, the storage data block is to SSD, the Hash concordance list of storage data block hash value to the SSD, the data fingerprint of storage data block does not repeat to the data fingerprint concordance list-〉 B. among the SSD, the storage data block is to SSD, the Hash concordance list of storage data block hash value to the SSD, the data fingerprint concordance list of the data fingerprint of storage data block to the SSD-〉 will the be stored in data block among the SSD is sent in the RAID Array Disk stores, as shown in Figure 3.

Except the described technical characterictic of instructions, be the known technology of those skilled in the art.

Claims

1. data de-duplication method that is applicable to mass storage system (MSS), it is characterized in that using the block size of 4KB-8KB that data stream is carried out piecemeal, minimum like this piece just can be guaranteed high data de-duplication rate, can carry out more accurately the location deletion of repeating data; And for the difference of data transmission rate between HDD disk and the internal memory RAM and the performance impact that causes, by using SSD to rebuild the framework of storage, as the transfer between RAM and the RAID array, not only can improve like this speed of retrieval, and improved data storage speed, when new data block hash value arrival need to be verified, RAM reads the Hash index that is stored among the SSD, exceed 3 times actual data transfer rate than HDD with SSD, just the performance that gets system is guaranteed, the relatively large caused performance issue of Hash concordance list that the fritter of use 4KB-8KB produces also can be resolved, for data de-duplication ratio and the low problem of data de-duplication degree of accuracy, use the duplication check of hash value and data fingerprint, to guarantee to the full extent data de-duplication degree of accuracy, because use SSD as intermediate point, ensured I/O speed, therefore in the situation of double verification, also can not affect the performance of system, concrete steps are as follows:

1) receives the data stream that to store;

2) be sent to Dedupe System module;

4) hash value of computational data piece;

5) data fingerprint of computational data piece;

6) read the Hash concordance list of storing among the SSD;

7) whether repeat in differentiation hash value and the Hash concordance list;