CN102999605A

CN102999605A - Method and device for optimizing data placement to reduce data fragments

Info

Publication number: CN102999605A
Application number: CN2012104746888A
Authority: CN
Inventors: 谭玉娟; 沙行勉; 晏志超; 诸葛晴凤; 刘铎
Original assignee: Chongqing University
Current assignee: Chongqing University
Priority date: 2012-11-21
Filing date: 2012-11-21
Publication date: 2013-03-27

Abstract

The invention relates to a method and a device for optimizing data placement to reduce data fragments. The method comprises the following steps of: carrying out data partitioning on each file to be backed up, and determining a data block fingerprint of each data block to be backed up; organizing a plurality of continuous data blocks to be backed up into a data segment to be backed up; searching whether the data block same as that backed up by the backed-up data segment in the system as to each data block to be backed up in the data segment to be backed up, if not, judging the data block to be a non-repeated data block, entering a data reading and writing step, if so, judging the data block to be a repeated data block, and entering the next step; calculating the data redundancy locality of the data segment to be backed up and the backed-up data segment, and quantifying the data redundancy locality, if the value of the data redundancy locality is smaller than a preset threshold, entering the data reading and writing step, or else, entering the next step; and deleting the repeated data block shared by the data segment to be backed up and the backed-up data segment from the data segment to be backed up. According to the method disclosed by the invention, non-sequenced placement of the data and the data fragment are reduced; deterioration of the data fragment is slowed down under the premise of sacrificing a little of data compression ratio; and the reading and writing performance of the system is improved.

Description

A kind of method and apparatus of placing to reduce the data fragment by optimization data

Technical field

The invention belongs to the computer information storage technology field, be specifically related to a kind of method and apparatus of placing to reduce the data fragment by optimization data.

Background technology

Data de-duplication is a kind of senior data lossless compress technique, is mainly used in saving storage space required in the information storage and backup system.The ultimate principle that it is realized is that each file is cut into a plurality of continuous data blocks successively, and Single document is deleted with the repeating data piece that occurs between interior or a plurality of files, reduces data space with this.Existing most information storage and backup system all adopts this technology to carry out the optimization of storage space, saves data storage cost and handling cost.

In having used the information storage and backup system of data de-duplication technology (referred to as data deduplication system), mainly exist two class data blocks.One class is the new data block that need to write disk, the another kind of repeating data piece that needs elimination.For new data block, they will sequentially write disk successively; And the repeating data piece of eliminating for needs, they will can not be repeated storage.Therefore, any one file to be backed up, the new data block that it comprises and repeating data piece can not store together, and the deposit position of repeating data piece is to be determined by the backup file of writing in the past these data blocks.This between a plurality of files the mechanism of elimination of duplicate data piece broken in the standby system in the past all data block sequential storage rule together with a backup file, cause the data block of a backup file can leave a plurality of different positions in, produce a plurality of data fragments.

The data de-duplication method of existing information storage and backup system is mainly paid close attention to the throughput that how to promote data compression rate and data de-duplication, do not consider because the deletion of repeating data piece can cause the non-order placement of data block and attract a lot of data fragments, and these data fragments can have a strong impact on the readwrite performance of data, cause the hydraulic performance decline of information storage and backup system.

Summary of the invention

Technical matters to be solved by this invention is exactly to reduce non-order placement and the data fragment of data, the deterioration of alleviation data fragment under the prerequisite of sacrificing few data compression rate, the readwrite performance of elevator system.

Solve the problems of the technologies described above, the invention provides a kind of method of placing to reduce the data fragment by optimization data, it may further comprise the steps:

Step 1 is carried out deblocking to each file to be backed up, and each data block to be backed up is asked for the data block fingerprint;

Step 2 is made into data segment to be backed up with a plurality of continuous data chunk to be backed up;

Step 3, for each data block to be backed up in the data segment to be backed up, whether have Backup Data section backed up identical data block, if do not have, then be non-repeating data piece if searching in system, enter step 6, if having, is the repeating data piece then, enters step 4;

Step 4 is calculated data segment to be backed up and the data redundancy locality of Backup Data section, with data redundancy locality quantification, if the value of this data redundancy locality enters step 6, otherwise enters step 5 less than predetermined threshold value;

Step 5, the repeating data piece that deletion data segment to be backed up and Backup Data section are shared from data segment to be backed up;

Step 6, data block successively order write disk.

The invention provides a kind of device of placing to reduce the data fragment by optimization data, it comprises:

Deblocking and fingerprint computing unit carry out deblocking for the file to be backed up that each is passed to storage server, and obtaining the average data block size is quantitative data block to be backed up, and each data block to be backed up is asked for the data block fingerprint;

The data segment organization unit is used for a plurality of continuous data chunk to be backed up are made into data segment to be backed up;

Repeating data piece query unit is used for searching the data segment that had backed up and whether has the data block identical with data segment to be backed up, if do not have, then be non-repeating data piece, change date read-write cell over to, if having, be the repeating data piece then, change repeating data piece screening unit over to;

Repeating data piece screening unit, be used for calculating the Backup Data section at these repeating data piece places and the data redundancy locality between the data segment to be backed up, with data redundancy locality quantification, if the value of this data redundancy locality is less than predetermined threshold value, change date read-write cell over to, otherwise change the data erase unit over to;

The data erase unit is used for the repeating data piece that deletion is confirmed by repeating data piece screening unit;

Date read-write cell, repeating data piece and other non-repeating data pieces of being used for needs are kept write disk together.

The repeating data piece search with delete procedure in, the present invention keeps the repeating data piece less than predetermined redundant locality threshold value, and they and non-repeating data piece are sequentially stored together, so the present invention can reduce the data fragment that generates.

Compare with existing data de-duplication method, the present invention has advantages of as follows:

1, by reserve part repeating data piece, these data blocks and non-repeating data piece are sequentially stored together, can reduce the data amount of debris that produces;

2, flock together by the data block that will more belong to same file, reduce the data amount of debris, can greatly strengthen the redundant locality of data;

3, the raising of data redundancy locality not only can improve throughput and the data write performance of data de-duplication, also can improve data and read performance;

4, by representing quantitatively the data redundancy locality, and the redundant locality of data arranged the repeating data amount that threshold value keeps with control, can under the prerequisite of sacrificing less data compression rate, reduce a large amount of data fragments, obtain preferably reading and writing data performance.If the data redundancy locality threshold value that arranges is larger, the repeating data amount of reservation is more, and the compressibility of sacrifice is just larger; Otherwise if threshold value is less, the repeating data amount of reservation is less, and the compressibility of sacrifice is also less.

In sum, the present invention sacrifices less data compression rate by keeping small part repeating data piece, the data block that more belongs to same file sequentially can be stored together, greatly reduce the data amount of debris that generates, strengthen the data redundancy locality, improve the readwrite performance of data.

Description of drawings

Description of drawings of the present invention is as follows:

Fig. 1 is the process flow diagram of placing to reduce the method for data fragment by optimization data of the present invention;

Fig. 2 is the structural representation of placing to reduce the device of data fragment by optimization data of the present invention.

Embodiment

The invention will be further described below in conjunction with drawings and Examples:

The main body that the present invention relates to is backup server and storage server, and backup server provides the data that need backup, and storage server is then stored the data that will back up.Searching and delete in storage server of repeating data carried out.

Fig. 1 is the process flow diagram of placing to reduce the method for data fragment by optimization data of the present invention; This flow process starts from S101.

In step S102, each file to be backed up is carried out deblocking, carry out deblocking as adopting the elongated algorithm of data block, obtaining the average data block size is quantitative data block to be backed up, is the data block of 8KB such as data volume; And each data block to be backed up asked for the data block fingerprint, and the algorithm of data block fingerprint can adopt the SHA-1 hash algorithm to calculate the cryptographic hash of each data block, and the cryptographic hash that obtains is called as the data block fingerprint.The data block fingerprint can be used for each data block of unique representative, and any two data blocks with identical fingerprints are considered to identical data block.

In step S103, a plurality of continuous data chunk to be backed up are made into data segment to be backed up, for example each data segment has 256 data blocks.

In step S104, search whether there be the data block identical with data segment to be backed up in the data segment that had backed up, these identical data blocks are the repeating data piece, if having the repeating data piece, enter step S105; If there is not the repeating data piece, then enter step S107.

In step S105, for each repeating data piece, add up the Backup Data section at its place and the repeating data amount that data segment to be backed up is shared, and with the size of this repeating data amount divided by data segment to be backed up, the value that calculates like this is the data redundancy locality of quantificational expression, if data redundancy locality quantized value enters step S107 less than predetermined threshold value; Otherwise, if data redundancy locality quantized value enters step S106 greater than predetermined threshold value.Wherein threshold value is predetermined data redundancy locality threshold value, can control the repeating data amount of reservation by this threshold value.If threshold value is larger, the repeating data that then keeps is more, and the compressibility of sacrifice is larger, and the data redundancy locality of keeping is also stronger; Otherwise if threshold value is less, the repeating data of reservation is less, and the compressibility of sacrifice is less, the data redundancy locality of keeping also a little less than.This threshold value is used for doing a balance between the compressibility of sacrificing and the data redundancy locality kept.

In step S106, these repeating data pieces of deletion from data segment to be backed up, flow process finishes.

In step S107, preserve successively these data blocks, flow process finishes.

Fig. 2 is the structural representation of placing to reduce the device of data fragment by optimization data of the present invention.1 expression deblocking and fingerprint computing unit, 2 expression data segment organization unit, 3 expression repeating data piece query unit, 4 expression repeating data piece screening unit, 5 expression data erase unit, 6 expression date read-write cell.

Deblocking and fingerprint computing unit 1 carry out deblocking for the file to be backed up that each is passed to storage server, and obtaining the average data block size is quantitative data block to be backed up, and each data block to be backed up is asked for the data block fingerprint;

Data segment organization unit 2 is used for a plurality of continuous data chunk to be backed up are made into data segment to be backed up;

Repeating data piece query unit 3 is used for searching the data segment that had backed up and whether has the data block identical with data segment to be backed up, if do not have, then be non-repeating data piece, change date read-write cell 6 over to, if having, be the repeating data piece then, change repeating data piece screening unit 4 over to;

Repeating data piece screening unit 4, be used for calculating the Backup Data section at these repeating data piece places and the data redundancy locality between the data segment to be backed up, with data redundancy locality quantification, if the value of this data redundancy locality is less than predetermined threshold value, change date read-write cell 6 over to, otherwise change data erase unit 5 over to;

Data erase unit 5; Be used for the repeating data piece that deletion is confirmed by repeating data piece screening unit;

Date read-write cell 6, repeating data piece and other non-repeating data pieces of being used for needs are kept write disk together.

Advantage of the present invention is, reduced non-order placement and the data fragment of data, alleviates the deterioration of data fragment under the prerequisite of sacrificing few data compression rate, promoted the readwrite performance of system.

Claims

1. method of placing to reduce the data fragment by optimization data is characterized in that: may further comprise the steps:

Step 6, data block successively order write disk.

2. the method for placing to reduce the data fragment by optimization data according to claim 1, it is characterized in that: the quantification of the data redundancy locality in step 4 is, the repeating data amount that the Backup Data section at statistics repeating data piece place and data segment to be backed up are shared, and with the size of this repeating data amount divided by data segment to be backed up.

3. the method for placing to reduce the data fragment by optimization data according to claim 1 is characterized in that: the threshold value in step 4 is predetermined data redundancy locality threshold value, and this threshold value control writes the repeating data amount of disk.

4. device of placing to reduce the data fragment by optimization data is characterized in that comprising:

Deblocking and fingerprint computing unit (1) carry out deblocking for the file to be backed up that each is passed to storage server, and obtaining the average data block size is quantitative data block to be backed up, and each data block to be backed up is asked for the data block fingerprint;

Data segment organization unit (2) is used for a plurality of continuous data chunk to be backed up are made into data segment to be backed up;

Repeating data piece query unit (3), be used for searching the data segment that had backed up and whether have the data block identical with data segment to be backed up, if do not have, it then is non-repeating data piece, change date read-write cell (6) over to, if have, be the repeating data piece then, change repeating data piece screening unit (4) over to;

Repeating data piece screening unit (4), be used for calculating the Backup Data section at these repeating data piece places and the data redundancy locality between the data segment to be backed up, with data redundancy locality quantification, if the value of this data redundancy locality is less than predetermined threshold value, change date read-write cell (6) over to, otherwise change data erase unit (5) over to;

Data erase unit (5) is used for the repeating data piece that deletion is confirmed by repeating data piece screening unit;

Date read-write cell (6), repeating data piece and other non-repeating data pieces of being used for needs are kept write disk together.