CN102999605A - Method and device for optimizing data placement to reduce data fragments - Google Patents

Method and device for optimizing data placement to reduce data fragments Download PDF

Info

Publication number
CN102999605A
CN102999605A CN2012104746888A CN201210474688A CN102999605A CN 102999605 A CN102999605 A CN 102999605A CN 2012104746888 A CN2012104746888 A CN 2012104746888A CN 201210474688 A CN201210474688 A CN 201210474688A CN 102999605 A CN102999605 A CN 102999605A
Authority
CN
China
Prior art keywords
data
backed
repeating
segment
locality
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2012104746888A
Other languages
Chinese (zh)
Inventor
谭玉娟
沙行勉
晏志超
诸葛晴凤
刘铎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University
Original Assignee
Chongqing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University filed Critical Chongqing University
Priority to CN2012104746888A priority Critical patent/CN102999605A/en
Publication of CN102999605A publication Critical patent/CN102999605A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a method and a device for optimizing data placement to reduce data fragments. The method comprises the following steps of: carrying out data partitioning on each file to be backed up, and determining a data block fingerprint of each data block to be backed up; organizing a plurality of continuous data blocks to be backed up into a data segment to be backed up; searching whether the data block same as that backed up by the backed-up data segment in the system as to each data block to be backed up in the data segment to be backed up, if not, judging the data block to be a non-repeated data block, entering a data reading and writing step, if so, judging the data block to be a repeated data block, and entering the next step; calculating the data redundancy locality of the data segment to be backed up and the backed-up data segment, and quantifying the data redundancy locality, if the value of the data redundancy locality is smaller than a preset threshold, entering the data reading and writing step, or else, entering the next step; and deleting the repeated data block shared by the data segment to be backed up and the backed-up data segment from the data segment to be backed up. According to the method disclosed by the invention, non-sequenced placement of the data and the data fragment are reduced; deterioration of the data fragment is slowed down under the premise of sacrificing a little of data compression ratio; and the reading and writing performance of the system is improved.

Description

A kind of method and apparatus of placing to reduce the data fragment by optimization data
Technical field
The invention belongs to the computer information storage technology field, be specifically related to a kind of method and apparatus of placing to reduce the data fragment by optimization data.
Background technology
Data de-duplication is a kind of senior data lossless compress technique, is mainly used in saving storage space required in the information storage and backup system.The ultimate principle that it is realized is that each file is cut into a plurality of continuous data blocks successively, and Single document is deleted with the repeating data piece that occurs between interior or a plurality of files, reduces data space with this.Existing most information storage and backup system all adopts this technology to carry out the optimization of storage space, saves data storage cost and handling cost.
In having used the information storage and backup system of data de-duplication technology (referred to as data deduplication system), mainly exist two class data blocks.One class is the new data block that need to write disk, the another kind of repeating data piece that needs elimination.For new data block, they will sequentially write disk successively; And the repeating data piece of eliminating for needs, they will can not be repeated storage.Therefore, any one file to be backed up, the new data block that it comprises and repeating data piece can not store together, and the deposit position of repeating data piece is to be determined by the backup file of writing in the past these data blocks.This between a plurality of files the mechanism of elimination of duplicate data piece broken in the standby system in the past all data block sequential storage rule together with a backup file, cause the data block of a backup file can leave a plurality of different positions in, produce a plurality of data fragments.
The data de-duplication method of existing information storage and backup system is mainly paid close attention to the throughput that how to promote data compression rate and data de-duplication, do not consider because the deletion of repeating data piece can cause the non-order placement of data block and attract a lot of data fragments, and these data fragments can have a strong impact on the readwrite performance of data, cause the hydraulic performance decline of information storage and backup system.
Summary of the invention
Technical matters to be solved by this invention is exactly to reduce non-order placement and the data fragment of data, the deterioration of alleviation data fragment under the prerequisite of sacrificing few data compression rate, the readwrite performance of elevator system.
Solve the problems of the technologies described above, the invention provides a kind of method of placing to reduce the data fragment by optimization data, it may further comprise the steps:
Step 1 is carried out deblocking to each file to be backed up, and each data block to be backed up is asked for the data block fingerprint;
Step 2 is made into data segment to be backed up with a plurality of continuous data chunk to be backed up;
Step 3, for each data block to be backed up in the data segment to be backed up, whether have Backup Data section backed up identical data block, if do not have, then be non-repeating data piece if searching in system, enter step 6, if having, is the repeating data piece then, enters step 4;
Step 4 is calculated data segment to be backed up and the data redundancy locality of Backup Data section, with data redundancy locality quantification, if the value of this data redundancy locality enters step 6, otherwise enters step 5 less than predetermined threshold value;
Step 5, the repeating data piece that deletion data segment to be backed up and Backup Data section are shared from data segment to be backed up;
Step 6, data block successively order write disk.
The invention provides a kind of device of placing to reduce the data fragment by optimization data, it comprises:
Deblocking and fingerprint computing unit carry out deblocking for the file to be backed up that each is passed to storage server, and obtaining the average data block size is quantitative data block to be backed up, and each data block to be backed up is asked for the data block fingerprint;
The data segment organization unit is used for a plurality of continuous data chunk to be backed up are made into data segment to be backed up;
Repeating data piece query unit is used for searching the data segment that had backed up and whether has the data block identical with data segment to be backed up, if do not have, then be non-repeating data piece, change date read-write cell over to, if having, be the repeating data piece then, change repeating data piece screening unit over to;
Repeating data piece screening unit, be used for calculating the Backup Data section at these repeating data piece places and the data redundancy locality between the data segment to be backed up, with data redundancy locality quantification, if the value of this data redundancy locality is less than predetermined threshold value, change date read-write cell over to, otherwise change the data erase unit over to;
The data erase unit is used for the repeating data piece that deletion is confirmed by repeating data piece screening unit;
Date read-write cell, repeating data piece and other non-repeating data pieces of being used for needs are kept write disk together.
The repeating data piece search with delete procedure in, the present invention keeps the repeating data piece less than predetermined redundant locality threshold value, and they and non-repeating data piece are sequentially stored together, so the present invention can reduce the data fragment that generates.
Compare with existing data de-duplication method, the present invention has advantages of as follows:
1, by reserve part repeating data piece, these data blocks and non-repeating data piece are sequentially stored together, can reduce the data amount of debris that produces;
2, flock together by the data block that will more belong to same file, reduce the data amount of debris, can greatly strengthen the redundant locality of data;
3, the raising of data redundancy locality not only can improve throughput and the data write performance of data de-duplication, also can improve data and read performance;
4, by representing quantitatively the data redundancy locality, and the redundant locality of data arranged the repeating data amount that threshold value keeps with control, can under the prerequisite of sacrificing less data compression rate, reduce a large amount of data fragments, obtain preferably reading and writing data performance.If the data redundancy locality threshold value that arranges is larger, the repeating data amount of reservation is more, and the compressibility of sacrifice is just larger; Otherwise if threshold value is less, the repeating data amount of reservation is less, and the compressibility of sacrifice is also less.
In sum, the present invention sacrifices less data compression rate by keeping small part repeating data piece, the data block that more belongs to same file sequentially can be stored together, greatly reduce the data amount of debris that generates, strengthen the data redundancy locality, improve the readwrite performance of data.
Description of drawings
Description of drawings of the present invention is as follows:
Fig. 1 is the process flow diagram of placing to reduce the method for data fragment by optimization data of the present invention;
Fig. 2 is the structural representation of placing to reduce the device of data fragment by optimization data of the present invention.
Embodiment
The invention will be further described below in conjunction with drawings and Examples:
The main body that the present invention relates to is backup server and storage server, and backup server provides the data that need backup, and storage server is then stored the data that will back up.Searching and delete in storage server of repeating data carried out.
Fig. 1 is the process flow diagram of placing to reduce the method for data fragment by optimization data of the present invention; This flow process starts from S101.
In step S102, each file to be backed up is carried out deblocking, carry out deblocking as adopting the elongated algorithm of data block, obtaining the average data block size is quantitative data block to be backed up, is the data block of 8KB such as data volume; And each data block to be backed up asked for the data block fingerprint, and the algorithm of data block fingerprint can adopt the SHA-1 hash algorithm to calculate the cryptographic hash of each data block, and the cryptographic hash that obtains is called as the data block fingerprint.The data block fingerprint can be used for each data block of unique representative, and any two data blocks with identical fingerprints are considered to identical data block.
In step S103, a plurality of continuous data chunk to be backed up are made into data segment to be backed up, for example each data segment has 256 data blocks.
In step S104, search whether there be the data block identical with data segment to be backed up in the data segment that had backed up, these identical data blocks are the repeating data piece, if having the repeating data piece, enter step S105; If there is not the repeating data piece, then enter step S107.
In step S105, for each repeating data piece, add up the Backup Data section at its place and the repeating data amount that data segment to be backed up is shared, and with the size of this repeating data amount divided by data segment to be backed up, the value that calculates like this is the data redundancy locality of quantificational expression, if data redundancy locality quantized value enters step S107 less than predetermined threshold value; Otherwise, if data redundancy locality quantized value enters step S106 greater than predetermined threshold value.Wherein threshold value is predetermined data redundancy locality threshold value, can control the repeating data amount of reservation by this threshold value.If threshold value is larger, the repeating data that then keeps is more, and the compressibility of sacrifice is larger, and the data redundancy locality of keeping is also stronger; Otherwise if threshold value is less, the repeating data of reservation is less, and the compressibility of sacrifice is less, the data redundancy locality of keeping also a little less than.This threshold value is used for doing a balance between the compressibility of sacrificing and the data redundancy locality kept.
In step S106, these repeating data pieces of deletion from data segment to be backed up, flow process finishes.
In step S107, preserve successively these data blocks, flow process finishes.
Fig. 2 is the structural representation of placing to reduce the device of data fragment by optimization data of the present invention.1 expression deblocking and fingerprint computing unit, 2 expression data segment organization unit, 3 expression repeating data piece query unit, 4 expression repeating data piece screening unit, 5 expression data erase unit, 6 expression date read-write cell.
Deblocking and fingerprint computing unit 1 carry out deblocking for the file to be backed up that each is passed to storage server, and obtaining the average data block size is quantitative data block to be backed up, and each data block to be backed up is asked for the data block fingerprint;
Data segment organization unit 2 is used for a plurality of continuous data chunk to be backed up are made into data segment to be backed up;
Repeating data piece query unit 3 is used for searching the data segment that had backed up and whether has the data block identical with data segment to be backed up, if do not have, then be non-repeating data piece, change date read-write cell 6 over to, if having, be the repeating data piece then, change repeating data piece screening unit 4 over to;
Repeating data piece screening unit 4, be used for calculating the Backup Data section at these repeating data piece places and the data redundancy locality between the data segment to be backed up, with data redundancy locality quantification, if the value of this data redundancy locality is less than predetermined threshold value, change date read-write cell 6 over to, otherwise change data erase unit 5 over to;
Data erase unit 5; Be used for the repeating data piece that deletion is confirmed by repeating data piece screening unit;
Date read-write cell 6, repeating data piece and other non-repeating data pieces of being used for needs are kept write disk together.
Advantage of the present invention is, reduced non-order placement and the data fragment of data, alleviates the deterioration of data fragment under the prerequisite of sacrificing few data compression rate, promoted the readwrite performance of system.

Claims (4)

1. method of placing to reduce the data fragment by optimization data is characterized in that: may further comprise the steps:
Step 1 is carried out deblocking to each file to be backed up, and each data block to be backed up is asked for the data block fingerprint;
Step 2 is made into data segment to be backed up with a plurality of continuous data chunk to be backed up;
Step 3, for each data block to be backed up in the data segment to be backed up, whether have Backup Data section backed up identical data block, if do not have, then be non-repeating data piece if searching in system, enter step 6, if having, is the repeating data piece then, enters step 4;
Step 4 is calculated data segment to be backed up and the data redundancy locality of Backup Data section, with data redundancy locality quantification, if the value of this data redundancy locality enters step 6, otherwise enters step 5 less than predetermined threshold value;
Step 5, the repeating data piece that deletion data segment to be backed up and Backup Data section are shared from data segment to be backed up;
Step 6, data block successively order write disk.
2. the method for placing to reduce the data fragment by optimization data according to claim 1, it is characterized in that: the quantification of the data redundancy locality in step 4 is, the repeating data amount that the Backup Data section at statistics repeating data piece place and data segment to be backed up are shared, and with the size of this repeating data amount divided by data segment to be backed up.
3. the method for placing to reduce the data fragment by optimization data according to claim 1 is characterized in that: the threshold value in step 4 is predetermined data redundancy locality threshold value, and this threshold value control writes the repeating data amount of disk.
4. device of placing to reduce the data fragment by optimization data is characterized in that comprising:
Deblocking and fingerprint computing unit (1) carry out deblocking for the file to be backed up that each is passed to storage server, and obtaining the average data block size is quantitative data block to be backed up, and each data block to be backed up is asked for the data block fingerprint;
Data segment organization unit (2) is used for a plurality of continuous data chunk to be backed up are made into data segment to be backed up;
Repeating data piece query unit (3), be used for searching the data segment that had backed up and whether have the data block identical with data segment to be backed up, if do not have, it then is non-repeating data piece, change date read-write cell (6) over to, if have, be the repeating data piece then, change repeating data piece screening unit (4) over to;
Repeating data piece screening unit (4), be used for calculating the Backup Data section at these repeating data piece places and the data redundancy locality between the data segment to be backed up, with data redundancy locality quantification, if the value of this data redundancy locality is less than predetermined threshold value, change date read-write cell (6) over to, otherwise change data erase unit (5) over to;
Data erase unit (5) is used for the repeating data piece that deletion is confirmed by repeating data piece screening unit;
Date read-write cell (6), repeating data piece and other non-repeating data pieces of being used for needs are kept write disk together.
CN2012104746888A 2012-11-21 2012-11-21 Method and device for optimizing data placement to reduce data fragments Pending CN102999605A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2012104746888A CN102999605A (en) 2012-11-21 2012-11-21 Method and device for optimizing data placement to reduce data fragments

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2012104746888A CN102999605A (en) 2012-11-21 2012-11-21 Method and device for optimizing data placement to reduce data fragments

Publications (1)

Publication Number Publication Date
CN102999605A true CN102999605A (en) 2013-03-27

Family

ID=47928173

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2012104746888A Pending CN102999605A (en) 2012-11-21 2012-11-21 Method and device for optimizing data placement to reduce data fragments

Country Status (1)

Country Link
CN (1) CN102999605A (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103473150A (en) * 2013-08-28 2013-12-25 华中科技大学 Fragment rewriting method for data repetition removing system
CN103609091A (en) * 2013-06-24 2014-02-26 华为技术有限公司 Method and device for data transmission
CN103885859A (en) * 2014-03-12 2014-06-25 华中科技大学 Fragment removing method and system based on global statistics
CN104216890A (en) * 2013-05-30 2014-12-17 北京赛科世纪数码科技有限公司 Method and system for compressing ELF file
CN105824720A (en) * 2016-03-10 2016-08-03 中国人民解放军国防科学技术大学 Continuous data reading oriented data placement method of deduplication and erasure correcting combined system
CN105897921A (en) * 2016-05-27 2016-08-24 重庆大学 Data block routing method combining fingerprint sampling and reducing data fragments
CN105930534A (en) * 2016-06-20 2016-09-07 重庆大学 Method for reducing data fragments on basis of cloud storage service prices
CN106066818A (en) * 2016-05-25 2016-11-02 重庆大学 A kind of data layout's method improving data de-duplication standby system restorability
CN106294002A (en) * 2016-07-26 2017-01-04 广州杰赛科技股份有限公司 A kind of cloud backup method and device
CN107623788A (en) * 2017-09-22 2018-01-23 努比亚技术有限公司 Using the raising method, apparatus and computer-readable recording medium of toggle speed
CN110442555A (en) * 2019-07-26 2019-11-12 华中科技大学 A kind of method and system of the reduction fragment of selectivity reserved space
CN111124259A (en) * 2018-10-31 2020-05-08 深信服科技股份有限公司 Data compression method and system based on full flash memory array
CN112463058A (en) * 2020-11-27 2021-03-09 杭州海康威视系统技术有限公司 Fragmented data sorting method and device and storage node
CN113632059A (en) * 2020-03-06 2021-11-09 华为技术有限公司 Apparatus and method for eliminating defragmentation in deduplication
WO2023279833A1 (en) * 2021-07-08 2023-01-12 华为技术有限公司 Data processing method and apparatus

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101582076A (en) * 2009-06-24 2009-11-18 浪潮电子信息产业股份有限公司 Data de-duplication method based on data base
CN101706825A (en) * 2009-12-10 2010-05-12 华中科技大学 Replicated data deleting method based on file content types
CN102033924A (en) * 2010-12-08 2011-04-27 浪潮(北京)电子信息产业有限公司 Data storage method and system
CN102222085A (en) * 2011-05-17 2011-10-19 华中科技大学 Data de-duplication method based on combination of similarity and locality
CN102385554A (en) * 2011-10-28 2012-03-21 华中科技大学 Method for optimizing duplicated data deletion system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101582076A (en) * 2009-06-24 2009-11-18 浪潮电子信息产业股份有限公司 Data de-duplication method based on data base
CN101706825A (en) * 2009-12-10 2010-05-12 华中科技大学 Replicated data deleting method based on file content types
CN102033924A (en) * 2010-12-08 2011-04-27 浪潮(北京)电子信息产业有限公司 Data storage method and system
CN102222085A (en) * 2011-05-17 2011-10-19 华中科技大学 Data de-duplication method based on combination of similarity and locality
CN102385554A (en) * 2011-10-28 2012-03-21 华中科技大学 Method for optimizing duplicated data deletion system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
谭玉娟: "数据备份系统中数据去重技术研究", 《中国博士学位论文全文数据库电子期刊》 *

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104216890A (en) * 2013-05-30 2014-12-17 北京赛科世纪数码科技有限公司 Method and system for compressing ELF file
CN103609091A (en) * 2013-06-24 2014-02-26 华为技术有限公司 Method and device for data transmission
CN103609091B (en) * 2013-06-24 2017-01-11 华为技术有限公司 Method and device for data transmission
CN103473150A (en) * 2013-08-28 2013-12-25 华中科技大学 Fragment rewriting method for data repetition removing system
CN103885859B (en) * 2014-03-12 2017-09-26 华中科技大学 It is a kind of to go fragment method and system based on global statistics
CN103885859A (en) * 2014-03-12 2014-06-25 华中科技大学 Fragment removing method and system based on global statistics
CN105824720A (en) * 2016-03-10 2016-08-03 中国人民解放军国防科学技术大学 Continuous data reading oriented data placement method of deduplication and erasure correcting combined system
CN105824720B (en) * 2016-03-10 2018-11-20 中国人民解放军国防科学技术大学 What a kind of data-oriented was continuously read delete again entangles the data placement method for deleting hybrid system
CN106066818B (en) * 2016-05-25 2019-05-17 重庆大学 A kind of data layout method improving data de-duplication standby system restorability
CN106066818A (en) * 2016-05-25 2016-11-02 重庆大学 A kind of data layout's method improving data de-duplication standby system restorability
CN105897921A (en) * 2016-05-27 2016-08-24 重庆大学 Data block routing method combining fingerprint sampling and reducing data fragments
CN105897921B (en) * 2016-05-27 2019-02-26 重庆大学 A kind of data block method for routing of the sampling of combination fingerprint and reduction fragmentation of data
CN105930534A (en) * 2016-06-20 2016-09-07 重庆大学 Method for reducing data fragments on basis of cloud storage service prices
CN106294002A (en) * 2016-07-26 2017-01-04 广州杰赛科技股份有限公司 A kind of cloud backup method and device
CN107623788A (en) * 2017-09-22 2018-01-23 努比亚技术有限公司 Using the raising method, apparatus and computer-readable recording medium of toggle speed
CN107623788B (en) * 2017-09-22 2020-10-27 海南飞特同创科技有限公司 Method and device for improving application starting speed and computer readable storage medium
CN111124259A (en) * 2018-10-31 2020-05-08 深信服科技股份有限公司 Data compression method and system based on full flash memory array
CN110442555A (en) * 2019-07-26 2019-11-12 华中科技大学 A kind of method and system of the reduction fragment of selectivity reserved space
CN110442555B (en) * 2019-07-26 2021-08-31 华中科技大学 Method and system for reducing fragments of selective reserved space
CN113632059A (en) * 2020-03-06 2021-11-09 华为技术有限公司 Apparatus and method for eliminating defragmentation in deduplication
CN112463058A (en) * 2020-11-27 2021-03-09 杭州海康威视系统技术有限公司 Fragmented data sorting method and device and storage node
WO2023279833A1 (en) * 2021-07-08 2023-01-12 华为技术有限公司 Data processing method and apparatus

Similar Documents

Publication Publication Date Title
CN102999605A (en) Method and device for optimizing data placement to reduce data fragments
US10318181B2 (en) System, method, and computer program product for increasing spare space in memory to extend a lifetime of the memory
US10809928B2 (en) Efficient data deduplication leveraging sequential chunks or auxiliary databases
CN106662981B (en) Storage device, program, and information processing method
US9880746B1 (en) Method to increase random I/O performance with low memory overheads
US8639669B1 (en) Method and apparatus for determining optimal chunk sizes of a deduplicated storage system
US10466932B2 (en) Cache data placement for compression in data storage systems
US8712963B1 (en) Method and apparatus for content-aware resizing of data chunks for replication
US10061693B2 (en) Method of generating secondary index and apparatus for storing secondary index
CN103019887B (en) Data back up method and device
KR20170054299A (en) Reference block aggregating into a reference set for deduplication in memory management
Zou et al. The dilemma between deduplication and locality: Can both be achieved?
CN101916171A (en) Concurrent hierarchy type replicated data eliminating method and system
US20120136842A1 (en) Partitioning method of data blocks
US9471245B1 (en) Method and apparatus for transferring modified data efficiently
CN112559452B (en) Data deduplication processing method, device, equipment and storage medium
WO2018171296A1 (en) File merging method and controller
US9189408B1 (en) System and method of offline annotation of future accesses for improving performance of backup storage system
US10503608B2 (en) Efficient management of reference blocks used in data deduplication
CN111124258B (en) Data storage method, device and equipment of full flash memory array and readable storage medium
CN104050057B (en) Historical sensed data duplicate removal fragment eliminating method and system
US10013346B2 (en) Method of decreasing write amplification of NAND flash using a journal approach
US10282127B2 (en) Managing data in a storage system
Zhang et al. Improving the performance of deduplication-based backup systems via container utilization based hot fingerprint entry distilling
KR101473837B1 (en) An Invalid Data Recycling Method for Improving I/O Performance in SSD-based Storage System

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20130327