CN103955530B

CN103955530B - Data reconstruction and optimization method of on-line repeating data deletion system

Info

Publication number: CN103955530B
Application number: CN201410198679.XA
Authority: CN
Inventors: 邓玉辉; 岑大慰; 黄战
Original assignee: Jinan University
Current assignee: Guangdong Hong Kong And Macao Qingchuang Technology Guangzhou Co ltd; Guangzhou Jinan University Science Park Management Co ltd
Priority date: 2014-05-12
Filing date: 2014-05-12
Publication date: 2017-02-22
Anticipated expiration: 2034-05-12
Also published as: CN103955530A

Abstract

The invention provides a data reconstruction and optimization method of an on-line repeating data deletion system. On the one hand, more data is stored and managed by limited space resources via redundancy detection and deletion of repeating data for data storage; on the other hand, a system model prefetches commonly-used file data blocks to the front end of a data slot in a duplication removing packet by scheduling and rearranging distribution of the data blocks after duplication removing, and aggregates the randomly and discretely distributed data blocks and corresponding fingerprints in the duplication removing packet for storage to shorten seek time on a disk in a file data recovery process, so that data reconstruction performance of the on-line repeating data deletion system is improved, response time of the system is shortened, and the data recovery efficiency is improved.

Description

A kind of data reconstruction optimization method of online data deduplication system

Technical field

The present invention relates to a kind of data reconstruction optimization method of online data deduplication system, more specifically to Visiting frequency based on file carry out the addressing of data block in technology that in duplicate removal bag, data block is reset and duplicate removal bag, recover with The technology of reconstruct.

Background technology

With the continuous development of network and various plateform system, modern society becomes the ocean of data.Daily communication people Live various identity informations, website interaction produce browse information, the order data of various ecommerce, Learning Studies and Document data of office etc., each computer user is the producer of data, is also the consumer of data.Information processing system Daily needs are in the face of the data source huge with process.In face of mass data, how effectively to store and to manage, in mining data Useful information becomes the focus of modernization intellectual technology.Effective storage of data is exactly using same space after all Resource stores more data volumes.The operation being directed to can be a lot, but the method acting on data itself is exactly data pressure Contracting and redundant data are deleted.Duplicate removal for data itself and compress technique are the most direct, are also current with research the widest Field.

Data de-duplication technology has application for many years and Research foundation in industrial quarters and academia.Sending out from this technology From the point of view of exhibition, the constant comparison being all by data of model framework, eliminate the data slot repeating, set up metadata and safeguard, its Middle duplicate removal rate and time efficiency are the emphasis of this technical concerns.The generation of data after from original document to duplicate removal, then to data also Originally it was original document, the emphasis of concern is different, data de-duplication technology, beyond itself storage is with aspect, obtains difference The expansion of degree.

Make a general survey of data compression data duplicate removal, no matter which kind of processing means, need to carry out process, the excavation of information of data Too busy to get away is all that file data after processing storage is recovered.In addition, storage system is only intended to big data Preserve, client needs request to access, or when system server needs to carry out data verification and compare, will be by the literary composition of system Number of packages evidence recovers from storage medium.So, file access pattern becomes another key technology point of data processing.Have The File Instauration Technique of effect can quickly respond the request of system, the ability improving system-computed and processing big data.

Content of the invention

The purpose of the present invention is to realize a kind of data reconstruction optimization method of online data deduplication system, process right As if the packet after data de-duplication, distribution in duplicate removal bag for the data after duplicate removal directly affect system response The response time of client, by optimizing storage organization, system can feedback user more in real time access request.

The purpose of the present invention is realized by following technical scheme：

A kind of data reconstruction optimization method of online data deduplication system, comprises the steps：

(1), after, online data deduplication system carries out data deduplication to original document, duplicate removal bag, duplicate removal system are generated The system response access request to the data based on file-level for the user, is accessed by the storage that file access pattern realizes user, online Data deduplication system can count the access times of each file in duplicate removal bag within the time of one section of default measured length, will visit Ask that frequency is higher than that the file of certain value classifies as active file collection, the file that visiting frequency is less than this critical value is classified as non-conventional literary composition Part collection, then execution step (2) operation；

(2), suspend the data access request of data deduplication system, carry out the data block based on file-level and reset, Document entity in the active file set pair duplicate removal bag that active file filter obtains according to step (1) carries out shunting process；Place Reason process is：According to the putting in order of original document in duplicate removal bag, read the document entity in duplicate removal bag one by one, comparison document is real The filename of metadata information section of body record respective file and file type, if file name is present in step (1) and generates Active file concentrate, then execution step (3) operation；

(3), read the unique data block number area of document entity, according to data block mapping ruler, find each corresponding volume Number deposit position in duplicate removal bag for the unique data block, corresponding unique data block is written in the file that will recover, And last the unique data block in document entity is also written in file to be recovered, if step (2) is all complete After one-tenth, then execution step (4), otherwise continue to return execution step (2)；

(4), the file of conventional concentration is re-started data block cutting and fingerprint calculates, and generate new logic data block Unit and file describe metamessage, and newly-generated data message is written in new duplicate removal bag, then execution step (5) behaviour Make；

(5), the non-conventional file set corresponding unique data block in old duplicate removal bag is carried out the number based on file-level According to recovery, file in non-conventional file set is appended in new duplicate removal bag, is put into the rear end of data slot in new duplicate removal bag, After the completion of delete old duplicate removal bag；

(6) data distribution in, newly-generated duplicate removal bag is based on the data block that active file is comprised and file unit The prefetching and concentrating of data, data deduplication system recovers the response request to data access for the user.

Preferably, in step (2), carrying out based on the prerequisite steps of file data rearrangement block is to find to be wrapped single file The all data blocks containing, corresponding data block is made unified scheduling, needs to duplicate removal before the corresponding data block of locating file File in bag is recovered, and file access pattern is a read block and the process of write file, by reading in duplicate removal bag The file metadata information data block message that each document entity comprises, recovers initial file data；Based on file-level Data block reset, not only unique data block is concentrated the front end being prefetched to data slot in duplicate removal bag, and data block refers to The related description information such as line and logic data block is also prefetched to the front end of corresponding data fragment in the lump.

Preferably, in step (2), described active file filter is used for realizing file data blocks distribution management, by changing Become the order that file enters data deduplication system, realize the data block based on active file collection and reset, file filter device is first First the file in duplicate removal bag is scanned by the order of system file, when the file scanning is in active file collection, just straight Tap into data block corresponding to style of writing part, the retrieval of fingerprint, logical data and document entity, retrieving includes seeking of data block The write of data field in location and recovery, and new duplicate removal bag, after All Files is all scanned, remaining not in active file Concentrate file just by original be arranged sequentially the data slot of active file collection in duplicate removal bag after.

Preferably, in step (3), storage format in duplicate removal bag for the data block is a copy, multiple indexes, data block Addressing unit be byte, in duplicate removal bag, in corresponding logic data block, each patrols the physical message record of unique data block The size of volume data block is identical, the numbering of unique data block from the beginning of 0, incremented by successively.

Preferably, data block addressing includes two mapping process, first, is found according to the numbering of data block in document entity Corresponding logic data block, because the size of each logical block is identical, the calculating process of addressing is：The numbering of data block is multiplied by The size of logical block, then just draws the physical address of counterlogic data block；Then, second addressing is according to patrolling of reading The physical displacement of unique data block of record and block size in volume data block, find corresponding data block, the addressing of data block and Physical mappings are actually the conversion of " index unique data block ".

Preferably, after file filter device is based on the screening recovery of active file collection to original document data in duplicate removal bag, need Again the data block that file is comprised and corresponding metadata to store in duplicate removal bag, concrete steps be by file cutting, Fingerprint generates, sets up and safeguard data, and after system cutting file, the process to data block is the hash value first calculating data block, connects And carry out hash and compare, be exactly finally that the data after duplicate removal is stored, the memory management module of system is to new unique number Processing procedure according to block is a scheduling that can concurrently execute.

Preferably, data recovery is for all unique data blocks comprising in single file, logic data block, data block Fingerprint and the unified recovery of file metadata.

Preferably, the data block processing procedure that the file after processing through data de-duplication technology is comprised is divided into The thread of four parallel processings：The storage of unique data block, logic data block storage, the storage of data block fingerprint and file metadata are deposited Storage, thread with programming mechanism be openMP.

Preferably, the file in active file filters scan duplicate removal bag is to enter data de-duplication system by original document The time sequencing of system, the filename comparing document entity in duplicate removal bag one by one whether there is in active file collection, to visiting frequency Different file shuntings is processed.

Preferably, change data deduplication system duplicate removal bag in original document press file entrance system time suitable Data content in duplicate removal bag is included unique data block, logic data block, data block fingerprint by the feature of the discrete distribution of sequence again Press the visiting frequency of file with file metadata, with single file for base unit Unified Set in be dispatched to respective counts in duplicate removal bag Front end according to fragment.

The present invention compared with prior art, has the advantage that and beneficial effect：

(1) data rearrangement based on active file for the present invention, with file for processing unit, to comprised in single file The corresponding data message of all data block data blocks carries out United Dispatching and distribution, in this access request with user level Hold consistent with mode.

(2) present invention shunts to the data of active file and non-active file, file test is concentrated pre- Get the data slot front end in duplicate removal bag, the time overhead that the system of saving is found to document entity.

(3) file access pattern termination mechanism, the present invention is based on the mistake to file access pattern in the duplicate removal bag after active file rearrangement Journey adds termination to judge, that is, after All Files all recovers from packet in file set, system no longer scans duplicate removal Alternative document entity in bag.This can save unnecessary file retrieval time.

Brief description

Fig. 1 is present system model structure schematic diagram；

Fig. 2 is the workflow diagrams based on file data rearrangement block for the present invention；

Fig. 3 is data block mapping and addressing scheme in duplicate removal bag of the present invention；

Fig. 4 is data flow storage organization schematic diagram of the present invention.

Specific embodiment

With reference to embodiment and accompanying drawing, the present invention is described in further detail, but embodiments of the present invention do not limit In this.

Embodiment

As shown in figure 1, a kind of data reconstruction optimization method of online data deduplication system of the present invention, the scene of application Model is online data deduplication system, including server end and client two parts：

The function that client is mainly realized is that file is carried out with stripping and slicing, calculates the hash value of data block, stores hash value, and Fingerprint as this data block.By comparing the fingerprint of each data block, judged the block whether this data block repeats, system is only Store unique data block, and record the ID of each data block.Each file can set up a document entity, and document entity is used for Preserve original metadata, including filename, data number of blocks, data block ID size, the size of last data block and The numbering of one group of unique data block, and last data block of file is (because this data block size is generally than normal number Little according to block, recurrence probability is very little, so independent store).Unique data block, data block fingerprint, all of document entity can be protected There is a duplicate removal bag, in duplicate removal bag, data is sent to server end in the form of a file.

Server parses the data in duplicate removal bag, and preserves unique data block, data block fingerprint table, logical data and file Entity is exactly the reading of this four classes data on server based on the operation interval of file data rearrangement block and writes.Based on file weight Row is the sequencing by reorganizing data in duplicate removal bag, to obtain the more excellent document retrieval of system and recovery time effect Rate.

It is embodied as model in order to more clearly illustrate the present invention, below in conjunction with the work based on file data rearrangement block Data block mapping and addressing scheme (Fig. 3) data stream storage organization schematic diagram (Fig. 4) in flow diagram (Fig. 2), duplicate removal bag Remake labor.

As shown in Fig. 2 system enters rearrangement to file is divided into two stages.First stage is file access pattern, process To as if duplicate removal bag.Based on the data recovery of file, first, read the document entity in duplicate removal bag, document entity contains phase Answer the numbering of file corresponding unique data block；Then, corresponding logic data block is found according to data block numbering, read logic The displacement of data block and size information, find the unique data block in duplicate removal bag；Finally, the data block based on document entity arranges Sequentially, unique data block is written in corresponding file.Second stage is that file is reset, and file is reset has three orders to hold The module of row.(1) file filter device, (2) data block cutting, (3) data block is processed, the function of each several part around process unit It is all file, the base unit of data processing is data block.

As shown in figure 3, the data that active file is concentrated is entered line retrieval, file with file for base unit by file filter device Retrieval in duplicate removal bag is to carry out corresponding data block addressing and operation according to document entity.Data block is in duplicate removal bag Storage format is a copy, multiple indexes.So in data deduplication system, needing to set up the logical description of data block Information, is set up with the index facilitating shared unique data block between different files.The addressing unit of data block is byte, duplicate removal bag The physical message record of middle unique data block is in corresponding logic data block.The size of each logic data block is identical, uniquely The numbering of data block from the beginning of 0, incremented by successively.Data block addressing includes two mapping process, first, according to number in document entity Find corresponding logic data block according to the numbering of block, because the size of each logical block is identical, the calculating process of addressing is：Number It is multiplied by the size of logical block according to the numbering of block, then just draw the physical address of counterlogic data block.Then, address for second It is according to the physical displacement of the unique data block of record and block size in the logic data block reading, find corresponding data block. The addressing of data block and physical mappings are actually the conversion of " index unique data block ".

As shown in figure 4, after file filter device is based on the screening recovery of active file collection to original document data in duplicate removal bag, The data block again comprising file and corresponding metadata is needed to store in duplicate removal bag.Concrete steps are by file and cut Divide, fingerprint generates, data is safeguarded in foundation.After system cutting file, the process to data block is the hash value first calculating data block, Then carry out hash to compare, be exactly finally that the data after duplicate removal is stored.The memory management module of system is to new unique The processing procedure of data block is a scheduling that can concurrently execute.In order to improve the treatment effeciency of data block, proposed by the present invention Storing process is divided into four threads concurrently executing with Open MP multithreading by model：Hash value insertion hash table, unique Data block is processed, logic data block is processed and metadata is processed.Because diverse location write number in duplicate removal bag for each thread According to so concurrent storage management not only can improve the delivery efficiency of system, and maintaining the independence of data to a certain extent Property.

Above-described embodiment is the present invention preferably embodiment, but embodiments of the present invention are not subject to above-described embodiment Limit, other any Spirit Essences without departing from the present invention and the change made under principle, modification, replacement, combine, simplify, All should be equivalent substitute mode, be included within protection scope of the present invention.

Claims

1. a kind of data reconstruction optimization method of online data deduplication system is it is characterised in that comprise the steps：

(1) after, online data deduplication system carries out data deduplication to original document, generate duplicate removal bag, machining system rings Answer the access request to the data based on file-level for the user, accessed by the storage that file access pattern realizes user, online repetition Data deletion system can count the access times of each file in duplicate removal bag within the time of one section of default measured length, will access frequency Degree classifies as active file collection higher than the file of the critical value setting, and the file that visiting frequency is less than this critical value is classified as non-commonly using File set, then execution step (2) operation；

(2), suspend the data access request of data deduplication system, carry out the data block based on file-level and reset, commonly use Document entity in the active file set pair duplicate removal bag that file filter device obtains according to step (1) carries out shunting process；Processed Cheng Shi：According to the putting in order of original document in duplicate removal bag, read the document entity in duplicate removal bag one by one, comparison document entity is remembered The record filename of metadata information section of respective file and file type, if file name be present in that step (1) generates normal With in file set, then execution step (3) operates；

(3), read the unique data block number area of document entity, according to data block mapping ruler, find each reference numeral Deposit position in duplicate removal bag for the unique data block, corresponding unique data block is written in the file that will recover, and Last unique data block in document entity is also written in file to be recovered, if step (2) is fully completed it Afterwards, then execution step (4), otherwise continue to return execution step (2)；

(4), the file of conventional concentration is re-started data block cutting and fingerprint calculates, and generate new logical data module unit Describe metamessage with file, newly-generated data message is written in new duplicate removal bag, then execution step (5) operation；

(5) data, carrying out the non-conventional file set corresponding unique data block in old duplicate removal bag based on file-level is extensive Multiple, file in non-conventional file set is appended in new duplicate removal bag, is put into the rear end of data slot in new duplicate removal bag, completes Delete old duplicate removal bag afterwards；

(6) data distribution in, newly-generated duplicate removal bag is based on the data block that active file is comprised and file metadata Prefetch and concentrate, data deduplication system recovers the response request to data access for the user.

2. the data reconstruction optimization method of online data deduplication system according to claim 1 is it is characterised in that walk Suddenly in (2), carrying out based on the prerequisite steps of file data rearrangement block is to find all data blocks being comprised single file, will Corresponding data block makees unified scheduling, needs the file in duplicate removal bag is carried out extensive before the corresponding data block of locating file Multiple, file access pattern is a read block and the process of write file, is comprised by reading each document entity in duplicate removal bag File metadata information data block message, recover initial file data；Data block based on file-level is reset, not only Unique data block is concentrated the front end of the data slot being prefetched in duplicate removal bag, and data block fingerprint is related with logic data block Description information be also prefetched to the front end of corresponding data fragment in the lump.

3. the data reconstruction optimization method of online data deduplication system according to claim 1 is it is characterised in that walk Suddenly in (2), described active file filter is used for realizing file data blocks distribution management, enters repeated data by changing file The order of deletion system, realizes the data block based on active file collection and resets, file filter device is first by the file in duplicate removal bag It is scanned by the order of system file, when the file scanning is in active file collection, with regard to directly carrying out corresponding to file The retrieval of data block, fingerprint, logical data and document entity, retrieving includes addressing and the recovery of data block, and newly goes The write of data field in again wrapping, after All Files is all scanned, the remaining file do not concentrated in common file is just pressed former Have after being arranged sequentially the data slot of active file collection in new duplicate removal bag.

4. the data reconstruction optimization method of online data deduplication system according to claim 1 is it is characterised in that walk Suddenly, in (3), storage format in duplicate removal bag for the data block is a copy, multiple indexes, and the addressing unit of data block is byte, In duplicate removal bag, the physical message record of unique data block is in corresponding logic data block, the size phase of each logic data block With, the numbering of unique data block from the beginning of 0, incremented by successively.

5. the data reconstruction optimization method of online data deduplication system according to claim 4 is it is characterised in that count Include two mapping process according to block addressing, first, corresponding logic data block found according to the numbering of data block in document entity, Because the size of each logical block is identical, the calculating process of addressing is：The numbering of data block is multiplied by the size of logical block, then Just draw the physical address of counterlogic data block；Then, second addressing is according to record in the logic data block reading The physical displacement of unique data block and block size, find corresponding data block, and the addressing of data block and physical mappings are actually The conversion of " index unique data block ".

6. the data reconstruction optimization method of online data deduplication system according to claim 1 is it is characterised in that literary composition Part filter is based on to original document data in duplicate removal bag after the screening of active file collection recovers, and needs again to comprise file Data block and corresponding metadata store in duplicate removal bag, and concrete steps are by file cutting, fingerprint generates, number is safeguarded in foundation According to, after system cutting file, the process to data block is the hash value first calculating data block, then carries out hash and compares, finally Exactly the data after duplicate removal is stored, the memory management module of system is one to the processing procedure of new unique data block The scheduling that can concurrently execute.

7. the data reconstruction optimization method of online data deduplication system according to claim 1 is it is characterised in that literary composition It is for all unique data blocks, logic data block, data block fingerprint and the file metadata comprising in single file that part recovers Unified recovery.

8. online data deduplication system according to claim 1 data reconstruction optimization method it is characterised in that：Will The data block processing procedure that file after data de-duplication technology is processed is comprised is divided into the line of four parallel processings Journey：The storage of unique data block, logic data block storage, the storage of data block fingerprint and file metadata storage, the volume that thread uses Journey mechanism is openMP.

9. the data reconstruction optimization method of online data deduplication system according to claim 3 is it is characterised in that often It is the time sequencing entering data deduplication system by original document with the file that file filter device scans in duplicate removal bag, one by one Relatively in duplicate removal bag, the filename of document entity whether there is at active file collection, the file shunting different to visiting frequency Reason.

10. online data deduplication system according to claim 1 data reconstruction optimization method it is characterised in that Change the original document in the duplicate removal bag of data deduplication system and press the spy that file enters the discrete distribution of time sequencing of system Levy, again by the data content in duplicate removal bag include unique data block, logic data block, data block fingerprint and file metadata by The visiting frequency of file, with single file for base unit Unified Set in be dispatched to the front end of corresponding data fragment in duplicate removal bag.