Background technology
Along with the development of network and various plateform systems, modern society becomes the ocean of data.Link up document data of order data, study research and the office of the mutual browsing information producing in various identity informations, website of people's life, various ecommerce etc. every day, each computer user is the producer of data, is also the consumer of data.Huge data source need to be faced and process to information handling system every day.In face of mass data, how effectively store and management, in mining data, Useful Information becomes the focus of modernization intellectual technology.Effective storage of data is exactly to utilize same space resources to store more data volume after all.The operation wherein relating to can be a lot, delete but the method that acts on data itself is exactly data compression and redundant data.Duplicate removal and compress technique for data itself are the most direct, are also to use at present the widest research field.
Data de-duplication technology has application and research basis for many years in industry member and academia.From the development of this technology, what model framework was constant is all the comparison of carrying out data, eliminates the data slot repeating, and sets up metadata and safeguards, wherein duplicate removal rate and time efficiency are the emphasis of this technical concerns.The generation of data after from original document to duplicate removal, then be reduced to original document to data, the emphasis difference of concern, data de-duplication technology, beyond itself stores utilization aspect, obtains expansion in various degree.
Make a general survey of data compression and data duplicate removal, no matter which kind of processing means, what need to carry out that the excavation of processing, the information of data be unable to do without is all that the file data after stores processor is recovered.In addition, storage system is just for the preservation of large data, and client needs request access, or system server need to carry out data verification and relatively time, all the file data of system will be recovered from storage medium.So, file reverts to another gordian technique point for data processing.The effectively File Instauration Technique request of responding system rapidly, improves system-computed and the ability of processing large data.
Summary of the invention
The object of the invention is to realize a kind of data reconstruction optimization method of online data deduplication system, that processes passes through data de-duplication packet afterwards to liking, the distribution of data after duplicate removal in duplicate removal bag directly affects the response time of system responses client, by optimizing storage organization, system is the request of access of feedback user more in real time.
Object of the present invention realizes by following technical scheme:
A data reconstruction optimization method for online data deduplication system, comprises the steps:
(1), after online data deduplication system carries out data duplicate removal to original document, generate duplicate removal bag, the request of access of machining system response user to the data based on file-level, recover to realize user's memory access by file, online data deduplication system can within one period that presets length, add up duplicate removal bag in the access times of each file, visiting frequency is classified as to active file collection higher than the file of certain value, visiting frequency is classified as non-common file set lower than the file of this critical value, then execution step (2) operation;
(2), suspend the data access request of data deduplication system, carry out resetting based on the data block of file-level, the document entity in the active file set pair duplicate removal bag that active file filtrator obtains according to step (1) is shunted processing; Processing procedure is: according to putting in order of original document in duplicate removal bag, read one by one the document entity in duplicate removal bag, filename and the file type of the metadata information section of comparison document entity record respective file, if being present in the active file of step (1) generation, concentrates this filename execution step (3) operation;
(3), the unique data block number district of file reading entity, according to data block mapping ruler, the deposit position of the unique data piece that finds each reference numeral in duplicate removal bag, corresponding unique data piece is written in the file that will recover, and last the unique data piece in document entity is also written in the file that will recover, if after step (2) all completes, execution step (4), otherwise continue to return execution step (2);
(4), conventional concentrated file being re-started to data block cutting and fingerprint calculates, and generate new logical data module unit and file and describe metamessage, newly-generated data message is written in new duplicate removal bag to then execution step (5) operation;
(5), unique data the piece corresponding non-common file set in old duplicate removal bag is carried out recover based on the data of file-level, non-common file centralized documentation is appended in new duplicate removal bag, be put into the rear end of data slot in new duplicate removal bag, the old duplicate removal bag of deletion after completing;
(6) it is looking ahead and concentrating of data block based on active file is comprised and file metadata that the data, in newly-generated duplicate removal bag distribute, and data deduplication system recovers the request of response user to data access.
Preferably, in step (2), carrying out based on the prerequisite step of file data rearrangement piece is to find all data blocks that Single document is comprised, corresponding data block is done to unified scheduling, before the corresponding data block of locating file, need the file in duplicate removal bag to recover, it is the process of a read block and writing in files that file recovers, and by reading file metadata information and the data block information that in duplicate removal bag, each document entity comprises, recovers initial file data; Reset based on the data block of file-level, not only unique data piece is concentrated to the front end that is prefetched to the data slot in duplicate removal bag, and the relevant descriptor such as data block fingerprint and logic data block is also prefetched to the front end of corresponding data fragment in the lump.
Preferably, in step (2), described active file filtrator is used for realizing file data blocks distribution management, the order that enters data deduplication system by changing file, the data block realizing based on active file collection is reset, first file filter device scans the file in duplicate removal bag by the order of system file, when the file scanning is during at active file collection, just directly carry out the corresponding data block of file, fingerprint, the retrieval of logical data and document entity, retrieving comprises addressing and the recovery of data block, and the writing of data field in new duplicate removal bag, All Files is all after been scanned, remaining not in the concentrated file of active file is just arranged in duplicate removal bag by original order after the data slot of active file collection.
Preferably, in step (3), the storage format of data block in duplicate removal bag is a copy, multiple indexes, the addressing unit of data block is byte, and in duplicate removal bag, the physical message of unique data piece is recorded in corresponding logic data block, and the size of each logic data block is identical, the numbering of unique data piece, since 0, increases progressively successively.
Preferably, data block addressing comprises two mapping process, first, find corresponding logic data block according to the numbering of data block in document entity, because the size of each logical block is identical, the calculating process of addressing is: the numbering of data block is multiplied by the size of logical block, then just draws the physical address of counterlogic data block; Then, addressing is for the second time according to physical displacement and the block size of the unique data piece recording in the logic data block of reading, and finds corresponding data block, and addressing and the physical mappings of data block is actually the conversion of " index-unique data piece ".
Preferably, after file filter device screens recovery to original document data in duplicate removal bag based on active file collection, need to be again by the data block of file including and corresponding metadata store in duplicate removal bag, concrete steps are to carry out file cutting, fingerprint to generate, set up service data, after system cutting file, it is the hash value of first computational data piece to the processing of data block, then carry out hash comparison, finally exactly the data after duplicate removal are stored, the memory management module of system to the processing procedure of new unique data piece be one can concurrent execution scheduling.
Preferably, data recovery is to recover for the unified of all unique data pieces, logic data block, data block fingerprint and the file metadata that comprise in Single document.
Preferably, the data block processing procedure that file after processing through data de-duplication technology is comprised is divided into the thread of four parallel processings: the storage of unique data piece, logic data block storage, data block fingerprint storage and file metadata storage, the programming mechanism that thread uses is openMP.
Preferably, file in active file filters scan duplicate removal bag is the time sequencing that enters data deduplication system by original document, whether the filename that compares one by one document entity in duplicate removal bag is present in active file collection, and the file shunting that visiting frequency is different is processed.
Preferably, original document in the duplicate removal bag of change data deduplication system enters the feature of the discrete distribution of time sequencing of system by file, again the data content in duplicate removal bag is comprised to unique data piece, logic data block, data block fingerprint and file metadata, by the visiting frequency of file, are dispatched to the front end of corresponding data fragment in duplicate removal bag in base unit Unified Set taking Single document.
Compared with prior art, tool has the following advantages and beneficial effect in the present invention:
(1) the present invention is based on the data rearrangement of active file, taking file as processing unit, the data message corresponding with data block to all data blocks that comprise in Single document carries out United Dispatching and distribution, and this is consistent with the request of access contents and mode of user level.
(2) the present invention shunts the data of active file and non-common file, file test is concentrated to the data slot front end being prefetched in duplicate removal bag, the time overhead that saving system is found document entity.
(3) file recovers termination mechanism, the present invention is based on the process of in the duplicate removal bag after active file is reset, file being recovered and add termination judgement, after All Files in file set all recovers from packet, system no longer scans the alternative document entity in duplicate removal bag.This can save the unnecessary document retrieval time.
Embodiment
Below in conjunction with embodiment and accompanying drawing, the present invention is described in further detail, but embodiments of the present invention are not limited to this.
Embodiment
As shown in Figure 1, the data reconstruction optimization method of a kind of online data deduplication system of the present invention, the model of place of application is online data deduplication system, comprises server end and client two parts:
The function that client mainly realizes is that file is carried out to stripping and slicing, the hash value of computational data piece, and storage hash value, and as the fingerprint of this data block.By comparing the fingerprint of each data block, judge the piece whether this data block repeats, system is only stored unique data block, and records the ID of each data block.Each file entity of can creating a file, document entity is for preserving the metadata of original, comprise the size of filename, data block quantity, data block ID size, last data block and the numbering of one group of unique data piece, and last data block of file is (because this data block size is conventionally little than normal data piece, recurrence probability is very little, so storage separately).Unique data piece, data block fingerprint, all document entities can be kept at a duplicate removal bag, and in duplicate removal bag, data send to server end with the form of file.
Data in server parses duplicate removal bag, and preserve unique data piece, data block fingerprint table, logical data and document entity, the operation interval based on file data rearrangement piece is exactly the read and write of these the four classes data on server.Resetting based on file is the sequencing by reorganizing data in duplicate removal bag, with the more excellent document retrieval of acquisition system and release time efficiency.
In order more clearly to illustrate specific embodiment of the invention model, remake labor below in conjunction with data block mapping in the workflow schematic diagram (Fig. 2) based on file data rearrangement piece, duplicate removal bag with addressing schematic diagram (Fig. 3) and data stream storage organization schematic diagram (Fig. 4).
As shown in Figure 2, system is reset and is divided into two stages file.First stage be file recover, processing to as if duplicate removal bag.Recover based on the data of file, first, read the document entity in duplicate removal bag, the numbering that document entity has comprised the unique data piece that corresponding document is corresponding; Then, find corresponding logic data block according to data block numbering, read displacement and the size information of logic data block, find the unique data piece in duplicate removal bag; Finally, the data block based on document entity puts in order, and unique data piece is written in corresponding file.Second stage is that file is reset, and file is reset the module that has three orders to carry out.(1) file filter device, (2) data block cutting, (3) data block processing, the function of each several part around processing unit be all file, the base unit of data processing is data block.
As shown in Figure 3, file filter device is retrieved data concentrated active file taking file as base unit, and the retrieval of file in duplicate removal bag carried out corresponding data block addressing and operation according to document entity.The storage format of data block in duplicate removal bag is a copy, multiple indexes.So in data deduplication system, need to set up the logical description information of data block, share the index of unique data piece to facilitate between different files and set up.The addressing unit of data block is byte, and in duplicate removal bag, the physical message of unique data piece is recorded in corresponding logic data block.The size of each logic data block is identical, and the numbering of unique data piece, since 0, increases progressively successively.Data block addressing comprises two mapping process, first, find corresponding logic data block according to the numbering of data block in document entity, because the size of each logical block is identical, the calculating process of addressing is: the numbering of data block is multiplied by the size of logical block, then just draws the physical address of counterlogic data block.Then, addressing is for the second time according to physical displacement and the block size of the unique data piece recording in the logic data block of reading, and finds corresponding data block.Addressing and the physical mappings of data block is actually the conversion of " index-unique data piece ".
As shown in Figure 4, after file filter device recovers based on active file collection screening original document data in duplicate removal bag, need to be again by the data block of file including and corresponding metadata store in duplicate removal bag.Concrete steps are to carry out file cutting, fingerprint to generate, set up service data.After system cutting file, be the hash value of first computational data piece to the processing of data block, then carry out hash comparison, finally exactly the data after duplicate removal are stored.The memory management module of system to the processing procedure of new unique data piece be one can concurrent execution scheduling.In order to improve the treatment effeciency of data block, the model that the present invention proposes is divided into storing process with Open MP multithreading the thread of four concurrent execution: hash value is inserted hash table, the processing of unique data piece, logic data block processing and metadata processing.Because the diverse location data writing of each thread in duplicate removal bag, so concurrent storage administration not only can improve the output efficiency of system, and has safeguarded the independence of data to a certain extent.
Above-described embodiment is preferably embodiment of the present invention; but embodiments of the present invention are not restricted to the described embodiments; other any do not deviate from change, the modification done under Spirit Essence of the present invention and principle, substitutes, combination, simplify; all should be equivalent substitute mode, within being included in protection scope of the present invention.