Background technology
With the continuous development of network and various plateform system, modern society becomes the ocean of data.Daily communication people
Live various identity informations, website interaction produce browse information, the order data of various ecommerce, Learning Studies and
Document data of office etc., each computer user is the producer of data, is also the consumer of data.Information processing system
Daily needs are in the face of the data source huge with process.In face of mass data, how effectively to store and to manage, in mining data
Useful information becomes the focus of modernization intellectual technology.Effective storage of data is exactly using same space after all
Resource stores more data volumes.The operation being directed to can be a lot, but the method acting on data itself is exactly data pressure
Contracting and redundant data are deleted.Duplicate removal for data itself and compress technique are the most direct, are also current with research the widest
Field.
Data de-duplication technology has application for many years and Research foundation in industrial quarters and academia.Sending out from this technology
From the point of view of exhibition, the constant comparison being all by data of model framework, eliminate the data slot repeating, set up metadata and safeguard, its
Middle duplicate removal rate and time efficiency are the emphasis of this technical concerns.The generation of data after from original document to duplicate removal, then to data also
Originally it was original document, the emphasis of concern is different, data de-duplication technology, beyond itself storage is with aspect, obtains difference
The expansion of degree.
Make a general survey of data compression data duplicate removal, no matter which kind of processing means, need to carry out process, the excavation of information of data
Too busy to get away is all that file data after processing storage is recovered.In addition, storage system is only intended to big data
Preserve, client needs request to access, or when system server needs to carry out data verification and compare, will be by the literary composition of system
Number of packages evidence recovers from storage medium.So, file access pattern becomes another key technology point of data processing.Have
The File Instauration Technique of effect can quickly respond the request of system, the ability improving system-computed and processing big data.
Content of the invention
The purpose of the present invention is to realize a kind of data reconstruction optimization method of online data deduplication system, process right
As if the packet after data de-duplication, distribution in duplicate removal bag for the data after duplicate removal directly affect system response
The response time of client, by optimizing storage organization, system can feedback user more in real time access request.
The purpose of the present invention is realized by following technical scheme:
A kind of data reconstruction optimization method of online data deduplication system, comprises the steps:
(1), after, online data deduplication system carries out data deduplication to original document, duplicate removal bag, duplicate removal system are generated
The system response access request to the data based on file-level for the user, is accessed by the storage that file access pattern realizes user, online
Data deduplication system can count the access times of each file in duplicate removal bag within the time of one section of default measured length, will visit
Ask that frequency is higher than that the file of certain value classifies as active file collection, the file that visiting frequency is less than this critical value is classified as non-conventional literary composition
Part collection, then execution step (2) operation;
(2), suspend the data access request of data deduplication system, carry out the data block based on file-level and reset,
Document entity in the active file set pair duplicate removal bag that active file filter obtains according to step (1) carries out shunting process;Place
Reason process is:According to the putting in order of original document in duplicate removal bag, read the document entity in duplicate removal bag one by one, comparison document is real
The filename of metadata information section of body record respective file and file type, if file name is present in step (1) and generates
Active file concentrate, then execution step (3) operation;
(3), read the unique data block number area of document entity, according to data block mapping ruler, find each corresponding volume
Number deposit position in duplicate removal bag for the unique data block, corresponding unique data block is written in the file that will recover,
And last the unique data block in document entity is also written in file to be recovered, if step (2) is all complete
After one-tenth, then execution step (4), otherwise continue to return execution step (2);
(4), the file of conventional concentration is re-started data block cutting and fingerprint calculates, and generate new logic data block
Unit and file describe metamessage, and newly-generated data message is written in new duplicate removal bag, then execution step (5) behaviour
Make;
(5), the non-conventional file set corresponding unique data block in old duplicate removal bag is carried out the number based on file-level
According to recovery, file in non-conventional file set is appended in new duplicate removal bag, is put into the rear end of data slot in new duplicate removal bag,
After the completion of delete old duplicate removal bag;
(6) data distribution in, newly-generated duplicate removal bag is based on the data block that active file is comprised and file unit
The prefetching and concentrating of data, data deduplication system recovers the response request to data access for the user.
Preferably, in step (2), carrying out based on the prerequisite steps of file data rearrangement block is to find to be wrapped single file
The all data blocks containing, corresponding data block is made unified scheduling, needs to duplicate removal before the corresponding data block of locating file
File in bag is recovered, and file access pattern is a read block and the process of write file, by reading in duplicate removal bag
The file metadata information data block message that each document entity comprises, recovers initial file data;Based on file-level
Data block reset, not only unique data block is concentrated the front end being prefetched to data slot in duplicate removal bag, and data block refers to
The related description information such as line and logic data block is also prefetched to the front end of corresponding data fragment in the lump.
Preferably, in step (2), described active file filter is used for realizing file data blocks distribution management, by changing
Become the order that file enters data deduplication system, realize the data block based on active file collection and reset, file filter device is first
First the file in duplicate removal bag is scanned by the order of system file, when the file scanning is in active file collection, just straight
Tap into data block corresponding to style of writing part, the retrieval of fingerprint, logical data and document entity, retrieving includes seeking of data block
The write of data field in location and recovery, and new duplicate removal bag, after All Files is all scanned, remaining not in active file
Concentrate file just by original be arranged sequentially the data slot of active file collection in duplicate removal bag after.
Preferably, in step (3), storage format in duplicate removal bag for the data block is a copy, multiple indexes, data block
Addressing unit be byte, in duplicate removal bag, in corresponding logic data block, each patrols the physical message record of unique data block
The size of volume data block is identical, the numbering of unique data block from the beginning of 0, incremented by successively.
Preferably, data block addressing includes two mapping process, first, is found according to the numbering of data block in document entity
Corresponding logic data block, because the size of each logical block is identical, the calculating process of addressing is:The numbering of data block is multiplied by
The size of logical block, then just draws the physical address of counterlogic data block;Then, second addressing is according to patrolling of reading
The physical displacement of unique data block of record and block size in volume data block, find corresponding data block, the addressing of data block and
Physical mappings are actually the conversion of " index unique data block ".
Preferably, after file filter device is based on the screening recovery of active file collection to original document data in duplicate removal bag, need
Again the data block that file is comprised and corresponding metadata to store in duplicate removal bag, concrete steps be by file cutting,
Fingerprint generates, sets up and safeguard data, and after system cutting file, the process to data block is the hash value first calculating data block, connects
And carry out hash and compare, be exactly finally that the data after duplicate removal is stored, the memory management module of system is to new unique number
Processing procedure according to block is a scheduling that can concurrently execute.
Preferably, data recovery is for all unique data blocks comprising in single file, logic data block, data block
Fingerprint and the unified recovery of file metadata.
Preferably, the data block processing procedure that the file after processing through data de-duplication technology is comprised is divided into
The thread of four parallel processings:The storage of unique data block, logic data block storage, the storage of data block fingerprint and file metadata are deposited
Storage, thread with programming mechanism be openMP.
Preferably, the file in active file filters scan duplicate removal bag is to enter data de-duplication system by original document
The time sequencing of system, the filename comparing document entity in duplicate removal bag one by one whether there is in active file collection, to visiting frequency
Different file shuntings is processed.
Preferably, change data deduplication system duplicate removal bag in original document press file entrance system time suitable
Data content in duplicate removal bag is included unique data block, logic data block, data block fingerprint by the feature of the discrete distribution of sequence again
Press the visiting frequency of file with file metadata, with single file for base unit Unified Set in be dispatched to respective counts in duplicate removal bag
Front end according to fragment.
The present invention compared with prior art, has the advantage that and beneficial effect:
(1) data rearrangement based on active file for the present invention, with file for processing unit, to comprised in single file
The corresponding data message of all data block data blocks carries out United Dispatching and distribution, in this access request with user level
Hold consistent with mode.
(2) present invention shunts to the data of active file and non-active file, file test is concentrated pre-
Get the data slot front end in duplicate removal bag, the time overhead that the system of saving is found to document entity.
(3) file access pattern termination mechanism, the present invention is based on the mistake to file access pattern in the duplicate removal bag after active file rearrangement
Journey adds termination to judge, that is, after All Files all recovers from packet in file set, system no longer scans duplicate removal
Alternative document entity in bag.This can save unnecessary file retrieval time.
Embodiment
As shown in figure 1, a kind of data reconstruction optimization method of online data deduplication system of the present invention, the scene of application
Model is online data deduplication system, including server end and client two parts:
The function that client is mainly realized is that file is carried out with stripping and slicing, calculates the hash value of data block, stores hash value, and
Fingerprint as this data block.By comparing the fingerprint of each data block, judged the block whether this data block repeats, system is only
Store unique data block, and record the ID of each data block.Each file can set up a document entity, and document entity is used for
Preserve original metadata, including filename, data number of blocks, data block ID size, the size of last data block and
The numbering of one group of unique data block, and last data block of file is (because this data block size is generally than normal number
Little according to block, recurrence probability is very little, so independent store).Unique data block, data block fingerprint, all of document entity can be protected
There is a duplicate removal bag, in duplicate removal bag, data is sent to server end in the form of a file.
Server parses the data in duplicate removal bag, and preserves unique data block, data block fingerprint table, logical data and file
Entity is exactly the reading of this four classes data on server based on the operation interval of file data rearrangement block and writes.Based on file weight
Row is the sequencing by reorganizing data in duplicate removal bag, to obtain the more excellent document retrieval of system and recovery time effect
Rate.
It is embodied as model in order to more clearly illustrate the present invention, below in conjunction with the work based on file data rearrangement block
Data block mapping and addressing scheme (Fig. 3) data stream storage organization schematic diagram (Fig. 4) in flow diagram (Fig. 2), duplicate removal bag
Remake labor.
As shown in Fig. 2 system enters rearrangement to file is divided into two stages.First stage is file access pattern, process
To as if duplicate removal bag.Based on the data recovery of file, first, read the document entity in duplicate removal bag, document entity contains phase
Answer the numbering of file corresponding unique data block;Then, corresponding logic data block is found according to data block numbering, read logic
The displacement of data block and size information, find the unique data block in duplicate removal bag;Finally, the data block based on document entity arranges
Sequentially, unique data block is written in corresponding file.Second stage is that file is reset, and file is reset has three orders to hold
The module of row.(1) file filter device, (2) data block cutting, (3) data block is processed, the function of each several part around process unit
It is all file, the base unit of data processing is data block.
As shown in figure 3, the data that active file is concentrated is entered line retrieval, file with file for base unit by file filter device
Retrieval in duplicate removal bag is to carry out corresponding data block addressing and operation according to document entity.Data block is in duplicate removal bag
Storage format is a copy, multiple indexes.So in data deduplication system, needing to set up the logical description of data block
Information, is set up with the index facilitating shared unique data block between different files.The addressing unit of data block is byte, duplicate removal bag
The physical message record of middle unique data block is in corresponding logic data block.The size of each logic data block is identical, uniquely
The numbering of data block from the beginning of 0, incremented by successively.Data block addressing includes two mapping process, first, according to number in document entity
Find corresponding logic data block according to the numbering of block, because the size of each logical block is identical, the calculating process of addressing is:Number
It is multiplied by the size of logical block according to the numbering of block, then just draw the physical address of counterlogic data block.Then, address for second
It is according to the physical displacement of the unique data block of record and block size in the logic data block reading, find corresponding data block.
The addressing of data block and physical mappings are actually the conversion of " index unique data block ".
As shown in figure 4, after file filter device is based on the screening recovery of active file collection to original document data in duplicate removal bag,
The data block again comprising file and corresponding metadata is needed to store in duplicate removal bag.Concrete steps are by file and cut
Divide, fingerprint generates, data is safeguarded in foundation.After system cutting file, the process to data block is the hash value first calculating data block,
Then carry out hash to compare, be exactly finally that the data after duplicate removal is stored.The memory management module of system is to new unique
The processing procedure of data block is a scheduling that can concurrently execute.In order to improve the treatment effeciency of data block, proposed by the present invention
Storing process is divided into four threads concurrently executing with Open MP multithreading by model:Hash value insertion hash table, unique
Data block is processed, logic data block is processed and metadata is processed.Because diverse location write number in duplicate removal bag for each thread
According to so concurrent storage management not only can improve the delivery efficiency of system, and maintaining the independence of data to a certain extent
Property.
Above-described embodiment is the present invention preferably embodiment, but embodiments of the present invention are not subject to above-described embodiment
Limit, other any Spirit Essences without departing from the present invention and the change made under principle, modification, replacement, combine, simplify,
All should be equivalent substitute mode, be included within protection scope of the present invention.