CN105354246A

CN105354246A - Distributed memory calculation based data deduplication method

Info

Publication number: CN105354246A
Application number: CN201510670867.2A
Authority: CN
Inventors: 林伟伟; 钟坯平; 利业鞑
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2015-10-13
Filing date: 2015-10-13
Publication date: 2016-02-24
Anticipated expiration: 2035-10-13
Also published as: CN105354246B

Abstract

The invention discloses a distributed memory calculation based data deduplication method. The method comprises the following steps: creating a file block fingerprint set and caching the file block fingerprint set into a distributed memory; performing block segmentation on a file according to an optimal file block segmentation policy, finishing block fingerprint calculation, comparing block fingerprints with the fingerprint set cached in the memory to find matched blocks, and adding corresponding citations for the blocks; adopting a multi-level cache policy for storage of the block fingerprint set, caching block fingerprints with high weights into the memory, and caching block fingerprints with small weights into a disk; and dividing the memory into a plurality of regions for storing different types of fingerprint information to perform different fingerprint comparison operations on the file. According to the data deduplication method, the efficiency of mass data deduplication is improved, so that host space and network bandwidth are saved, and the costs of data operation and maintenance are reduced for service providers.

Description

A kind of data duplicate removal method calculated based on distributed memory

Technical field

The present invention relates to mass data duplicate removal field, particularly a kind of data duplicate removal method calculated based on distributed memory.

Background technology

Instantly distributed system has been widely used in information-based industry, for tackling the growing of mass data.Although distributed system solves the storage problem of mass data, bringing new challenge simultaneously---the time that the backup-and-restore of data expends is more and more longer, and the redundancy of data gets more and more, and the storage and maintenance cost of data is more and more higher.Although unit stores price and significantly reduces, store total cost and but constantly rise, therefore data de-duplication technology is more and more paid close attention to.How for the secondary storage of mass data carries out efficient duplicate removal, reducing the time of duplicate removal process consumes as far as possible, has been problem demanding prompt solution.

Research in recent years about data de-duplication reaches climax, in the FAST meeting of 2011, paper " AStudyofPracticalDeduplication " data deduplication to main storage system is analyzed, and paper " TradeoffsinScalableDataRoutingforDeduplicationClusters " is weighed easily extensible data route in data deduplication cluster.In addition in a distributed system, the people such as JianmingYoung propose a kind of method using HDFS and Hbase, be combined the cryptographic hash of MD5 and SHA-1 hash function calculation document, and value transmit is to Hbase, new cryptographic hash is compared with existing codomain, to determine whether to require client upload file.Be combined MD5 and SHA-1 simultaneously, avoid sporadic collision.A prototype tool Dedoop (DeduplicationwithHadoop) of being developed by University of Leipzig, MapReduce is applied to the entity resolution process in large data, instrument enumerates MapReduce application mode the most ripe in data de-duplication technology.Piecemeal based on Entities Matching refers to and input data is carried out semantic piecemeal according to similar data, and limits for the entity of same block.Entity resolution process is divided into two MapReduce operations, analyzes operation and is mainly used in the statistic record frequency of occurrences, and coupling operation calculates for the treatment of load balancing and the degree of approximation.In addition, coupling operation adopts the load balancing regulation and control of " greedy pattern ", and that is matching task is according to the descending sort of task process size of data, and the Reduce operation making minimum load distributes.Dedoop additionally uses effective technology to avoid unnecessary paired comparisons.It requires which Reduce task is MR program clearly must define in which paired comparisons of process, so just without the need to carrying out identical paired comparisons on multiple node.MapReduce and memory controller are combined by the people such as AshishKathpal, propose by using the duplicate detection mechanism of HadoopMapReduce to carry out the original duplicate detection link of alternative Netapp, data fingerprint is moved to HDFS by memory controller, generate data fingerprint data storehouse, and on HDFS this database of permanent storage, use MapReduce to filter out from data fingerprint record set simultaneously and repeat record, and the data fingerprint table after deduplication is preserved back memory controller.Your people such as grade thick domestic Liu proposes a kind of extendible fingerprint queries method, based on the fingerprint queries optimization method of sampling mechanism, the data fingerprint scale that need inquire about is reduced by the method for sampling, and utilize extendible index structure to carry out tissue storage to data fingerprint, further increase the search efficiency of data fingerprint.In addition, the people such as domestic Wang Jianhui are also studied the HDFS distributed backup system that supporting document is heavily deleted, and utilize Open Source Framework Lucene to set up file index, are convenient to heap file fast query information.Whether, in data de-duplication process, first check every block identical, if first time backup, file can be stored in storage medium; If desired the blocks of files backed up is identical with the blocks of files backed up, then do not back up this block, uses the data that pointed repeats instead, record backup information, is convenient to recover.

Although carried out the research of many work in recent years in the duplicate removal of cloud data backup, but at present for mass data duplicate removal mainly for optimum file block, need to carry out data prediction and data modeling in advance, read finger print information from database or disk and do real-time analysis and computing, then contrast, the deduplicated efficiency of this mode is not high, expends time in and system resource.Therefore, carry out data deduplication system modelling for distributed memory computing method, give full play to the ability of multinuclear, with to parallel data processing, the speed that internal memory reads becomes multiple to accelerate, can not the problem of duplicate removal fast to solve mass data instantly.

Summary of the invention

The object of the invention is to overcome the shortcoming of prior art and deficiency, a kind of data duplicate removal method calculated based on distributed memory is provided, can by file block fingerprint with buffer memory storehouse contrast in distributed memory, filter same file block, and for each host assignment different task in distributed system is to reach system load balancing, improve the efficiency of mass data duplicate removal, thus save hosting space and the network bandwidth, for service provider reduces the cost of data O&M.

Object of the present invention is realized by following technical scheme:

Based on the data duplicate removal method that distributed memory calculates, comprise the step of following order:

S1. create file block fingerprint collection and be cached in distributed memory;

S2. according to the blocks of files partition strategy of optimum, piecemeal is carried out to file, and complete the calculating of piecemeal fingerprint, with the fingerprint set pair ratio of buffer memory in internal memory, find the piecemeal of coupling, quote for it adds correspondence;

S3. the storage of piecemeal fingerprint collection adopts multi-level buffer strategy, and what weights were large is cached in internal memory, and what weights were little is cached on disk;

S4. internal memory is divided into multiple region, stores dissimilar finger print information, to carry out different fingerprint contrast operations to file.

The described data duplicate removal method calculated based on distributed memory, also comprises: after creating file block fingerprint collection, adds fingerprint collection initial weight for creating file block fingerprint collection.

Described fingerprint collection initial weight is decayed, until initial weight is zero gradually along with passage of time.

The specifying information of described file block fingerprint collection comprises: piecemeal respective path, piecemeal creation-time, piecemeal HASH value, fingerprint collection creation-time, fingerprint collection citation times, fingerprint centralization of state power value; Described fingerprint centralization of state power value is determined jointly by fingerprint collection initial weight, fingerprint collection citation times, fingerprint collection creation-time, and fingerprint centralization of state power value is used for the initial state of unified fingerprint collection; Fingerprint collection creation-time is used for the decay of fingerprint centralization of state power value; Fingerprint collection citation times represents fingerprint collection active degree.

Described step S2, specifically comprises following steps:

S201. calculate the fingerprint value of file block to be compared, the fingerprint collection with buffer memory in distributed memory is made and being compared, and the order compared is according to the descending sort of fingerprint centralization of state power value, and the fingerprint collection that weights are large preferentially contrasts;

If S202. do not find the fingerprint collection of coupling in distributed memory, then read from disk and do not have the fingerprint collection of buffer memory to complete contrast, Comparing method is also contrast according to weights descending sort;

If S203. from internal memory or disk find fingerprint identical fingerprints collection, then that is concentrated by this fingerprint quotes and adds this piecemeal to, and revises this fingerprint collection citation times and weights;

If S204. do not find, then create new fingerprint collection, complete the initialization of fingerprint collection every terms of information, simultaneously for piecemeal adds quoting of this new fingerprint collection.

In step S3, described multi-level buffer strategy comprises: if cannot the whole fingerprint collection of buffer memory in internal memory, disk then as fingerprint collection L2 cache, according to weights descending sort, according to this order by fingerprint set cache in internal memory, fingerprint set cache in internal memory can not be cached to on disk.

Described multi-level buffer strategy also comprises: when creating new fingerprint collection or certain fingerprint collection is matched, determining that this fingerprint collection is substituted in internal memory, or be directly cached on disk according to weights size.

In step S4, described internal memory is divided into multiple region, is specially: the internal memory for buffer memory fingerprint collection is divided into two parts, the fingerprint collection information of part of cache file-level; The fingerprint collection information of part of cache piecemeal level; The part that the fingerprint collection of file-level or piecemeal level cannot be cached in internal memory can only be cached on disk, can not occupy the internal memory of other parts.

Compared with prior art, tool has the following advantages and beneficial effect in the present invention:

(1) the present invention calculates based on distributed memory, different from general distributed system data deduplication, pass through finger print information buffered in advance in internal memory in this method, real-time analysis and computing can be done to the data of extensive magnanimity in internal memory, when whether locating file piecemeal exists, fingerprint collection in direct contrast internal memory, without the need to reading from disk, thus duplicate removal speed of the present invention is fast.

(2) calculating the ability that can give full play to multinuclear based on distributed memory, leave in inside internal memory by fingerprint information data according to the row storage mode optimized, can carry out parallel processing when contrasting finger print information, the speed that internal memory reads can be accelerated at double.

(3) COMPREHENSIVE CALCULATING draws the weights of fingerprint collection from many aspects, and whether using weights as foundation, control fingerprint collection and can be cached in internal memory, the fingerprint collection that weights are larger is namely more active, can arrive by Rapid matching in certain time period.Simultaneously when memory headroom is not enough, partial fingerprints set cache on disk, is convenient to repeatedly use by employing multi-level buffer strategy.

(4) internal memory fingerprint storage region is divided into two parts, adopts file-level and piecemeal level two kinds of modes to carry out data deduplication.Two kinds of duplicate removal modes combine and can reduce piecemeal finger print information collection, can process for partial document without the need to piecemeal simultaneously.

Accompanying drawing explanation

Fig. 1 is the process flow diagram of a kind of data duplicate removal method based on distributed memory calculating of the present invention.

Fig. 2 is the Region dividing schematic diagram of file fingerprint collection and piecemeal fingerprint collection.

Fig. 3 is the fingerprint collection information schematic diagram of buffer memory in internal memory or on disk.

Fig. 4 implements schematic diagram based on the file-level duplicate removal of the data duplicate removal method of distributed memory calculating.

Fig. 5 implements schematic diagram based on the piecemeal level duplicate removal of the data duplicate removal method of distributed memory calculating.

Embodiment

Below in conjunction with embodiment and accompanying drawing, the present invention is described in further detail, but embodiments of the present invention are not limited thereto.

Embodiment one

Based on the data duplicate removal method that distributed memory calculates, comprise the following steps:

(1) file block fingerprint collection is created in distributed memory, and by this fingerprint set cache in internal memory.The wherein content that comprises of fingerprint collection a: part is piecemeal respective path, piecemeal creation-time, piecemeal HASH value etc.; Another part is fingerprint collection creation-time, fingerprint collection citation times, fingerprint centralization of state power value etc.Part I content is for mapping fingerprint collection and piecemeal, and the second part is for controlling in fingerprint set cache to distributed memory or being cached to disk.

(2) be that it adds unified initial weight during establishment fingerprint collection, determine the cache location of fingerprint collection.Wherein the initial weight of each fingerprint collection is decayed, until initial weight is zero gradually along with passage of time.

(3) when carrying out file backup or upload operation in distributed system, Operation control will be this document creation duplicate removal task at certain main frame, carries out piecemeal according to the blocks of files partition strategy of optimum to file, and completes the calculating of piecemeal fingerprint.By the fingerprint set pair ratio of each piecemeal fingerprint with buffer memory in internal memory, find the piecemeal of coupling, quote for it adds correspondence.Wherein the method for piecemeal fingerprint contrast is:

(3.1) calculate the fingerprint value of file block to be compared, the fingerprint collection with buffer memory in distributed memory is made and being compared, and the order compared is according to the descending sort of fingerprint centralization of state power value, and the fingerprint collection that weights are large preferentially contrasts;

(3.2) if do not find the fingerprint collection of coupling in distributed memory, then read from disk and do not have the fingerprint collection of buffer memory to complete contrast, Comparing method is also contrast according to weights descending sort;

(3.3) if from internal memory or disk find fingerprint identical fingerprints collection, then that is concentrated by this fingerprint quotes and adds this piecemeal to, and revises this fingerprint collection citation times and weights;

(3.4) if do not find, then create new fingerprint collection, complete the initialization of fingerprint collection every terms of information in first step, simultaneously for piecemeal adds quoting of this new fingerprint collection.

(4) in distributed memory computing method by the order of fingerprint collection control of right piecemeal contrast, and control certain fingerprint collection and whether be cached in internal memory.The weights height of fingerprint collection, namely the active degree of this fingerprint collection in internal memory is represented, wherein the method for fingerprint collection weight computing is: fingerprint collection current weight is determined jointly by the initial weight of fingerprint collection, citation times, creation-time, initial weight, for controlling the unified initial state of all fingerprint collection, ensures that up-to-date establishment file block can have higher weights; Fingerprint collection creation-time is used for the decay of fingerprint centralization of state power value, has lower weights with the fingerprint collection controlling not use for a long time; Fingerprint collection citation times, as the key factor affecting its weights, realizes the control of fingerprint collection active degree.

(5) based on the De-weight method that distributed memory calculates, in the process of mass data duplicate removal, along with passage of time can produce a large amount of fingerprint collection, using fingerprint centralization of state power value as foundation, realize fingerprint collection multi-level buffer strategy, wherein fingerprint collection multi-level buffer decision-making technique is: if internal memory cannot whole buffer memory fingerprint collection in distributed system, disk will as fingerprint collection L2 cache position; Fingerprint collection carries out descending sort according to its weights, and according to this order by fingerprint set cache in internal memory, when internal memory can not buffer memory more fingerprint collection, the fingerprint collection that all the other weights are lower is then cached on disk.In addition, when creating new fingerprint collection or certain fingerprint collection is cited, determining that this fingerprint collection is substituted in internal memory according to its weights size, or being directly cached on disk.

(6) quantity of file block fingerprint collection is reduced, realize in distributed memory fast fingerprint collection coupling, the internal memory being used for buffer memory fingerprint collection is divided into two parts, and a part is used for the fingerprint collection information of cache file level, can carry out the first process of certain file fast; A part is used for the fingerprint collection information of buffer memory piecemeal level.The part that the fingerprint collection of file-level or piecemeal level cannot be cached in internal memory can only be cached on disk, can not occupy the internal memory of other parts.

Embodiment two

Apply the present invention to the data deduplication based on Spark system:

As shown in Figure 1, be process flow diagram of the present invention, in distributed memory, first build file block fingerprint collection, the fingerprint collection simultaneously for creating adds initial weight, and to determine fingerprint set cache position, initial weight is decayed, until be zero gradually along with passage of time; According to the blocks of files partition strategy of optimum, piecemeal is carried out to file, and complete the calculating of piecemeal fingerprint, with the fingerprint set pair ratio of buffer memory in internal memory, if find the fingerprint collection of coupling, quote for it adds correspondence, if do not find, disk creates this piecemeal and new fingerprint collection; Whether the active degree of fingerprint collection in internal memory is showed by the height of its weights, according to the size order of weights, utilize control of right fingerprint collection to be cached in internal memory; The storage of piecemeal fingerprint collection adopts multi-level buffer strategy, and what weights were large is cached in internal memory, and what weights were little is cached on disk, to ensure that fingerprint collection is reused; Internal memory is divided into multiple region, respectively storage file level and piecemeal level finger print information, to carry out different process operations to file.

A kind of embodiment of the data duplicate removal method calculated based on distributed memory is provided in the present embodiment, this machining system is based on Spark internal memory computing system, carry out the structure of fingerprint collection FPD (FingerprintDatasets), FPD is divided into file-level fingerprint collection and piecemeal level fingerprint collection two class, there is the zones of different in internal memory in this two classes fingerprint collection as shown in Figure 2, storage space is mutually non-cross, so-called file fingerprint collection to be namely point block size be 1 situation.As shown in Figure 3, two kinds of FPD comprise piecemeal respective path, piecemeal modification time, piecemeal HASH value, fingerprint collection creation-time, fingerprint collection citation times and fingerprint centralization of state power value etc.

For the new fingerprint collection created, for it adds unified initial weight, each new fingerprint collection original state can be made on the one hand consistent, whether the fingerprint collection that can be used on the other hand controlling to create can be cached in internal memory.Increase along with the rally of passage of time fingerprint creates gradually, spatial cache idle in internal memory will become less, under normal circumstances, according to the strategy of LRU (LeastRecentUsed), result set before certain period its active degree in current slot will decrease, thus using the influence factor of the time interval as weights, in order to the weights making FPD total are more reasonable, the decay factor of initial weight with the product in the time interval namely as the reduction of fingerprint collection during this period of time initial weight, until this initial weight is reduced to zero, decay factor just can lose efficacy.

When there is Backup Data or upload file operation in Spark system, need to carry out duplicate removal process to data, task management creates file duplicate removal task at certain main frame.In order to improve deduplicated efficiency, first carry out the duplicate removal process based on file-level to file, the position contrasted in internal memory is the region of storage file level finger print information.As shown in Figure 4, when carrying out upload operation, directly the Hashization information that takes the fingerprint being carried out to file and then contrasting.When carrying out data backup operation, first obtain the information such as this file owning user, file path, filemodetime, whether exist according to information searching file, if exist, quoting of this file is added in database, revises the citation times of this fingerprint collection simultaneously; If do not find and carry out Hashization to file again, whether exist according to hash value locating file, if exist, add correspondence and quote.

If do not found, then piecemeal process is carried out to file, and contrast in the region of memory of memory partitioning fingerprint collection.As shown in Figure 5, according to rational size to file block, in corresponding region, search fingerprint collection whether exist, if exist, quoting of this piecemeal is added in database, revise the citation times of fingerprint collection simultaneously, if do not exist, create this piecemeal and piecemeal fingerprint collection in systems in which, and be that this new fingerprint collection adds creation-time, weights etc.

Before more each fingerprint centralization of state power value, need the calculating to each fingerprint centralization of state power value, weights represent the active degree of this fingerprint collection, are cached in internal memory or on disk in order to control fingerprint collection.Except the influence factor that above-mentioned initial weight and the time interval are fingerprint centralization of state power value, the number of times that fingerprint collection is cited affects the of paramount importance key factor of weights.Initial weight W0, initial weight rate of decay is Rw, and fingerprint integrates date created as Dc, and current date is Dn, and file or blocks of files citation times are Cn (initially quoting number of times is 0), then the weights of current finger print collection are:

W = \{\begin{matrix} C_{n} + W_{0} - (D_{n} - D_{c}) * R_{w}, W_{0} > (D_{n} - D_{c}) * R_{w} \\ C_{n} \end{matrix} - - - (1)

For the ease of reusing of fingerprint collection, adopt multi-level buffer strategy, by mid-for the internal memory fingerprint set cache changed on disk, next time can directly read from disk, without the need to rebuilding fingerprint collection.That suppose to put into buffer memory is FPDn, when memory cache less than, directly by corresponding for FPDn buffer memory region of memory; If memory cache is full, Min algorithm is adopted to take out the minimum weights fingerprint collection FPDmin of buffer memory in internal memory, the weights Wn of FPDn is contrasted with the weights Wmin of FPDmin, if Wn>Wmin, then FPDmin is replaced on disk, and FPDn is cached in internal memory, if Wn<Wmin, then FPDn is cached on disk.

Above-described embodiment is the present invention's preferably embodiment; but embodiments of the present invention are not restricted to the described embodiments; change, the modification done under other any does not deviate from Spirit Essence of the present invention and principle, substitute, combine, simplify; all should be the substitute mode of equivalence, be included within protection scope of the present invention.

Claims

1., based on the data duplicate removal method that distributed memory calculates, it is characterized in that, comprise the step of following order:

2. the data duplicate removal method calculated based on distributed memory according to claim 1, is characterized in that, also comprise: after creating file block fingerprint collection, adds fingerprint collection initial weight for creating file block fingerprint collection.

3. the data duplicate removal method calculated based on distributed memory according to claim 2, is characterized in that, described fingerprint collection initial weight is decayed, until initial weight is zero gradually along with passage of time.

4. the data duplicate removal method calculated based on distributed memory according to claim 2, it is characterized in that, the specifying information of described file block fingerprint collection comprises: piecemeal respective path, piecemeal creation-time, piecemeal HASH value, fingerprint collection creation-time, fingerprint collection citation times, fingerprint centralization of state power value; Described fingerprint centralization of state power value is determined jointly by fingerprint collection initial weight, fingerprint collection citation times, fingerprint collection creation-time, and fingerprint centralization of state power value is used for the initial state of unified fingerprint collection; Fingerprint collection creation-time is used for the decay of fingerprint centralization of state power value; Fingerprint collection citation times represents fingerprint collection active degree.

5. the data duplicate removal method calculated based on distributed memory according to claim 1, it is characterized in that, described step S2, specifically comprises following steps:

6. the data duplicate removal method calculated based on distributed memory according to claim 1, it is characterized in that, in step S3, described multi-level buffer strategy comprises: if cannot the whole fingerprint collection of buffer memory in internal memory, disk is then as fingerprint collection L2 cache, according to weights descending sort, according to this order by fingerprint set cache in internal memory, fingerprint set cache in internal memory can not be cached to on disk.

7. the data duplicate removal method calculated based on distributed memory according to claim 6, it is characterized in that, described multi-level buffer strategy also comprises: when creating new fingerprint collection or certain fingerprint collection is matched, determine that this fingerprint collection is substituted in internal memory according to weights size, or be directly cached on disk.

8. the data duplicate removal method calculated based on distributed memory according to claim 1, it is characterized in that, in step S4, described internal memory is divided into multiple region, be specially: the internal memory for buffer memory fingerprint collection is divided into two parts, the fingerprint collection information of part of cache file-level; The fingerprint collection information of part of cache piecemeal level; The part that the fingerprint collection of file-level or piecemeal level cannot be cached in internal memory can only be cached on disk, can not occupy the internal memory of other parts.