CN105354246B

CN105354246B - A kind of data duplicate removal method calculated based on distributed memory

Info

Publication number: CN105354246B
Application number: CN201510670867.2A
Authority: CN
Inventors: 林伟伟; 钟坯平; 利业鞑
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2015-10-13
Filing date: 2015-10-13
Publication date: 2018-11-02
Anticipated expiration: 2035-10-13
Also published as: CN105354246A

Abstract

A kind of data duplicate removal method calculated based on distributed memory disclosed by the invention, including the step of following sequence：Establishment file piecemeal fingerprint collection is simultaneously cached in distributed memory；Piecemeal is carried out to file according to optimal blocks of files partition strategy, and completes the calculating of piecemeal fingerprint, with the fingerprint collection comparison cached in memory, finds matched piecemeal, for the corresponding reference of its addition；The storage of piecemeal fingerprint collection uses multi-level buffer strategy, and weights are big to be cached in memory, and weights are small to be cached on disk；Memory is divided into multiple regions, stores different types of finger print information, to carry out different fingerprint contrast operations to file.The data duplicate removal method of the present invention improves the efficiency of mass data duplicate removal, and to save hosting space and network bandwidth, the cost of data O&M is reduced for service provider.

Description

A kind of data duplicate removal method calculated based on distributed memory

Technical field

The present invention relates to mass data duplicate removal field, more particularly to a kind of data deduplication side calculated based on distributed memory Method.

Background technology

Instantly distributed system has been widely used in information-based industry, the increasingly increasing for coping with mass data It is long.Although distributed system solves the storage problem of mass data, but bring new challenge simultaneously --- the backup of data The time expended with reduction is increasingly longer, and the redundancy of data is more and more, and the storage and maintenance cost of data is higher and higher.Although Unit storage price significantly reduce, but store totle drilling cost but constantly rise, therefore data de-duplication technology obtain it is more next More concerns.How for the secondary storage of mass data to carry out efficient duplicate removal, reduce that duplicate removal process expends as far as possible when Between, it has been a problem to be solved.

Reached climax about the research of data de-duplication in recent years, in FAST meetings in 2011, paper《A Study of Practical Deduplication》The data deduplication of main storage system is analyzed, paper 《Tradeoffs in Scalable Data Routing for Deduplication Clusters》To data deduplication cluster In the routing of expansible data weighed.In addition in a distributed system, Jianming Young et al. propose one kind and make With the method for HDFS and Hbase, the cryptographic Hash of MD5 and SHA-1 hash function calculation documents is used in combination, and value passes to Hbase, by new cryptographic Hash compared with codomain compare, to determine whether transmitting file in requirement client.In combination with using MD5 and SHA-1 avoids sporadic collision.The prototype tool Dedoop developed by University of Leipzig MapReduce is applied to the entity resolution in big data and handled by (Deduplication with Hadoop), and tool is included Application mode the most ripe in data de-duplication technology MapReduce.Piecemeal based on Entities Matching refers to will be defeated Enter data and carry out semantic piecemeal according to similar data, and the entity of same block is defined.Entity resolution processing point At two MapReduce operations, analysis operation is mainly used for the statistic record frequency of occurrences, and matching operation is for handling load balancing And the degree of approximation calculates.In addition, matching operation using " greedy pattern " load balancing regulate and control, that is to say, that matching task according to Task handles the descending arrangement of size of data, and makes the Reduce operations distribution of minimum load.Dedoop is additionally used effectively Technology avoid extra paired comparisons.It requires MR programs that must explicitly define out which which Reduce task handling Paired comparisons are thus not necessarily to carry out identical paired comparisons on multiple nodes.Ashish Kathpal et al. will MapReduce and storage control are used in combination, and propose the repetition testing mechanism by using Hadoop MapReduce to replace Detection is repeated for Netapp is original, data fingerprint is moved into HDFS by storage control, generates data fingerprint data Library, and the database is permanently stored on HDFS, while repetition note is filtered out from data fingerprint record set using MapReduce Record, and the data fingerprint table after deduplication is preserved into back storage control.Domestic Liu it is thick it is expensive et al. propose it is a kind of expansible Fingerprint queries method reduces the data that need to be inquired based on the fingerprint queries optimization method of sampling mechanism by the method for sampling Fingerprint scale, and tissue storage is carried out to data fingerprint using expansible index structure, further improve data fingerprint Search efficiency.In addition, domestic Wang Jianhui et al. is also to supporting that the HDFS distributed backup systems that file is deleted are studied again, it is sharp File index is established with Open Source Framework Lucene, is convenient for heap file quick search information.During data de-duplication, first examine Test every piece it is whether identical, if backing up for the first time, file can be stored in storage medium；If desired the blocks of files backed up with it is standby The blocks of files of part is identical, then is not backed up this block, uses pointer instead and is directed toward the data repeated, record backup information, convenient for recovery.

Although the research much to work had been carried out in terms of the duplicate removal of cloud data backup in recent years, at present for sea Data deduplication is measured primarily directed to optimal file block, needs to carry out data prediction and data modeling in advance, from database or Finger print information is read on disk and does analysis and operation in real time, is then compared, the deduplicated efficiency of this mode is not high, expends Time and system resource.Therefore, data deduplication system modelling is carried out for distributed memory computational methods, gives full play to multinuclear Ability, with to parallel data processing, the speed that memory is read is accelerated at multiple, to solve, mass data cannot quick duplicate removal instantly The problem of.

Invention content

The shortcomings that it is an object of the invention to overcome the prior art with it is insufficient, provide and a kind of calculated based on distributed memory Data duplicate removal method can filter same file block, and be distribution by file block fingerprint with library comparison is cached in distributed memory Each host assignment different task improves the efficiency of mass data duplicate removal, to save to reach system load balancing in formula system Hosting space and network bandwidth reduce the cost of data O&M for service provider.

The purpose of the present invention is realized by the following technical solution：

A kind of data duplicate removal method calculated based on distributed memory, including the step of following sequence：

S1. it establishment file piecemeal fingerprint collection and is cached in distributed memory；

S2. piecemeal is carried out to file according to optimal blocks of files partition strategy, and completes the calculating of piecemeal fingerprint, in memory The fingerprint collection of caching compares, and finds matched piecemeal, for the corresponding reference of its addition；

S3. the storage of piecemeal fingerprint collection uses multi-level buffer strategy, and weights are big to be cached in memory, the small caching of weights Onto disk；

S4. memory is divided into multiple regions, stores different types of finger print information, is compared with carrying out different fingerprints to file Operation.

The described data duplicate removal method calculated based on distributed memory further includes：After establishment file piecemeal fingerprint collection, Fingerprint collection initial weight is added for establishment file piecemeal fingerprint collection.

The fingerprint collection initial weight is gradually decayed as time goes by, until initial weight is zero.

The specifying information of the file block fingerprint collection includes：Piecemeal respective path, piecemeal creation time, piecemeal HASH Value, fingerprint collection creation time, fingerprint collection citation times, fingerprint collects weights；The fingerprint collects weights are initially weighed by fingerprint collection Value, fingerprint collection citation times, fingerprint collection creation time codetermine, and fingerprint collects weights are used for the starting shape of unified fingerprint collection State；Fingerprint collection creation time is used for the decaying of fingerprint collects weights；Fingerprint collection citation times indicate fingerprint collection active degree.

The step S2, specifically comprises the steps of：

S201. the fingerprint value for calculating file block to be compared is made with the fingerprint collection cached in distributed memory and being compared, than Compared with sequence arranged according to fingerprint collects weights descending, the big fingerprint collection of weights preferentially compares；

If S202. not finding matched fingerprint collection in distributed memory, the fingerprint not cached is read from disk Collection completes comparison, and Comparing method is also to arrange to compare according to weights descending；

If S203. the reference that the fingerprint is concentrated is added to from fingerprint identical fingerprints collection is found in memory or on disk The piecemeal, and change the fingerprint collection citation times and weights；

If S204. not finding, new fingerprint collection is created, completes the initialization of fingerprint collection every terms of information, while being piecemeal Add the reference of the new fingerprint collection.

In step S3, the multi-level buffer strategy includes：If whole fingerprint collection, disk conduct can not be cached in memory Fingerprint collection L2 cache is arranged according to weights descending, sequentially will be in fingerprint set cache to memory according to this, it is impossible in being cached to On the fingerprint set cache to disk deposited.

The multi-level buffer strategy further includes：When establishment new fingerprint collection or certain fingerprint collection are matched, according to power Value size determines that the fingerprint collection is to be substituted into memory, or be directly cached on disk.

In step S4, the memory is divided into multiple regions, specially：Memory for caching fingerprint collection is divided into two Point, the fingerprint collection information of part of cache file-level；The fingerprint collection information of part of cache piecemeal grade；File-level or piecemeal grade The part that fingerprint collection can not be cached in memory can only be cached on disk, cannot occupy the memory of other parts.

Compared with prior art, the present invention having the following advantages that and advantageous effect：

(1) present invention is calculated based on distributed memory, different from general distributed system data deduplication, in this method By the way that in finger print information buffered in advance to memory, can the data of extensive magnanimity will be done with analysis and operation in real time in memory, When locating file piecemeal whether there is, the fingerprint collection in memory is directly compared, without being read from disk, thus duplicate removal of the present invention speed Degree is fast.

(2) ability of multinuclear can be given full play to by being calculated based on distributed memory, by fingerprint information data according to optimization Row storage mode is stored in inside memory, can carry out parallel processing when compare finger print information, and the speed of memory reading can be at Accelerate again.

(3) whether COMPREHENSIVE CALCULATING obtains the weights of fingerprint collection from many aspects, and using weights as foundation, control fingerprint collection and can Be cached in memory, the fingerprint collection of weights bigger is i.e. more active, certain period can Rapid matching arrive.Simultaneously in memory headroom When insufficient, using multi-level buffer strategy by partial fingerprints set cache to disk, convenient for being used for multiple times.

(4) memory fingerprint repository domain is divided into two parts, data is carried out using file-level and piecemeal grade two ways Duplicate removal.Two kinds of duplicate removal mode combinations can reduce piecemeal finger print information collection, can be handled without piecemeal simultaneously for partial document.

Description of the drawings

Fig. 1 is a kind of flow chart of data duplicate removal method calculated based on distributed memory of the present invention.

Fig. 2 is the region division schematic diagram of file fingerprint collection and piecemeal fingerprint collection.

Fig. 3 is the fingerprint collection information schematic diagram that caches in memory or on disk.

Fig. 4 is that the file-level duplicate removal of the data duplicate removal method calculated based on distributed memory implements schematic diagram.

Fig. 5 is that the piecemeal grade duplicate removal of the data duplicate removal method calculated based on distributed memory implements schematic diagram.

Specific implementation mode

Present invention will now be described in further detail with reference to the embodiments and the accompanying drawings, but embodiments of the present invention are unlimited In this.

Embodiment one

A kind of data duplicate removal method calculated based on distributed memory, is included the following steps：

(1) establishment file piecemeal fingerprint collection in distributed memory, and will be in the fingerprint set cache to memory.Wherein fingerprint collection Including content：A part is piecemeal respective path, piecemeal creation time, piecemeal HASH values etc.；Another part is fingerprint collection wound Build time, fingerprint collection citation times, fingerprint collects weights etc..First part's content is for mapping fingerprint collection and piecemeal, second Point content is for controlling in fingerprint set cache to distributed memory or being cached to disk.

(2) unified initial weight is added for it when creating fingerprint collection, determines the cache location of fingerprint collection.Wherein each fingerprint The initial weight of collection is gradually decayed as time goes by, until initial weight is zero.

(3) when carrying out file backup or upload operation in distributed system, Operation control will be this document in certain host A duplicate removal task is created, piecemeal is carried out to file according to optimal blocks of files partition strategy, and complete the calculating of piecemeal fingerprint.It will Each piecemeal fingerprint finds matched piecemeal with the fingerprint collection comparison cached in memory, for the corresponding reference of its addition.Wherein piecemeal refers to Line comparison method be：

(3.1) fingerprint value for calculating file block to be compared, makes with the fingerprint collection cached in distributed memory and comparing, than Compared with sequence arranged according to fingerprint collects weights descending, the big fingerprint collection of weights preferentially compares；

(3.2) if not finding matched fingerprint collection in distributed memory, the fingerprint not cached is read from disk Collection completes comparison, and Comparing method is also to arrange to compare according to weights descending；

(3.3) if from fingerprint identical fingerprints collection is found in memory or on disk, the reference which concentrates is added to The piecemeal, and change the fingerprint collection citation times and weights；

(3.4) if not finding, new fingerprint collection is created, completes the initialization of fingerprint collection every terms of information in first step, The reference of the new fingerprint collection is added for piecemeal simultaneously.

(4) sequence of piecemeal comparison is controlled in distributed memory computational methods by fingerprint collects weights, and controls certain and refers to Whether line collection is cached in memory.The weights height of fingerprint collection, that is, represent the active degree of fingerprint collection in memory, middle finger Line collects weights calculate method be：Fingerprint collection current weight is common by the initial weight, citation times, creation time of fingerprint collection Determine, initial weight for controlling the unified initial state of all fingerprint collection, ensure newest establishment file piecemeal can possess compared with High weights；Fingerprint collection creation time is used for the decaying of fingerprint collects weights, with control for a long time the fingerprint collection that does not use possess compared with Low weights；Fingerprint collection citation times realize the control of fingerprint collection active degree as a key factor for influencing its weights.

(5) De-weight method calculated based on distributed memory as time goes by can during mass data duplicate removal A large amount of fingerprint collection are generated, using fingerprint collects weights as foundation, realize fingerprint collection multi-level buffer strategy, wherein fingerprint collection multi-level buffer Decision-making technique is：If memory can not all cache fingerprint collection in distributed system, disk will be used as fingerprint collection L2 cache position； Fingerprint collection carries out descending arrangement according to its weights, and cannot be cached in memory sequentially by fingerprint set cache to memory according to this In the case of more fingerprint collection, the lower fingerprint collection of remaining weights is then cached on disk.In addition, create new fingerprint collection or When certain fingerprint collection is cited, determine that the fingerprint collection is to be substituted into memory, or be directly cached to disk according to its weights size On.

(6) quantity for reducing file block fingerprint collection is realized the quick fingerprint collection matching in distributed memory, will be used for The memory of caching fingerprint collection is divided into two parts, and a part is used for the fingerprint collection information of cache file grade, can quickly carry out certain text The first processing of part；Fingerprint collection information of the part for caching piecemeal grade.The fingerprint collection of file-level or piecemeal grade can not cache It can only be cached on disk to the part in memory, the memory of other parts cannot be occupied.

Embodiment two

Apply the present invention to the data deduplication based on Spark systems：

As shown in Figure 1, for the flow chart of the present invention, file block fingerprint collection is built first in distributed memory, simultaneously Initial weight is added for the fingerprint collection of establishment, to determine that fingerprint set cache position, initial weight are gradually decayed as time goes by, Until being zero；Piecemeal is carried out to file according to optimal blocks of files partition strategy, and completes the calculating of piecemeal fingerprint, is delayed in memory The fingerprint collection comparison deposited, for the corresponding reference of its addition, creates this point if not finding if finding matched fingerprint collection on disk Block and new fingerprint collection；The active degree of fingerprint collection in memory is showed by the height of its weights, according to the size of weights Sequentially, whether it is cached in memory using control of right fingerprint collection；The storage of piecemeal fingerprint collection uses multi-level buffer strategy, weights Big is cached in memory, and weights are small to be cached on disk, to ensure that fingerprint collection is reused；Memory is divided into multiple areas Domain, difference storage file grade and piecemeal grade finger print information, to carry out different processing operations to file.

A kind of embodiment of the data duplicate removal method calculated based on distributed memory is provided in the present embodiment, the duplicate removal system System is to be based on Spark memory computing systems, carries out the structure of fingerprint collection FPD (Fingerprint Datasets), and FPD points are text Two class of part grade fingerprint collection and piecemeal grade fingerprint collection, this two classes fingerprint collection as shown in Figure 2 deposit different zones in memory, and storage is empty Between it is mutually non-cross, so-called file fingerprint collection be piecemeal size be 1 the case where.As shown in figure 3, two kinds of FPD include point Block respective path, piecemeal modification time, piecemeal HASH values, fingerprint collection creation time, fingerprint collection citation times and fingerprint collection Weights etc..

It on the one hand can make each new fingerprint collection initial shape for the fingerprint collection newly created for the unified initial weight of its addition State is consistent, on the other hand can be used for controlling whether the fingerprint collection created can be cached in memory.Fingerprint collection as time goes by It can gradually create and increase, idle spatial cache will become less in memory, it is generally the case that according to LRU (Least Recent Used) strategy, result set before certain time its active degree in current slot will can decrease, Thus using time interval as an influence factor of weights, in order to which the weights for keeping FPD total are more reasonable, the decaying of initial weight The factor is with the decrement that the product of time interval is as fingerprint collection during this period of time initial weight, up to the initial weight subtracts It is zero less, decay factor can just fail.

When occurring Backup Data in Spark systems or uploading file operation, need to carry out duplicate removal processing, operation to data Management is in certain host establishment file duplicate removal task.In order to improve deduplicated efficiency, the duplicate removal based on file-level is carried out to file first It handles, the position compared in memory is the region of storage file grade finger print information.As shown in figure 4, when carrying out upload operation, directly Connect to file carry out Hashization take the fingerprint information then compare.When carrying out data backup operation, this document is first obtained The information such as owning user, file path, filemodetime whether there is according to information searching file, if in the presence of if by this article The reference of part is added in database, while changing the citation times of the fingerprint collection；Hash is carried out to file again if not finding Change, whether there is according to hash value locating file, if adding corresponding reference in the presence of if.

If do not found, piecemeal processing is carried out to file, and compared in the region of memory of memory partitioning fingerprint collection.Such as Shown in Fig. 5, according to rational size to file block, in corresponding region searching fingerprint collection whether there is, if should in the presence of if The reference of piecemeal is added in database, while changing the citation times of fingerprint collection, should if being created in systems there is no if Piecemeal and piecemeal fingerprint collection, and add creation time, weights etc. for the new fingerprint collection.

Before comparing each fingerprint collects weights, the calculating to each fingerprint collects weights, weights is needed to indicate enlivening for the fingerprint collection Degree is cached to control fingerprint collection in memory or on disk.In addition to above-mentioned initial weight and time interval It is that the number that fingerprint collection is cited is to influence the mostly important key factor of weights outside the influence factor of fingerprint collects weights.Initially Weights W0, initial weight rate of decay are Rw, and fingerprint integrates date created as Dc, and current date Dn, file or blocks of files are drawn It is Cn (initial reference number is 0) with number, then the weights of current finger print collection are：

The fingerprint collection replaced in memory is delayed using multi-level buffer strategy for the ease of the reuse of fingerprint collection It is stored on disk, next time can directly read from disk, without rebuilding fingerprint collection.Assuming that be put into caching is FPDn, In the case where memory cache is less than, FPDn is directly cached into corresponding region of memory；If memory cache has been expired, calculated using Min Method takes out the minimum weights fingerprint collection FPDmin cached in memory, and the weights Wn of FPDn is compared with the weights Wmin of FPDmin, If Wn>Wmin then replaces FPDmin onto disk, and FPDn is cached in memory, if Wn<FPDn is then cached to by Wmin On disk.

The above embodiment is a preferred embodiment of the present invention, but embodiments of the present invention are not by above-described embodiment Limitation, it is other it is any without departing from the spirit and principles of the present invention made by changes, modifications, substitutions, combinations, simplifications, Equivalent substitute mode is should be, is included within the scope of the present invention.

Claims

1. it is a kind of based on distributed memory calculate data duplicate removal method, which is characterized in that include following sequence the step of：

S2. piecemeal is carried out to file according to optimal blocks of files partition strategy, and completes the calculating of piecemeal fingerprint, with being cached in memory The comparison of fingerprint collection, matched piecemeal is found, for the corresponding reference of its addition；

S3. the storage of piecemeal fingerprint collection uses multi-level buffer strategy, and weights are big to be cached in memory, weights it is small be cached to magnetic On disk；

S4. memory is divided into multiple regions, stores different types of finger print information, and behaviour is compared to carry out different fingerprints to file Make；

The data duplicate removal method calculated based on distributed memory is to be based on Spark systems, specific as follows：

Spark systems carry out the structure of fingerprint collection FPD, and it is two class of file-level fingerprint collection and piecemeal grade fingerprint collection, the text that FPD, which is divided to, Part grade fingerprint collection and piecemeal grade fingerprint collection all include piecemeal respective path, piecemeal modification time, piecemeal HASH values, and fingerprint collection creates Time, fingerprint collection citation times and fingerprint collects weights；

On the one hand make each new fingerprint collection original state consistent for the unified initial weight of its addition for the fingerprint collection newly created, On the other hand for controlling whether the fingerprint collection created can be cached in memory；According to the strategy of LRU, using time interval as The decay factor of one influence factor of weights, initial weight is used as fingerprint collection during this period of time with the product of time interval The decrement of initial weight, until the initial weight is reduced to zero, decay factor can just fail；

When occurring Backup Data in Spark systems or uploading file operation, need to carry out duplicate removal processing, job management to data In certain host establishment file duplicate removal task；The duplicate removal processing based on file-level, the position compared in memory are carried out to file first It is the region of storage file grade finger print information；When carrying out upload operation, Hashization directly is carried out to file and is taken the fingerprint information Then it is compared；When carrying out data backup operation, this document owning user, file path, filemodetime are first obtained Information whether there is according to information searching file, and if the reference of this document is added in database in the presence of if, while changing should The citation times of fingerprint collection；Hashization is carried out to file again if not finding, whether there is according to hash value locating file, if depositing Then adding corresponding reference；

If do not found, piecemeal processing is carried out to file, and compared in the region of memory of memory partitioning fingerprint collection；To file Piecemeal, lookup fingerprint collection whether there is in corresponding region, if the reference of the piecemeal is added in database in the presence of if, simultaneously The citation times for changing fingerprint collection create the piecemeal and piecemeal fingerprint collection in systems if being not present, and are the new finger Line collection adds creation time, weights；

Before comparing each fingerprint collects weights, the calculating to each fingerprint collects weights, weights is needed to indicate the active degree of the fingerprint collection, It is cached in memory or on disk to control fingerprint collection；Initial weight Wo, initial weight rate of decay are Rw, fingerprint collection Date created is Dc, and current date Dn, file or blocks of files citation times are Cn, and initial number of quoting is 0, then currently refers to The weights of line collection are：

Using multi-level buffer strategy, by the fingerprint set cache to disk replaced in memory, next time directly reads from disk It takes, without rebuilding fingerprint collection；Assuming that be put into caching is FPDn, in the case where memory cache is less than, directly will The corresponding region of memory of FPDn cachings；If memory cache has been expired, the minimum weights fingerprint cached in memory is taken out using Min algorithms Collect FPDmin, by the weights Wn of FPDn with the weights Wmin comparisons of FPDmin, if Wn>Wmin then replaces FPDmin to disk On, and FPDn is cached in memory, if Wn<FPDn is then cached on disk by Wmin.

2. the data duplicate removal method according to claim 1 calculated based on distributed memory, which is characterized in that further include： After establishment file piecemeal fingerprint collection, fingerprint collection initial weight is added for establishment file piecemeal fingerprint collection.

3. the data duplicate removal method according to claim 2 calculated based on distributed memory, which is characterized in that the finger Line collection initial weight is gradually decayed as time goes by, until initial weight is zero.

4. the data duplicate removal method according to claim 2 calculated based on distributed memory, which is characterized in that the text The specifying information of part piecemeal fingerprint collection includes：When piecemeal respective path, piecemeal creation time, piecemeal HASH values, fingerprint collection create Between, fingerprint collection citation times, fingerprint collects weights；The fingerprint collects weights are cited by fingerprint collection initial weight, fingerprint collection Number, fingerprint collection creation time codetermine, and fingerprint collects weights are used for the initial state of unified fingerprint collection；Fingerprint collection creation time Decaying for fingerprint collects weights；Fingerprint collection citation times indicate fingerprint collection active degree.

5. the data duplicate removal method according to claim 1 calculated based on distributed memory, which is characterized in that the step Rapid S2, specifically comprises the steps of：

S201. the fingerprint value for calculating file block to be compared is made with the fingerprint collection cached in distributed memory and being compared, compares Sequence is arranged according to fingerprint collects weights descending, and the big fingerprint collection of weights preferentially compares；

If S202. not finding matched fingerprint collection in distributed memory, it is complete that the fingerprint collection not cached is read from disk In contrast with, Comparing method is also to arrange to compare according to weights descending；

If the reference that the fingerprint is concentrated S203. is added to this point from fingerprint identical fingerprints collection is found in memory or on disk Block, and change the fingerprint collection citation times and weights；

If S204. not finding, new fingerprint collection is created, completes the initialization of fingerprint collection every terms of information, while adding for piecemeal The reference of the new fingerprint collection.

6. the data duplicate removal method according to claim 1 calculated based on distributed memory, which is characterized in that step S3 In, the multi-level buffer strategy includes：If can not cache whole fingerprint collection in memory, disk is slow as fingerprint collection two level It deposits, is arranged according to weights descending, it sequentially will be in fingerprint set cache to memory according to this, it is impossible to the fingerprint collection being cached in memory It is cached on disk.

7. the data duplicate removal method according to claim 6 calculated based on distributed memory, which is characterized in that described is more Grade cache policy further include：When establishment new fingerprint collection or certain fingerprint collection are matched, which is determined according to weights size Collection is to be substituted into memory, or be directly cached on disk.

8. the data duplicate removal method according to claim 1 calculated based on distributed memory, which is characterized in that step S4 In, the memory is divided into multiple regions, specially：Memory for caching fingerprint collection is divided into two parts, part of cache text The fingerprint collection information of part grade；The fingerprint collection information of part of cache piecemeal grade；The fingerprint collection of file-level or piecemeal grade can not cache It can only be cached on disk to the part in memory, the memory of other parts cannot be occupied.