CN105354246B - A kind of data duplicate removal method calculated based on distributed memory - Google Patents
A kind of data duplicate removal method calculated based on distributed memory Download PDFInfo
- Publication number
- CN105354246B CN105354246B CN201510670867.2A CN201510670867A CN105354246B CN 105354246 B CN105354246 B CN 105354246B CN 201510670867 A CN201510670867 A CN 201510670867A CN 105354246 B CN105354246 B CN 105354246B
- Authority
- CN
- China
- Prior art keywords
- fingerprint collection
- fingerprint
- memory
- piecemeal
- weights
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
Abstract
A kind of data duplicate removal method calculated based on distributed memory disclosed by the invention, including the step of following sequence:Establishment file piecemeal fingerprint collection is simultaneously cached in distributed memory;Piecemeal is carried out to file according to optimal blocks of files partition strategy, and completes the calculating of piecemeal fingerprint, with the fingerprint collection comparison cached in memory, finds matched piecemeal, for the corresponding reference of its addition;The storage of piecemeal fingerprint collection uses multi-level buffer strategy, and weights are big to be cached in memory, and weights are small to be cached on disk;Memory is divided into multiple regions, stores different types of finger print information, to carry out different fingerprint contrast operations to file.The data duplicate removal method of the present invention improves the efficiency of mass data duplicate removal, and to save hosting space and network bandwidth, the cost of data O&M is reduced for service provider.
Description
Technical field
The present invention relates to mass data duplicate removal field, more particularly to a kind of data deduplication side calculated based on distributed memory
Method.
Background technology
Instantly distributed system has been widely used in information-based industry, the increasingly increasing for coping with mass data
It is long.Although distributed system solves the storage problem of mass data, but bring new challenge simultaneously --- the backup of data
The time expended with reduction is increasingly longer, and the redundancy of data is more and more, and the storage and maintenance cost of data is higher and higher.Although
Unit storage price significantly reduce, but store totle drilling cost but constantly rise, therefore data de-duplication technology obtain it is more next
More concerns.How for the secondary storage of mass data to carry out efficient duplicate removal, reduce that duplicate removal process expends as far as possible when
Between, it has been a problem to be solved.
Reached climax about the research of data de-duplication in recent years, in FAST meetings in 2011, paper《A
Study of Practical Deduplication》The data deduplication of main storage system is analyzed, paper
《Tradeoffs in Scalable Data Routing for Deduplication Clusters》To data deduplication cluster
In the routing of expansible data weighed.In addition in a distributed system, Jianming Young et al. propose one kind and make
With the method for HDFS and Hbase, the cryptographic Hash of MD5 and SHA-1 hash function calculation documents is used in combination, and value passes to
Hbase, by new cryptographic Hash compared with codomain compare, to determine whether transmitting file in requirement client.In combination with using
MD5 and SHA-1 avoids sporadic collision.The prototype tool Dedoop developed by University of Leipzig
MapReduce is applied to the entity resolution in big data and handled by (Deduplication with Hadoop), and tool is included
Application mode the most ripe in data de-duplication technology MapReduce.Piecemeal based on Entities Matching refers to will be defeated
Enter data and carry out semantic piecemeal according to similar data, and the entity of same block is defined.Entity resolution processing point
At two MapReduce operations, analysis operation is mainly used for the statistic record frequency of occurrences, and matching operation is for handling load balancing
And the degree of approximation calculates.In addition, matching operation using " greedy pattern " load balancing regulate and control, that is to say, that matching task according to
Task handles the descending arrangement of size of data, and makes the Reduce operations distribution of minimum load.Dedoop is additionally used effectively
Technology avoid extra paired comparisons.It requires MR programs that must explicitly define out which which Reduce task handling
Paired comparisons are thus not necessarily to carry out identical paired comparisons on multiple nodes.Ashish Kathpal et al. will
MapReduce and storage control are used in combination, and propose the repetition testing mechanism by using Hadoop MapReduce to replace
Detection is repeated for Netapp is original, data fingerprint is moved into HDFS by storage control, generates data fingerprint data
Library, and the database is permanently stored on HDFS, while repetition note is filtered out from data fingerprint record set using MapReduce
Record, and the data fingerprint table after deduplication is preserved into back storage control.Domestic Liu it is thick it is expensive et al. propose it is a kind of expansible
Fingerprint queries method reduces the data that need to be inquired based on the fingerprint queries optimization method of sampling mechanism by the method for sampling
Fingerprint scale, and tissue storage is carried out to data fingerprint using expansible index structure, further improve data fingerprint
Search efficiency.In addition, domestic Wang Jianhui et al. is also to supporting that the HDFS distributed backup systems that file is deleted are studied again, it is sharp
File index is established with Open Source Framework Lucene, is convenient for heap file quick search information.During data de-duplication, first examine
Test every piece it is whether identical, if backing up for the first time, file can be stored in storage medium;If desired the blocks of files backed up with it is standby
The blocks of files of part is identical, then is not backed up this block, uses pointer instead and is directed toward the data repeated, record backup information, convenient for recovery.
Although the research much to work had been carried out in terms of the duplicate removal of cloud data backup in recent years, at present for sea
Data deduplication is measured primarily directed to optimal file block, needs to carry out data prediction and data modeling in advance, from database or
Finger print information is read on disk and does analysis and operation in real time, is then compared, the deduplicated efficiency of this mode is not high, expends
Time and system resource.Therefore, data deduplication system modelling is carried out for distributed memory computational methods, gives full play to multinuclear
Ability, with to parallel data processing, the speed that memory is read is accelerated at multiple, to solve, mass data cannot quick duplicate removal instantly
The problem of.
Invention content
The shortcomings that it is an object of the invention to overcome the prior art with it is insufficient, provide and a kind of calculated based on distributed memory
Data duplicate removal method can filter same file block, and be distribution by file block fingerprint with library comparison is cached in distributed memory
Each host assignment different task improves the efficiency of mass data duplicate removal, to save to reach system load balancing in formula system
Hosting space and network bandwidth reduce the cost of data O&M for service provider.
The purpose of the present invention is realized by the following technical solution:
A kind of data duplicate removal method calculated based on distributed memory, including the step of following sequence:
S1. it establishment file piecemeal fingerprint collection and is cached in distributed memory;
S2. piecemeal is carried out to file according to optimal blocks of files partition strategy, and completes the calculating of piecemeal fingerprint, in memory
The fingerprint collection of caching compares, and finds matched piecemeal, for the corresponding reference of its addition;
S3. the storage of piecemeal fingerprint collection uses multi-level buffer strategy, and weights are big to be cached in memory, the small caching of weights
Onto disk;
S4. memory is divided into multiple regions, stores different types of finger print information, is compared with carrying out different fingerprints to file
Operation.
The described data duplicate removal method calculated based on distributed memory further includes:After establishment file piecemeal fingerprint collection,
Fingerprint collection initial weight is added for establishment file piecemeal fingerprint collection.
The fingerprint collection initial weight is gradually decayed as time goes by, until initial weight is zero.
The specifying information of the file block fingerprint collection includes:Piecemeal respective path, piecemeal creation time, piecemeal HASH
Value, fingerprint collection creation time, fingerprint collection citation times, fingerprint collects weights;The fingerprint collects weights are initially weighed by fingerprint collection
Value, fingerprint collection citation times, fingerprint collection creation time codetermine, and fingerprint collects weights are used for the starting shape of unified fingerprint collection
State;Fingerprint collection creation time is used for the decaying of fingerprint collects weights;Fingerprint collection citation times indicate fingerprint collection active degree.
The step S2, specifically comprises the steps of:
S201. the fingerprint value for calculating file block to be compared is made with the fingerprint collection cached in distributed memory and being compared, than
Compared with sequence arranged according to fingerprint collects weights descending, the big fingerprint collection of weights preferentially compares;
If S202. not finding matched fingerprint collection in distributed memory, the fingerprint not cached is read from disk
Collection completes comparison, and Comparing method is also to arrange to compare according to weights descending;
If S203. the reference that the fingerprint is concentrated is added to from fingerprint identical fingerprints collection is found in memory or on disk
The piecemeal, and change the fingerprint collection citation times and weights;
If S204. not finding, new fingerprint collection is created, completes the initialization of fingerprint collection every terms of information, while being piecemeal
Add the reference of the new fingerprint collection.
In step S3, the multi-level buffer strategy includes:If whole fingerprint collection, disk conduct can not be cached in memory
Fingerprint collection L2 cache is arranged according to weights descending, sequentially will be in fingerprint set cache to memory according to this, it is impossible in being cached to
On the fingerprint set cache to disk deposited.
The multi-level buffer strategy further includes:When establishment new fingerprint collection or certain fingerprint collection are matched, according to power
Value size determines that the fingerprint collection is to be substituted into memory, or be directly cached on disk.
In step S4, the memory is divided into multiple regions, specially:Memory for caching fingerprint collection is divided into two
Point, the fingerprint collection information of part of cache file-level;The fingerprint collection information of part of cache piecemeal grade;File-level or piecemeal grade
The part that fingerprint collection can not be cached in memory can only be cached on disk, cannot occupy the memory of other parts.
Compared with prior art, the present invention having the following advantages that and advantageous effect:
(1) present invention is calculated based on distributed memory, different from general distributed system data deduplication, in this method
By the way that in finger print information buffered in advance to memory, can the data of extensive magnanimity will be done with analysis and operation in real time in memory,
When locating file piecemeal whether there is, the fingerprint collection in memory is directly compared, without being read from disk, thus duplicate removal of the present invention speed
Degree is fast.
(2) ability of multinuclear can be given full play to by being calculated based on distributed memory, by fingerprint information data according to optimization
Row storage mode is stored in inside memory, can carry out parallel processing when compare finger print information, and the speed of memory reading can be at
Accelerate again.
(3) whether COMPREHENSIVE CALCULATING obtains the weights of fingerprint collection from many aspects, and using weights as foundation, control fingerprint collection and can
Be cached in memory, the fingerprint collection of weights bigger is i.e. more active, certain period can Rapid matching arrive.Simultaneously in memory headroom
When insufficient, using multi-level buffer strategy by partial fingerprints set cache to disk, convenient for being used for multiple times.
(4) memory fingerprint repository domain is divided into two parts, data is carried out using file-level and piecemeal grade two ways
Duplicate removal.Two kinds of duplicate removal mode combinations can reduce piecemeal finger print information collection, can be handled without piecemeal simultaneously for partial document.
Description of the drawings
Fig. 1 is a kind of flow chart of data duplicate removal method calculated based on distributed memory of the present invention.
Fig. 2 is the region division schematic diagram of file fingerprint collection and piecemeal fingerprint collection.
Fig. 3 is the fingerprint collection information schematic diagram that caches in memory or on disk.
Fig. 4 is that the file-level duplicate removal of the data duplicate removal method calculated based on distributed memory implements schematic diagram.
Fig. 5 is that the piecemeal grade duplicate removal of the data duplicate removal method calculated based on distributed memory implements schematic diagram.
Specific implementation mode
Present invention will now be described in further detail with reference to the embodiments and the accompanying drawings, but embodiments of the present invention are unlimited
In this.
Embodiment one
A kind of data duplicate removal method calculated based on distributed memory, is included the following steps:
(1) establishment file piecemeal fingerprint collection in distributed memory, and will be in the fingerprint set cache to memory.Wherein fingerprint collection
Including content:A part is piecemeal respective path, piecemeal creation time, piecemeal HASH values etc.;Another part is fingerprint collection wound
Build time, fingerprint collection citation times, fingerprint collects weights etc..First part's content is for mapping fingerprint collection and piecemeal, second
Point content is for controlling in fingerprint set cache to distributed memory or being cached to disk.
(2) unified initial weight is added for it when creating fingerprint collection, determines the cache location of fingerprint collection.Wherein each fingerprint
The initial weight of collection is gradually decayed as time goes by, until initial weight is zero.
(3) when carrying out file backup or upload operation in distributed system, Operation control will be this document in certain host
A duplicate removal task is created, piecemeal is carried out to file according to optimal blocks of files partition strategy, and complete the calculating of piecemeal fingerprint.It will
Each piecemeal fingerprint finds matched piecemeal with the fingerprint collection comparison cached in memory, for the corresponding reference of its addition.Wherein piecemeal refers to
Line comparison method be:
(3.1) fingerprint value for calculating file block to be compared, makes with the fingerprint collection cached in distributed memory and comparing, than
Compared with sequence arranged according to fingerprint collects weights descending, the big fingerprint collection of weights preferentially compares;
(3.2) if not finding matched fingerprint collection in distributed memory, the fingerprint not cached is read from disk
Collection completes comparison, and Comparing method is also to arrange to compare according to weights descending;
(3.3) if from fingerprint identical fingerprints collection is found in memory or on disk, the reference which concentrates is added to
The piecemeal, and change the fingerprint collection citation times and weights;
(3.4) if not finding, new fingerprint collection is created, completes the initialization of fingerprint collection every terms of information in first step,
The reference of the new fingerprint collection is added for piecemeal simultaneously.
(4) sequence of piecemeal comparison is controlled in distributed memory computational methods by fingerprint collects weights, and controls certain and refers to
Whether line collection is cached in memory.The weights height of fingerprint collection, that is, represent the active degree of fingerprint collection in memory, middle finger
Line collects weights calculate method be:Fingerprint collection current weight is common by the initial weight, citation times, creation time of fingerprint collection
Determine, initial weight for controlling the unified initial state of all fingerprint collection, ensure newest establishment file piecemeal can possess compared with
High weights;Fingerprint collection creation time is used for the decaying of fingerprint collects weights, with control for a long time the fingerprint collection that does not use possess compared with
Low weights;Fingerprint collection citation times realize the control of fingerprint collection active degree as a key factor for influencing its weights.
(5) De-weight method calculated based on distributed memory as time goes by can during mass data duplicate removal
A large amount of fingerprint collection are generated, using fingerprint collects weights as foundation, realize fingerprint collection multi-level buffer strategy, wherein fingerprint collection multi-level buffer
Decision-making technique is:If memory can not all cache fingerprint collection in distributed system, disk will be used as fingerprint collection L2 cache position;
Fingerprint collection carries out descending arrangement according to its weights, and cannot be cached in memory sequentially by fingerprint set cache to memory according to this
In the case of more fingerprint collection, the lower fingerprint collection of remaining weights is then cached on disk.In addition, create new fingerprint collection or
When certain fingerprint collection is cited, determine that the fingerprint collection is to be substituted into memory, or be directly cached to disk according to its weights size
On.
(6) quantity for reducing file block fingerprint collection is realized the quick fingerprint collection matching in distributed memory, will be used for
The memory of caching fingerprint collection is divided into two parts, and a part is used for the fingerprint collection information of cache file grade, can quickly carry out certain text
The first processing of part;Fingerprint collection information of the part for caching piecemeal grade.The fingerprint collection of file-level or piecemeal grade can not cache
It can only be cached on disk to the part in memory, the memory of other parts cannot be occupied.
Embodiment two
Apply the present invention to the data deduplication based on Spark systems:
As shown in Figure 1, for the flow chart of the present invention, file block fingerprint collection is built first in distributed memory, simultaneously
Initial weight is added for the fingerprint collection of establishment, to determine that fingerprint set cache position, initial weight are gradually decayed as time goes by,
Until being zero;Piecemeal is carried out to file according to optimal blocks of files partition strategy, and completes the calculating of piecemeal fingerprint, is delayed in memory
The fingerprint collection comparison deposited, for the corresponding reference of its addition, creates this point if not finding if finding matched fingerprint collection on disk
Block and new fingerprint collection;The active degree of fingerprint collection in memory is showed by the height of its weights, according to the size of weights
Sequentially, whether it is cached in memory using control of right fingerprint collection;The storage of piecemeal fingerprint collection uses multi-level buffer strategy, weights
Big is cached in memory, and weights are small to be cached on disk, to ensure that fingerprint collection is reused;Memory is divided into multiple areas
Domain, difference storage file grade and piecemeal grade finger print information, to carry out different processing operations to file.
A kind of embodiment of the data duplicate removal method calculated based on distributed memory is provided in the present embodiment, the duplicate removal system
System is to be based on Spark memory computing systems, carries out the structure of fingerprint collection FPD (Fingerprint Datasets), and FPD points are text
Two class of part grade fingerprint collection and piecemeal grade fingerprint collection, this two classes fingerprint collection as shown in Figure 2 deposit different zones in memory, and storage is empty
Between it is mutually non-cross, so-called file fingerprint collection be piecemeal size be 1 the case where.As shown in figure 3, two kinds of FPD include point
Block respective path, piecemeal modification time, piecemeal HASH values, fingerprint collection creation time, fingerprint collection citation times and fingerprint collection
Weights etc..
It on the one hand can make each new fingerprint collection initial shape for the fingerprint collection newly created for the unified initial weight of its addition
State is consistent, on the other hand can be used for controlling whether the fingerprint collection created can be cached in memory.Fingerprint collection as time goes by
It can gradually create and increase, idle spatial cache will become less in memory, it is generally the case that according to LRU (Least
Recent Used) strategy, result set before certain time its active degree in current slot will can decrease,
Thus using time interval as an influence factor of weights, in order to which the weights for keeping FPD total are more reasonable, the decaying of initial weight
The factor is with the decrement that the product of time interval is as fingerprint collection during this period of time initial weight, up to the initial weight subtracts
It is zero less, decay factor can just fail.
When occurring Backup Data in Spark systems or uploading file operation, need to carry out duplicate removal processing, operation to data
Management is in certain host establishment file duplicate removal task.In order to improve deduplicated efficiency, the duplicate removal based on file-level is carried out to file first
It handles, the position compared in memory is the region of storage file grade finger print information.As shown in figure 4, when carrying out upload operation, directly
Connect to file carry out Hashization take the fingerprint information then compare.When carrying out data backup operation, this document is first obtained
The information such as owning user, file path, filemodetime whether there is according to information searching file, if in the presence of if by this article
The reference of part is added in database, while changing the citation times of the fingerprint collection;Hash is carried out to file again if not finding
Change, whether there is according to hash value locating file, if adding corresponding reference in the presence of if.
If do not found, piecemeal processing is carried out to file, and compared in the region of memory of memory partitioning fingerprint collection.Such as
Shown in Fig. 5, according to rational size to file block, in corresponding region searching fingerprint collection whether there is, if should in the presence of if
The reference of piecemeal is added in database, while changing the citation times of fingerprint collection, should if being created in systems there is no if
Piecemeal and piecemeal fingerprint collection, and add creation time, weights etc. for the new fingerprint collection.
Before comparing each fingerprint collects weights, the calculating to each fingerprint collects weights, weights is needed to indicate enlivening for the fingerprint collection
Degree is cached to control fingerprint collection in memory or on disk.In addition to above-mentioned initial weight and time interval
It is that the number that fingerprint collection is cited is to influence the mostly important key factor of weights outside the influence factor of fingerprint collects weights.Initially
Weights W0, initial weight rate of decay are Rw, and fingerprint integrates date created as Dc, and current date Dn, file or blocks of files are drawn
It is Cn (initial reference number is 0) with number, then the weights of current finger print collection are:
The fingerprint collection replaced in memory is delayed using multi-level buffer strategy for the ease of the reuse of fingerprint collection
It is stored on disk, next time can directly read from disk, without rebuilding fingerprint collection.Assuming that be put into caching is FPDn,
In the case where memory cache is less than, FPDn is directly cached into corresponding region of memory;If memory cache has been expired, calculated using Min
Method takes out the minimum weights fingerprint collection FPDmin cached in memory, and the weights Wn of FPDn is compared with the weights Wmin of FPDmin,
If Wn>Wmin then replaces FPDmin onto disk, and FPDn is cached in memory, if Wn<FPDn is then cached to by Wmin
On disk.
The above embodiment is a preferred embodiment of the present invention, but embodiments of the present invention are not by above-described embodiment
Limitation, it is other it is any without departing from the spirit and principles of the present invention made by changes, modifications, substitutions, combinations, simplifications,
Equivalent substitute mode is should be, is included within the scope of the present invention.
Claims (8)
1. it is a kind of based on distributed memory calculate data duplicate removal method, which is characterized in that include following sequence the step of:
S1. it establishment file piecemeal fingerprint collection and is cached in distributed memory;
S2. piecemeal is carried out to file according to optimal blocks of files partition strategy, and completes the calculating of piecemeal fingerprint, with being cached in memory
The comparison of fingerprint collection, matched piecemeal is found, for the corresponding reference of its addition;
S3. the storage of piecemeal fingerprint collection uses multi-level buffer strategy, and weights are big to be cached in memory, weights it is small be cached to magnetic
On disk;
S4. memory is divided into multiple regions, stores different types of finger print information, and behaviour is compared to carry out different fingerprints to file
Make;
The data duplicate removal method calculated based on distributed memory is to be based on Spark systems, specific as follows:
Spark systems carry out the structure of fingerprint collection FPD, and it is two class of file-level fingerprint collection and piecemeal grade fingerprint collection, the text that FPD, which is divided to,
Part grade fingerprint collection and piecemeal grade fingerprint collection all include piecemeal respective path, piecemeal modification time, piecemeal HASH values, and fingerprint collection creates
Time, fingerprint collection citation times and fingerprint collects weights;
On the one hand make each new fingerprint collection original state consistent for the unified initial weight of its addition for the fingerprint collection newly created,
On the other hand for controlling whether the fingerprint collection created can be cached in memory;According to the strategy of LRU, using time interval as
The decay factor of one influence factor of weights, initial weight is used as fingerprint collection during this period of time with the product of time interval
The decrement of initial weight, until the initial weight is reduced to zero, decay factor can just fail;
When occurring Backup Data in Spark systems or uploading file operation, need to carry out duplicate removal processing, job management to data
In certain host establishment file duplicate removal task;The duplicate removal processing based on file-level, the position compared in memory are carried out to file first
It is the region of storage file grade finger print information;When carrying out upload operation, Hashization directly is carried out to file and is taken the fingerprint information
Then it is compared;When carrying out data backup operation, this document owning user, file path, filemodetime are first obtained
Information whether there is according to information searching file, and if the reference of this document is added in database in the presence of if, while changing should
The citation times of fingerprint collection;Hashization is carried out to file again if not finding, whether there is according to hash value locating file, if depositing
Then adding corresponding reference;
If do not found, piecemeal processing is carried out to file, and compared in the region of memory of memory partitioning fingerprint collection;To file
Piecemeal, lookup fingerprint collection whether there is in corresponding region, if the reference of the piecemeal is added in database in the presence of if, simultaneously
The citation times for changing fingerprint collection create the piecemeal and piecemeal fingerprint collection in systems if being not present, and are the new finger
Line collection adds creation time, weights;
Before comparing each fingerprint collects weights, the calculating to each fingerprint collects weights, weights is needed to indicate the active degree of the fingerprint collection,
It is cached in memory or on disk to control fingerprint collection;Initial weight Wo, initial weight rate of decay are Rw, fingerprint collection
Date created is Dc, and current date Dn, file or blocks of files citation times are Cn, and initial number of quoting is 0, then currently refers to
The weights of line collection are:
Using multi-level buffer strategy, by the fingerprint set cache to disk replaced in memory, next time directly reads from disk
It takes, without rebuilding fingerprint collection;Assuming that be put into caching is FPDn, in the case where memory cache is less than, directly will
The corresponding region of memory of FPDn cachings;If memory cache has been expired, the minimum weights fingerprint cached in memory is taken out using Min algorithms
Collect FPDmin, by the weights Wn of FPDn with the weights Wmin comparisons of FPDmin, if Wn>Wmin then replaces FPDmin to disk
On, and FPDn is cached in memory, if Wn<FPDn is then cached on disk by Wmin.
2. the data duplicate removal method according to claim 1 calculated based on distributed memory, which is characterized in that further include:
After establishment file piecemeal fingerprint collection, fingerprint collection initial weight is added for establishment file piecemeal fingerprint collection.
3. the data duplicate removal method according to claim 2 calculated based on distributed memory, which is characterized in that the finger
Line collection initial weight is gradually decayed as time goes by, until initial weight is zero.
4. the data duplicate removal method according to claim 2 calculated based on distributed memory, which is characterized in that the text
The specifying information of part piecemeal fingerprint collection includes:When piecemeal respective path, piecemeal creation time, piecemeal HASH values, fingerprint collection create
Between, fingerprint collection citation times, fingerprint collects weights;The fingerprint collects weights are cited by fingerprint collection initial weight, fingerprint collection
Number, fingerprint collection creation time codetermine, and fingerprint collects weights are used for the initial state of unified fingerprint collection;Fingerprint collection creation time
Decaying for fingerprint collects weights;Fingerprint collection citation times indicate fingerprint collection active degree.
5. the data duplicate removal method according to claim 1 calculated based on distributed memory, which is characterized in that the step
Rapid S2, specifically comprises the steps of:
S201. the fingerprint value for calculating file block to be compared is made with the fingerprint collection cached in distributed memory and being compared, compares
Sequence is arranged according to fingerprint collects weights descending, and the big fingerprint collection of weights preferentially compares;
If S202. not finding matched fingerprint collection in distributed memory, it is complete that the fingerprint collection not cached is read from disk
In contrast with, Comparing method is also to arrange to compare according to weights descending;
If the reference that the fingerprint is concentrated S203. is added to this point from fingerprint identical fingerprints collection is found in memory or on disk
Block, and change the fingerprint collection citation times and weights;
If S204. not finding, new fingerprint collection is created, completes the initialization of fingerprint collection every terms of information, while adding for piecemeal
The reference of the new fingerprint collection.
6. the data duplicate removal method according to claim 1 calculated based on distributed memory, which is characterized in that step S3
In, the multi-level buffer strategy includes:If can not cache whole fingerprint collection in memory, disk is slow as fingerprint collection two level
It deposits, is arranged according to weights descending, it sequentially will be in fingerprint set cache to memory according to this, it is impossible to the fingerprint collection being cached in memory
It is cached on disk.
7. the data duplicate removal method according to claim 6 calculated based on distributed memory, which is characterized in that described is more
Grade cache policy further include:When establishment new fingerprint collection or certain fingerprint collection are matched, which is determined according to weights size
Collection is to be substituted into memory, or be directly cached on disk.
8. the data duplicate removal method according to claim 1 calculated based on distributed memory, which is characterized in that step S4
In, the memory is divided into multiple regions, specially:Memory for caching fingerprint collection is divided into two parts, part of cache text
The fingerprint collection information of part grade;The fingerprint collection information of part of cache piecemeal grade;The fingerprint collection of file-level or piecemeal grade can not cache
It can only be cached on disk to the part in memory, the memory of other parts cannot be occupied.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510670867.2A CN105354246B (en) | 2015-10-13 | 2015-10-13 | A kind of data duplicate removal method calculated based on distributed memory |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510670867.2A CN105354246B (en) | 2015-10-13 | 2015-10-13 | A kind of data duplicate removal method calculated based on distributed memory |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105354246A CN105354246A (en) | 2016-02-24 |
CN105354246B true CN105354246B (en) | 2018-11-02 |
Family
ID=55330219
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510670867.2A Active CN105354246B (en) | 2015-10-13 | 2015-10-13 | A kind of data duplicate removal method calculated based on distributed memory |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105354246B (en) |
Families Citing this family (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106372105A (en) * | 2016-08-19 | 2017-02-01 | 中国科学院信息工程研究所 | Spark platform-based microblog data preprocessing method |
CN107368545B (en) * | 2017-06-28 | 2019-08-27 | 深圳神州数码云科数据技术有限公司 | A kind of De-weight method and device based on Merkle Tree deformation algorithm |
CN107273536A (en) * | 2017-06-30 | 2017-10-20 | 郑州云海信息技术有限公司 | A kind of repeated data determines method, system and distributed memory system |
CN107329846B (en) * | 2017-07-11 | 2020-06-12 | 深圳市信义科技有限公司 | Big finger data comparison method based on big data technology |
CN109144417A (en) * | 2018-08-16 | 2019-01-04 | 广州杰赛科技股份有限公司 | A kind of cloud storage method, system and equipment |
CN109240605B (en) * | 2018-08-17 | 2020-05-19 | 华中科技大学 | Rapid repeated data block identification method based on 3D stacked memory |
CN109189577B (en) * | 2018-08-31 | 2020-05-19 | 武汉达梦数据库有限公司 | Method and device for preventing memory overflow during data synchronization |
CN109241023A (en) * | 2018-09-21 | 2019-01-18 | 郑州云海信息技术有限公司 | Distributed memory system date storage method, device, system and storage medium |
CN109522305B (en) * | 2018-12-06 | 2021-02-02 | 北京千方科技股份有限公司 | Big data deduplication method and device |
CN110147331B (en) * | 2019-05-16 | 2021-04-02 | 重庆大学 | Cache data processing method and system and readable storage medium |
CN111444167A (en) * | 2020-03-25 | 2020-07-24 | 厦门市美亚柏科信息股份有限公司 | Method, device and storage medium for removing duplicate data based on data abstract |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101079034A (en) * | 2006-07-10 | 2007-11-28 | 腾讯科技(深圳)有限公司 | System and method for eliminating redundancy file of file storage system |
CN101706825A (en) * | 2009-12-10 | 2010-05-12 | 华中科技大学 | Replicated data deleting method based on file content types |
EP2557514A1 (en) * | 2011-08-12 | 2013-02-13 | Nexenta Systems, Inc. | Cloud Storage System with Distributed Metadata |
CN104869140A (en) * | 2014-02-25 | 2015-08-26 | 阿里巴巴集团控股有限公司 | Multi-cluster system and method for controlling data storage of multi-cluster system |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2013119201A1 (en) * | 2012-02-06 | 2013-08-15 | Hewlett-Packard Development Company, L.P. | De-duplication |
-
2015
- 2015-10-13 CN CN201510670867.2A patent/CN105354246B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101079034A (en) * | 2006-07-10 | 2007-11-28 | 腾讯科技(深圳)有限公司 | System and method for eliminating redundancy file of file storage system |
CN101706825A (en) * | 2009-12-10 | 2010-05-12 | 华中科技大学 | Replicated data deleting method based on file content types |
EP2557514A1 (en) * | 2011-08-12 | 2013-02-13 | Nexenta Systems, Inc. | Cloud Storage System with Distributed Metadata |
CN104869140A (en) * | 2014-02-25 | 2015-08-26 | 阿里巴巴集团控股有限公司 | Multi-cluster system and method for controlling data storage of multi-cluster system |
Also Published As
Publication number | Publication date |
---|---|
CN105354246A (en) | 2016-02-24 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105354246B (en) | A kind of data duplicate removal method calculated based on distributed memory | |
US10268697B2 (en) | Distributed deduplication using locality sensitive hashing | |
US10761758B2 (en) | Data aware deduplication object storage (DADOS) | |
US8930648B1 (en) | Distributed deduplication using global chunk data structure and epochs | |
Fu et al. | A scalable inline cluster deduplication framework for big data protection | |
US9424274B2 (en) | Management of intermediate data spills during the shuffle phase of a map-reduce job | |
CN104199815B (en) | The method and system of summary storage consumption is reduced in data deduplication system | |
US20150127621A1 (en) | Use of solid state storage devices and the like in data deduplication | |
Lee et al. | Large-scale incremental processing with MapReduce | |
CN110162528A (en) | Magnanimity big data search method and system | |
US8874860B2 (en) | Logical buffer pool extension | |
CN103365954A (en) | Method and system for increasing in-line deduplication efficiency | |
EP3379415B1 (en) | Managing memory and storage space for a data operation | |
Niazi et al. | Size matters: Improving the performance of small files in hadoop | |
CN108089816A (en) | A kind of query formulation data de-duplication method and device based on load balancing | |
CN108021333A (en) | The system of random read-write data, device and method | |
Ciritoglu et al. | Hard: a heterogeneity-aware replica deletion for hdfs | |
Liu et al. | Hadoop based scalable cluster deduplication for big data | |
Gupta et al. | An efficient approach for storing and accessing small files with big data technology | |
US10114878B2 (en) | Index utilization in ETL tools | |
CN107357921A (en) | A kind of small documents storage localization method and system | |
Tang et al. | Tuning object-centric data management systems for large scale scientific applications | |
EP3832476A1 (en) | Accelerated and memory efficient similarity matching | |
Prabavathy et al. | Multi-index technique for metadata management in private cloud storage | |
Ge et al. | Cinhba: A secondary index with hotscore caching policy on key-value data store |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |