CN103631933B

CN103631933B - Distributed duplication elimination system-oriented data routing method

Info

Publication number: CN103631933B
Application number: CN201310655727.9A
Authority: CN
Inventors: 刘厚贵; 邢晶; 霍志刚; 安学军
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2013-12-06
Filing date: 2013-12-06
Publication date: 2017-04-12
Anticipated expiration: 2033-12-06
Also published as: CN103631933A

Abstract

The invention provides a distributed duplication elimination system-oriented data routing method, which comprises the following steps that a server categorizes fingerprints of all data blocks forming data, and transmits the fingerprints of different categories to different abstract storage nodes for storing data abstracts of the fingerprints of the corresponding categories respectively; the received fingerprints in the abstract storage nodes are queried to obtain hit scores of the fingerprints in each duplication elimination node; the hit scores are returned to the server; the server obtains a summery score of each duplication elimination node according to the hit scores of the fingerprints in each duplication elimination node, and determines a target duplication elimination node by combining the summery scores and the storage conditions of the duplication elimination nodes. According to the method, duplication elimination effect and the storage utilization rate are balanced, the communication and calculation overhead in a fingerprint query process is effectively inhibited, and the data routing expandability of a distributed duplication elimination system is improved.

Description

A kind of data routing method of Based on Distributed machining system

Technical field

This patent disclosure relates generally to data de-duplication technology, in particular to a kind of Based on Distributed machining system Data routing method.

Background technology

Since the mankind enter the digital information epoch, bulk information has been recorded into data.From the basic life of clothing, food, lodging and transportion -- basic necessities of life Demand living to education and medical care and commercial field, from traditional the Internet to the mobile Internet grown up by smart mobile phone, more The creation of data is participated in come more people and equipment, the annual data total amount for producing is presented explosive growth.It is same with this When, due to there is potential business and scientific research value in data, therefore increasing data are recorded and stored.It is international Data company（IDC）Research report point out that the whole world in 2011 is created and the data that replicate are up to 1.8ZB, and according to becoming Gesture, by 2015, this numeral will be close to 8ZB.International Data Corporation (IDC)（IDC）Find there is near in digital world by studying 75% data are to repeat.It is not unique, but has its counterpart, ESG（the Enterprise Strategy Group）Point out In backup and filing storage system, the redundancy of data is more than 90%.

Optimize storage using data de-duplication technology and can effectively reduce data taking up room on disk.So And, in the face of the rapid growth of data, single data de-duplication server has been difficult to meet the demand of autgmentability.Therefore, Cluster data de-duplication technology is arisen at the historic moment.Cluster data deduplication system, or claim distributed machining system, by by again The task distribution that complex data is deleted improves the disposal ability of data de-duplication server to different server nodes.In collection In group's data deduplication system, in addition to considering the duplicate removal problem of individual node, in addition it is also necessary to consider data distribution to duplicate removal The data routing mechanism of server node, this is because data routing mechanism is related to the overall duplicate removal effect of system and storage The balance of utilization rate.

At present, according to whether referring to data with existing（The data for having stored）By the data route point of distributed machining system For two methods.A kind of method is that stateless data route, and this data are route only with reference to current data fingerprint information, according to Fixed mapping ruler, by data distribution to different duplicate removal server nodes（Multiple knot is removed referred to as）Carry out duplicate removal.Wherein, refer to Stricture of vagina（FP）For judging whether the data block for constituting data repeats, typically data block can be calculated by SHA1 or MD5 functions Fingerprint.This data routing method realizes that simply time and space expense are all smaller.But this method has at 2 points not Foot：Firstly, since without reference to the data for having stored, therefore cannot ensure that data go the duplicate removal rate of multiple knot in target；Secondly, Existing space utilisation due to not accounting for multiple knot, and data different duplicate removal server nodes duplicate removal effect not Equally, therefore the problem of data silo can be produced, i.e. the data of certain data de-duplication server node storage are far above which His data de-duplication server node.

Another kind of method is that have status data to route, the data summarization and duplicate removal section of this method reference system data with existing The data storage situation of point is carrying out data route.Here, data summarization is by constituting the data block fingerprint insertion of data Bloom Filter（BF）Obtained from.Briefly, the method is first accessed to store and removes the data summarization of multiple knot with regard to each Summary memory node, inquiry fingerprint obtains the fingerprint and goes the hit score of multiple knot at each, then combine and remove multiple knot Space utilisation selection target removes multiple knot.The advantage of this method is the duplicate removal effect that can guarantee that distributed machining system is overall Really, while the space utilisation for removing multiple knot can be balanced.And have the disadvantage that the extra summary memory node of this method needs is used for Inquiry, and the memory cost of data summarization is very big, therefore this method is difficult to obtain good autgmentability.

It can be seen that, while the balance of duplicate removal effect and space utilisation is reached, how to improve distributed machining system number According to the extensibility of route, and communication and the growth of computing cost during fingerprint queries is suppressed to be currently also not solve Problem.

The content of the invention

To solve the above problems, the present invention provides a kind of data routing method of Based on Distributed machining system, wherein institute Stating distributed machining system includes summary memory node, removes multiple knot, and the server communicated with other nodes in system, described Method includes：

Step 1）, server the fingerprint of all data blocks for constituting data is classified, and by different classes of finger Stricture of vagina is separately sent to the different summary memory nodes of the data summarization for storing respective classes fingerprint；

Step 2）, in the summary memory node fingerprint that arrives of inquire-receive, obtain the fingerprint and remove multiple knot at each Hit fraction, the hit fraction is returned into the server；

Step 3）, the server goes the hit fraction of multiple knot to obtain each at each according to each fingerprint to remove multiple knot Collect fraction, fraction is collected according to this and determines that target removes multiple knot.

In one embodiment, in step 3）In, determine that target goes multiple knot to include according to fraction is collected：The server Remove the memory state of multiple knot and collect fraction to remove multiple knot determining target with reference to each.

In one embodiment, each summary memory node stores each all data block for going multiple knot data storage The data summarization of a class fingerprint in fingerprint, the wherein sum of fingerprint classification are identical with the number of summary memory node.

In one embodiment, the summary memory node stores each number for removing multiple knot using Bloom Filter According to summary.

In one embodiment, in step 1）In, the server number of the summary memory node is to constituting data Remainder identical fingerprint is divided into a class by the fingerprint delivery of all data blocks.

In one embodiment, step 2）Including：

Step 21）, it is described summary memory node in, using store each go multiple knot data summarization Bloom The hash function adopted by Filter is calculating the cryptographic Hash of the fingerprint for receiving；

Step 22）, according to cryptographic Hash inquiry with regard to each go multiple knot Bloom Filter corresponding position；

Step 23）, according to correspondence position calculate hit fraction；

Step 24）, the hit fraction returned into the server.

In one embodiment, step 3）Including：

Step 31）, go multiple knot, the server to calculate all fingerprints and remove the hit fraction of multiple knot at this for each Sum, obtain that this removes multiple knot collects fraction；

Step 32）, the server each is gone into inverse weight summation for collecting fraction and space utilisation of multiple knot, The maximum multiple knot that goes of value removes multiple knot as target.

In one embodiment, methods described also includes：

Step 0）, server from client receiving data, the data are carried out into piecemeal, and calculate the finger of each data block Stricture of vagina.

In one embodiment, methods described also includes：

Step 4）, the server sends said data to the target and goes multiple knot to carry out duplicate removal.

In one embodiment, methods described also includes：

Step 5）, the summary memory node update the data summarization that the target removes multiple knot.

The present invention can reach following beneficial effect：

Carry out data summarization to be used in data storage routing procedure using multiple summary memory nodes, solve single plucking Want the problem of memory node low memory.At the same time, due to depositing with distributed Blo om Filter storage method categories Storage data summarization, while memory node of making a summary constantly extends, communicates during effectively inhibiting fingerprint queries and calculating is opened The growth of pin, improves the extensibility of data route.

Description of the drawings

Fig. 1 is the flow chart of the data routing method of Based on Distributed machining system according to an embodiment of the invention；

Fig. 2 is the schematic diagram of distributed Bloom Fliter storage methods according to an embodiment of the invention；

It is less Bloom Fliter by original Bloom Fliter cuttings according to one embodiment of the invention that Fig. 3 is Schematic diagram；

Fig. 4 is the data structure schematic diagram of Bloom Fliter according to an embodiment of the invention；And

Fig. 5 is the block diagram of summary memory node functional module according to an embodiment of the invention.

Specific embodiment

With reference to the accompanying drawings and detailed description the present invention is illustrated.It should be appreciated that described herein concrete Embodiment only to explain the present invention, is not intended to limit the present invention.

According to one embodiment of present invention, there is provided a kind of data routing method of Based on Distributed machining system.Briefly For, the method includes：Using multiple nodes as summary memory node, store with regard to a certain on different summary memory nodes The data summarization of class fingerprint；The fingerprint category for needing all data blocks in the data of duplicate removal is sent to into corresponding summary storage Node, inquires about fingerprint in summary memory node, obtains the fingerprint and removes the hit fraction of multiple knot with regard to each；Then collect To in data, all fingerprints collect fraction with regard to what each removed multiple knot, collect fraction and duplicate removal node data according to multiple knot is removed Memory state removes multiple knot determining target, so as to carry out deduplication operation.It is described in detail referring now to Fig. 1.

First, data storage is made a summary in a distributed manner

Multiple nodes are set and are to determine that target goes the premise of multiple knot as summary memory node.Improve distributed duplicate removal System data route extensibility it is necessary to solve the problems, such as in existing system it is single summary memory node low memory, i.e., Made a summary come data storage using multiple nodes.If being only that data summarization is distributed to different nodes simply, it is likely to result in The broadcast operation repeated when fingerprint is inquired about by server.This is because, server is needed to obtain a Query Result Fingerprint is sent in different summary memory nodes and is inquired about.

In order to overcome the communication overhead and computing cost brought by above-mentioned data summarization distributed storage method, in a reality Apply in example, can be made a summary come data storage using distributed Bloom Filter storage methods.This storage method is to be based on The self-characteristic of Bloom Filter, i.e., require in False Rate（Such as 1%）In the case of certain with hash function number, Bloom The number of the size and insertion element of Filter is directly proportional.

In one embodiment, Bloom Filter can be created in each summary memory node.In another embodiment In, existing distributed machining system can be modified, the Bloom Filter based on memory node of making a summary in existing system To obtain the Bloom F ilter of each summary memory node in the present invention, including：

1st, by existing system relatively large Bloom Filter according to summary memory node number cutting Into several relatively small Bloom Filter, and these less Bloom Filter distributions are stored in into different plucking In wanting memory node.With reference to Fig. 2, store in the single summary memory node of existing system with regard to the N+1 number for removing multiple knot According to summary（That is the data summarization of all fingerprints in Fig. 2）Bloom Filter（Represented with BF node 0-BF node N）, then it is right There is the situation of m summary memory node in system, will can be used for storing each data summarization for removing multiple knot in existing system Original Bloom Filter are cut into m less Bloom Filter, and each Bloom Filter can store different classes of The data summarization of fingerprint, stores this N+1 corresponding less Bloom for removing multiple knot in each summary memory node respectively Filter（It is still to be represented with BF node 0-BF node N）.In the embodiment shown in Figure 2, make a summary Number is the data summarization of the fingerprint that each of 0 removes multiple knot, stores and remove multiple knot divided by each that m remainders are 1 in summary node 1 Fingerprint data summarization, by that analogy.In the embodiment shown in fig. 3, the number of summary memory node is 3, then system One larger Bloom Filter can be divided into the Bloom Filter of the sizes such as 3.Fig. 4 shows Bloom Filter's Data structure, wherein asi ze represent the size of Bloom Filter, and pointer a is oriented to the space of Bloom Filter distribution.

2nd, the data summarization with regard to a certain class fingerprint is stored on different summary memory nodes.Wherein fingerprint classification number It is equal with the number of summary memory node.In one embodiment, can be classified using the method for delivery.It should be understood that also may be used Classified with using other existing mode classifications.

Distributed storage data summarization is advantageous in that, when needing to carry out fingerprint queries to memory node of making a summary, system Different fingerprints are sent to by different summary memory nodes according to classification, it is to avoid the communication of broadcast type.At the same time, in fingerprint In query script, the hash function of each fingerprint only needs to once be calculated, it is to avoid identical fingerprints carry out repeated Calculate.

2nd, inquiry summary memory node

Generally speaking, plan as a whole server（The server knows the node structure in system, can be with any summary storage section Put and go multiple knot to interact）After the data for needing duplicate removal that client is sent are received, the data are carried out into piecemeal, meter Calculate the fingerprint of each data block, fingerprint is classified, and all fingerprint categories are sent to into corresponding summary storage section Point.Then, the fingerprint that inquire-receive is arrived in summary memory node, obtains the fingerprint and divides with regard to each hit for removing multiple knot Number.

According to one embodiment of present invention, the inquiry summary memory node before data distribution is gone multiple knot to target Higher duplicate removal rate can be realized.It is after using distributed Bloom Filter data storages summary, compared with prior art, right The query script of summary memory node there occurs change, specifically include：

1st, data are carried out piecemeal by server, calculate fingerprint to each data block.Conventional method be using SH1 or MD5 is calculating fingerprint.

2nd, fingerprint is classified according to the number of summary memory node.For example, by the number of summary memory node to referring to Remainder identical fingerprint is divided into a class by stricture of vagina delivery.

3rd, the fingerprint classified is separately sent to make a summary accordingly in memory node and is inquired about, it is to avoid original storage The broadcast type of mode sends.Fingerprint completes the calculating and Bloom Filter inquiries of hash function in each summary memory node, The hit fraction for obtaining is returned to into server.In one embodiment, query script includes：In summary memory node, make The hash functions that adopted of Bloom Filter made a summary with data storage are calculating the cryptographic Hash of the fingerprint for receiving；Then, The corresponding position of each Bloom Filter for removing multiple knot is inquired about according to the cryptographic Hash, if correspondence position is all 1（Hash function one As more than one）, then fingerprint hit, hit fraction add 1.

4th, hit fraction is returned to into server.Fig. 5 gives the functional module of summary memory node, wherein, ask team Arrange the request sent to summary memory node for caching server, the Bloom that BF managers are used in management node Filter, whole summary memory node externally provide the inquiry of Bloom Filte r and more New function.

3rd, determine that target removes multiple knot

Server removes multiple knot with regard to each to all fingerprints after the hit fraction that summary memory node is returned is received Hit fraction carry out addition and collect, obtain that each removes multiple knot collects fraction.This collects fraction and server can be helped to obtain The fraction of duplicate removal effect anticipation, but for system generally speaking, except duplicate removal effect will be considered, it is also contemplated that going multiple knot to deposit The problem that storage utilization rate is unbalance.

Therefore, in one embodiment, transmitting data to before target removes multiple knot, server will also consider Each removes the space utilisation of multiple knot.In one embodiment, server each can go multiple knot storage profit with keeping records With the table of rate, often can be to time its storage condition of each duplicate removal querying node, then to the storage through a period of time server Utilization rate table is updated.The data structure of space utilisation table can be as shown in table 1, and wherein deduperID is multiple knot Numbering, Container_num is the number of multiple knot storage object, and the number of object reflects the storage of multiple knot and accounts for Use situation.

Table 1

Server according to record in this table go multiple knot storage condition and go multiple knot collect fraction obtain one it is comprehensive Fraction is closed, the composite score is used to determine that the target of data is activation removes multiple knot.For example, by collecting fraction to remove multiple knot And the inverse of space utilisation is weighted summation, a composite score is obtained, what value was maximum removes multiple knot as mesh Mark removes multiple knot.Again for example, server can select space utilisation minimum in node of the fraction more than a certain threshold value is collected Node remove multiple knot as target.

4th, server transmits data to target and goes multiple knot to carry out duplicate removal

In this step, the data is activation received from client is removed multiple knot to target by server, by target duplicate removal section Point performs the operation of deleting duplicated data, can complete this step using data de-duplication technology well known in the art.

5th, make a summary memory node in Bloom Filter renewal

In one embodiment, server can send target and go the numbering of multiple knot to summary memory node.Then make a summary Memory node function removes the Bloom Filt er of multiple knot with regard to target in updating the node, fingerprint pair is calculated in renewal process The cryptographic Hash answered, and set is carried out according to cryptographic Hash to corresponding position in Bloom Filter, the work of data route is just Complete.

Be given below provided herein is Based on Distributed machining system data routing method a specific embodiment, bag Include following steps：

1st, server receives data from client, carries out point data by calling dedup_get_chunked functions Block and calculate the fingerprint of data block.

2nd, after completing fingerprint calculating, fingerprint is classified and dedup_ is called according to the number of summary memory node Server_send functions send the fingerprint.

3rd, memory node of making a summary receives finger print data, calls Bloom_query to inquire about corresponding finger print information.Wherein, Bloom_query calls corresponding hash function, obtains cryptographic Hash, then calls Bloom_check inquiry Bloom Filter Corresponding position whether set, so as to record whether fingerprint hits score.After the completion of inquiry, Query Result is obtained, sent back To server.

4th, server receives result and is collected.Due to being sent by classification before fingerprint queries, therefore, what is received divides Number is classified inquiry as a result, it is desirable to collect the fraction of each node.After obtaining collecting fraction, whois lookup space utilisation Information table, draws last composite score with weighting algorithm.Best result is chosen from composite score, which is corresponding just to remove multiple knot It is that the target that data route removes multiple knot.

5th, calling dedup_server_send functions that data and fingerprint are sent to target goes multiple knot to carry out duplicate removal.

6th, after duplicate removal is completed, server can send numbering that this removes multiple knot to summary memory node.Summary storage section Point calls the Bloom Filter in Bloom_update functions more new node, and corresponding Hash is equally calculated in renewal process Value, and call Bloom_set_bit to carry out set to corresponding position according to cryptographic Hash.

Correlation function is described as follows：

Function dedup_get_chunked (segement)

1, call guest sieve（Rabin）Function determines the border of data block, completes elongated piecemeal.

2, call SHA1 functions to calculate the fingerprint of data block.

3, the information integration of each data block is returned together.

Function dedup_server_send (data, length)

1, data is put in the caching of transmission, and length length of flag data.

2, by data is activation, wait and replying.

3, reply is received, is returned.

Function Bloom_query (sha)

1, call hash function to calculate the cryptographic Hash of fingerprint.

2, call whether the corresponding bit positions of Bloom_check functional queries are all 1, otherwise, function returns 0.

3, function returns 1.

Function Bloom_check (bloom, n ...)

1, call va_start functions to obtain first variable element address.

2, call va_arg to obtain parameter, checking needs whether the position for judging is 0, if returning 0.

3,2 are returned to, until parameter takes.

4, function returns 1.

Function Bloom_update (sha_cache, bloom_id)

1, fingerprint is taken out from fingerprint cache sha_cache.

2, call hash function to calculate the cryptographic Hash of fingerprint.

3, call whether the corresponding positions of Bloom_check are 1, if being not 1, call Bloom_s et_bit function pairs to be somebody's turn to do Position carries out putting 1.

4, judge whether caching is empty, if caching is empty, function is returned, and otherwise jumps to the 1st step.

Function Bloom_set_bit (bloom, n ...)

1, call va_start functions to obtain first variable element address.

2, call va_arg to obtain parameter, the position specified to parameter carries out set.

3,2 are returned to, until parameter takes.

4, function returns 1.

It should be noted that and understand, in the feelings without departing from the spirit and scope of the present invention required by appended claims Under condition, various modifications and improvements can be made to the present invention of foregoing detailed description.It is therefore desirable to the model of the technical scheme of protection Enclose and do not limited by given any specific exemplary teachings.

Claims

1. a kind of data routing method of Based on Distributed machining system, the distributed machining system include summary storage section Point, multiple knot is removed, and the server communicated with other nodes in system, methods described includes：

Step 1）, server the fingerprint of all data blocks for constituting data is classified, and by different classes of fingerprint point The different summary memory nodes of the data summarization of storage respective classes fingerprint are not sent to；

Step 2）, in the summary memory node fingerprint that arrives of inquire-receive, obtain the fingerprint and go the life of multiple knot at each The hit fraction is returned the server by mid score；

Step 3）, the server according to each fingerprint each go multiple knot hit fraction obtain each remittance for removing multiple knot Gross score, collects fraction according to this and determines that target removes multiple knot.

2. method according to claim 1, step 3）In, determine that target goes multiple knot to include according to fraction is collected：

The server removes the memory state of multiple knot and collects fraction to remove multiple knot determining target with reference to each.

3. method according to claim 1 and 2, wherein, each summary memory node stores each and goes multiple knot to store The data summarization of a class fingerprint in all data block fingerprints of data, the wherein sum of fingerprint classification are individual with summary memory node Number is identical.

4. method according to claim 1 and 2, wherein, the summary memory node is stored using Bloom Filter Each removes the data summarization of multiple knot.

5. method according to claim 1 and 2, in step 1）In, number pair of the server with the summary memory node The fingerprint delivery of all data blocks of data is constituted, remainder identical fingerprint is divided into into a class.

6. method according to claim 4, wherein, step 2）Including：

Step 21）, it is described summary memory node in, using store each go multiple knot data summarization Bloom Filter The hash function for being adopted is calculating the cryptographic Hash of the fingerprint for receiving；

Step 23）, according to correspondence position calculate hit fraction；

Step 24）, the hit fraction returned into the server.

7. method according to claim 2, wherein, step 3）Including：

Step 31）, go multiple knot, the server to calculate all fingerprints and remove the hit fraction of multiple knot at this for each With obtain that this removes multiple knot collects fraction；

Step 32）, the server each is gone into inverse weight summation for collecting fraction and space utilisation of multiple knot, value The maximum multiple knot that goes removes multiple knot as target.

8. method according to claim 1 and 2, also includes：

Step 0）, server from client receiving data, the data are carried out into piecemeal, and calculate the fingerprint of each data block.

9. method according to claim 1 and 2, also includes：

10. method according to claim 9, also includes：