CN109213738A

CN109213738A - A kind of cloud storage file-level data de-duplication searching system and method

Info

Publication number: CN109213738A
Application number: CN201811384763.5A
Authority: CN
Inventors: 董志勇; 邱琳; 赵航; 刘梦
Original assignee: Beacon Fire Technology Group Co Ltd; Wuhan Ligong Guangke Co Ltd
Current assignee: Beacon Fire Technology Group Co Ltd; Wuhan Ligong Guangke Co Ltd
Priority date: 2018-11-20
Filing date: 2018-11-20
Publication date: 2019-01-15
Anticipated expiration: 2038-11-20
Also published as: CN109213738B

Abstract

The invention discloses a kind of cloud storage file-level data de-duplication searching system and methods, this method passes through the characteristic information of fingerprint server storage file, when client proposes file storage application, coarse filtration is carried out first, it is searched in fingerprint server, if not finding the file record for having same characteristic features, this document is regarded as new file；If found, then carefully filtered, the file set being found is considered as comparison file, successively chooses the random point and characteristic interval for comparing file, carries out precise alignment, it is whether existing with confirmation request file, if it is, the metadata that demand file is arranged in name server is directed toward the metadata of the comparison file, if there is no, then file is stored, and records file feature information into fingerprint server.The present invention can largely reduce the typing of duplicate file by the filtering of thick, thin two steps, have the characteristics that execution efficiency is high, data de-duplication rate is high, be suitable for big data and cloud storage environment.

Description

A kind of cloud storage file-level data de-duplication searching system and method

Technical field

The present invention relates to the deletion of repeated data in computer storage, cloud storage and searching fields more particularly to a kind of cloud Storage file grade data de-duplication searching system and method.

Background technique

The high speed development of internet produces mass data, and the transimission and storage scene for resulting in mass data increasingly increases More, in this background, data storage technology is developed rapidly, and data de-duplication and compression are can to save largely The technology of data storage.Data de-duplication is to carry out duplicate removal, and leave in corresponding storage location by identifying duplicate contents Pointer minimizes data volume.Only a small number of main arrays provide additional function of the data de-duplication as product at present； Duplicate data waste valuable cloud resource, and generate overhead, it was reported that real only less than 5% disk array Support online data de-duplication and compression, the space by data deduplication saving is very considerable.Carry out deleting for repeated data Except it is necessary to which file is compared, since storage system has a large amount of file, shadow inevitably is generated to comparison efficiency It rings, a kind of the method elimination data redundancy and reduction memory capacity of file-level data de-duplication proposed by the present invention effectively solve Certainly the problem of file comparison efficiency.

Summary of the invention

The technical problem to be solved in the present invention is that in the prior art, for repeated data in cloud space, waste Valuable cloud resource leads to the problem of overhead and to solve the comparison efficiency of duplicate file, it is literary to provide a kind of cloud storage Part grade data de-duplication searching system and method.

The technical solution adopted by the present invention to solve the technical problems is:

The present invention provides a kind of cloud storage file-level data de-duplication searching system, which includes: client, Yun Cun Storage platform, fingerprint server and name server, cloud storage platform are made of multiple back end；Wherein:

Multiple back end are connected by name server with fingerprint server；Fingerprint server node for storing data The characteristic information of middle file；Client is for sending the request searched file and filtered；In the mistake for carrying out file filter Cheng Zhong carries out coarse filtration to file by the characteristic information of file；After the completion of coarse filtration, if also needing to carry out further file Confirmation, generates thin filtration duty by name server, and back end completion is transferred to filter again.

Further, characteristic information of the invention indicates local fingerprint, size, metadata pointer and the characteristic area of file Between.

Further, the data in fingerprint server of the invention carry out fingerprint extraction by the way of MD5, eliminate redundancy Data block, further data de-duplication is then done on name server, wherein the key-value pair information of fingerprint extraction are as follows: Key is file local fingerprint, and value is size, metadata pointer and the characteristic interval of file.

Further, the local fingerprint information of file of the invention are as follows: Hash operation, obtained text are carried out to file head and the tail Part signing messages；If file size is not enough to carry out head and the tail Hash operation, using entire file as signing messages.

Further, the characteristic interval of file of the invention are as follows: file and similar documents to be uploaded is accurately being compared Clock synchronization, generated difference section；Similar documents indicate partly or entirely there is identical fingerprints and file with file to be uploaded The file of size.

Further, name server of the invention determines random area according to file size and the quantity of characteristic interval Between number；According to file storage condition, random interval position is determined.

Further, back end of the invention receives the comparison request of name server transmitting, receives comparison data, presses It is compared according to section is compared, and is notified to comparison result.

The present invention provides a kind of cloud storage file-level data de-duplication search method, method includes the following steps:

The head and the tail progress Hash operation of S1, client selecting file, obtain file label using MD5 finger print information extracting mode Name, the local fingerprint information as file；

Since the MD5 fingerprint extraction arithmetic speed based on Hash is fast, CPU usage is low, and the data in fingerprint server are adopted Fingerprint extraction is carried out with the mode of MD5, eliminates the data block of redundancy, further repeated data is then done on name server It deletes.Wherein the key-value pair information of fingerprint extraction is that key is file local fingerprint, and value is that size, the metadata of file refer to Needle and characteristic interval.

S2, the document size information to be uploaded and file signature are sent to fingerprint server, by fingerprint server All Files corresponding to the finger print information, and statistics file information are directly taken out, obtained statistical information is returned into client End；

S2, the document size information to be uploaded and file signature are sent to fingerprint server, and carry out storage text The coarse filtration of part directly takes out All Files corresponding to the finger print information, and statistics file information as fingerprint server, will To statistical information return to client；

S3, client receive the file information of fingerprint server return, if quantity of documents is 0, then it represents that by wait deposit After storing up file coarse filtration, the characteristic information of this document is not matched in finger print information storehouse, the file to be uploaded is completely new text Part, client sends storage request to name server, while carrying the local fingerprint information of this document, by name server It determines the storage location of file, and the characteristic information of file is registered to fingerprint server；

If S4, quantity of documents are not 0, then it represents that after file to be stored coarse filtration, finger print information storehouse is matched to this article The characteristic information of part, client carry out the cyclic check stage, and client can successively send file and compare request, can carry in request File metadata pointer and characteristic interval further carefully filter file to be stored；

S5, name server obtain the verification request that client is sent, and according to file metadata pointer or index, find text Part metadata, and according to the quantity and distribution that randomized test section is arranged the case where the storage condition of file, characteristic interval, at random The quantity and the sum of the quantity of characteristic interval for examining section should be directly proportional with file size, ratio according to circumstances sets itself, special Sign section is not overlapped with random interval, and the area size of random interval is fixed value, according to circumstances sets itself, name server The random interval calculated is sent to client, begins preparing file precise alignment；

S6, client send the data of characteristic interval and random interval in name server, and name server will Data and inspection section are issued in back end, complete precise alignment by back end, and wait back end that will examine As a result it returns；

S7, back end, which obtain, examines block information and inspection data, is accurately compared the information in inspection section It is right, if compared successfully, Success Flag is returned, if comparing failure, returns to failure flags, and first comparison is failed Block information returns to name server；

S8, name server count comparing and increase file metadata information newly as a result, if compared completely successfully, will It is directed toward and compares successfully that file completely, and returns to the information that file has found and stored to client；

If S9, comparing failure, name server caching compares the block information of failure, and starts to client request File compares next time；

S10, client send new comparison solicited message, continue above-mentioned comparison step, if all comparisons have been tied Beam, and name server does not return to comparison successful information, is completed then client sends to compare, application documents storage；

S11, name server receive file and complete and start the storage location distribution of new file after applying for storage, and inform Client is ready for, and client sends file, and name server then starts storage file；

After S12, question paper storage are completed, by this document in caching when comparing, the generated section for comparing failure As the relative characteristic section of file, if there is there are intersections between partial section in relative characteristic section, at this point, only retaining it In a part, guarantee characteristic interval between be separated from each other, if characteristic interval is excessive, carry out selective selection, it is ensured that The quantity of characteristic interval is no more than the range of setting；

S13, characteristic interval and file local fingerprint, file size, file metadata pointer are registered to fingerprint server In, and client file is notified to be transmitted.

The beneficial effect comprise that: cloud storage file-level data de-duplication searching system of the invention and side Method passes through the filtering of thick, thin two steps, it is ensured that can largely reduce the typing of duplicate file, which, which has, executes Feature high-efficient, data de-duplication rate is high can provide rapidly the repetition situation of file, and execution efficiency is high, duplicate removal effect Obviously, more suitable for being used under mass data storage and cloud storage environment.

Detailed description of the invention

Present invention will be further explained below with reference to the attached drawings and examples, in attached drawing:

Fig. 1 is the system block diagram of the embodiment of the present invention；

Fig. 2 is the method flow diagram of the embodiment of the present invention.

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that described herein, specific examples are only used to explain the present invention, not For limiting the present invention.

As shown in Figure 1, the cloud storage file-level data de-duplication searching system of the embodiment of the present invention, the system include: Client, cloud storage platform, fingerprint server and name server, cloud storage platform are made of multiple back end；Wherein:

By introducing fingerprint server come the characteristic information of storage file, these information include the local fingerprint of file, text Part size, relative characteristic section, metadata pointer etc..

The present invention need client can fingerprint server communicate, client when presenting a paper upload request first The local fingerprint information and document size information of calculation document, and transfer to fingerprint server to search these information.Fingerprint The effect of server is compared for making the file of coarseness, to realize coarse filtration, fingerprint server returns to comparison result To client, client, which sends further compare to name server according to the result of return, requests either file storage to be asked It asks.Name server obtains the metadata pointer and its characteristic interval of the duplicate file that may be present of client transmitting, into The thin filtering of row, is inquiry this document storage condition first, then comprehensively considers file size, memory partitioning situation, characteristic interval The factors such as quantity are selected at random compares section, and is returned to client.Client is according to the block information extracting part of passback Point the file information is simultaneously transmitted, and name server is issued after receiving in data to data node, is compared by back end.Number Returned according to node compare whether successful information and for the first time unsuccessful block information is accused to name server by name server Know whether client file repeats, and plan in next step, to complete thin filtering.

The specific implementation procedure of the technology of the present invention method:

The head and the tail that step 1. client selects file carry out Hash operation and obtain the hash signature of head and the tail, and are merged, Wherein head and tail parts size is identical, and specific size can be set by situation, if file is too small, directly acquires entire text The hash signature of part, the client-cache hash signature.

The document size information to be uploaded and file signature are sent to fingerprint server by step 2., by fingerprinting service Device directly takes out All Files corresponding to the fingerprint, then carries out the comparison of file size, counts fingerprint and file size all The quantity and file metadata of identical file index or the information such as pointer, characteristic interval return to client.

Step 3. client receives the information of fingerprint server return, first determines whether quantity of documents is 0, if it is 0, then prove that this document is a completely new file, client sends storage request to name server, while carrying this article The local fingerprint information of part is determined the storage location of file by name server, and the characteristic information of file is registered to fingerprint Server.

If what step 4. client received is not 0 there may be the quantity of duplicate file, client is followed Ring checking stage, client can successively send file and compare request, can carry file metadata pointer and characteristic area in request Between.

Step 5. name server obtains the verification request that client is sent, and according to file metadata pointer or index, looks for To file metadata, and the case where according to the storage condition of file, characteristic interval etc. the quantity in setting randomized test section and point Cloth, the sum of the quantity in randomized test section and quantity of characteristic interval should be directly proportional with file size, ratio can according to circumstances from Row setting, characteristic interval are not overlapped as far as possible with random interval, and the area size of random interval is fixed value, can according to circumstances certainly The random interval calculated is sent to client, begins preparing file precise alignment by row setting, name server.

Step 6. client sends the data of characteristic interval and random interval in name server, name server By data and section is examined to be issued in back end, precise alignment is completed by back end, and waits back end that will examine Test result return.

Step 7. back end, which obtains, examines block information and inspection data, carries out to the information examined in section accurate It compares, if compared successfully, returns to Success Flag, if comparing failure, return to failure flags, and first comparison is failed Block information return to name server.

Step 8. name server counts comparing and increases file metadata information newly as a result, if compared completely successfully, It is directed toward and compares successfully that file completely, and return to the information that file has found and stored to client.

If step 9. compares failure, name server caching compares the block information of failure, and to client request Start file next time to compare.

Step 10. client sends new comparison solicited message, continues above-mentioned comparison step, if all comparisons are Terminate, and name server does not return to comparison successful information, is completed then client sends to compare, application documents are deposited Storage.

Step 11. name server receives file and completes and apply the storage location distribution for starting new file after storing, and Inform that client is ready for, client sends file, and name server then starts storage file.

After the storage of step 12. question paper is completed, by this document in caching when comparing, the generated area for comparing failure Between relative characteristic section as file, may have between partial section that there are intersections in relative characteristic section, at this point, only retaining A part therein guarantees to be separated from each other between characteristic interval, if characteristic interval is excessive, carries out selective selection, really The quantity for protecting characteristic interval is no more than certain range.

Characteristic interval and file local fingerprint, file size, file metadata pointer etc. are registered to fingerprint clothes by step 13. It is engaged in device, and client file is notified to be transmitted.

It should be understood that for those of ordinary skills, it can be modified or changed according to the above description, And all these modifications and variations should all belong to the protection domain of appended claims of the present invention.

Claims

1. a kind of cloud storage file-level data de-duplication searching system, which is characterized in that the system includes: client, Yun Cun Storage platform, fingerprint server and name server, cloud storage platform are made of multiple back end；Wherein:

Multiple back end are connected by name server with fingerprint server；Node is Chinese for storing data for fingerprint server The characteristic information of part；Client is for sending the request searched file and filtered；During carrying out file filter, Coarse filtration is carried out to file by the characteristic information of file；After the completion of coarse filtration, if also needing to carry out further file confirmation, Thin filtration duty is generated by name server, and back end completion is transferred to filter again.

2. cloud storage file-level data de-duplication searching system according to claim 1, which is characterized in that characteristic information Indicate local fingerprint, size, metadata pointer and the characteristic interval of file.

3. cloud storage file-level data de-duplication searching system according to claim 1, which is characterized in that fingerprinting service Data in device carry out fingerprint extraction by the way of MD5, eliminate the data block of redundancy, then do on name server into one The data de-duplication of step, the wherein key-value pair information of fingerprint extraction are as follows: key is file local fingerprint, and value is the big of file Small, metadata pointer and characteristic interval.

4. cloud storage file-level data de-duplication searching system according to claim 2, which is characterized in that the office of file Portion's finger print information are as follows: Hash operation is carried out to file head and the tail, obtains file signature information；If file size is not enough to carry out head and the tail Hash operation, then using entire file as signing messages.

5. cloud storage file-level data de-duplication searching system according to claim 2, which is characterized in that the spy of file Levy section are as follows: file and similar documents to be uploaded is when carrying out precise alignment, generated difference section；Similar documents indicate Partly or entirely there is with file to be uploaded the file of identical fingerprints information and file size.

6. cloud storage file-level data de-duplication searching system according to claim 2, which is characterized in that name service Device determines the number of random interval according to file size and the quantity of characteristic interval；According to file storage condition, determine random Section position.

7. cloud storage file-level data de-duplication searching system according to claim 1, which is characterized in that back end The comparison request for receiving name server transmitting, receives comparison data, is compared according to section is compared, and is notified to compare knot Fruit.

8. a kind of data de-duplication using cloud storage file-level data de-duplication searching system described in claim 1 is examined Suo Fangfa, which is characterized in that method includes the following steps:

The head and the tail progress Hash operation of S1, client selecting file, obtain file signature using MD5 finger print information extracting mode, Local fingerprint information as file；

S2, the document size information to be uploaded and file signature are sent to fingerprint server, and carry out storage file Coarse filtration directly takes out All Files corresponding to the finger print information, and statistics file information as fingerprint server, by what is obtained Statistical information returns to client；

S3, client receive the file information of fingerprint server return, if quantity of documents is 0, then it represents that pass through text to be stored After part coarse filtration, the characteristic information of this document is not matched in finger print information storehouse, the file to be uploaded is completely new file, visitor Family end sends storage request to name server, while carrying the local fingerprint information of this document, is determined by name server The storage location of file, and the characteristic information of file is registered to fingerprint server；

If S4, quantity of documents are not 0, then it represents that after file to be stored coarse filtration, finger print information storehouse is matched to this document Characteristic information, client carry out the cyclic check stage, and client can successively send file and compare request, can carry file in request Metadata pointer and characteristic interval further carefully filter file to be stored；

S5, name server obtain the verification request that client is sent, and according to file metadata pointer or index, find file member Data, and according to the quantity and distribution that randomized test section is arranged the case where the storage condition of file, characteristic interval, randomized test The sum of the quantity in section and quantity of characteristic interval should be directly proportional with file size, ratio according to circumstances sets itself, characteristic area Between be not overlapped with random interval, the area size of random interval is fixed value, according to circumstances sets itself, and name server will be counted Good random interval is sent to client, begins preparing file precise alignment；

S6, client send the data of characteristic interval and random interval in name server, and name server is by data And section is examined to be issued in the back end of cloud storage platform, precise alignment is completed by back end, and wait data section Point returns to inspection result；

S7, back end, which obtain, examines block information and inspection data, carries out precise alignment to the information examined in section, such as Fruit compares successfully, then returns to Success Flag, if comparing failure, returns to failure flags, and the section that first comparison is failed Information returns to name server；

S8, name server count comparing as a result, if compared completely successfully, increase file metadata information newly, are referred to Successfully that file is compared to complete, and returns to the information that file has found and stored to client；

If S9, comparing failure, name server caching compares the block information of failure, and starts to client request next Secondary file compares；

S10, client send new comparison solicited message, continue above-mentioned comparison step, if all comparisons have terminated, and And name server does not return to comparison successful information, is completed then client sends to compare, application documents storage；

S11, name server receive file and complete and start the storage location distribution of new file after applying for storage, and inform client End is ready for, and client sends file, and name server then starts storage file；

After S12, question paper storage are completed, by this document in caching when comparing, the generated section conduct for comparing failure The relative characteristic section of file, if there is there are intersections between partial section in relative characteristic section, at this point, only retaining therein A part guarantees to be separated from each other between characteristic interval, if characteristic interval is excessive, carries out selective selection, it is ensured that feature The quantity in section is no more than the range of setting；

S13, characteristic interval and file local fingerprint, file size, file metadata pointer are registered in fingerprint server, and Notice client file is transmitted.