CN109213738B

CN109213738B - Cloud storage file-level repeated data deletion retrieval system and method

Info

Publication number: CN109213738B
Application number: CN201811384763.5A
Authority: CN
Inventors: 董志勇; 邱琳; 赵航; 刘梦
Original assignee: Fiberhome Tech Group Co ltd; Wuhan Ligong Guangke Co Ltd
Current assignee: Fiberhome Tech Group Co ltd; Wuhan Ligong Guangke Co Ltd
Priority date: 2018-11-20
Filing date: 2018-11-20
Publication date: 2022-01-25
Anticipated expiration: 2038-11-20
Also published as: CN109213738A

Abstract

The invention discloses a cloud storage file-level repeated data deletion retrieval system and a cloud storage file-level repeated data deletion retrieval method, wherein the characteristic information of a file is stored through a fingerprint server, when a client proposes a file storage application, coarse filtering is firstly carried out, searching is carried out in the fingerprint server, and if no file record with the same characteristic is found, the file is regarded as a new file; if the file is found, fine filtering is carried out, the found file set is regarded as a comparison file, random point locations and characteristic intervals of the comparison file are sequentially selected, accurate comparison is carried out to confirm whether the request file exists, if yes, metadata of the request file is set in the name server to point to the metadata of the comparison file, and if not, the file is stored, and file characteristic information is recorded in the fingerprint server. According to the method, the input of repeated files can be greatly reduced through the filtering of the coarse step and the fine step, the method has the characteristics of high execution efficiency and high repeated data deletion rate, and is suitable for large data and cloud storage environments.

Description

Cloud storage file-level repeated data deletion retrieval system and method

Technical Field

The invention relates to the field of deletion and retrieval of repeated data in computer storage and cloud storage, in particular to a cloud storage file-level repeated data deletion retrieval system and method.

Background

The rapid development of the internet generates mass data, so that the transmission and storage scenes of the mass data are increasingly increased, under the background, the data storage technology is rapidly developed, and the repeated data deletion and compression are technologies capable of saving a large amount of data storage. Deduplication minimizes the amount of data by identifying duplicate content, doing deduplication, and leaving pointers at corresponding storage locations. Currently only a few main storage arrays provide deduplication as an additional function of the product; the repeated data wastes valuable cloud resources and generates additional overhead, and only less than 5% of disk arrays reportedly really support online data deduplication and compression, and the space saved by data deduplication is considerable. The method for deleting the repeated data at the file level eliminates data redundancy and reduces storage capacity, and effectively solves the problem of file comparison efficiency.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a cloud storage file-level repeated data deletion retrieval system and method aiming at the problems that in the prior art, the repeated data wastes precious cloud resources in a cloud space, generates extra expenses and solves the comparison efficiency of the repeated files.

The technical scheme adopted by the invention for solving the technical problems is as follows:

the invention provides a cloud storage file-level repeated data deleting and retrieving system, which comprises: the system comprises a client, a cloud storage platform, a fingerprint server and a name server, wherein the cloud storage platform consists of a plurality of data nodes; wherein:

the plurality of data nodes are connected with the fingerprint server through the name server; the fingerprint server is used for storing the characteristic information of the file in the data node; the client is used for sending a request for searching and filtering the file; in the process of filtering the file, coarsely filtering the file through the characteristic information of the file; after the coarse filtering is finished, if further file confirmation is needed, a fine filtering task is generated by the name server and is delivered to the data node to finish secondary filtering.

Further, the characteristic information of the present invention indicates a partial fingerprint, a size, a metadata pointer, and a characteristic interval of the file.

Further, the data in the fingerprint server of the present invention is subjected to fingerprint extraction in an MD5 manner, so as to eliminate redundant data blocks, and then further deduplication is performed on the name server, where key-value pair information of fingerprint extraction is: the key is the file local fingerprint, and the value is the size, metadata pointer and characteristic interval of the file.

Further, the local fingerprint information of the file of the present invention is: carrying out Hash operation on the head and the tail of the file to obtain file signature information; and if the file size is not enough to carry out head-to-tail hash operation, taking the whole file as signature information.

Further, the characteristic intervals of the file of the present invention are: the difference interval is generated when the file to be uploaded and the similar file are accurately compared; the similar file means a file having a fingerprint and a file size partially or entirely the same as those of the file to be uploaded.

Further, the name server determines the number of random intervals according to the size of the file and the number of the characteristic intervals; and determining the position of the random interval according to the file storage condition.

Furthermore, the data node receives the comparison request transmitted by the name server, receives the comparison data, compares the comparison data according to the comparison interval and reports the comparison result.

The invention provides a cloud storage file-level repeated data deletion retrieval method, which comprises the following steps:

s1, the client selects the head and the tail of the file to carry out Hash operation, and a file signature is obtained by utilizing an MD5 fingerprint information extraction mode and is used as local fingerprint information of the file;

because the hash-based MD5 fingerprint extraction operation speed is high, the CPU occupancy rate is low, the data in the fingerprint server is subjected to fingerprint extraction in an MD5 mode, redundant data blocks are eliminated, and then further repeated data deletion is carried out on the name server. The key value pair information extracted by the fingerprint is that the key is the local fingerprint of the file, and the value is the size, the metadata pointer and the characteristic interval of the file.

S2, sending the size information of the file to be uploaded and the file signature to a fingerprint server, directly taking out all files corresponding to the fingerprint information by the fingerprint server, counting the file information, and returning the obtained statistical information to the client;

s2, sending the size information of the file to be uploaded and the file signature to a fingerprint server, performing coarse filtering on the stored file, directly taking out all files corresponding to the fingerprint information by the fingerprint server, counting the file information, and returning the obtained statistical information to the client;

s3, the client receives the file information returned by the fingerprint server, if the number of the files is 0, the file information indicates that after the files to be stored are coarsely filtered, the characteristic information of the files is not matched in the fingerprint information base, the files to be uploaded are brand new files, the client sends a storage request to the name server, meanwhile, the local fingerprint information of the files is carried, the name server determines the storage position of the files, and the characteristic information of the files is registered to the fingerprint server;

s4, if the number of the files is not 0, the fingerprint information base is matched with the feature information of the files after the files to be stored are coarsely filtered, the client side carries out a cyclic verification stage, the client side sequentially sends file comparison requests, the requests carry file metadata pointers and feature intervals, and the files to be stored are further finely filtered;

s5, the name server obtains a check request sent by the client, finds out file metadata according to a file metadata pointer or index, sets the number and distribution of random check intervals according to the storage condition of the file and the condition of the characteristic intervals, the number of the random check intervals is in direct proportion to the sum of the number of the characteristic intervals and the size of the file, the ratio is set according to the condition, the characteristic intervals and the random intervals are not overlapped, the area size of the random intervals is a fixed value, the random intervals are set according to the condition, the name server sends the calculated random intervals to the client, and the accurate comparison of the files is started;

s6, the client sends the data of the characteristic interval and the random interval to the name server, the name server sends the data and the inspection interval to the data node, the data node completes the accurate comparison, and waits for the data node to return the inspection result;

s7, the data node acquires the information of the inspection interval and the inspection data, accurately compares the information in the inspection interval, if the comparison is successful, a success mark is returned, if the comparison is failed, a failure mark is returned, and the first interval information with failed comparison is returned to the name server;

s8, the name server counts the comparison result, if the comparison is successful, the file metadata information is added, the file metadata information points to the file which is successfully compared, and the found and stored information of the file is returned to the client;

s9, if the comparison fails, the name server caches the interval information of the failed comparison, and requests the client to start the next file comparison;

s10, the client sends new comparison request information, the comparison steps are continued, if all comparisons are finished and the name server does not return comparison success information, the client sends the comparison completion, and the application file is stored;

s11, the name server starts to distribute the storage position of the new file after receiving the file completion and applying for storage, and informs that the client is ready, the client sends the file, and the name server starts to store the file;

s12, after the file is stored, taking the interval of failed comparison generated when the file is compared in the cache as the relative characteristic interval of the file, if some intervals in the relative characteristic interval have intersection, only keeping one part of the intervals to ensure that the characteristic intervals are separated from each other, if the characteristic interval is overlarge, selectively selecting to ensure that the number of the characteristic intervals does not exceed the set range;

and S13, registering the characteristic interval, the local file fingerprint, the file size and the file metadata pointer into the fingerprint server, and informing the client that the file transmission is completed.

The invention has the following beneficial effects: according to the cloud storage file-level repeated data deletion retrieval system and method, the input of repeated files can be greatly reduced through filtering in the steps of coarse and fine, the algorithm has the characteristics of high execution efficiency and high repeated data deletion rate, the repeated condition of the files can be rapidly given, the execution efficiency is high, the repeated effect is obvious, and the cloud storage file-level repeated data deletion retrieval system and method are more suitable for being used in the environment of mass data storage and cloud storage.

Drawings

The invention will be further described with reference to the accompanying drawings and examples, in which:

FIG. 1 is a system block diagram of an embodiment of the present invention;

fig. 2 is a flow chart of a method of an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As shown in fig. 1, a cloud storage file-level deduplication retrieval system according to an embodiment of the present invention includes: the system comprises a client, a cloud storage platform, a fingerprint server and a name server, wherein the cloud storage platform consists of a plurality of data nodes; wherein:

The fingerprint server is introduced to store the characteristic information of the file, and the information comprises the local fingerprint of the file, the file size, the relative characteristic interval, the metadata pointer and the like.

The invention requires the client to be capable of communicating with the fingerprint server, and when the client submits a file uploading request, the client firstly calculates the local fingerprint information and the file size information of the file and sends the information to the fingerprint server for searching. The fingerprint server is used for comparing files with coarse granularity so as to realize coarse filtering, the fingerprint server returns a comparison result to the client, and the client sends a further comparison request or a file storage request to the name server according to the returned result. The name server acquires a metadata pointer and a characteristic interval of a possibly existing repeated file transmitted by the client to perform fine filtering, firstly, the storage condition of the file is inquired, then, factors such as the size of the file, the storage blocking condition and the number of the characteristic intervals are comprehensively considered, a comparison interval is randomly selected, and the comparison interval is transmitted back to the client. The client extracts partial file information according to the returned interval information and transmits the partial file information, and the name server receives the partial file information and sends the partial file information to the data node, and the data node compares the partial file information with the data node. And the data node returns the information of whether the comparison is successful and the information of the first unsuccessful interval to the name server, and the name server informs the client of whether the file is repeated and the next plan so as to finish the fine filtering.

The specific implementation process of the technical method comprises the following steps:

step 1, a client selects the head and the tail of a file, carries out hash operation to obtain the hash signatures of the head and the tail, and merges the hash signatures, wherein the sizes of the head and the tail are the same, the specific size can be set according to the situation, if the file is too small, the hash signature of the whole file is directly obtained, and the client caches the hash signature.

And 2, sending the file size information to be uploaded and the file signature to a fingerprint server, directly taking out all files corresponding to the fingerprint by the fingerprint server, comparing the file sizes, counting the number of the files with the same fingerprint and file sizes, and returning information such as file metadata indexes or pointers and characteristic intervals to the client.

And 3, the client receives the information returned by the fingerprint server, firstly judges whether the number of the files is 0, if so, the file is proved to be a brand new file, the client sends a storage request to the name server and simultaneously carries the local fingerprint information of the file, the name server determines the storage position of the file, and the characteristic information of the file is registered to the fingerprint server.

And 4, if the number of the possibly repeated files received by the client is not 0, the client performs a cyclic check stage, the client sequentially sends file comparison requests, and the requests carry file metadata pointers and characteristic intervals.

And 5, the name server acquires a check request sent by the client, finds file metadata according to a file metadata pointer or index, sets the number and distribution of random check intervals according to the storage condition of the file, the condition of the characteristic intervals and the like, the number of the random check intervals is in direct proportion to the sum of the number of the characteristic intervals and the size of the file, the ratio can be set by the name server according to the condition, the characteristic intervals and the random intervals are not overlapped as much as possible, the area size of the random intervals is a fixed value and can be set by the name server according to the condition, the name server sends the calculated random intervals to the client, and the accurate comparison of the files is started.

And 6, the client sends the data of the characteristic interval and the random interval to the name server, the name server sends the data and the inspection interval to the data nodes, the data nodes complete accurate comparison, and the data nodes wait for the data nodes to return inspection results.

And 7, the data node acquires the information of the inspection interval and the inspection data, accurately compares the information in the inspection interval, returns a success mark if the comparison is successful, returns a failure mark if the comparison is failed, and returns the interval information of which the first comparison is failed to the name server.

And 8, the name server counts the comparison result, if the complete comparison is successful, the file metadata information is added, the file metadata information points to the file which is successfully compared, and the found and stored information of the file is returned to the client.

And 9, if the comparison fails, caching the interval information of the failed comparison by the name server, and requesting the client to start the next file comparison.

And step 10, the client sends new comparison request information, the comparison steps are continued, if all comparisons are finished and the name server does not return comparison success information, the client sends comparison completion, and the application file is stored.

And 11, after receiving the file completion and applying for storage, the name server starts to allocate the storage position of the new file, informs the client that the file is ready, sends the file to the client, and starts to store the file.

And step 12, after the file is stored, taking the interval in which the comparison fails when the file is compared in the cache as a relative characteristic interval of the file, wherein intersection possibly exists between partial intervals in the relative characteristic interval, at the moment, only one part of the interval is reserved, the characteristic intervals are ensured to be separated from each other, and if the characteristic interval is too large, selective selection is carried out, so that the number of the characteristic intervals is ensured not to exceed a certain range.

And step 13, registering the characteristic interval, the local fingerprint of the file, the file size, the file metadata pointer and the like into the fingerprint server, and informing the client that the file transmission is finished.

It will be understood that modifications and variations can be made by persons skilled in the art in light of the above teachings and all such modifications and variations are intended to be included within the scope of the invention as defined in the appended claims.

Claims

1. A cloud storage file-level deduplication retrieval system, the system comprising: the system comprises a client, a cloud storage platform, a fingerprint server and a name server, wherein the cloud storage platform consists of a plurality of data nodes; wherein:

the plurality of data nodes are connected with the fingerprint server through the name server; the fingerprint server is used for storing the characteristic information of the file in the data node; the client is used for sending a request for searching and filtering the file; in the process of filtering the file, coarsely filtering the file through the characteristic information of the file; after the coarse filtering is finished, if further file confirmation is needed, a fine filtering task is generated by the name server and is delivered to the data node to finish secondary filtering;

the method for deleting and retrieving the repeated data realized by the cloud storage file-level repeated data deleting and retrieving system comprises the following steps:

s2, sending the size information of the file to be uploaded and the file signature to a fingerprint server, performing coarse filtering on the stored file, directly taking out all files corresponding to the local fingerprint information by the fingerprint server, counting the file information, and returning the obtained statistical information to the client;

s5, the name server obtains a check request sent by the client, finds out file metadata according to a file metadata pointer or index, sets the number and distribution of random check intervals according to the storage condition of the file and the condition of the characteristic intervals, the number of the random check intervals is in direct proportion to the sum of the number of the characteristic intervals and the size of the file, the ratio is set according to the condition, the characteristic intervals and the random check intervals are not overlapped, the area size of the random check intervals is a fixed value, the ratio is set according to the condition, the name server sends the calculated random check intervals to the client, and the accurate comparison of the file is started;

s6, the client sends the data of the characteristic interval and the random inspection interval to the name server, the name server issues the data and the random inspection interval to the data nodes of the cloud storage platform, the data nodes complete accurate comparison, and the client waits for the data nodes to return inspection results;

s7, the data node acquires random inspection interval information and inspection data, accurately compares the information in the random inspection interval, if the comparison is successful, a success mark is returned, if the comparison is failed, a failure mark is returned, and the first interval information with the comparison failure is returned to the name server;

s10, the client sends new comparison request information, the step S6-step S9 are continuously executed, if all comparisons are finished and the name server does not return comparison success information, the client sends comparison completion, and the application file is stored;

s12, after the files are stored, taking the interval of failed comparison generated when the files in the cache are compared as the relative characteristic interval of the files, if some intervals in the relative characteristic interval have intersection, only keeping one part of the intervals at the moment, ensuring that the relative characteristic intervals are separated from each other, and if the relative characteristic interval is overlarge, selectively selecting to ensure that the number of the relative characteristic intervals does not exceed the set range;

and S13, registering the relative characteristic interval, the file local fingerprint, the file size and the file metadata pointer into the fingerprint server, and informing the client that the file transmission is completed.

2. The cloud storage file level deduplication retrieval system of claim 1, wherein the characteristic information represents a partial fingerprint, a size, a metadata pointer, and a characteristic interval of the file.

3. The cloud storage file-level deduplication retrieval system of claim 1, wherein data in the fingerprint server is fingerprinted in MD5 manner, so as to eliminate redundant data blocks, and then further deduplication is performed on the name server, where key-value pair information of the fingerprint extraction is: the key is the file local fingerprint, and the value is the size, metadata pointer and characteristic interval of the file.

4. The cloud storage file-level deduplication retrieval system of claim 2, wherein the local fingerprint information of the file is: carrying out Hash operation on the head and the tail of the file to obtain file signature information; and if the file size is not enough to carry out head-to-tail hash operation, taking the whole file as signature information.

5. The cloud storage file-level deduplication retrieval system of claim 2, wherein the characteristic intervals of the file are: the difference interval is generated when the file to be uploaded and the similar file are accurately compared; the similar file means a file having the same fingerprint information and file size as the file to be uploaded in part or in whole.

6. The cloud storage file-level deduplication retrieval system of claim 2, wherein the name server determines a number of random check intervals according to a file size and a number of feature intervals; and determining the position of the random inspection interval according to the file storage condition.

7. The cloud storage file-level deduplication retrieval system of claim 1, wherein the data node accepts the comparison request transmitted by the name server, accepts the comparison data, performs comparison according to the comparison interval, and notifies the comparison result.