CN109213738B - Cloud storage file-level repeated data deletion retrieval system and method - Google Patents

Cloud storage file-level repeated data deletion retrieval system and method Download PDF

Info

Publication number
CN109213738B
CN109213738B CN201811384763.5A CN201811384763A CN109213738B CN 109213738 B CN109213738 B CN 109213738B CN 201811384763 A CN201811384763 A CN 201811384763A CN 109213738 B CN109213738 B CN 109213738B
Authority
CN
China
Prior art keywords
file
information
comparison
fingerprint
client
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811384763.5A
Other languages
Chinese (zh)
Other versions
CN109213738A (en
Inventor
董志勇
邱琳
赵航
刘梦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fiberhome Tech Group Co ltd
Wuhan Ligong Guangke Co Ltd
Original Assignee
Fiberhome Tech Group Co ltd
Wuhan Ligong Guangke Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fiberhome Tech Group Co ltd, Wuhan Ligong Guangke Co Ltd filed Critical Fiberhome Tech Group Co ltd
Priority to CN201811384763.5A priority Critical patent/CN109213738B/en
Publication of CN109213738A publication Critical patent/CN109213738A/en
Application granted granted Critical
Publication of CN109213738B publication Critical patent/CN109213738B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a cloud storage file-level repeated data deletion retrieval system and a cloud storage file-level repeated data deletion retrieval method, wherein the characteristic information of a file is stored through a fingerprint server, when a client proposes a file storage application, coarse filtering is firstly carried out, searching is carried out in the fingerprint server, and if no file record with the same characteristic is found, the file is regarded as a new file; if the file is found, fine filtering is carried out, the found file set is regarded as a comparison file, random point locations and characteristic intervals of the comparison file are sequentially selected, accurate comparison is carried out to confirm whether the request file exists, if yes, metadata of the request file is set in the name server to point to the metadata of the comparison file, and if not, the file is stored, and file characteristic information is recorded in the fingerprint server. According to the method, the input of repeated files can be greatly reduced through the filtering of the coarse step and the fine step, the method has the characteristics of high execution efficiency and high repeated data deletion rate, and is suitable for large data and cloud storage environments.

Description

Cloud storage file-level repeated data deletion retrieval system and method
Technical Field
The invention relates to the field of deletion and retrieval of repeated data in computer storage and cloud storage, in particular to a cloud storage file-level repeated data deletion retrieval system and method.
Background
The rapid development of the internet generates mass data, so that the transmission and storage scenes of the mass data are increasingly increased, under the background, the data storage technology is rapidly developed, and the repeated data deletion and compression are technologies capable of saving a large amount of data storage. Deduplication minimizes the amount of data by identifying duplicate content, doing deduplication, and leaving pointers at corresponding storage locations. Currently only a few main storage arrays provide deduplication as an additional function of the product; the repeated data wastes valuable cloud resources and generates additional overhead, and only less than 5% of disk arrays reportedly really support online data deduplication and compression, and the space saved by data deduplication is considerable. The method for deleting the repeated data at the file level eliminates data redundancy and reduces storage capacity, and effectively solves the problem of file comparison efficiency.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a cloud storage file-level repeated data deletion retrieval system and method aiming at the problems that in the prior art, the repeated data wastes precious cloud resources in a cloud space, generates extra expenses and solves the comparison efficiency of the repeated files.
The technical scheme adopted by the invention for solving the technical problems is as follows:
the invention provides a cloud storage file-level repeated data deleting and retrieving system, which comprises: the system comprises a client, a cloud storage platform, a fingerprint server and a name server, wherein the cloud storage platform consists of a plurality of data nodes; wherein:
the plurality of data nodes are connected with the fingerprint server through the name server; the fingerprint server is used for storing the characteristic information of the file in the data node; the client is used for sending a request for searching and filtering the file; in the process of filtering the file, coarsely filtering the file through the characteristic information of the file; after the coarse filtering is finished, if further file confirmation is needed, a fine filtering task is generated by the name server and is delivered to the data node to finish secondary filtering.
Further, the characteristic information of the present invention indicates a partial fingerprint, a size, a metadata pointer, and a characteristic interval of the file.
Further, the data in the fingerprint server of the present invention is subjected to fingerprint extraction in an MD5 manner, so as to eliminate redundant data blocks, and then further deduplication is performed on the name server, where key-value pair information of fingerprint extraction is: the key is the file local fingerprint, and the value is the size, metadata pointer and characteristic interval of the file.
Further, the local fingerprint information of the file of the present invention is: carrying out Hash operation on the head and the tail of the file to obtain file signature information; and if the file size is not enough to carry out head-to-tail hash operation, taking the whole file as signature information.
Further, the characteristic intervals of the file of the present invention are: the difference interval is generated when the file to be uploaded and the similar file are accurately compared; the similar file means a file having a fingerprint and a file size partially or entirely the same as those of the file to be uploaded.
Further, the name server determines the number of random intervals according to the size of the file and the number of the characteristic intervals; and determining the position of the random interval according to the file storage condition.
Furthermore, the data node receives the comparison request transmitted by the name server, receives the comparison data, compares the comparison data according to the comparison interval and reports the comparison result.
The invention provides a cloud storage file-level repeated data deletion retrieval method, which comprises the following steps:
s1, the client selects the head and the tail of the file to carry out Hash operation, and a file signature is obtained by utilizing an MD5 fingerprint information extraction mode and is used as local fingerprint information of the file;
because the hash-based MD5 fingerprint extraction operation speed is high, the CPU occupancy rate is low, the data in the fingerprint server is subjected to fingerprint extraction in an MD5 mode, redundant data blocks are eliminated, and then further repeated data deletion is carried out on the name server. The key value pair information extracted by the fingerprint is that the key is the local fingerprint of the file, and the value is the size, the metadata pointer and the characteristic interval of the file.
S2, sending the size information of the file to be uploaded and the file signature to a fingerprint server, directly taking out all files corresponding to the fingerprint information by the fingerprint server, counting the file information, and returning the obtained statistical information to the client;
s2, sending the size information of the file to be uploaded and the file signature to a fingerprint server, performing coarse filtering on the stored file, directly taking out all files corresponding to the fingerprint information by the fingerprint server, counting the file information, and returning the obtained statistical information to the client;
s3, the client receives the file information returned by the fingerprint server, if the number of the files is 0, the file information indicates that after the files to be stored are coarsely filtered, the characteristic information of the files is not matched in the fingerprint information base, the files to be uploaded are brand new files, the client sends a storage request to the name server, meanwhile, the local fingerprint information of the files is carried, the name server determines the storage position of the files, and the characteristic information of the files is registered to the fingerprint server;
s4, if the number of the files is not 0, the fingerprint information base is matched with the feature information of the files after the files to be stored are coarsely filtered, the client side carries out a cyclic verification stage, the client side sequentially sends file comparison requests, the requests carry file metadata pointers and feature intervals, and the files to be stored are further finely filtered;
s5, the name server obtains a check request sent by the client, finds out file metadata according to a file metadata pointer or index, sets the number and distribution of random check intervals according to the storage condition of the file and the condition of the characteristic intervals, the number of the random check intervals is in direct proportion to the sum of the number of the characteristic intervals and the size of the file, the ratio is set according to the condition, the characteristic intervals and the random intervals are not overlapped, the area size of the random intervals is a fixed value, the random intervals are set according to the condition, the name server sends the calculated random intervals to the client, and the accurate comparison of the files is started;
s6, the client sends the data of the characteristic interval and the random interval to the name server, the name server sends the data and the inspection interval to the data node, the data node completes the accurate comparison, and waits for the data node to return the inspection result;
s7, the data node acquires the information of the inspection interval and the inspection data, accurately compares the information in the inspection interval, if the comparison is successful, a success mark is returned, if the comparison is failed, a failure mark is returned, and the first interval information with failed comparison is returned to the name server;
s8, the name server counts the comparison result, if the comparison is successful, the file metadata information is added, the file metadata information points to the file which is successfully compared, and the found and stored information of the file is returned to the client;
s9, if the comparison fails, the name server caches the interval information of the failed comparison, and requests the client to start the next file comparison;
s10, the client sends new comparison request information, the comparison steps are continued, if all comparisons are finished and the name server does not return comparison success information, the client sends the comparison completion, and the application file is stored;
s11, the name server starts to distribute the storage position of the new file after receiving the file completion and applying for storage, and informs that the client is ready, the client sends the file, and the name server starts to store the file;
s12, after the file is stored, taking the interval of failed comparison generated when the file is compared in the cache as the relative characteristic interval of the file, if some intervals in the relative characteristic interval have intersection, only keeping one part of the intervals to ensure that the characteristic intervals are separated from each other, if the characteristic interval is overlarge, selectively selecting to ensure that the number of the characteristic intervals does not exceed the set range;
and S13, registering the characteristic interval, the local file fingerprint, the file size and the file metadata pointer into the fingerprint server, and informing the client that the file transmission is completed.
The invention has the following beneficial effects: according to the cloud storage file-level repeated data deletion retrieval system and method, the input of repeated files can be greatly reduced through filtering in the steps of coarse and fine, the algorithm has the characteristics of high execution efficiency and high repeated data deletion rate, the repeated condition of the files can be rapidly given, the execution efficiency is high, the repeated effect is obvious, and the cloud storage file-level repeated data deletion retrieval system and method are more suitable for being used in the environment of mass data storage and cloud storage.
Drawings
The invention will be further described with reference to the accompanying drawings and examples, in which:
FIG. 1 is a system block diagram of an embodiment of the present invention;
fig. 2 is a flow chart of a method of an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
As shown in fig. 1, a cloud storage file-level deduplication retrieval system according to an embodiment of the present invention includes: the system comprises a client, a cloud storage platform, a fingerprint server and a name server, wherein the cloud storage platform consists of a plurality of data nodes; wherein:
the plurality of data nodes are connected with the fingerprint server through the name server; the fingerprint server is used for storing the characteristic information of the file in the data node; the client is used for sending a request for searching and filtering the file; in the process of filtering the file, coarsely filtering the file through the characteristic information of the file; after the coarse filtering is finished, if further file confirmation is needed, a fine filtering task is generated by the name server and is delivered to the data node to finish secondary filtering.
The fingerprint server is introduced to store the characteristic information of the file, and the information comprises the local fingerprint of the file, the file size, the relative characteristic interval, the metadata pointer and the like.
The invention requires the client to be capable of communicating with the fingerprint server, and when the client submits a file uploading request, the client firstly calculates the local fingerprint information and the file size information of the file and sends the information to the fingerprint server for searching. The fingerprint server is used for comparing files with coarse granularity so as to realize coarse filtering, the fingerprint server returns a comparison result to the client, and the client sends a further comparison request or a file storage request to the name server according to the returned result. The name server acquires a metadata pointer and a characteristic interval of a possibly existing repeated file transmitted by the client to perform fine filtering, firstly, the storage condition of the file is inquired, then, factors such as the size of the file, the storage blocking condition and the number of the characteristic intervals are comprehensively considered, a comparison interval is randomly selected, and the comparison interval is transmitted back to the client. The client extracts partial file information according to the returned interval information and transmits the partial file information, and the name server receives the partial file information and sends the partial file information to the data node, and the data node compares the partial file information with the data node. And the data node returns the information of whether the comparison is successful and the information of the first unsuccessful interval to the name server, and the name server informs the client of whether the file is repeated and the next plan so as to finish the fine filtering.
The specific implementation process of the technical method comprises the following steps:
step 1, a client selects the head and the tail of a file, carries out hash operation to obtain the hash signatures of the head and the tail, and merges the hash signatures, wherein the sizes of the head and the tail are the same, the specific size can be set according to the situation, if the file is too small, the hash signature of the whole file is directly obtained, and the client caches the hash signature.
And 2, sending the file size information to be uploaded and the file signature to a fingerprint server, directly taking out all files corresponding to the fingerprint by the fingerprint server, comparing the file sizes, counting the number of the files with the same fingerprint and file sizes, and returning information such as file metadata indexes or pointers and characteristic intervals to the client.
And 3, the client receives the information returned by the fingerprint server, firstly judges whether the number of the files is 0, if so, the file is proved to be a brand new file, the client sends a storage request to the name server and simultaneously carries the local fingerprint information of the file, the name server determines the storage position of the file, and the characteristic information of the file is registered to the fingerprint server.
And 4, if the number of the possibly repeated files received by the client is not 0, the client performs a cyclic check stage, the client sequentially sends file comparison requests, and the requests carry file metadata pointers and characteristic intervals.
And 5, the name server acquires a check request sent by the client, finds file metadata according to a file metadata pointer or index, sets the number and distribution of random check intervals according to the storage condition of the file, the condition of the characteristic intervals and the like, the number of the random check intervals is in direct proportion to the sum of the number of the characteristic intervals and the size of the file, the ratio can be set by the name server according to the condition, the characteristic intervals and the random intervals are not overlapped as much as possible, the area size of the random intervals is a fixed value and can be set by the name server according to the condition, the name server sends the calculated random intervals to the client, and the accurate comparison of the files is started.
And 6, the client sends the data of the characteristic interval and the random interval to the name server, the name server sends the data and the inspection interval to the data nodes, the data nodes complete accurate comparison, and the data nodes wait for the data nodes to return inspection results.
And 7, the data node acquires the information of the inspection interval and the inspection data, accurately compares the information in the inspection interval, returns a success mark if the comparison is successful, returns a failure mark if the comparison is failed, and returns the interval information of which the first comparison is failed to the name server.
And 8, the name server counts the comparison result, if the complete comparison is successful, the file metadata information is added, the file metadata information points to the file which is successfully compared, and the found and stored information of the file is returned to the client.
And 9, if the comparison fails, caching the interval information of the failed comparison by the name server, and requesting the client to start the next file comparison.
And step 10, the client sends new comparison request information, the comparison steps are continued, if all comparisons are finished and the name server does not return comparison success information, the client sends comparison completion, and the application file is stored.
And 11, after receiving the file completion and applying for storage, the name server starts to allocate the storage position of the new file, informs the client that the file is ready, sends the file to the client, and starts to store the file.
And step 12, after the file is stored, taking the interval in which the comparison fails when the file is compared in the cache as a relative characteristic interval of the file, wherein intersection possibly exists between partial intervals in the relative characteristic interval, at the moment, only one part of the interval is reserved, the characteristic intervals are ensured to be separated from each other, and if the characteristic interval is too large, selective selection is carried out, so that the number of the characteristic intervals is ensured not to exceed a certain range.
And step 13, registering the characteristic interval, the local fingerprint of the file, the file size, the file metadata pointer and the like into the fingerprint server, and informing the client that the file transmission is finished.
It will be understood that modifications and variations can be made by persons skilled in the art in light of the above teachings and all such modifications and variations are intended to be included within the scope of the invention as defined in the appended claims.

Claims (7)

1. A cloud storage file-level deduplication retrieval system, the system comprising: the system comprises a client, a cloud storage platform, a fingerprint server and a name server, wherein the cloud storage platform consists of a plurality of data nodes; wherein:
the plurality of data nodes are connected with the fingerprint server through the name server; the fingerprint server is used for storing the characteristic information of the file in the data node; the client is used for sending a request for searching and filtering the file; in the process of filtering the file, coarsely filtering the file through the characteristic information of the file; after the coarse filtering is finished, if further file confirmation is needed, a fine filtering task is generated by the name server and is delivered to the data node to finish secondary filtering;
the method for deleting and retrieving the repeated data realized by the cloud storage file-level repeated data deleting and retrieving system comprises the following steps:
s1, the client selects the head and the tail of the file to carry out Hash operation, and a file signature is obtained by utilizing an MD5 fingerprint information extraction mode and is used as local fingerprint information of the file;
s2, sending the size information of the file to be uploaded and the file signature to a fingerprint server, performing coarse filtering on the stored file, directly taking out all files corresponding to the local fingerprint information by the fingerprint server, counting the file information, and returning the obtained statistical information to the client;
s3, the client receives the file information returned by the fingerprint server, if the number of the files is 0, the file information indicates that after the files to be stored are coarsely filtered, the characteristic information of the files is not matched in the fingerprint information base, the files to be uploaded are brand new files, the client sends a storage request to the name server, meanwhile, the local fingerprint information of the files is carried, the name server determines the storage position of the files, and the characteristic information of the files is registered to the fingerprint server;
s4, if the number of the files is not 0, the fingerprint information base is matched with the feature information of the files after the files to be stored are coarsely filtered, the client side carries out a cyclic verification stage, the client side sequentially sends file comparison requests, the requests carry file metadata pointers and feature intervals, and the files to be stored are further finely filtered;
s5, the name server obtains a check request sent by the client, finds out file metadata according to a file metadata pointer or index, sets the number and distribution of random check intervals according to the storage condition of the file and the condition of the characteristic intervals, the number of the random check intervals is in direct proportion to the sum of the number of the characteristic intervals and the size of the file, the ratio is set according to the condition, the characteristic intervals and the random check intervals are not overlapped, the area size of the random check intervals is a fixed value, the ratio is set according to the condition, the name server sends the calculated random check intervals to the client, and the accurate comparison of the file is started;
s6, the client sends the data of the characteristic interval and the random inspection interval to the name server, the name server issues the data and the random inspection interval to the data nodes of the cloud storage platform, the data nodes complete accurate comparison, and the client waits for the data nodes to return inspection results;
s7, the data node acquires random inspection interval information and inspection data, accurately compares the information in the random inspection interval, if the comparison is successful, a success mark is returned, if the comparison is failed, a failure mark is returned, and the first interval information with the comparison failure is returned to the name server;
s8, the name server counts the comparison result, if the comparison is successful, the file metadata information is added, the file metadata information points to the file which is successfully compared, and the found and stored information of the file is returned to the client;
s9, if the comparison fails, the name server caches the interval information of the failed comparison, and requests the client to start the next file comparison;
s10, the client sends new comparison request information, the step S6-step S9 are continuously executed, if all comparisons are finished and the name server does not return comparison success information, the client sends comparison completion, and the application file is stored;
s11, the name server starts to distribute the storage position of the new file after receiving the file completion and applying for storage, and informs that the client is ready, the client sends the file, and the name server starts to store the file;
s12, after the files are stored, taking the interval of failed comparison generated when the files in the cache are compared as the relative characteristic interval of the files, if some intervals in the relative characteristic interval have intersection, only keeping one part of the intervals at the moment, ensuring that the relative characteristic intervals are separated from each other, and if the relative characteristic interval is overlarge, selectively selecting to ensure that the number of the relative characteristic intervals does not exceed the set range;
and S13, registering the relative characteristic interval, the file local fingerprint, the file size and the file metadata pointer into the fingerprint server, and informing the client that the file transmission is completed.
2. The cloud storage file level deduplication retrieval system of claim 1, wherein the characteristic information represents a partial fingerprint, a size, a metadata pointer, and a characteristic interval of the file.
3. The cloud storage file-level deduplication retrieval system of claim 1, wherein data in the fingerprint server is fingerprinted in MD5 manner, so as to eliminate redundant data blocks, and then further deduplication is performed on the name server, where key-value pair information of the fingerprint extraction is: the key is the file local fingerprint, and the value is the size, metadata pointer and characteristic interval of the file.
4. The cloud storage file-level deduplication retrieval system of claim 2, wherein the local fingerprint information of the file is: carrying out Hash operation on the head and the tail of the file to obtain file signature information; and if the file size is not enough to carry out head-to-tail hash operation, taking the whole file as signature information.
5. The cloud storage file-level deduplication retrieval system of claim 2, wherein the characteristic intervals of the file are: the difference interval is generated when the file to be uploaded and the similar file are accurately compared; the similar file means a file having the same fingerprint information and file size as the file to be uploaded in part or in whole.
6. The cloud storage file-level deduplication retrieval system of claim 2, wherein the name server determines a number of random check intervals according to a file size and a number of feature intervals; and determining the position of the random inspection interval according to the file storage condition.
7. The cloud storage file-level deduplication retrieval system of claim 1, wherein the data node accepts the comparison request transmitted by the name server, accepts the comparison data, performs comparison according to the comparison interval, and notifies the comparison result.
CN201811384763.5A 2018-11-20 2018-11-20 Cloud storage file-level repeated data deletion retrieval system and method Active CN109213738B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811384763.5A CN109213738B (en) 2018-11-20 2018-11-20 Cloud storage file-level repeated data deletion retrieval system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811384763.5A CN109213738B (en) 2018-11-20 2018-11-20 Cloud storage file-level repeated data deletion retrieval system and method

Publications (2)

Publication Number Publication Date
CN109213738A CN109213738A (en) 2019-01-15
CN109213738B true CN109213738B (en) 2022-01-25

Family

ID=64993843

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811384763.5A Active CN109213738B (en) 2018-11-20 2018-11-20 Cloud storage file-level repeated data deletion retrieval system and method

Country Status (1)

Country Link
CN (1) CN109213738B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110096483B (en) * 2019-05-08 2021-04-30 北京奇艺世纪科技有限公司 Duplicate file detection method, terminal and server
CN110636141B (en) * 2019-10-17 2021-11-09 中国人民解放军陆军工程大学 Multi-cloud storage system based on cloud and mist cooperation and management method thereof
CN111177082B (en) * 2019-12-03 2023-06-09 世强先进(深圳)科技股份有限公司 PDF file duplicate removal storage method and system
CN111324687A (en) * 2020-02-17 2020-06-23 平安科技(深圳)有限公司 Data processing method and device in knowledge base, computer equipment and storage medium
CN111294613A (en) * 2020-02-20 2020-06-16 北京奇艺世纪科技有限公司 Video processing method, client and server
CN112347060B (en) * 2020-10-19 2023-09-26 北京天融信网络安全技术有限公司 Data storage method, device and equipment of desktop cloud system and readable storage medium
CN112631514A (en) * 2020-12-17 2021-04-09 龙存科技(北京)股份有限公司 File duplicate removal method and system applied to cloud disk system
CN113362046A (en) * 2021-08-10 2021-09-07 北京开科唯识技术股份有限公司 Control method and device for preventing salary generation errors

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101477523A (en) * 2008-11-24 2009-07-08 北京邮电大学 Index structure and retrieval method for ultra-large fingerprint base
CN102156727A (en) * 2011-04-01 2011-08-17 华中科技大学 Method for deleting repeated data by using double-fingerprint hash check
CN103034659A (en) * 2011-09-29 2013-04-10 国际商业机器公司 Repeated data deleting method and system
CN103177111A (en) * 2013-03-29 2013-06-26 西安理工大学 System and method for deleting repeating data
CN104077422A (en) * 2014-07-22 2014-10-01 百度在线网络技术(北京)有限公司 Repeated APK removing method and device in APK downloading
CN104932841A (en) * 2015-06-17 2015-09-23 南京邮电大学 Saving type duplicated data deleting method in cloud storage system
CN105955675A (en) * 2016-06-22 2016-09-21 南京邮电大学 Repeated data deletion system and method for de-centralization cloud environment
CN107924353A (en) * 2015-10-14 2018-04-17 株式会社日立制作所 The control method of storage system and storage system

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9122639B2 (en) * 2011-01-25 2015-09-01 Sepaton, Inc. Detection and deduplication of backup sets exhibiting poor locality
US9043292B2 (en) * 2011-06-14 2015-05-26 Netapp, Inc. Hierarchical identification and mapping of duplicate data in a storage system

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101477523A (en) * 2008-11-24 2009-07-08 北京邮电大学 Index structure and retrieval method for ultra-large fingerprint base
CN102156727A (en) * 2011-04-01 2011-08-17 华中科技大学 Method for deleting repeated data by using double-fingerprint hash check
CN103034659A (en) * 2011-09-29 2013-04-10 国际商业机器公司 Repeated data deleting method and system
CN103177111A (en) * 2013-03-29 2013-06-26 西安理工大学 System and method for deleting repeating data
CN104077422A (en) * 2014-07-22 2014-10-01 百度在线网络技术(北京)有限公司 Repeated APK removing method and device in APK downloading
CN104932841A (en) * 2015-06-17 2015-09-23 南京邮电大学 Saving type duplicated data deleting method in cloud storage system
CN107924353A (en) * 2015-10-14 2018-04-17 株式会社日立制作所 The control method of storage system and storage system
CN105955675A (en) * 2016-06-22 2016-09-21 南京邮电大学 Repeated data deletion system and method for de-centralization cloud environment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Sequence of Hashes Compression in Data De-duplication;Subashini Balachandran;《 Data Compression Conference (dcc 2008)》;20080403;全文 *
一种并行层次化的重复数据删除技术;贾志凯等;《计算机研究与发展》;20110228;第48卷;第100-104页 *

Also Published As

Publication number Publication date
CN109213738A (en) 2019-01-15

Similar Documents

Publication Publication Date Title
CN109213738B (en) Cloud storage file-level repeated data deletion retrieval system and method
CN102782643B (en) Use the indexed search of Bloom filter
US10228851B2 (en) Cluster storage using subsegmenting for efficient storage
US7478113B1 (en) Boundaries
US20080270729A1 (en) Cluster storage using subsegmenting
US20120303595A1 (en) Data restoration method for data de-duplication
CN104932841A (en) Saving type duplicated data deleting method in cloud storage system
WO2013086969A1 (en) Method, device and system for finding duplicate data
US9183218B1 (en) Method and system to improve deduplication of structured datasets using hybrid chunking and block header removal
US20100250480A1 (en) Identifying similar files in an environment having multiple client computers
US20140222770A1 (en) De-duplication data bank
CN109766318B (en) File reading method and device
EP3610364B1 (en) Wan optimized micro-service based deduplication
CN105069111A (en) Similarity based data-block-grade data duplication removal method for cloud storage
CN111033487A (en) Microservice-based deduplication
CN111522502B (en) Data deduplication method and device, electronic equipment and computer-readable storage medium
CN110908589A (en) Data file processing method, device and system and storage medium
CN106990914B (en) Data deleting method and device
CN111522791B (en) Distributed file repeated data deleting system and method
CN110245129B (en) Distributed global data deduplication method and device
CN100357943C (en) A method for inspecting garbage files in cluster file system
CN110737389A (en) Method and device for storing data
CN110019056B (en) Container metadata separation for cloud layer
US10860212B1 (en) Method or an apparatus to move perfect de-duplicated unique data from a source to destination storage tier
US10949088B1 (en) Method or an apparatus for having perfect deduplication, adapted for saving space in a deduplication file system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant