CN113672170A - Redundant data marking and removing method - Google Patents

Redundant data marking and removing method Download PDF

Info

Publication number
CN113672170A
CN113672170A CN202110838878.2A CN202110838878A CN113672170A CN 113672170 A CN113672170 A CN 113672170A CN 202110838878 A CN202110838878 A CN 202110838878A CN 113672170 A CN113672170 A CN 113672170A
Authority
CN
China
Prior art keywords
data block
data
file
group
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110838878.2A
Other languages
Chinese (zh)
Inventor
朱敏俊
王奕
黄宗浩
李渊
张晖
厉励
张逸鲁
高宇
戴梅
黄麒玮
蔡云飞
曹斌
石强
王正源
王骏杰
于镆铘
崔敏杰
胡佳迎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fudan University Shanghai Cancer Center
Original Assignee
Fudan University Shanghai Cancer Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fudan University Shanghai Cancer Center filed Critical Fudan University Shanghai Cancer Center
Priority to CN202110838878.2A priority Critical patent/CN113672170A/en
Publication of CN113672170A publication Critical patent/CN113672170A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0604Improving or facilitating administration, e.g. storage management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0614Improving the reliability of storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0646Horizontal data movement in storage systems, i.e. moving data in between storage devices or systems
    • G06F3/0652Erasing, e.g. deleting, data cleaning, moving of data to a wastebasket
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]

Abstract

The invention relates to a redundant data marking and removing method, and belongs to the technical field of data storage. The method comprises the following steps: when a file is written, the file is subjected to dynamic variable-length segmentation to form a plurality of data blocks with different lengths; grouping the data blocks to obtain data block groups, and calculating bloom values of each data block and the data block groups; processing the bloom value of the data block to form a characteristic value of the data block; judging whether the characteristic value of the data block exists in a metadata base or not; if the characteristic values of the data blocks exist in the metadata base, calculating the bloom values of the data blocks again, comparing the bloom values with the bloom values of all data block groups in the metadata base, positioning similar groups of the data blocks, and determining redundant data blocks; marking the redundant data blocks and deleting or retaining the redundant data blocks according to a predetermined strategy. The method has the advantages of high redundancy recognition rate, high reliability, high robustness and less resource occupation.

Description

Redundant data marking and removing method
Technical Field
The invention belongs to the technical field of data storage, and particularly relates to a redundant data marking and removing method.
Background
Redundant-data marking (redundancy-mark) is a data reduction technique aimed at reducing the storage capacity used in storage systems. The redundant data in the data storage system is marked and deleted by searching redundant variable-size data blocks at different positions in different files, and only one or a plurality of necessary parts are reserved, so that excessive redundant data is eliminated. The redundant data marking technology can reduce consumption of physical storage space to a great extent, improve service efficiency of retrieval and the like, save transmission bandwidth and allow efficient and economic backup data copying among different sites of a user.
The redundant data marking and removing technique is classified into Pre-processing (Pre-processing), Online processing (Online-processing), and Post-processing (Post-processing) according to the stage of data processing.
The redundant data marking and removing in the preprocessing mode are carried out before data are written into a system, redundancy detection is carried out, and redundant data are marked or deleted. The preprocessing can effectively reduce the load pressure of the rear-end storage system and effectively avoid unnecessary data redundant writing. But also increases the data writing delay and reduces the response speed of the system.
The method for deleting the redundant data in the online processing mode is to extract a certain characteristic value and execute a redundant data marking algorithm while writing the data into a disk. The redundant data deletion processed on line reduces the data volume to a certain extent, but also has a problem that the data throughput rate is reduced by the duplication removal operation itself, which causes the reduction of the service performance.
The redundant data deleting method of the post-processing mode is to delete redundant data after the data is written into a disk. The data is written into the temporary disk space, then the redundant data deletion is started, and finally the data subjected to the redundant data deletion is written into the disk. Since the redundant data deletion is performed on a separate storage device after data is written to the disk, it generally has little effect on normal business processing. However, the current post-processing mode cannot dynamically adjust the occupation of system resources, and does not have the function of preferentially ensuring the performance of the online service, so that the system online service is affected when the system occupation rate is too large.
Redundant data marking and removal techniques may be classified at the file level, block level, byte level, depending on the granularity of the redundant identification target.
The redundant data marking and removing of the file level takes the file as a basic identification unit to identify and remove redundant data. The method has the advantages that the recognition algorithm is relatively simple, the engineering is easy to realize, the recognition efficiency is high, the speed is high, the defects are that the target scene is limited, and the redundant recognition rate is very low for the scene of a large file.
And marking and removing redundant data at a block level, and performing redundancy detection by taking the minimum block of the storage system as an identification unit. The method has the advantages that the bottom library of the operating system has perfect support for the recognition algorithm, has high calculation speed and is sensitive to data change, and is suitable for dynamic file detection.
The block-level redundant data marking and removing are divided into a fixed-length blocking mode and a variable-length blocking mode according to different blocking modes.
Referring to fig. 1, a file is divided into fixed-length blocks in a fixed-length blocking manner, but the method is too sensitive to data insertion and deletion, the redundancy rate of dynamic data in practical application changes greatly with time, the timeliness of redundancy identification is obvious, and the identification effect is relatively limited.
Byte-level redundant data marking and removal searches for and marks redundant data from a byte-based unit, typically generating a differential portion of content via a differential marking strategy. Byte-level deduplication has the advantages of high deduplication rate and low deduplication efficiency and speed, and the difference content occupies a large content proportion, and also occupies a large amount of storage resources of the system, and is generally used in a post-redundancy removal method.
In addition, considering the wide use of the distributed storage system, the traditional redundant data identification method cannot be applied to both performance and accuracy, and has certain defects.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a redundant data marking and removing method, which can adaptively adjust the influence of an identification algorithm and a deduplication operation on system resources according to the system state, can set the granularity and the priority of a redundant marker, preferentially ensures the resource use of normal services of the system, reduces the influence on the normal services of the system to the minimum, and has the advantages of high redundant identification rate, high reliability, high robustness and less resource occupation.
According to one aspect of the present invention, there is provided a method for marking and removing redundant data, the method comprising the steps of:
when a file is written, the file is subjected to dynamic variable-length segmentation to form a plurality of data blocks with different lengths;
grouping the data blocks to obtain data block groups, and calculating bloom values of each data block and the data block groups;
processing the bloom value of the data block to form a characteristic value of the data block;
judging whether the characteristic value of the data block exists in a metadata base or not;
if the characteristic values of the data blocks exist in the metadata base, calculating the bloom values of the data blocks again, comparing the bloom values with the bloom values of all data block groups in the metadata base, positioning similar groups of the data blocks, and determining redundant data blocks;
marking the redundant data blocks and deleting or retaining the redundant data blocks according to a predetermined strategy.
Preferably, if the characteristic value of the data block does not exist in the metadata base, the data block is written into a certain data block group, and the characteristic value of the data block is stored in the metadata base as an index of the data block.
Preferably, a characteristic diagram technology of memory automatic mapping is adopted, and the file is subjected to dynamic variable-length segmentation according to the attribute of the file.
Preferably, when calculating the characteristic value of the data block, the data characteristic value of the sliding interval is calculated through a sliding window according to the original characteristic value, the characteristic value sliding-in value and the characteristic value sliding-out value of the data block.
Preferably, the grouping the plurality of data blocks to obtain a data block group includes:
extracting the characteristic value of the data block, performing similarity fitting on the acquired characteristic value of the data block and the characteristic value of the data block in the metadata database, classifying the data block into different similarity groups, and establishing a similarity group file data block index in each similarity group.
Preferably, the comparing the bloom value of each data block group in the metadata database to locate the similar group of the data block comprises:
and calculating a similarity value between the data block and the data block group in the metadata base according to the distribution value, and if the similarity value exceeds a similarity threshold set by a system, taking the data block group corresponding to the sampling data of the current file data block as a similarity group of the data block.
Preferably, the similarity threshold is dynamically adjusted.
Preferably, in each similarity group, the sample characteristic value of each file forms the characteristic value index of the group, and the characteristic values of all data blocks in the same similarity group are stored in the metadata of the group.
Preferably, the metadata base stores related attribute information of the data blocks and the data block groups, including read-write time, redundancy marks, object sizes, and storage paths.
Preferably, when the system receives an operation request for the target file, the following processing steps are executed:
judging whether the target file is a file subjected to redundancy marking;
if the target file is not subjected to the redundancy marking operation, directly executing the requested operation on the target file;
if the target file is subjected to redundancy marking operation and has redundant content, acquiring metafile data of the target file, and determining a target data block group and a target data block path of an operation request;
and executing and completing the operation requested to the target file.
Has the advantages that: the redundant data marking and removing method preferentially ensures the resource use of the normal service of the system, reduces the influence on the normal service of the system to the minimum, and has the advantages of high redundant identification rate, high reliability, high robustness and less resource occupation.
The features and advantages of the present invention will become apparent by reference to the following drawings and detailed description of specific embodiments of the invention.
Drawings
FIG. 1 is a schematic diagram of a fixed-length chunking technique of the prior art;
FIG. 2 is a flow chart of a method for marking and removing redundant data according to the present invention;
fig. 3 is a schematic diagram of the variable-length blocking technique of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
FIG. 2 is a flow chart of a method for marking and removing redundant data according to the present invention. As shown in fig. 2, the present invention provides a method for marking and removing redundant data, which comprises the following steps:
when a file is written, the file is subjected to dynamic variable-length segmentation to form a plurality of data blocks with different lengths;
grouping the data blocks to obtain data block groups, and calculating bloom values of each data block and the data block groups;
processing the bloom value of the data block to form a characteristic value of the data block;
judging whether the characteristic value of the data block exists in a metadata base or not;
if the characteristic values of the data blocks exist in the metadata base, calculating the bloom values of the data blocks again, comparing the bloom values with the bloom values of all data block groups in the metadata base, positioning similar groups of the data blocks, and determining redundant data blocks;
marking the redundant data blocks and deleting or retaining the redundant data blocks according to a predetermined strategy.
Preferably, if the characteristic value of the data block does not exist in the metadata base, the data block is written into a certain data block group, and the characteristic value of the data block is stored in the metadata base as an index of the data block.
Specifically, system environment identification and parameter setting are firstly carried out, then the type of a file is identified, the bloom value of a data block is calculated, whether the bloom value of the data block exists or not is inquired in a metadata base, if the bloom value of the data block does not exist, the data block is indicated to be a non-redundant data block, the data block is written into a certain data block group in the system, if the bloom value of the data block exists, the data block is indicated to be a redundant data block, then a similar group of the redundant data block is located, mark data is generated and stored as metadata, and finally a database record or a log record is generated for future reference.
Wherein processing the bloom values comprises filtering the bloom values with a filter. Illustratively, the filter may be a bloom filter, which consists of one long binary vector and several hash functions, and can be used to quickly retrieve whether an element belongs to a set.
When an element is added to a bloom filter, the following operations are performed: and calculating the element values by using a hash function in the bloom filter to obtain hash values, wherein several hash functions obtain several hash values. And setting the value of the corresponding subscript to be 1 in the bit array according to the obtained hash value.
When it is necessary to determine whether an element exists in the bloom filter, the following operations are performed: and performing the same hash calculation on the given element again, judging whether each element in the bit array is 1 after obtaining the hash value, if the values are all 1, indicating that the element is in the bloom filter, and if one value is not 1, indicating that the element is not in the bloom filter.
Preferably, a characteristic diagram technology of memory automatic mapping is adopted, and the file is subjected to dynamic variable-length segmentation according to the attribute of the file.
Referring to fig. 3, the variable-length blocks are dynamically and variably segmented according to the attributes of the file objects by using a characteristic diagram technology of memory automatic mapping, the technology is insensitive to changes of the file objects, only part of data blocks are affected by adding or deleting part of data to or from the file objects, and the affected data blocks are segmented again by dynamically adjusting the characteristic range, so that the affected data blocks are minimized.
Preferably, when calculating the characteristic value of the data block, the data characteristic value of the sliding interval is calculated through a sliding window according to the original characteristic value, the characteristic value sliding-in value and the characteristic value sliding-out value of the data block.
When the characteristic value of the data block is calculated, the data characteristic value of the sliding interval can be quickly calculated through the original characteristic value, the characteristic value sliding-in value and the characteristic value sliding-out value in the file object, and the execution efficiency of the redundant data marking operation is improved.
Preferably, the grouping the plurality of data blocks to obtain a data block group includes:
extracting the characteristic value of the data block, performing similarity fitting on the acquired characteristic value of the data block and the characteristic value of the data block in the metadata database, classifying the data block into different similarity groups, and establishing a similarity group file data block index in each similarity group.
The method comprises the steps of extracting characteristic values of file data blocks through a specific algorithm, carrying out similarity fitting on the obtained characteristic values and the characteristic values of the file data blocks in the meta index, classifying the file data blocks into different similarity groups, and establishing a similarity group file data block index in each similarity group to improve query speed.
Preferably, the comparing the bloom value of each data block group in the metadata database to locate the similar group of the data block comprises:
and calculating a similarity value between the data block and the data block group in the metadata base according to the distribution value, and if the similarity value exceeds a similarity threshold set by a system, taking the data block group corresponding to the sampling data of the current file data block as a similarity group of the data block.
And if the similarity value of the file data block exceeds a threshold value set by a system, taking a data group corresponding to the sampling data of the current file data block as a file similarity group.
Further, by comparing the bloom value of the file with the bloom values of the similar groups in the meta-index, the redundant data block is located and redundantly marked, and the corresponding meta-data is updated. The method can effectively reduce the data query time and query frequency in the process of identifying the redundant data, and greatly improves the performance of marking and removing the redundant data compared with the traditional redundant deletion.
Preferably, the similarity threshold is dynamically adjusted. In each similarity group, the sampling characteristic value of each file forms the characteristic value index of the group, and the characteristic values of all data blocks in the same similarity group are stored in the metadata of the group.
Preferably, the metadata base stores related attribute information of the data blocks and the data block groups, including read-write time, redundancy marks, object sizes, and storage paths.
Preferably, when the system receives an operation request for the target file, the following processing steps are executed:
judging whether the target file is a file subjected to redundancy marking;
if the target file is not subjected to the redundancy marking operation, directly executing the requested operation on the target file;
if the target file is subjected to redundancy marking operation and has redundant content, acquiring metafile data of the target file, and determining a target data block group and a target data block path of an operation request;
and executing and completing the operation requested to the target file.
The redundant data marking and removing technology extracts the characteristic value of the file data block through a specific algorithm, and performs similarity fitting on the acquired characteristic value and the characteristic value of the file data block in the meta index. The file data blocks are classified into different similarity groups. In each similarity group, a similarity group file data block index is established so as to improve the query speed and divide the files after the redundancy marking into different similarity groups according to the similarity of the files. In each similarity group, the sample feature values of the files constitute the feature value index of the group. All file chunk characteristic values in the same affinity group are saved in the metadata of the group.
When a system needs to perform a file write operation request, the characteristic value of a file is extracted first, and if the similarity value of a file data block exceeds a threshold value set by the system, a data group corresponding to the sampling data of the current file data block is used as a similarity group of the file. And further comparing the bloom value of the file with the bloom values of the similar groups in the meta-index, positioning the redundant data block, marking the redundancy, and updating the corresponding meta-data.
According to the embodiment, the data query time and query frequency in the redundant data identification process are effectively reduced, and the performance of marking and removing redundant data is greatly improved compared with the traditional redundant deletion.
The redundant data marking and removing function of the embodiment can be deployed on the management node according to the service characteristics and scale of the system, so that the efficiency is better, the optimization convergence time is shorter, the redundant data marking and removing function can be deployed on each distributed cluster node independently, the reliability of the system is improved, single-point faults are prevented, and the deployment mode is flexible.
The effectiveness of redundant data markers is typically measured and compared by a redundant data marker ratio (simple scalar ratio). If the size of the total amount of valid data before the redundancy marking is used, the size of the total amount of valid data after the redundancy marking is used. The ratio of these two data sizes is the redundant data rescaling ratio.
According to the embodiment, the operation parameters of the redundant data marking technology can be dynamically adjusted according to the characteristics of the system data volume, the system scale, the system file characteristics, the system IO characteristics and the like, and the redundant data marking technology is flexibly applied.
The present embodiment can be applied to different scenarios, for example: when a large file is uploaded to a server through page service, redundant data marking can be carried out by checking the characteristic value of each fragment, and the advantage of the redundant data marking technology is very obvious.
The method has the advantages that a certain data is backed up regularly, after the backup data is packaged, when a request is written into a system, the technical advantages of redundant data marking and removing are obvious, and only the data which is not backed up is written, so that automatic incremental backup is realized. In the conventional data backup service, the data difference between different backup objects is usually very small, and may be only about 2% -3%, and although a high change rate may also exist, the occurrence probability is usually very small.
In another application scenario, for example: the performance of a dynamic blocking technology introduced by a material library for storing audio and video or pictures and a redundant data marking and removing technology is degraded into redundant identification and marking of file-level granularity. Because the probability of the matched redundant data segment is very small, and the data fast level continuity is low, the performance is degraded, and the result shows a low re-scaling ratio.
The redundant data marking of the embodiment is based on a redundant data marking technology combining a fast preprocessing strategy and a slow postprocessing strategy, is not aware of users and business services, and is completely transparent.
Firstly, a user can dynamically adjust the marking performance of redundant data by setting a self-defined marking training cycle and adjusting the characteristic similarity threshold of a marking group, dynamically adjust the occupation of system resources and adjust the influence on normal service to be minimum so as to achieve the optimum.
Secondly, by adopting a variable-length blocking technology, the memory mapping sliding blocking technology, the high-efficiency DSMT algorithm and the ATMT algorithm are utilized, and the operation efficiency is superior to that of the traditional redundant data deleting technology.
In addition, the redundant data marking technology filters redundant data by using a leading file characteristic value extraction technology, the technology extracts the characteristic value of a file data block through a specific algorithm, and similarity fitting is carried out on the obtained characteristic value and the characteristic value of the file data block in the meta index. The file data blocks are classified into different similarity groups.
And finally, establishing a similar group file data block index in each similar group by using a redundant data marking technology so as to improve the query speed.
The invention combines redundant data marking technology and distributed clustering technology, and provides redundancy and reliability of cluster level. In a multi-node cluster environment, as long as any node in the system exists, the redundant data marking technology can be kept running, and the robustness and the continuity of the service system can be effectively ensured.
The redundant data tagging technique of the present invention allows a user to benefit from cost-effective and efficient storage by identifying redundant data in a data system. The method can be directly embodied as effective reduction of initial hardware purchasing cost, and meanwhile, explosive growth of data can be effectively controlled, and subsequent system elastic expansion is realized. Meanwhile, the reduction of management resources, the reduction of space sites, the reduction of power supply and distribution pressure, the reduction of costs of refrigeration, fire protection, maintenance management and the like can be brought.
When the redundant data deleting technology is applied to a backup application scene, the duplicate removal effect is very obvious. In the application scene, the backup server backs up the user data into the NAS storage space, full backup and additional backup are carried out through a certain backup strategy, and the data redundancy is high.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention, and all modifications and equivalents of the present invention, which are made by the contents of the present specification and the accompanying drawings, or directly/indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. A method for marking and removing redundant data, the method comprising the steps of:
when a file is written, the file is subjected to dynamic variable-length segmentation to form a plurality of data blocks with different lengths; grouping the data blocks to obtain data block groups, and calculating bloom values of each data block and the data block groups;
processing the bloom value of the data block to form a characteristic value of the data block;
judging whether the characteristic value of the data block exists in a metadata base or not; if the characteristic values of the data blocks exist in the metadata base, calculating the bloom values of the data blocks again, comparing the bloom values with the bloom values of all data block groups in the metadata base, positioning similar groups of the data blocks, and determining redundant data blocks;
marking the redundant data blocks and deleting or retaining the redundant data blocks according to a predetermined strategy.
2. The method according to claim 1, wherein if the characteristic value of the data block does not exist in the metadata database, the data block is written into a certain data block group, and the characteristic value of the data block is stored in the metadata database as an index of the data block.
3. The method according to claim 1, wherein a profile technique of memory auto-mapping is employed to perform dynamic variable-length segmentation on the file according to its attributes.
4. The method according to claim 3, wherein when calculating the eigenvalue of the data block, the data eigenvalue of the sliding interval is calculated through a sliding window according to the original eigenvalue, eigenvalue in-value and out-value of the data block.
5. The method of claim 1, wherein grouping the plurality of data blocks into data block groups comprises:
extracting the characteristic value of the data block, performing similarity fitting on the acquired characteristic value of the data block and the characteristic value of the data block in the metadata database, classifying the data block into different similarity groups, and establishing a similarity group file data block index in each similarity group.
6. The method of claim 1, wherein comparing the bloom value of each data block group in the metadata database to locate the similarity group of the data blocks comprises:
and calculating a similarity value between the data block and the data block group in the metadata base according to the distribution value, and if the similarity value exceeds a similarity threshold set by a system, taking the data block group corresponding to the sampling data of the current file data block as a similarity group of the data block.
7. The method of claim 6, wherein the similarity threshold is dynamically adjusted.
8. The method of claim 6, wherein the sample eigenvalue of each file in each similarity group constitutes the eigenvalue index of the group, and the eigenvalues of all data blocks in the same similarity group are stored in the metadata of the group.
9. The method according to claim 1, wherein the metadata database stores related attribute information of the data blocks and the data block groups, including read/write time, redundancy flag, object size, and storage path.
10. The method according to claim 6, wherein when the system receives the operation request for the target file, the following processing steps are executed:
judging whether the target file is a file subjected to redundancy marking;
if the target file is not subjected to the redundancy marking operation, directly executing the requested operation on the target file;
if the target file is subjected to redundancy marking operation and has redundant content, acquiring metafile data of the target file, and determining a target data block group and a target data block path of an operation request;
and executing and completing the operation requested to the target file.
CN202110838878.2A 2021-07-23 2021-07-23 Redundant data marking and removing method Pending CN113672170A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110838878.2A CN113672170A (en) 2021-07-23 2021-07-23 Redundant data marking and removing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110838878.2A CN113672170A (en) 2021-07-23 2021-07-23 Redundant data marking and removing method

Publications (1)

Publication Number Publication Date
CN113672170A true CN113672170A (en) 2021-11-19

Family

ID=78540047

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110838878.2A Pending CN113672170A (en) 2021-07-23 2021-07-23 Redundant data marking and removing method

Country Status (1)

Country Link
CN (1) CN113672170A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116303404A (en) * 2023-05-11 2023-06-23 起点(山东)大数据科技有限责任公司 Big data storage system for preventing data redundancy based on data classification and peer comparison
CN116991332A (en) * 2023-09-26 2023-11-03 长春易加科技有限公司 Intelligent factory large-scale data storage and analysis method
CN117369731A (en) * 2023-12-07 2024-01-09 苏州元脑智能科技有限公司 Data reduction processing method, device, equipment and medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101963982A (en) * 2010-09-27 2011-02-02 清华大学 Method for managing metadata of redundancy deletion and storage system based on location sensitive Hash
CN102323958A (en) * 2011-10-27 2012-01-18 上海文广互动电视有限公司 Data de-duplication method
CN103345472A (en) * 2013-06-04 2013-10-09 北京航空航天大学 Redundancy removal file system based on limited binary tree bloom filter and construction method of redundancy removal file system
CN104881470A (en) * 2015-05-28 2015-09-02 暨南大学 Repeated data deletion method oriented to mass picture data
CN106611035A (en) * 2016-06-12 2017-05-03 四川用联信息技术有限公司 Retrieval algorithm for deleting repetitive data in cloud storage
CN109101360A (en) * 2017-06-21 2018-12-28 北京大学 A kind of data completeness protection method based on Bloom filter and intersection coding
CN110968452A (en) * 2019-11-20 2020-04-07 华北电力大学(保定) Data integrity verification method capable of safely removing duplicate in cloud storage of smart power grid

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101963982A (en) * 2010-09-27 2011-02-02 清华大学 Method for managing metadata of redundancy deletion and storage system based on location sensitive Hash
CN102323958A (en) * 2011-10-27 2012-01-18 上海文广互动电视有限公司 Data de-duplication method
CN103345472A (en) * 2013-06-04 2013-10-09 北京航空航天大学 Redundancy removal file system based on limited binary tree bloom filter and construction method of redundancy removal file system
CN104881470A (en) * 2015-05-28 2015-09-02 暨南大学 Repeated data deletion method oriented to mass picture data
CN106611035A (en) * 2016-06-12 2017-05-03 四川用联信息技术有限公司 Retrieval algorithm for deleting repetitive data in cloud storage
CN109101360A (en) * 2017-06-21 2018-12-28 北京大学 A kind of data completeness protection method based on Bloom filter and intersection coding
CN110968452A (en) * 2019-11-20 2020-04-07 华北电力大学(保定) Data integrity verification method capable of safely removing duplicate in cloud storage of smart power grid

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116303404A (en) * 2023-05-11 2023-06-23 起点(山东)大数据科技有限责任公司 Big data storage system for preventing data redundancy based on data classification and peer comparison
CN116303404B (en) * 2023-05-11 2023-08-04 起点(山东)大数据科技有限责任公司 Big data storage system for preventing data redundancy based on data classification and peer comparison
CN116991332A (en) * 2023-09-26 2023-11-03 长春易加科技有限公司 Intelligent factory large-scale data storage and analysis method
CN116991332B (en) * 2023-09-26 2023-12-15 长春易加科技有限公司 Intelligent factory large-scale data storage and analysis method
CN117369731A (en) * 2023-12-07 2024-01-09 苏州元脑智能科技有限公司 Data reduction processing method, device, equipment and medium
CN117369731B (en) * 2023-12-07 2024-02-27 苏州元脑智能科技有限公司 Data reduction processing method, device, equipment and medium

Similar Documents

Publication Publication Date Title
US11768803B2 (en) Snapshot metadata arrangement for efficient cloud integrated data management
CN113672170A (en) Redundant data marking and removing method
US9613043B2 (en) Object deduplication and application aware snapshots
US8712963B1 (en) Method and apparatus for content-aware resizing of data chunks for replication
US8639669B1 (en) Method and apparatus for determining optimal chunk sizes of a deduplicated storage system
US9201891B2 (en) Storage system
US9141633B1 (en) Special markers to optimize access control list (ACL) data for deduplication
US9424185B1 (en) Method and system for garbage collection of data storage systems
US7366859B2 (en) Fast incremental backup method and system
US11182256B2 (en) Backup item metadata including range information
US9183218B1 (en) Method and system to improve deduplication of structured datasets using hybrid chunking and block header removal
CN102323958A (en) Data de-duplication method
US8924366B2 (en) Data storage deduplication systems and methods
CN108255647B (en) High-speed data backup method under samba server cluster
AU2010200866B1 (en) Data reduction indexing
CN109710455B (en) Deleted file recovery method and system based on FAT32 file system
CN102999433A (en) Redundant data deletion method and system of virtual disks
CN107506466B (en) Small file storage method and system
CN111399765B (en) Data processing method and device, electronic equipment and readable storage medium
CN106980680B (en) Data storage method and storage device
CN106874399B (en) Networking backup system and backup method
Zhu et al. An intelligent data de-duplication based backup system
US20220197861A1 (en) System and method for reducing read amplification of archival storage using proactive consolidation
CN105302669A (en) Method and system for data deduplication in cloud backup process
CN109697197B (en) Method for engraving and restoring Access database file

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination