CN113672170A

CN113672170A - Redundant data marking and removing method

Info

Publication number: CN113672170A
Application number: CN202110838878.2A
Authority: CN
Inventors: 朱敏俊; 王奕; 黄宗浩; 李渊; 张晖; 厉励; 张逸鲁; 高宇; 戴梅; 黄麒玮; 蔡云飞; 曹斌; 石强; 王正源; 王骏杰; 于镆铘; 崔敏杰; 胡佳迎
Original assignee: Fudan University Shanghai Cancer Center
Current assignee: Fudan University Shanghai Cancer Center
Priority date: 2021-07-23
Filing date: 2021-07-23
Publication date: 2021-11-19

Abstract

The invention relates to a redundant data marking and removing method, and belongs to the technical field of data storage. The method comprises the following steps: when a file is written, the file is subjected to dynamic variable-length segmentation to form a plurality of data blocks with different lengths; grouping the data blocks to obtain data block groups, and calculating bloom values of each data block and the data block groups; processing the bloom value of the data block to form a characteristic value of the data block; judging whether the characteristic value of the data block exists in a metadata base or not; if the characteristic values of the data blocks exist in the metadata base, calculating the bloom values of the data blocks again, comparing the bloom values with the bloom values of all data block groups in the metadata base, positioning similar groups of the data blocks, and determining redundant data blocks; marking the redundant data blocks and deleting or retaining the redundant data blocks according to a predetermined strategy. The method has the advantages of high redundancy recognition rate, high reliability, high robustness and less resource occupation.

Description

Redundant data marking and removing method

Technical Field

The invention belongs to the technical field of data storage, and particularly relates to a redundant data marking and removing method.

Background

Redundant-data marking (redundancy-mark) is a data reduction technique aimed at reducing the storage capacity used in storage systems. The redundant data in the data storage system is marked and deleted by searching redundant variable-size data blocks at different positions in different files, and only one or a plurality of necessary parts are reserved, so that excessive redundant data is eliminated. The redundant data marking technology can reduce consumption of physical storage space to a great extent, improve service efficiency of retrieval and the like, save transmission bandwidth and allow efficient and economic backup data copying among different sites of a user.

The redundant data marking and removing technique is classified into Pre-processing (Pre-processing), Online processing (Online-processing), and Post-processing (Post-processing) according to the stage of data processing.

The redundant data marking and removing in the preprocessing mode are carried out before data are written into a system, redundancy detection is carried out, and redundant data are marked or deleted. The preprocessing can effectively reduce the load pressure of the rear-end storage system and effectively avoid unnecessary data redundant writing. But also increases the data writing delay and reduces the response speed of the system.

The method for deleting the redundant data in the online processing mode is to extract a certain characteristic value and execute a redundant data marking algorithm while writing the data into a disk. The redundant data deletion processed on line reduces the data volume to a certain extent, but also has a problem that the data throughput rate is reduced by the duplication removal operation itself, which causes the reduction of the service performance.

The redundant data deleting method of the post-processing mode is to delete redundant data after the data is written into a disk. The data is written into the temporary disk space, then the redundant data deletion is started, and finally the data subjected to the redundant data deletion is written into the disk. Since the redundant data deletion is performed on a separate storage device after data is written to the disk, it generally has little effect on normal business processing. However, the current post-processing mode cannot dynamically adjust the occupation of system resources, and does not have the function of preferentially ensuring the performance of the online service, so that the system online service is affected when the system occupation rate is too large.

Redundant data marking and removal techniques may be classified at the file level, block level, byte level, depending on the granularity of the redundant identification target.

The redundant data marking and removing of the file level takes the file as a basic identification unit to identify and remove redundant data. The method has the advantages that the recognition algorithm is relatively simple, the engineering is easy to realize, the recognition efficiency is high, the speed is high, the defects are that the target scene is limited, and the redundant recognition rate is very low for the scene of a large file.

And marking and removing redundant data at a block level, and performing redundancy detection by taking the minimum block of the storage system as an identification unit. The method has the advantages that the bottom library of the operating system has perfect support for the recognition algorithm, has high calculation speed and is sensitive to data change, and is suitable for dynamic file detection.

The block-level redundant data marking and removing are divided into a fixed-length blocking mode and a variable-length blocking mode according to different blocking modes.

Referring to fig. 1, a file is divided into fixed-length blocks in a fixed-length blocking manner, but the method is too sensitive to data insertion and deletion, the redundancy rate of dynamic data in practical application changes greatly with time, the timeliness of redundancy identification is obvious, and the identification effect is relatively limited.

Byte-level redundant data marking and removal searches for and marks redundant data from a byte-based unit, typically generating a differential portion of content via a differential marking strategy. Byte-level deduplication has the advantages of high deduplication rate and low deduplication efficiency and speed, and the difference content occupies a large content proportion, and also occupies a large amount of storage resources of the system, and is generally used in a post-redundancy removal method.

In addition, considering the wide use of the distributed storage system, the traditional redundant data identification method cannot be applied to both performance and accuracy, and has certain defects.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a redundant data marking and removing method, which can adaptively adjust the influence of an identification algorithm and a deduplication operation on system resources according to the system state, can set the granularity and the priority of a redundant marker, preferentially ensures the resource use of normal services of the system, reduces the influence on the normal services of the system to the minimum, and has the advantages of high redundant identification rate, high reliability, high robustness and less resource occupation.

According to one aspect of the present invention, there is provided a method for marking and removing redundant data, the method comprising the steps of:

when a file is written, the file is subjected to dynamic variable-length segmentation to form a plurality of data blocks with different lengths;

grouping the data blocks to obtain data block groups, and calculating bloom values of each data block and the data block groups;

processing the bloom value of the data block to form a characteristic value of the data block;

judging whether the characteristic value of the data block exists in a metadata base or not;

if the characteristic values of the data blocks exist in the metadata base, calculating the bloom values of the data blocks again, comparing the bloom values with the bloom values of all data block groups in the metadata base, positioning similar groups of the data blocks, and determining redundant data blocks;

marking the redundant data blocks and deleting or retaining the redundant data blocks according to a predetermined strategy.

Preferably, if the characteristic value of the data block does not exist in the metadata base, the data block is written into a certain data block group, and the characteristic value of the data block is stored in the metadata base as an index of the data block.

Preferably, a characteristic diagram technology of memory automatic mapping is adopted, and the file is subjected to dynamic variable-length segmentation according to the attribute of the file.

Preferably, when calculating the characteristic value of the data block, the data characteristic value of the sliding interval is calculated through a sliding window according to the original characteristic value, the characteristic value sliding-in value and the characteristic value sliding-out value of the data block.

Preferably, the grouping the plurality of data blocks to obtain a data block group includes:

extracting the characteristic value of the data block, performing similarity fitting on the acquired characteristic value of the data block and the characteristic value of the data block in the metadata database, classifying the data block into different similarity groups, and establishing a similarity group file data block index in each similarity group.

Preferably, the comparing the bloom value of each data block group in the metadata database to locate the similar group of the data block comprises:

and calculating a similarity value between the data block and the data block group in the metadata base according to the distribution value, and if the similarity value exceeds a similarity threshold set by a system, taking the data block group corresponding to the sampling data of the current file data block as a similarity group of the data block.

Preferably, the similarity threshold is dynamically adjusted.

Preferably, in each similarity group, the sample characteristic value of each file forms the characteristic value index of the group, and the characteristic values of all data blocks in the same similarity group are stored in the metadata of the group.

Preferably, the metadata base stores related attribute information of the data blocks and the data block groups, including read-write time, redundancy marks, object sizes, and storage paths.

Preferably, when the system receives an operation request for the target file, the following processing steps are executed:

judging whether the target file is a file subjected to redundancy marking;

if the target file is not subjected to the redundancy marking operation, directly executing the requested operation on the target file;

if the target file is subjected to redundancy marking operation and has redundant content, acquiring metafile data of the target file, and determining a target data block group and a target data block path of an operation request;

and executing and completing the operation requested to the target file.

Has the advantages that: the redundant data marking and removing method preferentially ensures the resource use of the normal service of the system, reduces the influence on the normal service of the system to the minimum, and has the advantages of high redundant identification rate, high reliability, high robustness and less resource occupation.

The features and advantages of the present invention will become apparent by reference to the following drawings and detailed description of specific embodiments of the invention.

Drawings

FIG. 1 is a schematic diagram of a fixed-length chunking technique of the prior art;

FIG. 2 is a flow chart of a method for marking and removing redundant data according to the present invention;

fig. 3 is a schematic diagram of the variable-length blocking technique of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

FIG. 2 is a flow chart of a method for marking and removing redundant data according to the present invention. As shown in fig. 2, the present invention provides a method for marking and removing redundant data, which comprises the following steps:

Specifically, system environment identification and parameter setting are firstly carried out, then the type of a file is identified, the bloom value of a data block is calculated, whether the bloom value of the data block exists or not is inquired in a metadata base, if the bloom value of the data block does not exist, the data block is indicated to be a non-redundant data block, the data block is written into a certain data block group in the system, if the bloom value of the data block exists, the data block is indicated to be a redundant data block, then a similar group of the redundant data block is located, mark data is generated and stored as metadata, and finally a database record or a log record is generated for future reference.

Wherein processing the bloom values comprises filtering the bloom values with a filter. Illustratively, the filter may be a bloom filter, which consists of one long binary vector and several hash functions, and can be used to quickly retrieve whether an element belongs to a set.

When an element is added to a bloom filter, the following operations are performed: and calculating the element values by using a hash function in the bloom filter to obtain hash values, wherein several hash functions obtain several hash values. And setting the value of the corresponding subscript to be 1 in the bit array according to the obtained hash value.

When it is necessary to determine whether an element exists in the bloom filter, the following operations are performed: and performing the same hash calculation on the given element again, judging whether each element in the bit array is 1 after obtaining the hash value, if the values are all 1, indicating that the element is in the bloom filter, and if one value is not 1, indicating that the element is not in the bloom filter.

Referring to fig. 3, the variable-length blocks are dynamically and variably segmented according to the attributes of the file objects by using a characteristic diagram technology of memory automatic mapping, the technology is insensitive to changes of the file objects, only part of data blocks are affected by adding or deleting part of data to or from the file objects, and the affected data blocks are segmented again by dynamically adjusting the characteristic range, so that the affected data blocks are minimized.

When the characteristic value of the data block is calculated, the data characteristic value of the sliding interval can be quickly calculated through the original characteristic value, the characteristic value sliding-in value and the characteristic value sliding-out value in the file object, and the execution efficiency of the redundant data marking operation is improved.

The method comprises the steps of extracting characteristic values of file data blocks through a specific algorithm, carrying out similarity fitting on the obtained characteristic values and the characteristic values of the file data blocks in the meta index, classifying the file data blocks into different similarity groups, and establishing a similarity group file data block index in each similarity group to improve query speed.

And if the similarity value of the file data block exceeds a threshold value set by a system, taking a data group corresponding to the sampling data of the current file data block as a file similarity group.

Further, by comparing the bloom value of the file with the bloom values of the similar groups in the meta-index, the redundant data block is located and redundantly marked, and the corresponding meta-data is updated. The method can effectively reduce the data query time and query frequency in the process of identifying the redundant data, and greatly improves the performance of marking and removing the redundant data compared with the traditional redundant deletion.

Preferably, the similarity threshold is dynamically adjusted. In each similarity group, the sampling characteristic value of each file forms the characteristic value index of the group, and the characteristic values of all data blocks in the same similarity group are stored in the metadata of the group.

judging whether the target file is a file subjected to redundancy marking;

and executing and completing the operation requested to the target file.

The redundant data marking and removing technology extracts the characteristic value of the file data block through a specific algorithm, and performs similarity fitting on the acquired characteristic value and the characteristic value of the file data block in the meta index. The file data blocks are classified into different similarity groups. In each similarity group, a similarity group file data block index is established so as to improve the query speed and divide the files after the redundancy marking into different similarity groups according to the similarity of the files. In each similarity group, the sample feature values of the files constitute the feature value index of the group. All file chunk characteristic values in the same affinity group are saved in the metadata of the group.

When a system needs to perform a file write operation request, the characteristic value of a file is extracted first, and if the similarity value of a file data block exceeds a threshold value set by the system, a data group corresponding to the sampling data of the current file data block is used as a similarity group of the file. And further comparing the bloom value of the file with the bloom values of the similar groups in the meta-index, positioning the redundant data block, marking the redundancy, and updating the corresponding meta-data.

According to the embodiment, the data query time and query frequency in the redundant data identification process are effectively reduced, and the performance of marking and removing redundant data is greatly improved compared with the traditional redundant deletion.

The redundant data marking and removing function of the embodiment can be deployed on the management node according to the service characteristics and scale of the system, so that the efficiency is better, the optimization convergence time is shorter, the redundant data marking and removing function can be deployed on each distributed cluster node independently, the reliability of the system is improved, single-point faults are prevented, and the deployment mode is flexible.

The effectiveness of redundant data markers is typically measured and compared by a redundant data marker ratio (simple scalar ratio). If the size of the total amount of valid data before the redundancy marking is used, the size of the total amount of valid data after the redundancy marking is used. The ratio of these two data sizes is the redundant data rescaling ratio.

According to the embodiment, the operation parameters of the redundant data marking technology can be dynamically adjusted according to the characteristics of the system data volume, the system scale, the system file characteristics, the system IO characteristics and the like, and the redundant data marking technology is flexibly applied.

The present embodiment can be applied to different scenarios, for example: when a large file is uploaded to a server through page service, redundant data marking can be carried out by checking the characteristic value of each fragment, and the advantage of the redundant data marking technology is very obvious.

The method has the advantages that a certain data is backed up regularly, after the backup data is packaged, when a request is written into a system, the technical advantages of redundant data marking and removing are obvious, and only the data which is not backed up is written, so that automatic incremental backup is realized. In the conventional data backup service, the data difference between different backup objects is usually very small, and may be only about 2% -3%, and although a high change rate may also exist, the occurrence probability is usually very small.

In another application scenario, for example: the performance of a dynamic blocking technology introduced by a material library for storing audio and video or pictures and a redundant data marking and removing technology is degraded into redundant identification and marking of file-level granularity. Because the probability of the matched redundant data segment is very small, and the data fast level continuity is low, the performance is degraded, and the result shows a low re-scaling ratio.

The redundant data marking of the embodiment is based on a redundant data marking technology combining a fast preprocessing strategy and a slow postprocessing strategy, is not aware of users and business services, and is completely transparent.

Firstly, a user can dynamically adjust the marking performance of redundant data by setting a self-defined marking training cycle and adjusting the characteristic similarity threshold of a marking group, dynamically adjust the occupation of system resources and adjust the influence on normal service to be minimum so as to achieve the optimum.

Secondly, by adopting a variable-length blocking technology, the memory mapping sliding blocking technology, the high-efficiency DSMT algorithm and the ATMT algorithm are utilized, and the operation efficiency is superior to that of the traditional redundant data deleting technology.

In addition, the redundant data marking technology filters redundant data by using a leading file characteristic value extraction technology, the technology extracts the characteristic value of a file data block through a specific algorithm, and similarity fitting is carried out on the obtained characteristic value and the characteristic value of the file data block in the meta index. The file data blocks are classified into different similarity groups.

And finally, establishing a similar group file data block index in each similar group by using a redundant data marking technology so as to improve the query speed.

The invention combines redundant data marking technology and distributed clustering technology, and provides redundancy and reliability of cluster level. In a multi-node cluster environment, as long as any node in the system exists, the redundant data marking technology can be kept running, and the robustness and the continuity of the service system can be effectively ensured.

The redundant data tagging technique of the present invention allows a user to benefit from cost-effective and efficient storage by identifying redundant data in a data system. The method can be directly embodied as effective reduction of initial hardware purchasing cost, and meanwhile, explosive growth of data can be effectively controlled, and subsequent system elastic expansion is realized. Meanwhile, the reduction of management resources, the reduction of space sites, the reduction of power supply and distribution pressure, the reduction of costs of refrigeration, fire protection, maintenance management and the like can be brought.

When the redundant data deleting technology is applied to a backup application scene, the duplicate removal effect is very obvious. In the application scene, the backup server backs up the user data into the NAS storage space, full backup and additional backup are carried out through a certain backup strategy, and the data redundancy is high.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention, and all modifications and equivalents of the present invention, which are made by the contents of the present specification and the accompanying drawings, or directly/indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A method for marking and removing redundant data, the method comprising the steps of:

when a file is written, the file is subjected to dynamic variable-length segmentation to form a plurality of data blocks with different lengths; grouping the data blocks to obtain data block groups, and calculating bloom values of each data block and the data block groups;

judging whether the characteristic value of the data block exists in a metadata base or not; if the characteristic values of the data blocks exist in the metadata base, calculating the bloom values of the data blocks again, comparing the bloom values with the bloom values of all data block groups in the metadata base, positioning similar groups of the data blocks, and determining redundant data blocks;

2. The method according to claim 1, wherein if the characteristic value of the data block does not exist in the metadata database, the data block is written into a certain data block group, and the characteristic value of the data block is stored in the metadata database as an index of the data block.

3. The method according to claim 1, wherein a profile technique of memory auto-mapping is employed to perform dynamic variable-length segmentation on the file according to its attributes.

4. The method according to claim 3, wherein when calculating the eigenvalue of the data block, the data eigenvalue of the sliding interval is calculated through a sliding window according to the original eigenvalue, eigenvalue in-value and out-value of the data block.

5. The method of claim 1, wherein grouping the plurality of data blocks into data block groups comprises:

6. The method of claim 1, wherein comparing the bloom value of each data block group in the metadata database to locate the similarity group of the data blocks comprises:

7. The method of claim 6, wherein the similarity threshold is dynamically adjusted.

8. The method of claim 6, wherein the sample eigenvalue of each file in each similarity group constitutes the eigenvalue index of the group, and the eigenvalues of all data blocks in the same similarity group are stored in the metadata of the group.

9. The method according to claim 1, wherein the metadata database stores related attribute information of the data blocks and the data block groups, including read/write time, redundancy flag, object size, and storage path.

10. The method according to claim 6, wherein when the system receives the operation request for the target file, the following processing steps are executed:

judging whether the target file is a file subjected to redundancy marking;

and executing and completing the operation requested to the target file.