CN111143343A

CN111143343A - Data efficient deleting method and system based on source-end deduplication

Info

Publication number: CN111143343A
Application number: CN201911374951.4A
Authority: CN
Inventors: 周建华; 张有成; 姚崎; 丁红; 李海鹏; 许萍萍
Original assignee: Nanjing Unary Information Technology Co ltd
Current assignee: Nanjing Unary Information Technology Co ltd
Priority date: 2019-12-27
Filing date: 2019-12-27
Publication date: 2020-05-12
Anticipated expiration: 2039-12-27
Also published as: CN111143343B

Abstract

The invention discloses a data high-efficiency deleting method based on source-end re-deletion, which comprises the steps of segmenting a data stream of a source end into data blocks in a backup process, calculating fingerprints, comparing the fingerprints, transmitting the corresponding data blocks to a container of a server end for storage if the fingerprints do not exist and indicate that the data blocks are new blocks, marking the corresponding container as 1, writing the container into a data file after the container is full, and then creating a new container; the expiration of the backup set is automatically cleared, and the guid object record is cleared; and performing data block and fingerprint cleaning on the container marked as 0 by using preset cyclic deletion logic at idle time outside the normal business window period, wherein the container marked as 0 indicates that the data block and the fingerprint thereof in the container are not referenced and can be cleaned. The advantages are that: the invention adopts a marking mode, the statistical logic is simpler, the cleaning logic is not influenced by the size of the deduplication library, and the efficiency is higher.

Description

Data efficient deleting method and system based on source-end deduplication

Technical Field

The invention relates to a source-end-deduplication-based data efficient deleting method and system, and belongs to the technical field of data protection.

Background

The source-end deduplication can reduce transmission bandwidth, reduce storage space and the like, so that the source-end deduplication is widely applied to data protection products. For convenience of explanation, the source-side database is agreed to store the data after the source-side database is deleted again into the deduplication library, and the deduplication library includes a deduplication fingerprint library and a deduplication database. And storing index information of the data blocks in the re-deleted fingerprint database, and storing the data blocks in the re-deleted database. The data after source end de-duplication is used has the following characteristics: the data blocks stored in the deduplication database are unique in the whole database, most of the data blocks in the deduplication database can be commonly used by a plurality of data sources, and the aim of reducing the storage space can be achieved only by the characteristic. The characteristic plays a positive role in reducing the storage space, but causes great complexity to deletion operation, and the data in the re-deleted database is difficult to be cleaned conveniently like common data. The first method of the prior art is to record the number of references of each data block, increase the number of references for repeated data blocks during backup, subtract the number of references of included data blocks during deletion, and indicate that the data block can be cleared up and the storage space occupied by the data block can be released when the number of references is 0. The method has the advantages that the backup and deletion performance can be greatly influenced along with the increase of the deletion library, the other method is centralized cleaning, the centralized cleaning is executed at a specific time point, all data files in use are marked, the granularity of the data files relative to data blocks is large, statistics is fast, and then the data files and fingerprints which are not in use are deleted to achieve the purpose of releasing space.

Disclosure of Invention

The invention aims to solve the technical problem of overcoming the defects that in the existing source end deduplication technology, due to the uniqueness characteristic of data after deduplication, the logic is complex, the efficiency is low, and the space cannot be released quickly and efficiently during deletion operation, and provides a source end deduplication-based data efficient deletion method and system.

In order to solve the technical problems, the invention provides a data efficient deleting method based on source end deduplication, in the backup process, a data stream of a source end is divided into data blocks, fingerprints are calculated and compared, if the fingerprints do not exist, the corresponding data blocks are transmitted to a container of a server end to be stored, the corresponding container is marked as 1, the container is written into a data file after the container is full, a new container is created, the container comprises a plurality of data blocks, a deduplication library comprises a plurality of data files with fixed sizes, and each data file comprises a plurality of containers;

the expiration of the backup set is automatically cleared, and the guid object record is cleared;

and performing data block and fingerprint cleaning on the container marked as 0 by using preset cyclic deletion logic at idle time outside the normal business window period, wherein the container marked as 0 indicates that the data block and the fingerprint thereof in the container are not referenced and can be cleaned.

Further, the container is fixed in size.

Further, the process of marking each container is as follows:

determining a backup set, wherein the backup set comprises an object library and a deduplication library, the object library stores object files, object files store object records and index data of the objects, the deduplication library stores data files, and the data files store information of each data block contained in the objects;

and acquiring the referenced object file, reading index data in the object file according to the unique identifier of the object, finding a corresponding container according to a fingerprint in the index data, and marking 1 on the corresponding container record.

Further, the loop deletion logic is:

s1, in the backup process, for the referenced data blocks, marking the container in which the corresponding data block is located as 1, and marking the corresponding object record as 1, which indicates that the referenced data blocks have been checked;

s2, traversing the object records, finding the objects marked as 0, finding the positions of the containers stored in the corresponding data blocks in the deduplication library according to the index information of the records in the object files, marking the containers corresponding to the fingerprints as 1, and marking the object record as 0 to indicate that the objects are not checked;

s3, traversing the container records in the deduplication library, cleaning the data blocks and fingerprints thereof in the container marked as 0, and then marking the container state as 2, which represents that the container is cleaned and can be reused;

s4, marking the container record in the deduplication library as 0 of 1, and marking all the object records in the object library as 0;

s5, collecting all containers marked as 2 in the deduplication library, and preferentially selecting the collected containers for reuse when new data needs to be stored;

s6, executing the above steps S1-S5 in a set cycle.

A data efficient deleting system based on source-end deduplication comprises a container determining module, a backup set cleaning module and a deleting module;

the container determining module is used for segmenting data blocks of a data stream at a source end in a backup process, calculating fingerprints, comparing the fingerprints, transmitting the corresponding data blocks to a container at a server end for storage if the fingerprints do not exist and indicate that the data blocks are new blocks, marking the corresponding container as 1, writing the container into a data file after the container is full, and then creating a new container, wherein the container comprises a plurality of data blocks, a deduplication library comprises a plurality of data files with fixed sizes, and each data file comprises a plurality of containers;

the backup set cleaning module is used for automatically cleaning the backup set after the backup set is expired, and simultaneously deleting the guid object records;

and the deleting module is used for clearing the data blocks and the fingerprints of the container marked as 0 by utilizing a preset circulating deleting logic at the idle time outside the normal service window period, wherein the container marked as 0 indicates that the data blocks and the fingerprints of the container can not be cleared when the data blocks and the fingerprints are not referred.

Further, the size of the container determined by the container determination module is fixed.

Further, the container determination module comprises a backup set determination module and a container marking module;

the backup set determining module is configured to determine a backup set, where the backup set includes an object library and a deduplication library, where the object library stores object files, object file storage object records and index data of objects, and the deduplication library stores data files, and the data file storage object stores information of each data block included in the data file;

the container marking module is used for acquiring the referred object file, reading the index data in the object file according to the unique identifier of the object, finding the corresponding container according to the fingerprint in the index data, and marking the corresponding container record with a mark 1.

Furthermore, the cleaning module comprises a backup module, a first traversal module, a second traversal module, an initialization module, a collection module and a circulation module;

the backup module is used for marking a container where a corresponding data block is located as 1 and marking a corresponding object record as 1 for the referenced data block in the backup process, wherein the container represents that the referenced data block has been checked;

the first traversal module is used for traversing the object records, finding the objects marked as 0, finding the positions of containers stored in the corresponding data blocks in the deduplication library according to the index information of the records in the object file, marking the containers corresponding to the fingerprints as 1, and marking the object record marks as 0 to indicate that the objects are not checked;

the second traversal module traverses the container records in the deduplication library, cleans the data blocks and fingerprints thereof in the container marked as 0, and then marks the container state as 2, which represents that the container is cleaned and can be reused; (ii) a

The initialization module marks the container records in the deduplication library as 0 of 1, and marks all the object records in the object library as 0;

the collection module is used for collecting all containers marked as 2 in the deduplication library, and preferentially selecting the collected containers for reuse when new data needs to be stored;

and the circulating module is used for circularly executing the processes of the backup module, the first traversal module, the second traversal module, the initialization module and the collection module in a set period.

The invention achieves the following beneficial effects:

the invention can regularly execute cleaning in the background under the condition of not influencing normal backup and recovery services, and the cleaned space can be repeatedly utilized, thereby achieving the purpose of phase-change space release. The invention adopts a marking mode, the statistical logic is simpler, the cleaning logic is not influenced by the size of the deduplication library, and the efficiency is higher.

Drawings

FIG. 1 is a schematic flow diagram for marking containers that are still being referenced;

FIG. 2 is a schematic flow diagram of cleaning a container.

Detailed Description

The invention is further described below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.

The deleting logic used by the invention is to utilize the marks of the containers to distinguish which containers can be cleaned and which containers are still used, so the key is how to quickly mark the containers in the deduplication library, the fingerprint deletion is simple, only the containers needing to be deleted are read to analyze the stored fingerprints, and then the corresponding records are deleted in the fingerprint library. If the marking is done at each cleaning, it takes a long time to find the fingerprint in the index file of all the objects in the object library once to determine which containers are still used as a whole. The invention extends the stage to the backup stage, the fingerprint is compared in the backup process, only the hit container needs to be marked at the moment, the backup logic is hardly influenced, and because the effectiveness of the backup set is limited by time, only a small number of objects need to be searched to determine which containers can be cleaned when the containers are cleaned.

The container of the present invention has a small particle size. A container is a logical concept used to manage a batch of data blocks, the container is of a fixed size (each container is the same size), and a data file may contain multiple containers. Because the container is of a fixed size, once the container is cleaned, the container can be reused to store new data, and in addition, the cleaning at the container level can be performed by using the idle time of normal services, and the normal services are hardly influenced.

The method comprises the steps that a plurality of data files with fixed sizes are contained in a deduplication library, each data file contains a plurality of containers with fixed sizes, each container contains a plurality of data blocks, in the backup process, data streams of a source end are divided into the data blocks, fingerprints are calculated, the fingerprints are compared, if the fingerprints do not exist and indicate that the data blocks are new blocks, the corresponding data blocks are transmitted to a container of a server end to be stored, the corresponding container is marked as 1, the container is written into the data files after the container is full, and then a new container is created.

The reason for taking a container as a unit is that the purpose of the deduplication library containing a batch of fingerprints in container units is to utilize the locality principle of data. The principle of locality is that if a block is used, then the adjacent blocks will be used with a high probability. The container is used as an updating unit of the cache, so that the hit rate of the cache can be effectively improved. The delete function can also use this principle, if a block needs to be cleaned up, then its adjacent data blocks will be cleaned up with a high probability.

The backup set is stored in the background in an object mode and mainly comprises two parts, wherein an object library and an object file store object records and index data of the objects, and a deduplication library and a data file store information of each data block contained in the objects. And reading the backup set to access the object library, and according to the index information recorded in the object file, removing the duplication and deletion of the database to find the corresponding data block.

As shown in fig. 1, a referenced object file is obtained, index data of an object in the object file is read according to guid (object unique identifier), a corresponding container is found according to a fingerprint in the index data, and a corresponding container record is marked with a mark 1.

The method specifically comprises the following steps:

firstly, in the backup process, for the referenced data blocks, the container in which the corresponding data block is located is marked as 1, which represents that the container has the data block used. The corresponding object record is also marked 1, representing that it has been checked.

Traversing the object records, finding the objects marked as 0, finding the positions of the containers stored in the corresponding data blocks in the deduplication library according to the index information of the records in the object files, and marking the containers corresponding to the fingerprints.

Traversing container records in the deduplication library, wherein the fingerprint in the container marked as 0 can be cleaned when the fingerprint in the container is not referenced, and then marking the container state as 2 indicates that the container is cleaned and can be recycled.

And fourthly, initializing a mark bit. And setting the container record in the deduplication library as 1 to be 0, and setting all object records in the object library as 0.

And fifthly, collecting all containers marked as 2 in the deduplication library, and preferentially selecting the collected containers for reuse when new data needs to be stored.

As shown in fig. 2, the steps of clearing by using the idle time outside the normal service window period are as follows:

starting to collect container ids to be cleaned, judging whether a container needs to be cleaned according to a mark of the container, if so, judging whether the service is idle, if not, waiting for 1s and then judging again, if so, closing a secondary cache in a cleaning write cache, judging whether the service is idle again, if not, waiting for 1s again, judging whether the service is idle again, if so, taking out a container id, checking the state of the container again, if still needing to be cleaned, executing cleaning, cleaning the mark, judging whether the service is idle again, if not, waiting for 1s, judging again, if opening the secondary cache of the write cache, initializing a bloom filter, and ending.

The above-mentioned process is circularly executed in the background with the set period, because the backup set in the data protection product has life cycle, it can be automatically cleaned when the life cycle is expired, and can also be cleaned manually, so that along with the object replacement in the object library, the above-mentioned cleaning logic can clean the containers whose expiration is no longer used, and can store new data after recovery, so that the goal of space cyclic utilization and phase-changing reduction of storage space can be reached.

Correspondingly, the invention also provides a data high-efficiency deleting system based on source-end deduplication, which comprises a container determining module, a backup set cleaning module and a deleting module;

The size of the container determined by the container determination module is fixed.

The container determining module comprises a backup set determining module and a container marking module;

The cleaning module comprises a backup module, a first traversal module, a second traversal module, an initialization module, a collection module and a circulation module;

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims

1. A data high-efficiency deleting method based on source-end re-deletion is characterized in that,

in the backup process, data blocks are segmented from a data stream of a source end, fingerprints are calculated, the fingerprints are compared, if the fingerprints do not exist, the data blocks are transmitted to a container of a server end to be stored, the corresponding container is marked as 1, the container is written into a data file after the container is full, a new container is created, the container comprises a plurality of data blocks, a plurality of data files with fixed sizes are contained in a deduplication library, and each data file comprises a plurality of containers;

2. The source-side deduplication based data efficient deletion method of claim 1, wherein the container size is fixed.

3. The source-side deduplication based data efficient deletion method as claimed in claim 1, wherein the marking of each container is performed by:

4. The source-side deduplication based data efficient deletion method of claim 3, wherein the loop deletion logic is:

s6, executing the above steps S1-S5 in a set cycle.

5. A data efficient deleting system based on source-end deduplication is characterized by comprising a container determining module, a backup set cleaning module and a deleting module;

6. The source-side deduplication based data efficient deletion system of claim 5, wherein the size of the container determined by the container determination module is fixed.

7. The source-side deduplication based data efficient deletion system of claim 5, wherein the container determination module comprises a backup set determination module and a container marking module;

8. The source-side deduplication based data-efficient deletion system of claim 7, wherein the cleaning module comprises a backup module, a first traversal module, a second traversal module, an initialization module, a collection module, and a loop module;