CN111143343A - Data efficient deleting method and system based on source-end deduplication - Google Patents
Data efficient deleting method and system based on source-end deduplication Download PDFInfo
- Publication number
- CN111143343A CN111143343A CN201911374951.4A CN201911374951A CN111143343A CN 111143343 A CN111143343 A CN 111143343A CN 201911374951 A CN201911374951 A CN 201911374951A CN 111143343 A CN111143343 A CN 111143343A
- Authority
- CN
- China
- Prior art keywords
- container
- data
- module
- deduplication
- library
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 37
- 238000004140 cleaning Methods 0.000 claims abstract description 29
- 238000012217 deletion Methods 0.000 claims abstract description 21
- 230000008569 process Effects 0.000 claims abstract description 20
- 230000037430 deletion Effects 0.000 claims abstract description 19
- 238000003860 storage Methods 0.000 claims abstract description 18
- 125000004122 cyclic group Chemical group 0.000 claims abstract description 4
- 238000010586 diagram Methods 0.000 description 10
- 238000004590 computer program Methods 0.000 description 7
- 230000006870 function Effects 0.000 description 5
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000011084 recovery Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 239000002245 particle Substances 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/14—Error detection or correction of the data by redundancy in operation
- G06F11/1402—Saving, restoring, recovering or retrying
- G06F11/1446—Point-in-time backing up or restoration of persistent data
- G06F11/1448—Management of the data involved in backup or backup restore
- G06F11/1453—Management of the data involved in backup or backup restore using de-duplication of the data
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Quality & Reliability (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a data high-efficiency deleting method based on source-end re-deletion, which comprises the steps of segmenting a data stream of a source end into data blocks in a backup process, calculating fingerprints, comparing the fingerprints, transmitting the corresponding data blocks to a container of a server end for storage if the fingerprints do not exist and indicate that the data blocks are new blocks, marking the corresponding container as 1, writing the container into a data file after the container is full, and then creating a new container; the expiration of the backup set is automatically cleared, and the guid object record is cleared; and performing data block and fingerprint cleaning on the container marked as 0 by using preset cyclic deletion logic at idle time outside the normal business window period, wherein the container marked as 0 indicates that the data block and the fingerprint thereof in the container are not referenced and can be cleaned. The advantages are that: the invention adopts a marking mode, the statistical logic is simpler, the cleaning logic is not influenced by the size of the deduplication library, and the efficiency is higher.
Description
Technical Field
The invention relates to a source-end-deduplication-based data efficient deleting method and system, and belongs to the technical field of data protection.
Background
The source-end deduplication can reduce transmission bandwidth, reduce storage space and the like, so that the source-end deduplication is widely applied to data protection products. For convenience of explanation, the source-side database is agreed to store the data after the source-side database is deleted again into the deduplication library, and the deduplication library includes a deduplication fingerprint library and a deduplication database. And storing index information of the data blocks in the re-deleted fingerprint database, and storing the data blocks in the re-deleted database. The data after source end de-duplication is used has the following characteristics: the data blocks stored in the deduplication database are unique in the whole database, most of the data blocks in the deduplication database can be commonly used by a plurality of data sources, and the aim of reducing the storage space can be achieved only by the characteristic. The characteristic plays a positive role in reducing the storage space, but causes great complexity to deletion operation, and the data in the re-deleted database is difficult to be cleaned conveniently like common data. The first method of the prior art is to record the number of references of each data block, increase the number of references for repeated data blocks during backup, subtract the number of references of included data blocks during deletion, and indicate that the data block can be cleared up and the storage space occupied by the data block can be released when the number of references is 0. The method has the advantages that the backup and deletion performance can be greatly influenced along with the increase of the deletion library, the other method is centralized cleaning, the centralized cleaning is executed at a specific time point, all data files in use are marked, the granularity of the data files relative to data blocks is large, statistics is fast, and then the data files and fingerprints which are not in use are deleted to achieve the purpose of releasing space.
Disclosure of Invention
The invention aims to solve the technical problem of overcoming the defects that in the existing source end deduplication technology, due to the uniqueness characteristic of data after deduplication, the logic is complex, the efficiency is low, and the space cannot be released quickly and efficiently during deletion operation, and provides a source end deduplication-based data efficient deletion method and system.
In order to solve the technical problems, the invention provides a data efficient deleting method based on source end deduplication, in the backup process, a data stream of a source end is divided into data blocks, fingerprints are calculated and compared, if the fingerprints do not exist, the corresponding data blocks are transmitted to a container of a server end to be stored, the corresponding container is marked as 1, the container is written into a data file after the container is full, a new container is created, the container comprises a plurality of data blocks, a deduplication library comprises a plurality of data files with fixed sizes, and each data file comprises a plurality of containers;
the expiration of the backup set is automatically cleared, and the guid object record is cleared;
and performing data block and fingerprint cleaning on the container marked as 0 by using preset cyclic deletion logic at idle time outside the normal business window period, wherein the container marked as 0 indicates that the data block and the fingerprint thereof in the container are not referenced and can be cleaned.
Further, the container is fixed in size.
Further, the process of marking each container is as follows:
determining a backup set, wherein the backup set comprises an object library and a deduplication library, the object library stores object files, object files store object records and index data of the objects, the deduplication library stores data files, and the data files store information of each data block contained in the objects;
and acquiring the referenced object file, reading index data in the object file according to the unique identifier of the object, finding a corresponding container according to a fingerprint in the index data, and marking 1 on the corresponding container record.
Further, the loop deletion logic is:
s1, in the backup process, for the referenced data blocks, marking the container in which the corresponding data block is located as 1, and marking the corresponding object record as 1, which indicates that the referenced data blocks have been checked;
s2, traversing the object records, finding the objects marked as 0, finding the positions of the containers stored in the corresponding data blocks in the deduplication library according to the index information of the records in the object files, marking the containers corresponding to the fingerprints as 1, and marking the object record as 0 to indicate that the objects are not checked;
s3, traversing the container records in the deduplication library, cleaning the data blocks and fingerprints thereof in the container marked as 0, and then marking the container state as 2, which represents that the container is cleaned and can be reused;
s4, marking the container record in the deduplication library as 0 of 1, and marking all the object records in the object library as 0;
s5, collecting all containers marked as 2 in the deduplication library, and preferentially selecting the collected containers for reuse when new data needs to be stored;
s6, executing the above steps S1-S5 in a set cycle.
A data efficient deleting system based on source-end deduplication comprises a container determining module, a backup set cleaning module and a deleting module;
the container determining module is used for segmenting data blocks of a data stream at a source end in a backup process, calculating fingerprints, comparing the fingerprints, transmitting the corresponding data blocks to a container at a server end for storage if the fingerprints do not exist and indicate that the data blocks are new blocks, marking the corresponding container as 1, writing the container into a data file after the container is full, and then creating a new container, wherein the container comprises a plurality of data blocks, a deduplication library comprises a plurality of data files with fixed sizes, and each data file comprises a plurality of containers;
the backup set cleaning module is used for automatically cleaning the backup set after the backup set is expired, and simultaneously deleting the guid object records;
and the deleting module is used for clearing the data blocks and the fingerprints of the container marked as 0 by utilizing a preset circulating deleting logic at the idle time outside the normal service window period, wherein the container marked as 0 indicates that the data blocks and the fingerprints of the container can not be cleared when the data blocks and the fingerprints are not referred.
Further, the size of the container determined by the container determination module is fixed.
Further, the container determination module comprises a backup set determination module and a container marking module;
the backup set determining module is configured to determine a backup set, where the backup set includes an object library and a deduplication library, where the object library stores object files, object file storage object records and index data of objects, and the deduplication library stores data files, and the data file storage object stores information of each data block included in the data file;
the container marking module is used for acquiring the referred object file, reading the index data in the object file according to the unique identifier of the object, finding the corresponding container according to the fingerprint in the index data, and marking the corresponding container record with a mark 1.
Furthermore, the cleaning module comprises a backup module, a first traversal module, a second traversal module, an initialization module, a collection module and a circulation module;
the backup module is used for marking a container where a corresponding data block is located as 1 and marking a corresponding object record as 1 for the referenced data block in the backup process, wherein the container represents that the referenced data block has been checked;
the first traversal module is used for traversing the object records, finding the objects marked as 0, finding the positions of containers stored in the corresponding data blocks in the deduplication library according to the index information of the records in the object file, marking the containers corresponding to the fingerprints as 1, and marking the object record marks as 0 to indicate that the objects are not checked;
the second traversal module traverses the container records in the deduplication library, cleans the data blocks and fingerprints thereof in the container marked as 0, and then marks the container state as 2, which represents that the container is cleaned and can be reused; (ii) a
The initialization module marks the container records in the deduplication library as 0 of 1, and marks all the object records in the object library as 0;
the collection module is used for collecting all containers marked as 2 in the deduplication library, and preferentially selecting the collected containers for reuse when new data needs to be stored;
and the circulating module is used for circularly executing the processes of the backup module, the first traversal module, the second traversal module, the initialization module and the collection module in a set period.
The invention achieves the following beneficial effects:
the invention can regularly execute cleaning in the background under the condition of not influencing normal backup and recovery services, and the cleaned space can be repeatedly utilized, thereby achieving the purpose of phase-change space release. The invention adopts a marking mode, the statistical logic is simpler, the cleaning logic is not influenced by the size of the deduplication library, and the efficiency is higher.
Drawings
FIG. 1 is a schematic flow diagram for marking containers that are still being referenced;
FIG. 2 is a schematic flow diagram of cleaning a container.
Detailed Description
The invention is further described below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.
The deleting logic used by the invention is to utilize the marks of the containers to distinguish which containers can be cleaned and which containers are still used, so the key is how to quickly mark the containers in the deduplication library, the fingerprint deletion is simple, only the containers needing to be deleted are read to analyze the stored fingerprints, and then the corresponding records are deleted in the fingerprint library. If the marking is done at each cleaning, it takes a long time to find the fingerprint in the index file of all the objects in the object library once to determine which containers are still used as a whole. The invention extends the stage to the backup stage, the fingerprint is compared in the backup process, only the hit container needs to be marked at the moment, the backup logic is hardly influenced, and because the effectiveness of the backup set is limited by time, only a small number of objects need to be searched to determine which containers can be cleaned when the containers are cleaned.
The container of the present invention has a small particle size. A container is a logical concept used to manage a batch of data blocks, the container is of a fixed size (each container is the same size), and a data file may contain multiple containers. Because the container is of a fixed size, once the container is cleaned, the container can be reused to store new data, and in addition, the cleaning at the container level can be performed by using the idle time of normal services, and the normal services are hardly influenced.
The method comprises the steps that a plurality of data files with fixed sizes are contained in a deduplication library, each data file contains a plurality of containers with fixed sizes, each container contains a plurality of data blocks, in the backup process, data streams of a source end are divided into the data blocks, fingerprints are calculated, the fingerprints are compared, if the fingerprints do not exist and indicate that the data blocks are new blocks, the corresponding data blocks are transmitted to a container of a server end to be stored, the corresponding container is marked as 1, the container is written into the data files after the container is full, and then a new container is created.
The reason for taking a container as a unit is that the purpose of the deduplication library containing a batch of fingerprints in container units is to utilize the locality principle of data. The principle of locality is that if a block is used, then the adjacent blocks will be used with a high probability. The container is used as an updating unit of the cache, so that the hit rate of the cache can be effectively improved. The delete function can also use this principle, if a block needs to be cleaned up, then its adjacent data blocks will be cleaned up with a high probability.
The backup set is stored in the background in an object mode and mainly comprises two parts, wherein an object library and an object file store object records and index data of the objects, and a deduplication library and a data file store information of each data block contained in the objects. And reading the backup set to access the object library, and according to the index information recorded in the object file, removing the duplication and deletion of the database to find the corresponding data block.
As shown in fig. 1, a referenced object file is obtained, index data of an object in the object file is read according to guid (object unique identifier), a corresponding container is found according to a fingerprint in the index data, and a corresponding container record is marked with a mark 1.
The method specifically comprises the following steps:
firstly, in the backup process, for the referenced data blocks, the container in which the corresponding data block is located is marked as 1, which represents that the container has the data block used. The corresponding object record is also marked 1, representing that it has been checked.
Traversing the object records, finding the objects marked as 0, finding the positions of the containers stored in the corresponding data blocks in the deduplication library according to the index information of the records in the object files, and marking the containers corresponding to the fingerprints.
Traversing container records in the deduplication library, wherein the fingerprint in the container marked as 0 can be cleaned when the fingerprint in the container is not referenced, and then marking the container state as 2 indicates that the container is cleaned and can be recycled.
And fourthly, initializing a mark bit. And setting the container record in the deduplication library as 1 to be 0, and setting all object records in the object library as 0.
And fifthly, collecting all containers marked as 2 in the deduplication library, and preferentially selecting the collected containers for reuse when new data needs to be stored.
As shown in fig. 2, the steps of clearing by using the idle time outside the normal service window period are as follows:
starting to collect container ids to be cleaned, judging whether a container needs to be cleaned according to a mark of the container, if so, judging whether the service is idle, if not, waiting for 1s and then judging again, if so, closing a secondary cache in a cleaning write cache, judging whether the service is idle again, if not, waiting for 1s again, judging whether the service is idle again, if so, taking out a container id, checking the state of the container again, if still needing to be cleaned, executing cleaning, cleaning the mark, judging whether the service is idle again, if not, waiting for 1s, judging again, if opening the secondary cache of the write cache, initializing a bloom filter, and ending.
The above-mentioned process is circularly executed in the background with the set period, because the backup set in the data protection product has life cycle, it can be automatically cleaned when the life cycle is expired, and can also be cleaned manually, so that along with the object replacement in the object library, the above-mentioned cleaning logic can clean the containers whose expiration is no longer used, and can store new data after recovery, so that the goal of space cyclic utilization and phase-changing reduction of storage space can be reached.
Correspondingly, the invention also provides a data high-efficiency deleting system based on source-end deduplication, which comprises a container determining module, a backup set cleaning module and a deleting module;
the container determining module is used for segmenting data blocks of a data stream at a source end in a backup process, calculating fingerprints, comparing the fingerprints, transmitting the corresponding data blocks to a container at a server end for storage if the fingerprints do not exist and indicate that the data blocks are new blocks, marking the corresponding container as 1, writing the container into a data file after the container is full, and then creating a new container, wherein the container comprises a plurality of data blocks, a deduplication library comprises a plurality of data files with fixed sizes, and each data file comprises a plurality of containers;
the backup set cleaning module is used for automatically cleaning the backup set after the backup set is expired, and simultaneously deleting the guid object records;
and the deleting module is used for clearing the data blocks and the fingerprints of the container marked as 0 by utilizing a preset circulating deleting logic at the idle time outside the normal service window period, wherein the container marked as 0 indicates that the data blocks and the fingerprints of the container can not be cleared when the data blocks and the fingerprints are not referred.
The size of the container determined by the container determination module is fixed.
The container determining module comprises a backup set determining module and a container marking module;
the backup set determining module is configured to determine a backup set, where the backup set includes an object library and a deduplication library, where the object library stores object files, object file storage object records and index data of objects, and the deduplication library stores data files, and the data file storage object stores information of each data block included in the data file;
the container marking module is used for acquiring the referred object file, reading the index data in the object file according to the unique identifier of the object, finding the corresponding container according to the fingerprint in the index data, and marking the corresponding container record with a mark 1.
The cleaning module comprises a backup module, a first traversal module, a second traversal module, an initialization module, a collection module and a circulation module;
the backup module is used for marking a container where a corresponding data block is located as 1 and marking a corresponding object record as 1 for the referenced data block in the backup process, wherein the container represents that the referenced data block has been checked;
the first traversal module is used for traversing the object records, finding the objects marked as 0, finding the positions of containers stored in the corresponding data blocks in the deduplication library according to the index information of the records in the object file, marking the containers corresponding to the fingerprints as 1, and marking the object record marks as 0 to indicate that the objects are not checked;
the second traversal module traverses the container records in the deduplication library, cleans the data blocks and fingerprints thereof in the container marked as 0, and then marks the container state as 2, which represents that the container is cleaned and can be reused; (ii) a
The initialization module marks the container records in the deduplication library as 0 of 1, and marks all the object records in the object library as 0;
the collection module is used for collecting all containers marked as 2 in the deduplication library, and preferentially selecting the collected containers for reuse when new data needs to be stored;
and the circulating module is used for circularly executing the processes of the backup module, the first traversal module, the second traversal module, the initialization module and the collection module in a set period.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.
Claims (8)
1. A data high-efficiency deleting method based on source-end re-deletion is characterized in that,
in the backup process, data blocks are segmented from a data stream of a source end, fingerprints are calculated, the fingerprints are compared, if the fingerprints do not exist, the data blocks are transmitted to a container of a server end to be stored, the corresponding container is marked as 1, the container is written into a data file after the container is full, a new container is created, the container comprises a plurality of data blocks, a plurality of data files with fixed sizes are contained in a deduplication library, and each data file comprises a plurality of containers;
the expiration of the backup set is automatically cleared, and the guid object record is cleared;
and performing data block and fingerprint cleaning on the container marked as 0 by using preset cyclic deletion logic at idle time outside the normal business window period, wherein the container marked as 0 indicates that the data block and the fingerprint thereof in the container are not referenced and can be cleaned.
2. The source-side deduplication based data efficient deletion method of claim 1, wherein the container size is fixed.
3. The source-side deduplication based data efficient deletion method as claimed in claim 1, wherein the marking of each container is performed by:
determining a backup set, wherein the backup set comprises an object library and a deduplication library, the object library stores object files, object files store object records and index data of the objects, the deduplication library stores data files, and the data files store information of each data block contained in the objects;
and acquiring the referenced object file, reading index data in the object file according to the unique identifier of the object, finding a corresponding container according to a fingerprint in the index data, and marking 1 on the corresponding container record.
4. The source-side deduplication based data efficient deletion method of claim 3, wherein the loop deletion logic is:
s1, in the backup process, for the referenced data blocks, marking the container in which the corresponding data block is located as 1, and marking the corresponding object record as 1, which indicates that the referenced data blocks have been checked;
s2, traversing the object records, finding the objects marked as 0, finding the positions of the containers stored in the corresponding data blocks in the deduplication library according to the index information of the records in the object files, marking the containers corresponding to the fingerprints as 1, and marking the object record as 0 to indicate that the objects are not checked;
s3, traversing the container records in the deduplication library, cleaning the data blocks and fingerprints thereof in the container marked as 0, and then marking the container state as 2, which represents that the container is cleaned and can be reused;
s4, marking the container record in the deduplication library as 0 of 1, and marking all the object records in the object library as 0;
s5, collecting all containers marked as 2 in the deduplication library, and preferentially selecting the collected containers for reuse when new data needs to be stored;
s6, executing the above steps S1-S5 in a set cycle.
5. A data efficient deleting system based on source-end deduplication is characterized by comprising a container determining module, a backup set cleaning module and a deleting module;
the container determining module is used for segmenting data blocks of a data stream at a source end in a backup process, calculating fingerprints, comparing the fingerprints, transmitting the corresponding data blocks to a container at a server end for storage if the fingerprints do not exist and indicate that the data blocks are new blocks, marking the corresponding container as 1, writing the container into a data file after the container is full, and then creating a new container, wherein the container comprises a plurality of data blocks, a deduplication library comprises a plurality of data files with fixed sizes, and each data file comprises a plurality of containers;
the backup set cleaning module is used for automatically cleaning the backup set after the backup set is expired, and simultaneously deleting the guid object records;
and the deleting module is used for clearing the data blocks and the fingerprints of the container marked as 0 by utilizing a preset circulating deleting logic at the idle time outside the normal service window period, wherein the container marked as 0 indicates that the data blocks and the fingerprints of the container can not be cleared when the data blocks and the fingerprints are not referred.
6. The source-side deduplication based data efficient deletion system of claim 5, wherein the size of the container determined by the container determination module is fixed.
7. The source-side deduplication based data efficient deletion system of claim 5, wherein the container determination module comprises a backup set determination module and a container marking module;
the backup set determining module is configured to determine a backup set, where the backup set includes an object library and a deduplication library, where the object library stores object files, object file storage object records and index data of objects, and the deduplication library stores data files, and the data file storage object stores information of each data block included in the data file;
the container marking module is used for acquiring the referred object file, reading the index data in the object file according to the unique identifier of the object, finding the corresponding container according to the fingerprint in the index data, and marking the corresponding container record with a mark 1.
8. The source-side deduplication based data-efficient deletion system of claim 7, wherein the cleaning module comprises a backup module, a first traversal module, a second traversal module, an initialization module, a collection module, and a loop module;
the backup module is used for marking a container where a corresponding data block is located as 1 and marking a corresponding object record as 1 for the referenced data block in the backup process, wherein the container represents that the referenced data block has been checked;
the first traversal module is used for traversing the object records, finding the objects marked as 0, finding the positions of containers stored in the corresponding data blocks in the deduplication library according to the index information of the records in the object file, marking the containers corresponding to the fingerprints as 1, and marking the object record marks as 0 to indicate that the objects are not checked;
the second traversal module traverses the container records in the deduplication library, cleans the data blocks and fingerprints thereof in the container marked as 0, and then marks the container state as 2, which represents that the container is cleaned and can be reused; (ii) a
The initialization module marks the container records in the deduplication library as 0 of 1, and marks all the object records in the object library as 0;
the collection module is used for collecting all containers marked as 2 in the deduplication library, and preferentially selecting the collected containers for reuse when new data needs to be stored;
and the circulating module is used for circularly executing the processes of the backup module, the first traversal module, the second traversal module, the initialization module and the collection module in a set period.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911374951.4A CN111143343B (en) | 2019-12-27 | 2019-12-27 | Efficient data deleting method and system based on source terminal deduplication |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911374951.4A CN111143343B (en) | 2019-12-27 | 2019-12-27 | Efficient data deleting method and system based on source terminal deduplication |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111143343A true CN111143343A (en) | 2020-05-12 |
CN111143343B CN111143343B (en) | 2023-12-15 |
Family
ID=70520911
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911374951.4A Active CN111143343B (en) | 2019-12-27 | 2019-12-27 | Efficient data deleting method and system based on source terminal deduplication |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111143343B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112637153A (en) * | 2020-12-14 | 2021-04-09 | 南京壹进制信息科技有限公司 | Method and system for removing duplicate in storage encryption |
CN113312002A (en) * | 2021-06-11 | 2021-08-27 | 北京百度网讯科技有限公司 | Data processing method, device, equipment and medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102663086A (en) * | 2012-04-09 | 2012-09-12 | 华中科技大学 | Method for retrieving data block indexes |
CN102982180A (en) * | 2012-12-18 | 2013-03-20 | 华为技术有限公司 | Method and device for storing data |
US20160232177A1 (en) * | 2015-02-06 | 2016-08-11 | Ashish Govind Khurange | Methods and systems of a dedupe file-system garbage collection |
CN107301019A (en) * | 2017-06-22 | 2017-10-27 | 重庆大学 | The rubbish recovering method of time diagram and container position table is quoted in a kind of combination |
-
2019
- 2019-12-27 CN CN201911374951.4A patent/CN111143343B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102663086A (en) * | 2012-04-09 | 2012-09-12 | 华中科技大学 | Method for retrieving data block indexes |
CN102982180A (en) * | 2012-12-18 | 2013-03-20 | 华为技术有限公司 | Method and device for storing data |
US20160232177A1 (en) * | 2015-02-06 | 2016-08-11 | Ashish Govind Khurange | Methods and systems of a dedupe file-system garbage collection |
CN107301019A (en) * | 2017-06-22 | 2017-10-27 | 重庆大学 | The rubbish recovering method of time diagram and container position table is quoted in a kind of combination |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112637153A (en) * | 2020-12-14 | 2021-04-09 | 南京壹进制信息科技有限公司 | Method and system for removing duplicate in storage encryption |
CN112637153B (en) * | 2020-12-14 | 2024-02-20 | 航天壹进制(江苏)信息科技有限公司 | Method and system for storing encryption and deduplication |
CN113312002A (en) * | 2021-06-11 | 2021-08-27 | 北京百度网讯科技有限公司 | Data processing method, device, equipment and medium |
Also Published As
Publication number | Publication date |
---|---|
CN111143343B (en) | 2023-12-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10360182B2 (en) | Recovering data lost in data de-duplication system | |
US8224875B1 (en) | Systems and methods for removing unreferenced data segments from deduplicated data systems | |
CN106502587B (en) | Hard disk data management method and hard disk control device | |
CN102541757B (en) | Write cache method, cache synchronization method and device | |
CN103955530B (en) | Data reconstruction and optimization method of on-line repeating data deletion system | |
CN102982180A (en) | Method and device for storing data | |
CN103226965B (en) | Based on the audio/video data access method of time bitmap | |
CN111143343B (en) | Efficient data deleting method and system based on source terminal deduplication | |
CN113574498A (en) | Tagging affected similar groups in a garbage collection operation of a deduplication storage system | |
CN101673192A (en) | Method for time-sequence data processing, device and system therefor | |
CN103164528B (en) | A kind of index establishing method of audio, video data | |
CN102024034A (en) | Fragment processing method for high-definition media-oriented embedded file system | |
CN104050057B (en) | Historical sensed data duplicate removal fragment eliminating method and system | |
CN107066349A (en) | A kind of method and system of cluster file system data protection | |
CN111125298A (en) | Method, equipment and storage medium for reconstructing NTFS file directory tree | |
CN107168651A (en) | A kind of small documents polymerize storage processing method | |
CN115221131A (en) | High-speed data reading and writing method and device for time sequence database | |
CN104702874A (en) | Storing method for file monitored by video | |
CN109656929B (en) | Method and device for carving complex relation type database file | |
CN105068941A (en) | Cache page replacing method and cache page replacing device | |
CN108170766B (en) | CDP backup and recovery method for ensuring database consistency | |
CN106528703A (en) | Deduplication mode switching method and apparatus | |
CN103699681B (en) | The treating method and apparatus of data rewind | |
CN109271353B (en) | Method and system for selectively rewriting self-reference block in data deduplication process | |
CN108153805A (en) | A kind of method, the system of efficient cleaning Hbase time series datas |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Address after: Building 1, 6th Floor, Changfeng Building, No.14 Xinghuo Road, Research and Innovation Park, Jiangbei New District, Nanjing City, Jiangsu Province, 210000 Applicant after: Aerospace One System (Jiangsu) Information Technology Co.,Ltd. Address before: 210014 Building C, Building 3, No. 5 Baixia High-tech Park, No. 5 Yongzhi Road, Qinhuai District, Nanjing City, Jiangsu Province Applicant before: NANJING UNARY INFORMATION TECHNOLOGY Co.,Ltd. |
|
CB02 | Change of applicant information | ||
GR01 | Patent grant | ||
GR01 | Patent grant |