Disclosure of Invention
The invention mainly aims to provide a data deleting method and a data deleting device, and aims to solve the problems that the local characteristics are not obvious in distributed storage, data cannot be effectively and reasonably deleted, redundant data cannot be completely deleted and storage space is occupied due to the fact that the data source local characteristics are adopted to improve the deduplication performance in the conventional main thought of deduplication.
In order to achieve the above object, the present invention provides a data deleting method, which comprises the steps of:
acquiring data to be processed;
determining objects matched with each other in data to be processed;
comparing the data corresponding to the objects matched with each other, and determining repeated data among the objects matched with each other;
deleting the determined duplicate data.
Preferably, the step of acquiring the data to be processed includes:
determining a mode of data deletion;
when the data deleting mode is a real-time deleting mode, acquiring currently stored data and historically stored data, and taking the currently stored data and the historically stored data as data to be processed;
and when the mode of data deletion is a timed deletion mode, acquiring the currently stored data, and taking the currently stored data as the data to be processed.
Preferably, before the acquiring the data to be processed, the method further includes:
receiving data to be stored, slicing the data to be stored into slice data blocks with preset sizes;
and storing each slice data block in an object mode, and storing the object and the fingerprint index of each slice data in a data structure.
Preferably, the step of determining the objects matching each other in the data to be processed includes:
determining an object contained in the data to be stored;
and adding the determined objects into a fingerprint index queue, performing object fingerprint comparison, and determining the objects matched with each other through a Hash algorithm.
Preferably, the step of comparing the data corresponding to the objects matched with each other and determining the repeated data between the objects matched with each other includes:
comparing the data corresponding to the matched objects, and calculating the correctness of the data through an MD5 algorithm to obtain effective data;
and determining repeated data between the matched objects from the valid data.
In addition, to achieve the above object, the present invention also provides a data deleting device, including:
the acquisition module is used for acquiring data to be processed;
the determining module is used for extracting objects which are matched with each other in the data to be processed;
the comparison module is used for comparing the data corresponding to the objects which are matched with each other and determining the repeated data among the objects which are matched with each other;
a deletion module to delete the determined duplicate data.
Preferably, the acquisition module comprises
A determination unit configured to determine a mode of data deletion;
the acquisition unit is used for acquiring currently stored data and historically stored data when the data deletion mode is a real-time deletion mode, and taking the currently stored data and the historically stored data as data to be processed; the acquisition unit is also used for
And when the mode of data deletion is a timed deletion mode, acquiring the currently stored data, and taking the currently stored data as the data to be processed.
Preferably, the apparatus further comprises:
the processing module is used for receiving data to be stored, slicing the data to be stored and cutting the data into slice data blocks with preset sizes;
and the storage module is used for storing each slice data block in an object mode and storing the object and the fingerprint index of each slice data in a data structure.
Preferably, the determining module is further configured to determine an object included in the data to be stored; the determination module is also used for
And adding the determined objects into a fingerprint index queue, performing object fingerprint comparison, and determining the objects matched with each other through a Hash algorithm.
Preferably, the comparison module is further configured to compare data corresponding to the objects that are matched with each other, and calculate correctness of the data through an MD5 algorithm to obtain valid data; the comparison module is also used for
And determining repeated data between the matched objects from the valid data.
The invention provides an object-based data deduplication mode, which effectively overcomes the defect of insufficient local characteristics, finds out repetitive data among objects matched with each other by using the objects, deletes the repetitive data, realizes high-performance parallel data deduplication processing in a single node, and expands the throughput of data deduplication. The data is effectively and reasonably deleted, and the storage space is improved.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The invention provides a data deleting method.
Referring to fig. 1, fig. 1 is a flowchart illustrating a data deleting method according to an embodiment of the present invention.
In one embodiment, the data deleting method includes:
step S10, acquiring data to be processed;
in this embodiment, when data needs to be processed, the processed data is to-be-processed data, for example, when data needs to be deleted, the to-be-deleted data is to-be-processed data; or, for another example, when data needs to be deleted, the historical data and the data to be currently deleted are to-be-processed data. Referring to fig. 2, the manner of acquiring the data to be processed includes:
step S11, determining a data deletion mode;
step S12, when the data deleting mode is a real-time deleting mode, acquiring the currently stored data and the historically stored data, and taking the currently stored data and the historically stored data as the data to be processed;
step S13, when the mode of data deletion is the timed deletion mode, acquiring the currently stored data, and using the currently stored data as the data to be processed.
The data deleting mode includes a timing deleting mode and a real-time deleting mode. The setting of the data deletion mode is set by a user after the system is started, or the data deletion mode is operated in a default deletion mode. When data needs to be deleted, a mode of data deletion is determined, for example, a timed deletion mode or a real-time deletion mode. When the real-time deletion mode is determined, taking the historical data and the current data as the data to be processed, or directly taking the current data as the data to be processed; and when the mode is determined to be the timed deletion mode, taking all the currently stored data as the data to be processed, or taking the data stored between the previous time node and the current time node as the data to be processed. The definition of the data to be processed in the real-time and timed deletion modes is set by a user or default of a system and is a corresponding definition mode. And when the data needs to be deleted, acquiring the data to be processed according to a set mode.
Step S20, determining objects matched with each other in the data to be processed;
during data storage, the data storage mode is storage according to an object mode, namely, data are classified according to objects, then data corresponding to the objects, namely, data of a receiving client side are respectively stored, and the data are respectively stored in the distributed storage system according to the object classification of the received data. Referring to fig. 3, the process of data storage includes:
step S21, receiving data to be stored, slicing the data to be stored into slices of data blocks with preset sizes;
in step S22, each slice data block is stored in an object manner, and the object and the fingerprint index of each slice data are stored in a data structure.
The data sent by the client is received, the data of the client is sliced, that is, the data transmitted by the client is divided into a plurality of small data blocks, each data block is divided into 128M (according to different requirements or system performance settings or according to the total size of the transmitted data, for example, 64M or 256M and the like) in our distributed storage system, and the data is sent to the data storage system. After slicing the data, stability and reliability of data transmission are improved, and if the data are not segmented into a plurality of data blocks under the condition of poor network conditions, a large amount of data are transmitted at one time, data can be subjected to data verification for many times in the transmission process, transmission time is prolonged, data loss or data north blocking can be caused, and stability and reliability of the data are influenced.
The method is characterized in that effective data and fingerprint indexes are stored in a cache in a data structure form, and then the whole object is transmitted to a storage system through a network, so that the mutual relation between the data and the fingerprint indexes is established, and the subsequent operations of searching and acquiring the data and the like are facilitated.
Step S30, comparing the data corresponding to the matched objects and determining the repeated data between the matched objects;
in the real-time deletion mode, when data are received, the currently received data are checked, historical data are compared with the currently received data, and the similarity between the current data and the historical data is found, namely the similar data are found; in the case of the timed deletion mode, when the timed time is reached, the manner of determining similar data is the same as that of deleting data in real time, but the compared data is data to be processed, for example, the data is stored in whole, or the data from the last time node to the current stored data is compared with the data stored before the last time node, so as to find similar data.
Specifically, the process of determining duplicate data includes:
step S31, determining an object included in the data to be stored;
step S32, adding the determined objects into a fingerprint index queue, performing object fingerprint comparison, and determining the objects matched with each other through a Hash algorithm;
s33, comparing the data corresponding to the matched objects, and calculating the correctness of the data through an MD5 algorithm to obtain effective data;
in step S34, duplicate data between objects that match each other is determined from the valid data.
In the process of deleting real-time or timing data, different objects are added into an index fingerprint queue, object fingerprint comparison is firstly carried out, objects with higher similarity (the similarity is greater than a preset threshold, for example, the preset threshold is 80% similar or 70% similar and the like) are extracted through a Hash algorithm, then the correctness of the data is calculated according to an MD5 algorithm, if the data is found to be unmodified and the same copy exists in the data, the repeated data is found, and the found repeated data is used as the repeated data among the objects which are matched with each other.
In step S40, the determined duplicate data is deleted.
The repeated data is deleted, namely, the redundant data is deleted, so that a large amount of storage space is saved, the storage capacity of the disk is improved, and the storage efficiency is improved due to the improvement of the storage capacity.
For better describing the embodiment of the present invention, refer to fig. 5, which is an architecture diagram of data de-duplication according to the present invention, and refer to fig. 6, which is a schematic diagram of data de-duplication.
The embodiment provides an object-based data deduplication mode, which effectively overcomes the defect of insufficient local features, finds out repetitive data among objects matched with each other by using the objects, deletes the repetitive data, realizes high-performance parallel data deduplication processing in a single node, and expands throughput of data deduplication. The data is effectively and reasonably deleted, and the storage space is improved.
The invention further provides a data deleting device.
Referring to fig. 7, fig. 7 is a functional module diagram of a data deleting device according to an embodiment of the present invention.
In one embodiment, the data deleting device includes: the device comprises an acquisition module 10, a determination module 20, a processing module 30, a storage module 40, a comparison module 50 and a deletion module 60.
The acquiring module 10 is configured to acquire data to be processed;
in this embodiment, when data needs to be processed, the processed data is to-be-processed data, for example, when data needs to be deleted, the to-be-deleted data is to-be-processed data; or, for another example, when data needs to be deleted, the historical data and the data to be currently deleted are to-be-processed data. Referring to fig. 8, the acquisition module 10 includes: a determination unit 11 and an acquisition unit 12,
the determining unit 11 is configured to determine a mode of data deletion;
the acquiring unit 12 is configured to acquire currently stored data and historically stored data when the data deletion mode is the real-time deletion mode, and use the currently stored data and the historically stored data as data to be processed; the acquisition unit 12 is also used for
And when the mode of data deletion is a timed deletion mode, acquiring the currently stored data, and taking the currently stored data as the data to be processed.
The data deleting mode includes a timing deleting mode and a real-time deleting mode. The setting of the data deletion mode is set by a user after the system is started, or the data deletion mode is operated in a default deletion mode. When data needs to be deleted, a mode of data deletion is determined, for example, a timed deletion mode or a real-time deletion mode. When the real-time deletion mode is determined, taking the historical data and the current data as the data to be processed, or directly taking the current data as the data to be processed; and when the mode is determined to be the timed deletion mode, taking all the currently stored data as the data to be processed, or taking the data stored between the previous time node and the current time node as the data to be processed. The definition of the data to be processed in the real-time and timed deletion modes is set by a user or default of a system and is a corresponding definition mode. And when the data needs to be deleted, acquiring the data to be processed according to a set mode.
The determining module 20 is configured to determine objects matching with each other in the data to be processed;
during data storage, the data storage mode is storage according to an object mode, namely, data are classified according to objects, then data corresponding to the objects, namely, data of a receiving client side are respectively stored, and the data are respectively stored in the distributed storage system according to the object classification of the received data.
The processing module 30 is configured to receive data to be stored, slice the data to be stored, and cut the data into slice data blocks of a preset size;
the storage module 40 is configured to store each slice data block in an object manner, and store a data structure formed by the object and the fingerprint index of each slice data.
The data sent by the client is received, the data of the client is sliced, that is, the data transmitted by the client is divided into a plurality of small data blocks, each data block is divided into 128M (according to different requirements or system performance settings or according to the total size of the transmitted data, for example, 64M or 256M and the like) in our distributed storage system, and the data is sent to the data storage system. After slicing the data, stability and reliability of data transmission are improved, and if the data are not segmented into a plurality of data blocks under the condition of poor network conditions, a large amount of data are transmitted at one time, data can be subjected to data verification for many times in the transmission process, transmission time is prolonged, data loss or data north blocking can be caused, and stability and reliability of the data are influenced.
The method is characterized in that effective data and fingerprint indexes are stored in a cache in a data structure form, and then the whole object is transmitted to a storage system through a network, so that the mutual relation between the data and the fingerprint indexes is established, and the subsequent operations of searching and acquiring the data and the like are facilitated.
The comparing module 50 is configured to compare data corresponding to the objects that are matched with each other, and determine repeated data between the objects that are matched with each other;
in the real-time deletion mode, when data are received, the currently received data are checked, historical data are compared with the currently received data, and the similarity between the current data and the historical data is found, namely the similar data are found; in the case of the timed deletion mode, when the timed time is reached, the manner of determining similar data is the same as that of deleting data in real time, but the compared data is data to be processed, for example, the data is stored in whole, or the data from the last time node to the current stored data is compared with the data stored before the last time node, so as to find similar data.
The determining module 20 is further configured to determine an object included in the data to be stored; the determination module 20 is also used for
Adding the determined objects into a fingerprint index queue, performing object fingerprint comparison, and determining objects matched with each other through a Hash algorithm;
the comparison module 50 is further configured to compare data corresponding to the objects that are matched with each other, and calculate correctness of the data through an MD5 algorithm to obtain valid data; the alignment module 50 is also used for
And determining repeated data between the matched objects from the valid data.
In the process of deleting real-time or timing data, different objects are added into an index fingerprint queue, object fingerprint comparison is firstly carried out, objects with higher similarity (the similarity is greater than a preset threshold, for example, the preset threshold is 80% similar or 70% similar and the like) are extracted through a Hash algorithm, then the correctness of the data is calculated according to an MD5 algorithm, if the data is found to be unmodified and the same copy exists in the data, the repeated data is found, and the found repeated data is used as the repeated data among the objects which are matched with each other.
The deleting module 60 is configured to delete the determined duplicate data.
The repeated data is deleted, namely, the redundant data is deleted, so that a large amount of storage space is saved, the storage capacity of the disk is improved, and the storage efficiency is improved due to the improvement of the storage capacity.
For better describing the embodiment of the present invention, refer to fig. 5, which is an architecture diagram of data de-duplication according to the present invention, and refer to fig. 6, which is a schematic diagram of data de-duplication.
The embodiment provides an object-based data deduplication mode, which effectively overcomes the defect of insufficient local features, finds out repetitive data among objects matched with each other by using the objects, deletes the repetitive data, realizes high-performance parallel data deduplication processing in a single node, and expands throughput of data deduplication. The data is effectively and reasonably deleted, and the storage space is improved.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.