The content of the invention
It is a primary object of the present invention to provide a kind of data-erasure method and device, it is intended to solve current repeated data and delete
The main thought removed is to improve to delete performance again using data source locality characteristic, and locality characteristic is not in distributed storage
Substantially, it is impossible to effective and reasonable deletion data, redundant data is caused to be deleted not thorough enough, the problem of taking memory space.
To achieve the above object, a kind of data-erasure method that the present invention is provided, including step:
Obtain pending data;
Determine the object being mutually matched in pending data;
The corresponding data of object being mutually matched are compared, it is determined that the data repeated between the object being mutually matched;
The data of identified repetition are deleted.
Preferably, the step of acquisition pending data includes:
Determine the pattern that data are deleted;
When the pattern that data are deleted is real-time puncturing pattern, the number of the data being currently stored in and history deposit is obtained
According to the data for being stored in the data being currently stored in and history are used as pending data;
When the pattern that data are deleted is timing puncturing pattern, the current data stored are obtained, by what is currently stored
Data are used as pending data.
Preferably, before the acquisition pending data, in addition to:
Data to be stored are received, by the data slicer to be stored, the slice of data block of default size are cut into;
Each slice of data block is stored in the way of object, the object of each slice of data and fingerprint index are constituted into number
Stored according to structure.
Preferably, the step of object being mutually matched in the determination pending data, includes:
Determine the object included in the data to be stored;
Identified object is added into fingerprint index queue, object fingerprint comparison is carried out, is determined by hash algorithm mutual
The object of matching.
Preferably, it is described to compare the corresponding data of object being mutually matched, it is determined that the number repeated between the object being mutually matched
According to the step of include:
The corresponding data of object being mutually matched are compared, the correctness for calculating data by MD5 algorithms obtains valid data;
The data repeated between the object being mutually matched are determined from valid data.
In addition, to achieve the above object, the present invention also provides a kind of data deletion apparatus, including:
Acquisition module, for obtaining pending data;
Determining module, the object being mutually matched for extracting in pending data;
Comparing module, for comparing the corresponding data of object being mutually matched, it is determined that repeated between the object being mutually matched
Data;
Removing module, for the data of identified repetition to be deleted.
Preferably, the acquisition module includes
Determining unit, for determining the pattern that data are deleted;
When acquiring unit for the pattern deleted in data is real-time puncturing pattern, obtain the data that are currently stored in and
The data of history deposit, the data that the data being currently stored in and history are stored in are used as pending data;Acquiring unit is also used
In
When the pattern that data are deleted is timing puncturing pattern, the current data stored are obtained, by what is currently stored
Data are used as pending data.
Preferably, described device also includes:
Processing module, for receiving data to be stored, by the data slicer to be stored, is cut into the number of slices of default size
According to block;
Memory module, for each slice of data block to be stored in the way of object, by the object of each slice of data and
Fingerprint index composition data structure is stored.
Preferably, the determining module, is additionally operable to the object for determining to include in the data to be stored;Determining module is also used
In
Identified object is added into fingerprint index queue, object fingerprint comparison is carried out, is determined by hash algorithm mutual
The object of matching.
Preferably, the comparing module, is additionally operable to compare the corresponding data of the object being mutually matched, passes through MD5 algorithm meters
The correctness for the evidence that counts obtains valid data;Comparing module is additionally operable to
The data repeated between the object being mutually matched are determined from valid data.
The present invention proposes a kind of object-based data de-duplication mode, efficiently solves lacking for local feature deficiency
Fall into, the data repeated between the object being mutually matched are found out using object, the data repeated is deleted, realizes high-performance in individual node
Parallel data delete processing, the handling capacity that growth data is deleted again again.Effective and reasonable deletion data, improve memory space.
Embodiment
It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, it is not intended to limit the present invention.
The present invention provides a kind of data-erasure method.
Reference picture 1, Fig. 1 is the schematic flow sheet of an embodiment of data-erasure method of the present invention.
In one embodiment, the data-erasure method includes:
Step S10, obtains pending data;
In the present embodiment, when there is data to need processing, processed data are pending data, for example, needing
When deleting data, the data of processing to be deleted are pending data;Or, for another example when needing to delete data, historical data
Data with current processing to be deleted are pending data.With reference to Fig. 2, the mode of the acquisition pending data includes:
Step S11, determines the pattern that data are deleted;
Step S12, when the pattern that data are deleted is real-time puncturing pattern, obtains the data being currently stored in and history is deposited
The data entered, the data that the data being currently stored in and history are stored in are used as pending data;
Step S13, when the pattern that data are deleted is timing puncturing pattern, obtains the current data stored, will be current
The data stored are as pending data.
Data puncturing pattern includes timing puncturing pattern and real-time puncturing pattern.The system that is arranged on of data puncturing pattern is opened
Set, or run with the puncturing pattern of acquiescence by user after opening.When needing to delete data, the pattern that data are deleted is determined,
For example, being defined as timing puncturing pattern, or it is defined as real-time puncturing pattern.When being defined as real-time puncturing pattern, by history
Data, as pending data, directly can regard current data as pending data in other words with current data;It is determined that
It is pending data by currently stored all data during for timing puncturing pattern, or to existing before a upper timing node
The data stored between timing node are as pending data.Pending data under above-mentioned real-time and timing puncturing pattern
It is defined as user to set or system default, is a kind of wherein corresponding definition mode.When needing to delete data, according to setting
Mode obtains pending data.
Step S20, determines the object being mutually matched in pending data;
In data storage, the mode of data storage is to be stored in the way of object, i.e. returned according to objects on data
Class, then stores the data of corresponding objects respectively, i.e. receive the data of client, classifies according to the object of the data of reception, point
It is not stored in distributed memory system.With reference to Fig. 3, the process of data storage includes:
Step S21, receives data to be stored, by the data slicer to be stored, is cut into the slice of data block of default size;
Step S22, each slice of data block is stored in the way of object, by the object of each slice of data and fingerprint rope
Draw the storage of composition data structure.
The data that client is sent are received, by the data slicer of client, i.e. the data point that client transmissions come
Into several small data blocks, in our distributed memory system can by each data block cutting be 128M (according to demand not
Same or systematic function sets or set according to the total size of transmission data, for example, 64M or 256M etc. can be also configured to), send
To data-storage system.After by data slicer, data transmission stability and reliability increase, because in the bad feelings of network condition
Under condition, if being not several data blocks by data cutting, substantial amounts of data are disposably transmitted, data can be entered in transmitting procedure
The multiple data check of row, adds the time of transmission, while can also cause the loss or data north obstruction of data, have impact on number
According to stability and reliability.
In the way of object by slice of data storage into distributed system, the characteristics of object is stored is by effective data
Stored with the form of fingerprint index composition data structure and pass through network transmission to storage system into caching, then by whole object
In, the correlation between data and fingerprint index is so just established, facilitates the lookup of follow-up data with obtaining the behaviour such as data
Make.
Step S30, compares the corresponding data of object being mutually matched, it is determined that the data repeated between the object being mutually matched;
In real-time puncturing pattern, the data being currently received are verified when data is received, history number is contrasted
According to the data being currently received, find out the similitude of current data and historical data, i.e. find similar data;For calmly
When puncturing pattern when, when reaching the time of timing, determine that the mode of set of metadata of similar data is identical with the mode deleted in real time, but contrast
Data be pending data, for example, be the data entirely stored, or by a upper timing node to currently having stored
The data comparison stored before data, with a upper timing node, finds similar data.
Specifically, determining the process of repeated data includes:
Step S31, determines the object included in the data to be stored;
Step S32, fingerprint index queue is added by identified object, is carried out object fingerprint comparison, is passed through hash algorithm
It is determined that the object being mutually matched;
Step S33, compares the corresponding data of object being mutually matched, and the correctness for calculating data by MD5 algorithms is obtained
Valid data;
Step S34, determines the data repeated between the object being mutually matched from valid data.
Doing in real time or during timing data deletion, different objects be added in an index fingerprint queue,
First carry out object fingerprint comparison, by similarity it is higher (similarity be more than predetermined threshold value, for example, predetermined threshold value be 80% it is similar or
70% is similar etc.) object extracted by hash algorithm, the correctness of data is then calculated according to MD5 algorithms, is found
Data are in the case of no modification, and data have identical copy, then show to find the data of repetition, by the repetition found
Data be used as between the object being mutually matched repeat data.
Step S40, the data of identified repetition are deleted.
The data repeated are deleted, i.e. the data of redundancy are deleted, substantial amounts of memory space is saved, improves disk
Memory capacity, because memory capacity improve, improve storage efficiency.
It is the Organization Chart of data de-duplication of the present invention with reference to Fig. 5 to preferably describe the embodiment of the present invention, reference
Fig. 6, is the schematic diagram that data are deleted.
The present embodiment proposes a kind of object-based data de-duplication mode, efficiently solves lacking for local feature deficiency
Fall into, the data repeated between the object being mutually matched are found out using object, the data repeated is deleted, realizes high-performance in individual node
Parallel data delete processing, the handling capacity that growth data is deleted again again.Effective and reasonable deletion data, improve memory space.
The present invention further provides a kind of data deletion apparatus.
Reference picture 7, Fig. 7 is the high-level schematic functional block diagram of an embodiment of data deletion apparatus of the present invention.
In one embodiment, the data deletion apparatus includes:Acquisition module 10, determining module 20, processing module 30, deposit
Store up module 40, comparing module 50 and removing module 60.
The acquisition module 10, for obtaining pending data;
In the present embodiment, when there is data to need processing, processed data are pending data, for example, needing
When deleting data, the data of processing to be deleted are pending data;Or, for another example when needing to delete data, historical data
Data with current processing to be deleted are pending data.With reference to Fig. 8, the acquisition module 10 includes:Determining unit 11 and obtain
Unit 12 is taken,
The determining unit 11, for determining the pattern that data are deleted;
The acquiring unit 12, when the pattern for being deleted in data is real-time puncturing pattern, obtains the number being currently stored in
According to this and history deposit data, the data that the data being currently stored in and history are stored in are used as pending data;Obtain single
Member 12 is additionally operable to
When the pattern that data are deleted is timing puncturing pattern, the current data stored are obtained, by what is currently stored
Data are used as pending data.
Data puncturing pattern includes timing puncturing pattern and real-time puncturing pattern.The system that is arranged on of data puncturing pattern is opened
Set, or run with the puncturing pattern of acquiescence by user after opening.When needing to delete data, the pattern that data are deleted is determined,
For example, being defined as timing puncturing pattern, or it is defined as real-time puncturing pattern.When being defined as real-time puncturing pattern, by history
Data, as pending data, directly can regard current data as pending data in other words with current data;It is determined that
It is pending data by currently stored all data during for timing puncturing pattern, or to existing before a upper timing node
The data stored between timing node are as pending data.Pending data under above-mentioned real-time and timing puncturing pattern
It is defined as user to set or system default, is a kind of wherein corresponding definition mode.When needing to delete data, according to setting
Mode obtains pending data.
The determining module 20, for determining the object being mutually matched in pending data;
In data storage, the mode of data storage is to be stored in the way of object, i.e. returned according to objects on data
Class, then stores the data of corresponding objects respectively, i.e. receive the data of client, classifies according to the object of the data of reception, point
It is not stored in distributed memory system.
The processing module 30, for receiving data to be stored, by the data slicer to be stored, is cut into default size
Slice of data block;
The memory module 40, for each slice of data block to be stored in the way of object, by each slice of data
Object and the storage of fingerprint index composition data structure.
The data that client is sent are received, by the data slicer of client, i.e. the data point that client transmissions come
Into several small data blocks, in our distributed memory system can by each data block cutting be 128M (according to demand not
Same or systematic function sets or set according to the total size of transmission data, for example, 64M or 256M etc. can be also configured to), send
To data-storage system.After by data slicer, data transmission stability and reliability increase, because in the bad feelings of network condition
Under condition, if being not several data blocks by data cutting, substantial amounts of data are disposably transmitted, data can be entered in transmitting procedure
The multiple data check of row, adds the time of transmission, while can also cause the loss or data north obstruction of data, have impact on number
According to stability and reliability.
In the way of object by slice of data storage into distributed system, the characteristics of object is stored is by effective data
Stored with the form of fingerprint index composition data structure and pass through network transmission to storage system into caching, then by whole object
In, the correlation between data and fingerprint index is so just established, facilitates the lookup of follow-up data with obtaining the behaviour such as data
Make.
The comparing module 50, for comparing the corresponding data of object being mutually matched, it is determined that between the object being mutually matched
The data repeated;
In real-time puncturing pattern, the data being currently received are verified when data is received, history number is contrasted
According to the data being currently received, find out the similitude of current data and historical data, i.e. find similar data;For calmly
When puncturing pattern when, when reaching the time of timing, determine that the mode of set of metadata of similar data is identical with the mode deleted in real time, but contrast
Data be pending data, for example, be the data entirely stored, or by a upper timing node to currently having stored
The data comparison stored before data, with a upper timing node, finds similar data.
The determining module 20, is additionally operable to the object for determining to include in the data to be stored;Determining module 20 is additionally operable to
Identified object is added into fingerprint index queue, object fingerprint comparison is carried out, is determined by hash algorithm mutual
The object of matching;
The comparing module 50, is additionally operable to compare the corresponding data of the object being mutually matched, and data are calculated by MD5 algorithms
Correctness obtain valid data;Comparing module 50 is additionally operable to
The data repeated between the object being mutually matched are determined from valid data.
Doing in real time or during timing data deletion, different objects be added in an index fingerprint queue,
First carry out object fingerprint comparison, by similarity it is higher (similarity be more than predetermined threshold value, for example, predetermined threshold value be 80% it is similar or
70% is similar etc.) object extracted by hash algorithm, the correctness of data is then calculated according to MD5 algorithms, is found
Data are in the case of no modification, and data have identical copy, then show to find the data of repetition, by the repetition found
Data be used as between the object being mutually matched repeat data.
The removing module 60, for the data of identified repetition to be deleted.
The data repeated are deleted, i.e. the data of redundancy are deleted, substantial amounts of memory space is saved, improves disk
Memory capacity, because memory capacity improve, improve storage efficiency.
It is the Organization Chart of data de-duplication of the present invention with reference to Fig. 5 to preferably describe the embodiment of the present invention, reference
Fig. 6, is the schematic diagram that data are deleted.
The present embodiment proposes a kind of object-based data de-duplication mode, efficiently solves lacking for local feature deficiency
Fall into, the data repeated between the object being mutually matched are found out using object, the data repeated is deleted, realizes high-performance in individual node
Parallel data delete processing, the handling capacity that growth data is deleted again again.Effective and reasonable deletion data, improve memory space.
The preferred embodiments of the present invention are these are only, are not intended to limit the scope of the invention, it is every to utilize this hair
Equivalent structure or equivalent flow conversion that bright specification and accompanying drawing content are made, or directly or indirectly it is used in other related skills
Art field, is included within the scope of the present invention.