Disclosure of Invention
The invention aims to provide a method and a device for deleting repeated data of a storage system, which greatly reduce the calculation cost, ensure the reliability and the deleting rate of the stored data and at least solve the problems that the repeated data deleting system in the prior art needs to calculate a hash value and has large calculation cost.
To achieve the above object, an aspect of the present invention provides a method for deduplication of a storage system, including: reading the data blocks which are not stored; each non-stored data block comprises data bits and a plurality of sections of check bits; extracting a plurality of sections of check bits of an unrecorded data block; the method comprises the steps of performing one-to-one retrieval comparison on a plurality of sections of check bits of an unrecorded data block and a plurality of sections of check bits of a stored data block; if the check bits of each segment of the non-stored data block are consistent with the check bits of each segment of the stored data block, marking the non-stored data block as a repeated data block, deleting the repeated data block, and storing index information of the repeated data block; the index information is used for data reading; if the check bits of each segment of the non-stored data block are not consistent with the check bits of each segment of the stored data block, the non-stored data block is marked as a non-repeated data block, and the non-repeated data block is stored.
Further, the check bits are composed of a plurality of pieces of data check bits and a piece of error correction check bits.
Further, the data check bit is obtained through check code calculation; the error correction check bits are calculated by error correction codes.
Further, the multi-section check bits of the data block which is not stored are searched and compared with the multi-section check bits of the data block which is stored one by adopting parallel processing or serial processing.
Further, storing the non-duplicate data block includes: the data bits of the non-stored data block are stored in the data bit area and the parity bits of the non-stored data block are stored in the parity bit area.
Further, the data reading includes: reading the stored data block; performing error correction decoding on the read stored data block; performing segment verification on the multi-segment verification bits of the decoded data block; judging whether each section of check bits of the data block subjected to the sectional check pass the check; if all the segments of check bits of the data block after the segment check pass the check, outputting the data bits of the data block after the segment check to all index positions according to the index information; and if the check bits of each segment of the data block subjected to the segment check do not pass the check, outputting error information at the index position corresponding to the data block.
Another aspect of the present invention provides a data de-duplication apparatus for a storage system, including: the system comprises a data input module, an extraction module, a retrieval comparison module, a marking module, a repeated data deleting module, a first data storage module and a second data storage module; the data input module is used for reading the data blocks which are not stored; the extraction module is used for extracting a plurality of sections of check bits of the data block which is not stored; the retrieval comparison module is used for carrying out one-to-one retrieval comparison on the multi-section check bits of the non-stored data block and the multi-section check bits of the stored data block so as to judge whether each section check bit of the non-stored data block is consistent with each section check bit of the stored data block; the marking module is used for marking the data blocks which are not stored as repeated data blocks or non-repeated data blocks; the repeated data deleting module is used for deleting repeated data blocks; the first data storage module is used for storing index information of the repeated data blocks; the second data storage module is used for storing non-repeated data blocks.
Further, the storage system deduplication apparatus further includes a data reading apparatus including: the system comprises a data reading module, an error correction decoding module, a segmentation checking module, a judging module, an output module and an error reporting module; the data reading module is used for reading the stored data blocks; the error correction decoding module is used for performing error correction decoding on the read stored data block; the segmentation check module is used for carrying out segmentation check on the multi-segment check bits of the decoded data block; the judging module is used for judging whether each section of check bits of the data block subjected to the sectional check pass the check; the output module is used for outputting the data bits of the data block subjected to the segmentation verification to all index positions; the error reporting module is used for outputting error information at the index position corresponding to the data block.
The technical scheme of the invention provides a method and a device for deleting repeated data of a storage system. The method for deleting the repeated data of the storage system comprises the following steps: reading the data blocks which are not stored; each non-stored data block comprises data bits and a plurality of sections of check bits; extracting a plurality of sections of check bits of an unrecorded data block; the method comprises the steps of performing one-to-one retrieval comparison on a plurality of sections of check bits of an unrecorded data block and a plurality of sections of check bits of a stored data block; if the check bits of each segment of the non-stored data block are consistent with the check bits of each segment of the stored data block, marking the non-stored data block as a repeated data block, deleting the repeated data block, and storing index information of the repeated data block; the index information is used for data reading; if the check bits of each segment of the non-stored data block are not consistent with the check bits of each segment of the stored data block, the non-stored data block is marked as a non-repeated data block, and the non-repeated data block is stored. Compared with the traditional calculation repeated data deleting method, the deleting method provided by the invention avoids the calculation of the hash value, thereby greatly reducing the calculation cost. In addition, the reliability and the erasure rate of the stored data are ensured, and the method is a good choice for the data erasure of an actual storage system (particularly a mobile data storage system such as a flash memory and the like).
Description of the embodiments
It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other. The invention will be described in detail below with reference to the drawings in connection with embodiments.
FIG. 1 is a flow chart of an alternative method of deduplication of a storage system according to an embodiment of the present invention. The invention provides a method for deleting repeated data of a storage system, which comprises the following steps:
step S102: reading the data blocks which are not stored; each non-stored data block comprises data bits and a plurality of sections of check bits;
step S104: extracting a plurality of sections of check bits of an unrecorded data block;
step S106: the method comprises the steps of performing one-to-one retrieval comparison on a plurality of sections of check bits of an unrecorded data block and a plurality of sections of check bits of a stored data block;
step S108: if the check bits of each segment of the non-stored data block are consistent with the check bits of each segment of the stored data block, marking the non-stored data block as a repeated data block, deleting the repeated data block, and storing index information of the repeated data block; the index information is used for data reading;
step S110: if the check bits of each segment of the non-stored data block are not consistent with the check bits of each segment of the stored data block, the non-stored data block is marked as a non-repeated data block, and the non-repeated data block is stored. In the parallel relationship between step S108 and step S110, the decision after step S106 goes to step S108 or step S110.
Compared with the traditional method for calculating the repeated data deletion, the method avoids the calculation of the hash value, thereby greatly reducing the calculation cost, ensuring the reliability and the repeated deletion rate of the stored data, and being a good choice for the repeated data deletion of an actual storage system (particularly a mobile data storage system such as a flash memory and the like).
As an optimization scheme of the invention, the multi-section check bits of the non-stored data block and the multi-section check bits of the stored data block are subjected to one-to-one search comparison and are processed in parallel or in series. The multi-section check bits of the non-stored data block can be searched and compared with the multi-section check bits of the stored data block one by using parallel processing, so that the method has higher comparison speed, but higher parallel processing capability is required.
FIG. 2 is a schematic diagram of an alternative storage system data storage method according to an embodiment of the invention. As can be seen from fig. 2, the stored data bits typically need to be checked to improve the reliability of the stored data. The purpose of the check bits is to check the integrity of the data and when errors occur in the stored data, an error correction mechanism may be activated to correct the error bits. In the memory system 140, the data bits and the parity bits of the data segment are typically stored separately, the data bits are stored in the data bit region 141 of the memory system, and the parity bits are stored in the parity bit region 142 of the memory system. Typically, the length of the check bits is much smaller than the length of the data bits.
The method of calculating the check bits is related to the check code and error correction code adopted. In general, the check code may employ Cyclic Redundancy Check (CRC), hamming check, etc., and the error correction code may employ BCH code, reed-solomon (RS) code, low Density Parity Check (LDPC) code, etc.
Fig. 3 is a schematic diagram of a codeword structure after encoding an optional storage system data block according to an embodiment of the present invention. As can be seen from fig. 3, the original data bits 201 are divided into several segments, and the segments are verified to form corresponding check bit sequences 202, which are used as a determination flag for the validity of the data sequences of the segments. The reason for adopting the segment check is that the segment check can improve the check performance and reduce the collision probability of check bits during repeated data deletion under the condition of long data bits.
In order to more clearly illustrate the above procedure, a method of calculating the check bits will be described below by taking CRC check as an example. The CRC has the advantage that the bit length of the input information can be arbitrarily selected, and the CRC has higher flexibility.
Let the data be divided into L segments, each segment having a length of k data bits. Assume that the first segment data is [ m ] 0 ,m 1 ,…,m k-1 ]The corresponding polynomial m (x) =m 0 +m 1 x+…+m k-1 x k-1 . The degree of the CRC generator polynomial g (x) is r. Polynomial x r m (x) is summed up over g (x), p (x) =x r m(x)modg(x),
I.e. a polynomial p (x) of degree r is obtained, the coefficients [ p ] 0 ,p 1 ,…,p r-1 ]The corresponding sequence of length r is the check bit sequence 202.
When the check bit sequences 202 of the L pieces of data are all found, a sequence of length L (k+r) is obtained. Then, the sequence is error correction coded to obtain an error correction check bit sequence 203. As described above, the error correction code may employ BCH code, RS code, LDPC code, or the like.
In order to more clearly illustrate the above procedure, a calculation method of the error correction check bit sequence 203 will be described below taking an LDPC code as an example. LDPC codes are a class of linear codes defined by a sparse check matrix. Let D be the information sequence to be encoded and P be the check bit sequence. For a given D, the error correction check sequence P is calculated such that D and P constitute a vector c= [ D, P]Needs to satisfy the check equation C ∙ H T = 0
Where H is the check matrix of the LDPC code, T represents the matrix transpose, and ∙ represents the modulo-2 multiplication. The check matrix of an LDPC code is typically a sparse matrix, which may be constructed in a manner that will not be described in detail herein. Assuming that the length of the data bit D is M and the length of the encoded codeword is N, the length of P is N-M.
FIG. 4 is a schematic diagram of an alternative storage system deduplication principle in accordance with an embodiment of the present invention. As shown, the data of the first data stream 301 and the second data stream 302 need to be stored in a storage system. Each data stream contains 4 data blocks, which if stored directly to the storage system, require storage space of 8 data blocks. However, if the repeated data blocks are identified by the repeated data deleting technology, the repeated data (for example, the data block a) only stores one copy in the system, other repeated copies establish index information, and finally, only 4 data blocks need to be stored to obtain the final data stream 303, thereby saving the storage space of the system.
FIG. 5 is a flowchart of an alternative method for deduplication of a storage system according to an embodiment of the present invention. As can be seen from fig. 5, the method comprises the steps of:
s401: and (5) reading data.
S402: and extracting check bits.
S403: the check bits are retrieved.
S404: judging whether the detection device exists or not, and if so, turning to step S405; otherwise, go to step S406.
S405: deleting the data block and storing index information.
S406: the data block is saved.
S407: and starting the next repeated data deleting process.
In order to more clearly illustrate the data recovery algorithm proposed by the present invention, the execution of the algorithm is described in detail below.
Step S401: and (5) reading data. The data reading mode is to read by blocks, and each block of data is an encoded data block shown in fig. 3 and comprises data bits and a plurality of sections of check bits.
Step S402: and extracting check bits. The check bits in the data block read in step S401 are extracted, which typically comprise a plurality of segments.
Step S403: the check bits are retrieved. The check bits extracted in step S402 are compared with stored check bits by retrieval. For each segment of check bits of each data block, the search may be performed in parallel or in serial.
Step S404: and judging whether check bits of the data block to be stored exist. If and only if each segment of check bits is consistent with each segment of check bit of a certain stored data block, respectively, it is considered that the check bit exists, otherwise, it is considered that the check bit does not exist.
For example, the number of data block segments of the storage system l=8. Assuming that the check bit sequence of 8-segment data is p in turn 0 、p 1 … p 7 Assume that the error correction check bit sequence is P. If the check bit sequence of a stored data segment is p' in turn 0 、p´ 1 …p´ 7 The error correction check bit sequence is P' and meets the requirement
p 0 =p´ 0 、p 1 =p´ 1 … p 7 =p´ 7 、P=P´,
Each segment of parity bits is considered to be identical to each segment of parity bits of the data block, respectively.
If yes, go to step S405; otherwise, go to step S406.
Step S405: deleting the data block and storing index information. The data block is repeated data blocks, and only index information is saved for data reading.
Step S406: the data block is saved. The specific method is that the data bits of the data block are stored in a data bit area, and the check bits are stored in a check bit area.
Step S407: and starting the next repeated data deleting process.
FIG. 6 is a flowchart of an alternative method of reading data from a storage system according to an embodiment of the invention. As can be seen from fig. 6, the method comprises the steps of:
s501: the data is read.
S502: and (5) error correction decoding.
S503: and (5) checking data.
S504: judging whether each segment of data passes the verification, if so, turning to step S505; otherwise, go to step S506.
S505: and outputting data according to the index.
S506: and outputting error information according to the index.
S507: the next segment of data reading flow is started.
In order to more clearly illustrate the method for reading data after the memory system is duplicated and deleted, the execution process of the method is specifically described below.
Step S501: the data is read. The data reading mode is to read in blocks, and each block of data corresponds to the encoded data block shown in fig. 3. The read data block can also contain bit errors due to noise interference during the read process, etc.
Step S502: and (5) error correction decoding. The method used for error correction decoding is related to the coding scheme employed and aims to correct errors in the data reading process.
Step S503: and (5) checking data. And carrying out segment data verification on the decoded sequence, and judging whether each segment of verification passes or not.
In order to more clearly illustrate the above procedure, a procedure of data checking will be described below by taking CRC checking as an example. For example, the number of data block segments of the storage system l=8. Assume that the 8 pieces of data read are sequentially d 0 、d 1 … d 7 The read check bit sequence is p in turn 0 、p 1 … p 7 Assuming p 0 、p´ 1 … p´ 7 To adopt d 0 、d 1 … d 7 The calculated check bit sequence, if p i =p´ i ,
The check bits read by the ith segment are considered to be identical to the check bits calculated for that segment of data (0.ltoreq.i < 8).
Step S504: judging whether each segment of data passes the verification, if so, turning to step S505; otherwise, go to step S506.
Step S505: and if each segment of data passes the verification, outputting the data bit sequence to all index positions according to the index output data.
Step S506: if the data of each segment fails to pass the verification, error information is output at the corresponding index position.
Step S507: the next segment of data reading flow is started.
FIG. 7 is a schematic diagram of an alternative storage system deduplication apparatus in accordance with an embodiment of the present invention. As can be seen from fig. 7, the storage system deduplication apparatus includes: a data entry module 10, an extraction module 20, a retrieval comparison module 30, a tagging module 40, a deduplication module 50, a first data storage module 60, and a second data storage module 70; the data entry module 10 is used for reading the data blocks which are not stored; the extracting module 20 is configured to extract a plurality of segments of check bits of an unrecorded data block; the search comparison module 30 is configured to perform a one-to-one search comparison on the multiple segments of check bits of the non-stored data block and the multiple segments of check bits of the stored data block to determine whether each segment of check bits of the non-stored data block is consistent with each segment of check bits of the stored data block; the marking module 40 is configured to mark the data block that is not stored as a duplicate data block or a non-duplicate data block; the data de-duplication module 50 is configured to de-duplicate data blocks; the first data storage module 60 is configured to store index information of the repeated data blocks; the second data storage module 70 is used to hold non-duplicate data blocks.
FIG. 8 is a schematic diagram of an alternative storage system deduplication data reading apparatus according to an embodiment of the present invention. As can be seen from fig. 8, the data reading apparatus includes: the device comprises a data reading module 80, an error correction decoding module 90, a segmentation checking module 100, a judging module 110, an output module 120 and an error reporting module 130; the data reading module 80 is used for reading the stored data blocks; the error correction decoding module 90 is configured to perform error correction decoding on the read stored data block; the segment checking module 100 is configured to perform segment checking on the multi-segment check bits of the decoded data block; the judging module 110 is configured to judge whether each segment of check bits of the data block after the segment check passes the check; the output module 120 is configured to output the data bits of the data block after the segment verification to all index positions; the error reporting module 130 is configured to output error information at an index position corresponding to the data block.
The elements and model steps of the examples described in the embodiments disclosed herein may be implemented in electronic hardware, computer software, or a combination of both, and to clearly illustrate the interchangeability of hardware and software, the components and steps of the examples have been described generally in terms of functionality in the foregoing description. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.