WO2023147842A1 - Data storage and methods of deduplication of data storage, storing new file and deleting file - Google Patents

Data storage and methods of deduplication of data storage, storing new file and deleting file Download PDF

Info

Publication number
WO2023147842A1
WO2023147842A1 PCT/EP2022/052334 EP2022052334W WO2023147842A1 WO 2023147842 A1 WO2023147842 A1 WO 2023147842A1 EP 2022052334 W EP2022052334 W EP 2022052334W WO 2023147842 A1 WO2023147842 A1 WO 2023147842A1
Authority
WO
WIPO (PCT)
Prior art keywords
file
prior
data storage
new
size
Prior art date
Application number
PCT/EP2022/052334
Other languages
French (fr)
Inventor
Assaf Natanzon
Original Assignee
Huawei Technologies Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co., Ltd. filed Critical Huawei Technologies Co., Ltd.
Priority to PCT/EP2022/052334 priority Critical patent/WO2023147842A1/en
Publication of WO2023147842A1 publication Critical patent/WO2023147842A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1748De-duplication implemented within the file system, e.g. based on file segments
    • G06F16/1752De-duplication implemented within the file system, e.g. based on file segments based on file chunks

Definitions

  • the present disclosure relates generally to the field of data protection and backup; and more specifically, to a data storage, a method of deduplication of the data storage, a method of storing a new file in the data storage and a method of deleting a file in the data storage.
  • data deduplication is a technique used to reduce the amount of data, which is either passed over a network or stored in a conventional data storage system.
  • the typical data deduplication technique performs by identifying duplicates of data at some granularity and avoiding communication or storage of such duplicates explicitly.
  • a conventional data deduplicated system when a data file is deduplicated, the data of the data file is chunked to variable size chunks and instead of keeping identical chunks twice, a pointer is kept to the identical chunks.
  • a meta data to the pointer of the chunk takes approximately 100 bytes, that means the size of the meta data is typically about 1% of the data.
  • the meta data is used to reduce the overhead of the data structure required for the pointers and allows a fast access and search abilities.
  • the deduplication ratios are small (e.g., 5 times)
  • the meta data is insignificant.
  • the deduplication ratios are large (e.g., 50 times or 100 times)
  • the meta data becomes highly significant.
  • the present disclosure provides a data storage, a method of deduplication of the data storage, a method of storing a new file in the data storage and a method of deleting a file in the data storage.
  • the present disclosure provides a solution to the existing problem of inefficiently reducing the size of the meta data.
  • An aim of the present disclosure is to provide a solution that overcomes at least partially the problems encountered in prior art, and provides an improved data storage, an improved method of deduplication of the data storage, an improved method of storing a new file in the data storage and an improved method of deleting a file in the data storage.
  • the present disclosure provides a data storage including a prior file that has been subjected to a first level of deduplication using a first division of the prior file into prior chunks of a first size range.
  • the file is additionally divided into prior portions of a second size range greater than the first size range.
  • Each prior portion is stored in a prior portion file.
  • Each prior portion file is named using a prior strong hash calculated for the prior portion file as at least part of the name.
  • the data storage further comprises a metadata file associated with the prior file and includes the prior strong hashes of all the prior portions.
  • the disclosed data storage efficiently reduces the size of the meta data when high deduplication ratios are used, due to multi-level deduplication.
  • the prior file is deduplicated at the first level by dividing into the prior chunks. Additionally, the prior file is further deduplicated by dividing into the prior portions.
  • the data storage manifests a hierarchy of deduplication having large chunks deduplication at a file system level and small chunks deduplication at a lower layer. Thus, the disclosed data storage obtains a reduced amount of meta data.
  • the present disclosure provides a method of deduplication of a data storage.
  • the method comprises providing a prior file arranged to be divided into a number of chunks having a size within a first size range.
  • the method further comprises dividing the prior file into portions of a second size range greater than the first size range, each portion comprising a number of complete chunks.
  • the method further comprises for each portion calculating a prior strong hash and storing data of the portion in a prior portion file specific to the portion.
  • the method further comprises for each portion naming the prior portion file, where the name includes the prior strong hash of the respective portion.
  • the method further comprises creating a metadata file, where the metadata file includes the prior strong hashes of all the prior portions of the prior file.
  • the method further comprises deduplicating each portion based on the chunks in that portion.
  • the disclosed method efficiently deduplicates the prior file by dividing the prior file into the prior portions (i.e., the large size chunks), without any requirement to maintain the data structure for determining that if a chunk already exists or not.
  • the method enables checking for a chunk which already exists or not, by using name of the prior portion file used specifically for that chunk. Additionally, the method manifests simplicity. In a case, where the deduplication ratios are high, the use of the method provides a much smaller size of the metadata in contrast to conventional methods.
  • the second size is a multiple of the first size.
  • the present disclosure provides a method of storing a new file in a data storage.
  • the data storage has already stored a prior file divided into prior portion files and associated with a metadata file including the strong hashes of the prior portion files.
  • the method comprises the steps of dividing the new file into new portions of a second size range greater than the first size range.
  • the method further comprises for each new portion calculating a new strong hash and determining if a file whose name including the hash value exists. If a file with name whose prefix starts with the new strong hash exists, creating a hard link to the file whose name starts with this new strong hash and naming of the hard link including the new strong hash.
  • the method further comprises for creating a meta data file for the stored file.
  • the disclosed method provides an efficient way of storing the new file in the data storage by having a reduced metadata. Additionally, the disclosed method provides an efficient deduplication of the new file.
  • the present disclosure provides a method of deleting a file in a data storage.
  • the method comprises searching for metadata file, if there is one or more hard links associated with the file, deleting one of the hard links, and deleting the metadata of the file.
  • the disclosed method provides the simplest way of deleting the file in the data storage.
  • FIG. 1 is a block diagram that illustrates various exemplary components of a data storage, in accordance with an embodiment of the present disclosure
  • FIG. 2 is a flowchart of a method of deduplication of a data storage, in accordance with an embodiment of the present disclosure
  • FIG. 3 is a flowchart of a method of storing a new file in a data storage, in accordance with an embodiment of the present disclosure
  • FIG. 4 is a flowchart of a method of deleting a file in a data storage, in accordance with an embodiment of the present disclosure
  • FIG. 5 illustrates an implementation scenario of deduplication of a data storage, in accordance with an embodiment of the present disclosure.
  • FIG. 6 illustrates an implementation scenario of storing a new file in a data storage, in accordance with an embodiment of the present disclosure.
  • an underlined number is employed to represent an item over which the underlined number is positioned or an item to which the underlined number is adjacent.
  • a non-underlined number relates to an item identified by a line linking the nonunderlined number to the item.
  • the non-underlined number is used to identify a general item at which the arrow is pointing.
  • FIG. 1 is a block diagram that illustrates various exemplary components of a data storage, in accordance with an embodiment of the present disclosure.
  • a block diagram 100 of a data storage 102 that includes a memory 104, a network interface 106 and a control unit 108.
  • the memory 104 is configured to store a prior file 104A.
  • the data storage 102 may include suitable logic, circuitry, interfaces, or code that is configured to store data as data blocks having either fixed or variable sizes.
  • the data storage 102 may also be referred to as a backup storage in which data can be stored in form of data blocks for any defined duration.
  • the data storage 102 may be a non-volatile mass storage, such as a cloud-based storage.
  • the data storage 102 includes the memory 104, the control unit 108 and the network interface 106 in order to store, process and share the data to other computing components, such as a user device.
  • the data storage 102 may also be referred to as a data storage system. Examples of the data storage 102 may include, but are not limited to, a dorado file system, an ocean protect file system, and the like.
  • the memory 104 may include suitable logic, circuitry, interfaces, or code that is configured to store data and the instructions executable by the control unit 108.
  • the memory 104 is configured to store the prior file 104A. Examples of implementation of the memory 104 may include, but are not limited to, an Electrically Erasable Programmable Read-Only Memory (EEPROM), Random Access Memory (RAM), Read Only Memory (ROM), Hard Disk Drive (HDD), Flash memory, Solid-State Drive (SSD), or CPU cache memory.
  • the memory 104 may store an operating system or other program products (including one or more operation algorithms) to operate the data storage 102.
  • the network interface 106 may include suitable logic, circuitry, interfaces, or code that is communicatively coupled with the memory 104 and the control unit 108.
  • Examples of the network interface 106 include, but are not limited to, a data terminal, a transceiver, a facsimile machine, a virtual server, and the like.
  • the control unit 108 may include suitable logic, circuitry, interfaces, or code that is configured to execute the instructions stored in the memory 104. In an example, the control unit 108 may be a general-purpose processor.
  • control unit 108 may include, but is not limited to a processor, a central processing unit (CPU), a digital signal processor (DSP), a microprocessor, a microcontroller, a complex instruction set computing (CISC) processor, an application-specific integrated circuit (ASIC) processor, a reduced instruction set (RISC) processor, a very long instruction word (VLIW) processor, a central processing unit (CPU), a state machine, a data processing unit, a graphics processing unit (GPU), and other processors or control circuitry.
  • the control unit 108 may refer to one or more individual processors, processing devices, a processing unit that is part of a machine, such as the data storage 102.
  • the data storage 102 including the prior file 104A that has been subjected to a first level of deduplication using a first division of the prior file 104A into prior chunks of a first size range, where the file is additionally divided into prior portions of a second size range greater than the first size range.
  • Each prior portion is stored in a prior portion file.
  • Each prior portion file is named using a prior strong hash calculated for the prior portion file as at least part of the name.
  • the memory 104 of the data storage 102 is configured to store the prior file 104A that is arranged to be divided into the prior chunks of the first size range.
  • the prior chunks may also be referred to as small size chunks which may have either a fixed size or a variable size.
  • the prior file 104A is subjected to deduplication by division into the prior portions of the second size range.
  • the prior portions may also be referred to as large size chunks which have larger size comparative to the prior chunks.
  • the prior portions may also have either a fixed size or a variable size.
  • the prior file 104A is divided into the prior portions independent to the division into the prior chunks.
  • each of the prior portions is stored in the prior portion file, which contains the data of the respective prior portion.
  • the prior strong hash is calculated for each of the prior portions, which is used as at least part of the name of each prior portion file.
  • each prior portion file for each of the prior portions is named using the prior strong hash calculated for each of the prior portions, as at least part of the name.
  • the data storage 102 further comprises a metadata file associated with the prior file 104A and including the prior strong hashes of all the prior portions. After storing each of the prior portions in the respective prior portion file, the metadata file is generated. The metadata file is generated in order to avoid the storage of the prior portions file in the data storage 102 as a flat file.
  • the metadata file includes the prior strong hashes calculated for each of the prior portions as well as length of data of each prior portion (i.e., large size chunks).
  • the data storage 102 By virtue of deduplicating the prior file 104A by dividing the prior file 104A into the prior portions of the second size range and further dividing the prior file 104A into the prior chunks of the first size range, the data storage 102 obtains a reduced size of meta data. This way, the data storage 102 provides a reduced size of the metadata even in cases of high deduplication ratios, due to multi-level deduplication.
  • FIG. 2 is a flowchart of a method of deduplication of a data storage, in accordance with an embodiment of the present disclosure.
  • FIG. 2 is described in conjunction with elements from FIG. 1.
  • a method 200 of deduplication of the data storage 102 (of FIG. 1).
  • the method 200 includes steps 202-to-214.
  • the control unit 108 of the data storage 102 is configured to execute the method 200.
  • the method 200 is used for deduplication of the data storage 102 in the following steps.
  • the method 200 comprises providing a prior file arranged to be divided into a number of chunks having a size within a first size range.
  • the prior file e.g., the prior file 104A, of FIG. 1
  • the prior file 104A is arranged to be divided into small size chunks for deduplication at a lower level.
  • the prior file 104A is stored on a data storage (e.g., the data storage 102, of FIG. 1) which supports deduplication at the lower level.
  • the deduplication of the prior file 104A at the lower level may also be referred to as a lower layer deduplication.
  • all chunks have a constant, first size.
  • Each of the small size chunks which, may be obtained through division of the prior file 104A, has the constant, first size.
  • the size of each chunk is between 2 kB and 20 kB.
  • the small size chunks may have the size of any size between 2k and 20kB.
  • the method 200 further comprises dividing the prior file 104A into portions of a second size range greater than the first size range, each portion comprising a number of complete chunks.
  • the prior file 104A is deduplicated by dividing the prior file 104A into the portions (i.e., prior portions) of the second size range.
  • the portions (i.e., prior portions) corresponds to large size chunks because they have larger size comparative to the small size chunks which may be obtained through further division of the prior file 104A.
  • the deduplication of the prior file 104A by dividing the file into the portions of the second size range may also be referred to as an upper layer deduplication.
  • Each portion (i.e., large size chunk) constitutes of several small size chunks.
  • all the file portions have a constant, second size.
  • Each of the portions of the prior file 104A has the constant, second size.
  • the size of each portion is between 2 MB and 6 MB.
  • the portions of the prior file 104A may have the size of any size between 2MB and 6MB.
  • the size is any number of 2MB and 6MB and consists of an average, for example, 512 small chunks, but it always consists of complete small chunks.
  • the second size is a multiple of the first size.
  • the second size range attained by the portions is larger than the first size range attained by the small size chunks.
  • the second size range may be a multiple of the first size range. For example, in a case, if the first size range is variable that means the second size range consists of several such variable size chunks.
  • the chunks and/or the portions are of variable size.
  • each of the chunks (i.e., the small size chunks) and the portions (i.e., large size chunks) may have a variable size.
  • the division into chunks and/or the division into portions is achieved by content defined chunking, CDC, algorithm.
  • the division of the prior file 104A into the chunks (i.e., the small size chunks) and/or into the portions (i.e., the large size chunks) is obtained by use of the content defined chunking (CDC) algorithm.
  • the CDC algorithm may be defined as a method to split a file (e.g., the prior file 104A) into variable size chunks, where the split points are defined by a few internal features of the file (i.e., the prior file 104A).
  • variable size chunks are more resistant to byte shifting. Therefore, division of the file (i.e., the prior file 104A) into the variable size chunks increases the probability of finding duplicate chunks within the file (i.e., the prior file 104A) or between the files.
  • each portion contains a consecutive set of smaller chunks.
  • each portion (i.e., the large size chunk) of the prior file 104A may contain the consecutive set of smaller chunks by virtue of having the larger size in comparison to the small size chunks.
  • the method 200 further comprises for each portion calculating a prior strong hash. Thereafter, the prior strong hash is calculated for each portion.
  • the method 200 further comprises for each portion storing data of the portion in a prior portion file specific to the portion.
  • the data of the portion i.e., the large size chunk
  • the method 200 further comprises for each portion naming the prior portion file, where the name includes the prior strong hash of the respective portion.
  • the prior portion file of the portion is named using the prior strong hash calculated for that portion, as at least part of the name.
  • name of the prior portion file includes a prefix and a suffix.
  • the prefix is the prior strong hash calculated for that portion and suffix is the name of the stored file (i.e., the prior file 104A).
  • the method 200 further comprises creating a metadata file, said metadata file including the prior strong hashes of all the prior portions of the prior file 104A and the length of each corresponding portion file.
  • the metadata file is created.
  • the metadata file includes all the prior strong hashes calculated for all the prior portions of the prior file 104A.
  • the metadata file includes the length of data of each prior portion file corresponding to the respective prior portion of the prior file 104A.
  • the method 200 further comprises deduplicating each portion based on the chunks in that portion.
  • Each portion i.e., the large size chunk
  • Each portion includes the number of complete small size chunks which may be obtained by use of the CDC algorithm.
  • the deduplication of each portion is performed based on the small size chunks comprised by the respective portion of the prior file 104A.
  • the method 200 efficiently deduplicates the prior file 104A by dividing the prior file 104A into the prior portions (i.e., the large size chunks), without any requirement to maintain the data structure for determining that if a chunk (i.e., the prior portion) already exists or not.
  • the method 200 enables checking for a chunk which already exists or not, by using name of the prior portion file used specifically for that chunk (i.e., the prior portion). If the data storage 102 has a cache memory, then, in such a case, the existence of such file may be checked a bit faster. Additionally, the method 200 manifests simplicity. In a case, where the deduplication ratios are high, the use of the method 200 provides a much smaller size of the metadata in contrast to conventional methods.
  • steps 202-to-214 are only illustrative and other alternatives can also be provided where one or more steps are added, one or more steps are removed, or one or more steps are provided in a different sequence without departing from the scope of the claims herein.
  • FIG. 3 is a flowchart of a method of storing a new file in a data storage, in accordance with an embodiment of the present disclosure.
  • FIG. 3 is described in conjunction with elements from FIGs. 1 and 2.
  • a method 300 of storing a new file in the data storage 102 (of FIG. 1).
  • the method 300 includes steps 302-to-312.
  • the control unit 108 of the data storage 102 is configured to execute the method 300.
  • the method 300 is used for storing a new file in the data storage 102, having already stored therein a prior file 104A divided into prior portion files and associated with a metadata file including the strong hashes of the prior portion files.
  • the method 300 is used for storing the new file in the data storage 102 which have already stored the prior file 104A.
  • the prior file 104A is subjected to deduplication by dividing the prior file 104A into the prior portions of a second size range (i.e., a large size range) and each prior portion is stored in a prior portion file which stores the data of the respective prior portion.
  • the name of the prior portion file includes a prior strong hash calculated for the respective prior portion, as a part of the name.
  • the prior strong hashes calculated for each of the prior portions are stored in the metadata file.
  • the prior file 104A is further subjected to deduplication at the lower layer by dividing each prior portion of the prior file 104A into prior chunks of a first size range (i.e., a small size range). Similar to the prior file 104A, the new file may be subjected to deduplication at the lower level, by dividing the new file into small chunks of the first size range.
  • the new file is processed in the following steps of the method 300, which are described in the following way.
  • the method 300 comprises dividing the new file into new portions of a second size range greater than the first size range.
  • the new file is divided into the new portions of the second size range.
  • the second size range e.g., 2MB-6MB
  • the first size range e.g., 2kB-to-20kB.
  • the method 300 comprises for each new portion calculating a new strong hash. After division of the new file into the new portions, the new strong hash is calculated for each new portion.
  • the method 300 further comprises for each new portion determining if a file whose name including the hash value exists. Thereafter, it is determined for each new portion that each new portion file associated with each new portion, whose name includes the calculated new strong hash, already exists in the data storage 102 or not.
  • the method 300 further comprises for each new portion if a file with name whose prefix starts with the new strong hash exists, creating a hard link to the file whose name starts with this new strong hash and naming of the hard link including the new strong hash . If there exists a new portion file whose name starts with the new strong hash, then, there is no need to keep that new portion because the new portion file whose name starts with the new strong hash already contains the same data of the chunk (i.e., the new portion) that is going to be stored. Instead, the hard link to the new portion file whose name starts with the new strong hash is created. Also, the name of the created hard link may have the new strong hash as a prefix with some suffix which is usually the name of the storage file, like an index.
  • the method 300 further comprises for each new portion if no file whose name starts with the new strong hash is found, storing the new portion file in the data storage 102, and naming the new portion file using the strong hash of the new portion file as at least part of the name.
  • Each new portion file is named using the new strong hash calculated for each new portion, as at least part of the name. For example, the new strong hash calculated for each new portion is used as the prefix and name of the stored file (i.e., the new file) is used as the suffix, while naming each new portion file.
  • the method 300 further comprises for creating a meta data file for the stored file.
  • the meta data file created for the stored file i.e., the new file
  • the meta data file created for the stored file consists of a list of new strong hashes of each new portion file as well as including the length of data of each new portion (i.e., large size chunk) of the new file.
  • An implementation scenario of the method 300 is described in more detail, for example, in FIG. 6.
  • the method 300 achieves all the advantages and technical effects of the method 200 (of FIG. 2).
  • the method 300 provides an efficient way of storing the new file in the data storage 102 in addition to an efficient deduplication of the new file.
  • steps 302-to-312 are only illustrative and other alternatives can also be provided where one or more steps are added, one or more steps are removed, or one or more steps are provided in a different sequence without departing from the scope of the claims herein.
  • FIG. 4 is a flowchart of a method of deleting a file in a data storage, in accordance with an embodiment of the present disclosure.
  • FIG. 4 is described in conjunction with elements from FIGs. 1, 2, and 3.
  • a method 400 of deleting a file in the data storage 102 (of FIG. 1).
  • the method 400 includes steps 402-to-406.
  • the control unit 108 of the data storage 102 is configured to execute the method 400.
  • the method 400 is used to delete a file in the data storage 102 in the following steps.
  • the method 400 comprises searching for metadata file. For each file, such as the prior file 104A (of FIG. 1), or the new file (of FIG. 3), there is a separate metadata file with a name identical to the file name. Therefore, the metadata file associated with the file (to be deleted) is searched out for deletion.
  • the method 400 further comprises if there is one or more hard links associated with the file, deleting one of the hard links. If a metadata file is found that has one or more hard links associated with each portion file (i.e., the prior portion file or the new portion file) then, in such a case, all the hard links are deleted. For each file (more specifically, for each portion file), there is a specific hard link whose name includes the strong hash as a prefix and suffix is also unique to that hard link. Moreover, the usage of the hard link avoids the garbage collection. Therefore, when all the hard links are deleted, the portion file also gets deleted. Alternatively stated, the data of the portion file is deleted when all the hard links are deleted.
  • the method 400 comprises deleting the metadata of the file. After deleting one of the hard links in the metadata file, the metadata file is also deleted.
  • the method 400 provides the simplest way of deleting the file in the data storage 102.
  • the steps 402-to-406 are only illustrative and other alternatives can also be provided where one or more steps are added, one or more steps are removed, or one or more steps are provided in a different sequence without departing from the scope of the claims herein.
  • the present disclosure provides a computer program product comprising computer readable code means which, when executed in the control unit 108 of the data storage 102 (of FIG. 1), cause the control unit 108 to perform all of the methods (i.e., the method 200 of FIG. 2, the method 300 of FIG. 3 and the method 400 of FIG. 4).
  • the present disclosure provides a computer program product comprising a non-transitory memory (e.g., the memory 104) having stored there on the computer readable code means.
  • the control unit 108 of the data storage 102 comprising a program memory holding the aforementioned computer program product.
  • FIG. 5 illustrates an implementation scenario of deduplication of a data storage, in accordance with an embodiment of the present disclosure.
  • FIG. 5 is described in conjunction with elements from FIGs. 1, 2, 3, and 4.
  • an implementation scenario 500 of deduplication of the data storage 102 (of FIG. 1).
  • the implementation scenario 500 includes the prior file 104A (of FIG. 1).
  • a metadata file 504 is further shown.
  • the prior file 104A (also represented as File 1, Fl) is subjected to deduplication by dividing the prior file 104A (i.e., Fl) into the plurality of prior portions 502 of a second size range in between 2 MB-to-6 MB.
  • the plurality of prior portions 502 includes five prior portions, such as the first prior portion 502A, the second prior portion 502B, the third prior portion 502C, the fourth prior portion 502D and the fifth prior portion 502E.
  • Each prior portion of the plurality of prior portions 502 is of 4MB (i.e., fixed size) in the implementation scenario 500. However, in practice, each prior portion of the plurality of prior portions 502 may have variable size also.
  • a prior strong hash is calculated for each prior portion, such as the first prior portion 502A has a first prior strong hash, Hl, the second prior portion 502B has a second prior strong hash, H2, the third prior portion 502C has a third prior strong hash, H3, the fourth prior portion 502D has a fourth prior strong hash, H4, and the fifth prior portion 502E has a fifth prior strong hash, H5.
  • Each prior strong hash is of 20 bytes hence, five strong hashes are of 100 bytes of data. Additionally, length (e.g., 32 bits) of each prior portion (i.e., 502A- 502D) is required to be stored.
  • each prior portion of the plurality of prior portions 502 is stored in a separate prior portion file and name of the prior portion file is “H F”, where “H” is the strong hash of a 4MB chunk and “F” is the original file name.
  • the first prior portion 502A is stored in a first prior portion file and name of the first prior portion file is H1 F1, where Hl is the first prior strong hash associated with the first prior portion 502A and Fl is the original file name.
  • the second prior portion 502B, the third prior portion 502C, the fourth prior portion 502D and the fifth prior portion 502E is stored in a second prior portion file named as H2 F1, a third prior portion file named as H3 F1, a fourth prior portion file named as H4 F1, and a fifth prior portion file named as H5 F1, respectively.
  • the metadata file 504 is created for the prior file 104A that includes all the five prior strong hashes of each of the plurality of prior portions 502. Therefore, the metadata file 504 includes Hl, H2, H3, H4 and H5 of each of the plurality of prior portions 502.
  • the metadata file 504 may also be referred to as a header file.
  • a prior portion file with name whose prefix starts with the prior strong hash already exists in the data storage 102, in such a case, either a hard link to the prior portion file whose name starts with the prior strong hash is created, or a reference count of the prior portion is increased in the internal file system structure.
  • the names of the prior portion files may also be used for deduplication in order to have a key value storage which stores the strong hashes in memory (e.g., the memory 104 of the data storage 102) resulting in an improved (e.g., higher) performance.
  • the metadata file 504 and each of the prior portion files associated with each prior portion of the plurality of prior portions 502 may also be referred to as file system (e.g., S3FS file system) layer.
  • each prior portion file for example, the first prior portion file named as H1 F1 is divided into variable size chunks at a lower level and such division may also be referred to as a variable size chunking or variable length deduplication at granularity of an average 16KB.
  • the division of each prior portion into variable size chunks may also be referred to as a pool layer.
  • FIG. 6 illustrates an implementation scenario of storing a new file in a data storage, in accordance with an embodiment of the present disclosure.
  • FIG. 6 is described in conjunction with elements from FIGs. 1, 2, 3, 4, and 5.
  • FIG. 6 there is shown an implementation scenario 600 of storing a new file in the data storage 102 (of FIG. 1).
  • the implementation scenario 600 includes two files, such a first file 602 and a second file 604.
  • the first file 602 (also represented as File 1) corresponds to the prior file 104A (of FIG. 1).
  • the second file 604 (also represented as File 2) corresponds to the new file (of FIG. 3).
  • Each of the first file 602 (i.e., File 1) and the second file 604 (i.e., File 2) is deduplicated by division into a first plurality of prior portions 606 and a second plurality of prior portions 608, respectively.
  • the first metadata file 610 stores all the strong hashes (i.e., Hl, H2, H3, H4, and H5) calculated for each prior portion of the first plurality of prior portions 606.
  • the second metadata file 612 stores all the strong hashes (i.e., Hl, H6, H3, H4, and H5) calculated for each prior portion of the second plurality of prior portions 608.
  • each prior portion of the first plurality of prior portions 606 and the second plurality of prior portions 608, is of 4MB and many of them are unchanged.
  • the files i.e., the first file 602 and the second file 604 are large and very similar, therefore metadata is also deduplicated.
  • a command such as “find (e.g., Is) Hl *” may be performed on a local S3 file system.
  • each prior portion of the first file 602 is checked with each prior portion of the second file 604. It is clearly visible from the FIG. 6, that many of the strong hashes stored in the first metadata file 610 and the second metadata file 612 are unchanged. Only, the second metadata file 612 stores a different strong hash (e.g., H6). Alternatively stated, the second file 604 has a different prior portion of 4MB.
  • the strong hashes i.e., Hl, H3, H4, and H5 which are same in the first metadata file 610 and the second metadata file 612
  • a hard link is created and stored for each of them in the second metadata file 612.
  • hard links named as a H1 F2, a H3 F2, a H4 F2, and a H5 F2 is created for Hl, H3, H4, and H5, respectively.
  • a reference counter is increased in lower file system (FS) level for the same strong hashes.
  • the hard links can be replaced by either counters or GC or KVDB (i.e., key valued database management) for better performance.
  • the different strong hash i.e., H6 is stored as it is in the second metadata file 612.
  • a virtual machine of size 40 Giga Byte (GB) is considered whose backup is to be maintained.
  • GB Giga Byte
  • a prior portion of 4MB This is assumed that there are hundred generations of backup.
  • a conventional file system such as a virtual machine disk (VMDK) is considered to ocean protect.
  • VMDK virtual machine disk
  • the deduplication ratio of the data is very high as all blocks are identical in the 40GB except one block, so, the overall the deduplication ratio is close to 100, which means the 100 copies of almost the same data.
  • the total data stored in the system before deduplication is 4TB and the metadata is 1% of 4TB which is 40GB, the data will get deduplicated of almost 100X so it will also take 40GB, so total data we store will be 80GB which is factor of 50, the meta data is not a significant portion of the data after deduplication.
  • the data storage 102 with multi-level deduplication is considered for said virtual machine with aforementioned assumptions.
  • a metadata file and one block of 4MB is generated in each generation of data backup. If the size of the meta data file is 80KB for single generation of data backup, then, for 100 generations of backup, total size of the metadata file is 8MB.
  • each metadata file differs from another metadata file by 20 bytes. Therefore, the total amount of data kept in the ocean protect file system is 40GB.
  • the total amount of data kept in the data storage 102 with multi-level deduplication is approximately equal to (original file divided to 4MB files) + 100 ⁇ 4 (changes between the files) + 8MB (metadata files), which is overall less than 40.5GB.
  • the metadata size will be -0.01% of the data, which means the deduplication ratio will be close to 100, as the data size is deduplicated.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A data storage including a prior file that has been subjected to a first level of deduplication using a first division of the prior file into prior chunks of a first size range. The prior file is additionally divided into prior portions of a second size range greater than the first size range. Each prior portion is stored in a prior portion file. Each prior portion file is named using a prior strong hash calculated for the prior portion file as at least part of the name. The data storage further includes a metadata file associated with the prior file that includes the prior strong hashes of all the prior portions. The disclosed data storage efficiently reduces the size of the meta data due to multi-level deduplication.

Description

DATA STORAGE AND METHODS OF DEDUPLICATION OF DATA STORAGE, STORING NEW FILE AND DELETING FILE
TECHNICAL FIELD
The present disclosure relates generally to the field of data protection and backup; and more specifically, to a data storage, a method of deduplication of the data storage, a method of storing a new file in the data storage and a method of deleting a file in the data storage.
BACKGROUND
Generally, data deduplication is a technique used to reduce the amount of data, which is either passed over a network or stored in a conventional data storage system. The typical data deduplication technique performs by identifying duplicates of data at some granularity and avoiding communication or storage of such duplicates explicitly. In a conventional data deduplicated system, when a data file is deduplicated, the data of the data file is chunked to variable size chunks and instead of keeping identical chunks twice, a pointer is kept to the identical chunks. In the conventional data deduplicated system, a meta data to the pointer of the chunk takes approximately 100 bytes, that means the size of the meta data is typically about 1% of the data. The meta data is used to reduce the overhead of the data structure required for the pointers and allows a fast access and search abilities. In a case, when the deduplication ratios are small (e.g., 5 times), the meta data is insignificant. In another case (e.g., a data backup system), when the deduplication ratios are large (e.g., 50 times or 100 times), the meta data becomes highly significant.
Currently, certain methods have been proposed to reduce the size of meta data, such as using a Merkle trees approach. In the Merkle trees approach, the data is divided into hierarchical order and a chunk at each hierarchical level is addressed by using a hash. Thereafter, several strong hashes are grouped as an object and thus, if large chunks of an object are found already existing, then, only a pointer to the hash of a group is kept and this way, the overall meta data becomes small. However, a limitation is associated with the Merkle trees approach, that is the Merkle trees approach does not suitable for a primary storage as each access to the data requires traversing through many elements. Thus, there exists a technical problem of inefficiently reducing the size of the meta data.
Therefore, in light of the foregoing discussion, there exists a need to overcome the aforementioned drawbacks associated with the conventional methods of reducing the size of the meta data.
SUMMARY
The present disclosure provides a data storage, a method of deduplication of the data storage, a method of storing a new file in the data storage and a method of deleting a file in the data storage. The present disclosure provides a solution to the existing problem of inefficiently reducing the size of the meta data. An aim of the present disclosure is to provide a solution that overcomes at least partially the problems encountered in prior art, and provides an improved data storage, an improved method of deduplication of the data storage, an improved method of storing a new file in the data storage and an improved method of deleting a file in the data storage.
The object of the present disclosure is achieved by the solutions provided in the enclosed independent claims. Advantageous implementations of the present disclosure are further defined in the dependent claims.
In an aspect, the present disclosure provides a data storage including a prior file that has been subjected to a first level of deduplication using a first division of the prior file into prior chunks of a first size range. The file is additionally divided into prior portions of a second size range greater than the first size range. Each prior portion is stored in a prior portion file. Each prior portion file is named using a prior strong hash calculated for the prior portion file as at least part of the name. The data storage further comprises a metadata file associated with the prior file and includes the prior strong hashes of all the prior portions.
The disclosed data storage efficiently reduces the size of the meta data when high deduplication ratios are used, due to multi-level deduplication. The prior file is deduplicated at the first level by dividing into the prior chunks. Additionally, the prior file is further deduplicated by dividing into the prior portions. Alternatively stated, the data storage manifests a hierarchy of deduplication having large chunks deduplication at a file system level and small chunks deduplication at a lower layer. Thus, the disclosed data storage obtains a reduced amount of meta data.
In another aspect, the present disclosure provides a method of deduplication of a data storage. The method comprises providing a prior file arranged to be divided into a number of chunks having a size within a first size range. The method further comprises dividing the prior file into portions of a second size range greater than the first size range, each portion comprising a number of complete chunks. The method further comprises for each portion calculating a prior strong hash and storing data of the portion in a prior portion file specific to the portion. The method further comprises for each portion naming the prior portion file, where the name includes the prior strong hash of the respective portion. The method further comprises creating a metadata file, where the metadata file includes the prior strong hashes of all the prior portions of the prior file. The method further comprises deduplicating each portion based on the chunks in that portion.
The disclosed method efficiently deduplicates the prior file by dividing the prior file into the prior portions (i.e., the large size chunks), without any requirement to maintain the data structure for determining that if a chunk already exists or not. The method enables checking for a chunk which already exists or not, by using name of the prior portion file used specifically for that chunk. Additionally, the method manifests simplicity. In a case, where the deduplication ratios are high, the use of the method provides a much smaller size of the metadata in contrast to conventional methods.
In a further implementation form, the second size is a multiple of the first size. By virtue of having the second size as the multiple of the first size, large chunks deduplication of the prior file can be obtained.
In yet another aspect, the present disclosure provides a method of storing a new file in a data storage. The data storage has already stored a prior file divided into prior portion files and associated with a metadata file including the strong hashes of the prior portion files. The method comprises the steps of dividing the new file into new portions of a second size range greater than the first size range. The method further comprises for each new portion calculating a new strong hash and determining if a file whose name including the hash value exists. If a file with name whose prefix starts with the new strong hash exists, creating a hard link to the file whose name starts with this new strong hash and naming of the hard link including the new strong hash. If no file whose name starts with the new strong hash is found, storing the new portion file in the data storage 102, and naming the new portion file using the strong hash of the new portion file as at least part of the name. The method further comprises for creating a meta data file for the stored file.
The disclosed method provides an efficient way of storing the new file in the data storage by having a reduced metadata. Additionally, the disclosed method provides an efficient deduplication of the new file.
In yet another aspect, the present disclosure provides a method of deleting a file in a data storage. The method comprises searching for metadata file, if there is one or more hard links associated with the file, deleting one of the hard links, and deleting the metadata of the file.
The disclosed method provides the simplest way of deleting the file in the data storage.
It is to be appreciated that all the aforementioned implementation forms can be combined.
It has to be noted that all devices, elements, circuitry, units and means described in the present application could be implemented in the software or hardware elements or any kind of combination thereof. All steps which are performed by the various entities described in the present application as well as the functionalities described to be performed by the various entities are intended to mean that the respective entity is adapted to or configured to perform the respective steps and functionalities. Even if, in the following description of specific embodiments, a specific functionality or step to be performed by external entities is not reflected in the description of a specific detailed element of that entity which performs that specific step or functionality, it should be clear for a skilled person that these methods and functionalities can be implemented in respective software or hardware elements, or any kind of combination thereof. It will be appreciated that features of the present disclosure are susceptible to being combined in various combinations without departing from the scope of the present disclosure as defined by the appended claims.
Additional aspects, advantages, features and objects of the present disclosure would be made apparent from the drawings and the detailed description of the illustrative implementations construed in conjunction with the appended claims that follow. BRIEF DESCRIPTION OF THE DRAWINGS
The summary above, as well as the following detailed description of illustrative embodiments, is better understood when read in conjunction with the appended drawings. For the purpose of illustrating the present disclosure, exemplary constructions of the disclosure are shown in the drawings. However, the present disclosure is not limited to specific methods and instrumentalities disclosed herein. Moreover, those in the art will understand that the drawings are not to scale. Wherever possible, like elements have been indicated by identical numbers.
Embodiments of the present disclosure will now be described, by way of example only, with reference to the following diagrams wherein:
FIG. 1 is a block diagram that illustrates various exemplary components of a data storage, in accordance with an embodiment of the present disclosure;
FIG. 2 is a flowchart of a method of deduplication of a data storage, in accordance with an embodiment of the present disclosure;
FIG. 3 is a flowchart of a method of storing a new file in a data storage, in accordance with an embodiment of the present disclosure;
FIG. 4 is a flowchart of a method of deleting a file in a data storage, in accordance with an embodiment of the present disclosure;
FIG. 5 illustrates an implementation scenario of deduplication of a data storage, in accordance with an embodiment of the present disclosure; and
FIG. 6 illustrates an implementation scenario of storing a new file in a data storage, in accordance with an embodiment of the present disclosure.
In the accompanying drawings, an underlined number is employed to represent an item over which the underlined number is positioned or an item to which the underlined number is adjacent. A non-underlined number relates to an item identified by a line linking the nonunderlined number to the item. When a number is non-underlined and accompanied by an associated arrow, the non-underlined number is used to identify a general item at which the arrow is pointing. DETAILED DESCRIPTION OF EMBODIMENTS
The following detailed description illustrates embodiments of the present disclosure and ways in which they can be implemented. Although some modes of carrying out the present disclosure have been disclosed, those skilled in the art would recognize that other embodiments for carrying out or practicing the present disclosure are also possible.
FIG. 1 is a block diagram that illustrates various exemplary components of a data storage, in accordance with an embodiment of the present disclosure. With reference to FIG. 1, there is shown a block diagram 100 of a data storage 102 that includes a memory 104, a network interface 106 and a control unit 108. The memory 104 is configured to store a prior file 104A.
The data storage 102 may include suitable logic, circuitry, interfaces, or code that is configured to store data as data blocks having either fixed or variable sizes. The data storage 102 may also be referred to as a backup storage in which data can be stored in form of data blocks for any defined duration. For example, the data storage 102 may be a non-volatile mass storage, such as a cloud-based storage. The data storage 102 includes the memory 104, the control unit 108 and the network interface 106 in order to store, process and share the data to other computing components, such as a user device. The data storage 102 may also be referred to as a data storage system. Examples of the data storage 102 may include, but are not limited to, a dorado file system, an ocean protect file system, and the like.
The memory 104 may include suitable logic, circuitry, interfaces, or code that is configured to store data and the instructions executable by the control unit 108. The memory 104 is configured to store the prior file 104A. Examples of implementation of the memory 104 may include, but are not limited to, an Electrically Erasable Programmable Read-Only Memory (EEPROM), Random Access Memory (RAM), Read Only Memory (ROM), Hard Disk Drive (HDD), Flash memory, Solid-State Drive (SSD), or CPU cache memory. The memory 104 may store an operating system or other program products (including one or more operation algorithms) to operate the data storage 102.
The network interface 106 may include suitable logic, circuitry, interfaces, or code that is communicatively coupled with the memory 104 and the control unit 108. Examples of the network interface 106 include, but are not limited to, a data terminal, a transceiver, a facsimile machine, a virtual server, and the like. The control unit 108 may include suitable logic, circuitry, interfaces, or code that is configured to execute the instructions stored in the memory 104. In an example, the control unit 108 may be a general-purpose processor. Other examples of the control unit 108 may include, but is not limited to a processor, a central processing unit (CPU), a digital signal processor (DSP), a microprocessor, a microcontroller, a complex instruction set computing (CISC) processor, an application-specific integrated circuit (ASIC) processor, a reduced instruction set (RISC) processor, a very long instruction word (VLIW) processor, a central processing unit (CPU), a state machine, a data processing unit, a graphics processing unit (GPU), and other processors or control circuitry. Moreover, the control unit 108 may refer to one or more individual processors, processing devices, a processing unit that is part of a machine, such as the data storage 102.
In operation, the data storage 102 including the prior file 104A that has been subjected to a first level of deduplication using a first division of the prior file 104A into prior chunks of a first size range, where the file is additionally divided into prior portions of a second size range greater than the first size range. Each prior portion is stored in a prior portion file. Each prior portion file is named using a prior strong hash calculated for the prior portion file as at least part of the name. The memory 104 of the data storage 102 is configured to store the prior file 104A that is arranged to be divided into the prior chunks of the first size range. The prior chunks may also be referred to as small size chunks which may have either a fixed size or a variable size. The prior file 104A is subjected to deduplication by division into the prior portions of the second size range. The prior portions may also be referred to as large size chunks which have larger size comparative to the prior chunks. The prior portions may also have either a fixed size or a variable size. The prior file 104A is divided into the prior portions independent to the division into the prior chunks. Thereafter, each of the prior portions is stored in the prior portion file, which contains the data of the respective prior portion. Furthermore, the prior strong hash is calculated for each of the prior portions, which is used as at least part of the name of each prior portion file. Alternatively stated, each prior portion file for each of the prior portions is named using the prior strong hash calculated for each of the prior portions, as at least part of the name.
The data storage 102 further comprises a metadata file associated with the prior file 104A and including the prior strong hashes of all the prior portions. After storing each of the prior portions in the respective prior portion file, the metadata file is generated. The metadata file is generated in order to avoid the storage of the prior portions file in the data storage 102 as a flat file. The metadata file includes the prior strong hashes calculated for each of the prior portions as well as length of data of each prior portion (i.e., large size chunks).
By virtue of deduplicating the prior file 104A by dividing the prior file 104A into the prior portions of the second size range and further dividing the prior file 104A into the prior chunks of the first size range, the data storage 102 obtains a reduced size of meta data. This way, the data storage 102 provides a reduced size of the metadata even in cases of high deduplication ratios, due to multi-level deduplication.
FIG. 2 is a flowchart of a method of deduplication of a data storage, in accordance with an embodiment of the present disclosure. FIG. 2 is described in conjunction with elements from FIG. 1. With reference to FIG. 2, there is shown a method 200 of deduplication of the data storage 102 (of FIG. 1). The method 200 includes steps 202-to-214. The control unit 108 of the data storage 102 is configured to execute the method 200.
The method 200 is used for deduplication of the data storage 102 in the following steps.
At step 202, the method 200 comprises providing a prior file arranged to be divided into a number of chunks having a size within a first size range. The prior file (e.g., the prior file 104A, of FIG. 1) is arranged to be divided into small size chunks for deduplication at a lower level. Alternatively stated, the prior file 104A is stored on a data storage (e.g., the data storage 102, of FIG. 1) which supports deduplication at the lower level. The deduplication of the prior file 104A at the lower level may also be referred to as a lower layer deduplication.
In accordance with an embodiment, all chunks have a constant, first size. Each of the small size chunks which, may be obtained through division of the prior file 104A, has the constant, first size.
In accordance with an embodiment, the size of each chunk is between 2 kB and 20 kB. For example, the small size chunks may have the size of any size between 2k and 20kB.
At step 204, the method 200 further comprises dividing the prior file 104A into portions of a second size range greater than the first size range, each portion comprising a number of complete chunks. The prior file 104A is deduplicated by dividing the prior file 104A into the portions (i.e., prior portions) of the second size range. The portions (i.e., prior portions) corresponds to large size chunks because they have larger size comparative to the small size chunks which may be obtained through further division of the prior file 104A. The deduplication of the prior file 104A by dividing the file into the portions of the second size range may also be referred to as an upper layer deduplication. Each portion (i.e., large size chunk) constitutes of several small size chunks.
In accordance with an embodiment, all the file portions have a constant, second size. Each of the portions of the prior file 104A has the constant, second size.
In accordance with an embodiment, the size of each portion is between 2 MB and 6 MB. For example, the portions of the prior file 104A may have the size of any size between 2MB and 6MB. The size is any number of 2MB and 6MB and consists of an average, for example, 512 small chunks, but it always consists of complete small chunks.
In accordance with an embodiment, the second size is a multiple of the first size. The second size range attained by the portions is larger than the first size range attained by the small size chunks. In an implementation, the second size range may be a multiple of the first size range. For example, in a case, if the first size range is variable that means the second size range consists of several such variable size chunks.
In accordance with an embodiment, the chunks and/or the portions are of variable size. In an implementation, each of the chunks (i.e., the small size chunks) and the portions (i.e., large size chunks) may have a variable size.
In accordance with an embodiment, the division into chunks and/or the division into portions is achieved by content defined chunking, CDC, algorithm. The division of the prior file 104A into the chunks (i.e., the small size chunks) and/or into the portions (i.e., the large size chunks) is obtained by use of the content defined chunking (CDC) algorithm. Generally, the CDC algorithm may be defined as a method to split a file (e.g., the prior file 104A) into variable size chunks, where the split points are defined by a few internal features of the file (i.e., the prior file 104A). Unlike to fixed size chunks, variable size chunks are more resistant to byte shifting. Therefore, division of the file (i.e., the prior file 104A) into the variable size chunks increases the probability of finding duplicate chunks within the file (i.e., the prior file 104A) or between the files.
In accordance with an embodiment, each portion contains a consecutive set of smaller chunks. In an implementation, each portion (i.e., the large size chunk) of the prior file 104A may contain the consecutive set of smaller chunks by virtue of having the larger size in comparison to the small size chunks.
At step 206, the method 200 further comprises for each portion calculating a prior strong hash. Thereafter, the prior strong hash is calculated for each portion.
At step 208, the method 200 further comprises for each portion storing data of the portion in a prior portion file specific to the portion. The data of the portion (i.e., the large size chunk) is stored in the prior portion file which is specific for that portion.
At step 210, the method 200 further comprises for each portion naming the prior portion file, where the name includes the prior strong hash of the respective portion. The prior portion file of the portion is named using the prior strong hash calculated for that portion, as at least part of the name. For example, name of the prior portion file includes a prefix and a suffix. The prefix is the prior strong hash calculated for that portion and suffix is the name of the stored file (i.e., the prior file 104A).
At step 212, the method 200 further comprises creating a metadata file, said metadata file including the prior strong hashes of all the prior portions of the prior file 104A and the length of each corresponding portion file. After storing the data of each portion into the prior portion file and naming of the prior portion file, the metadata file is created. The metadata file includes all the prior strong hashes calculated for all the prior portions of the prior file 104A. In addition to the prior strong hashes, the metadata file includes the length of data of each prior portion file corresponding to the respective prior portion of the prior file 104A. An implementation scenario of the method 200 is described in more detail, for example, in FIG. 6.
At step 214, the method 200 further comprises deduplicating each portion based on the chunks in that portion. Each portion (i.e., the large size chunk) includes the number of complete small size chunks which may be obtained by use of the CDC algorithm. Thereafter, the deduplication of each portion (i.e., the large size chunk) is performed based on the small size chunks comprised by the respective portion of the prior file 104A.
Thus, the method 200 efficiently deduplicates the prior file 104A by dividing the prior file 104A into the prior portions (i.e., the large size chunks), without any requirement to maintain the data structure for determining that if a chunk (i.e., the prior portion) already exists or not. The method 200 enables checking for a chunk which already exists or not, by using name of the prior portion file used specifically for that chunk (i.e., the prior portion). If the data storage 102 has a cache memory, then, in such a case, the existence of such file may be checked a bit faster. Additionally, the method 200 manifests simplicity. In a case, where the deduplication ratios are high, the use of the method 200 provides a much smaller size of the metadata in contrast to conventional methods.
The steps 202-to-214 are only illustrative and other alternatives can also be provided where one or more steps are added, one or more steps are removed, or one or more steps are provided in a different sequence without departing from the scope of the claims herein.
FIG. 3 is a flowchart of a method of storing a new file in a data storage, in accordance with an embodiment of the present disclosure. FIG. 3 is described in conjunction with elements from FIGs. 1 and 2. With reference to FIG. 3, there is shown a method 300 of storing a new file in the data storage 102 (of FIG. 1). The method 300 includes steps 302-to-312. The control unit 108 of the data storage 102 is configured to execute the method 300.
The method 300 is used for storing a new file in the data storage 102, having already stored therein a prior file 104A divided into prior portion files and associated with a metadata file including the strong hashes of the prior portion files. The method 300 is used for storing the new file in the data storage 102 which have already stored the prior file 104A. The prior file 104A is subjected to deduplication by dividing the prior file 104A into the prior portions of a second size range (i.e., a large size range) and each prior portion is stored in a prior portion file which stores the data of the respective prior portion. The name of the prior portion file includes a prior strong hash calculated for the respective prior portion, as a part of the name. Also, the prior strong hashes calculated for each of the prior portions are stored in the metadata file. The prior file 104A is further subjected to deduplication at the lower layer by dividing each prior portion of the prior file 104A into prior chunks of a first size range (i.e., a small size range). Similar to the prior file 104A, the new file may be subjected to deduplication at the lower level, by dividing the new file into small chunks of the first size range. For deduplication of the new file, the new file is processed in the following steps of the method 300, which are described in the following way.
At step 302, the method 300 comprises dividing the new file into new portions of a second size range greater than the first size range. For deduplication, the new file is divided into the new portions of the second size range. The second size range (e.g., 2MB-6MB) is larger than the first size range (e.g., 2kB-to-20kB).
At step 304, the method 300 comprises for each new portion calculating a new strong hash. After division of the new file into the new portions, the new strong hash is calculated for each new portion.
At step 306, the method 300 further comprises for each new portion determining if a file whose name including the hash value exists. Thereafter, it is determined for each new portion that each new portion file associated with each new portion, whose name includes the calculated new strong hash, already exists in the data storage 102 or not.
At step 308, the method 300 further comprises for each new portion if a file with name whose prefix starts with the new strong hash exists, creating a hard link to the file whose name starts with this new strong hash and naming of the hard link including the new strong hash . If there exists a new portion file whose name starts with the new strong hash, then, there is no need to keep that new portion because the new portion file whose name starts with the new strong hash already contains the same data of the chunk (i.e., the new portion) that is going to be stored. Instead, the hard link to the new portion file whose name starts with the new strong hash is created. Also, the name of the created hard link may have the new strong hash as a prefix with some suffix which is usually the name of the storage file, like an index.
At step 310, the method 300 further comprises for each new portion if no file whose name starts with the new strong hash is found, storing the new portion file in the data storage 102, and naming the new portion file using the strong hash of the new portion file as at least part of the name. Each new portion file is named using the new strong hash calculated for each new portion, as at least part of the name. For example, the new strong hash calculated for each new portion is used as the prefix and name of the stored file (i.e., the new file) is used as the suffix, while naming each new portion file. At step 312, the method 300 further comprises for creating a meta data file for the stored file. The meta data file created for the stored file (i.e., the new file) consists of a list of new strong hashes of each new portion file as well as including the length of data of each new portion (i.e., large size chunk) of the new file. An implementation scenario of the method 300 is described in more detail, for example, in FIG. 6. The method 300 achieves all the advantages and technical effects of the method 200 (of FIG. 2). Thus, the method 300 provides an efficient way of storing the new file in the data storage 102 in addition to an efficient deduplication of the new file.
The steps 302-to-312 are only illustrative and other alternatives can also be provided where one or more steps are added, one or more steps are removed, or one or more steps are provided in a different sequence without departing from the scope of the claims herein.
FIG. 4 is a flowchart of a method of deleting a file in a data storage, in accordance with an embodiment of the present disclosure. FIG. 4 is described in conjunction with elements from FIGs. 1, 2, and 3. With reference to FIG. 4, there is shown a method 400 of deleting a file in the data storage 102 (of FIG. 1). The method 400 includes steps 402-to-406. The control unit 108 of the data storage 102 is configured to execute the method 400.
The method 400 is used to delete a file in the data storage 102 in the following steps.
At step 402, the method 400 comprises searching for metadata file. For each file, such as the prior file 104A (of FIG. 1), or the new file (of FIG. 3), there is a separate metadata file with a name identical to the file name. Therefore, the metadata file associated with the file (to be deleted) is searched out for deletion.
At step 404, the method 400 further comprises if there is one or more hard links associated with the file, deleting one of the hard links. If a metadata file is found that has one or more hard links associated with each portion file (i.e., the prior portion file or the new portion file) then, in such a case, all the hard links are deleted. For each file (more specifically, for each portion file), there is a specific hard link whose name includes the strong hash as a prefix and suffix is also unique to that hard link. Moreover, the usage of the hard link avoids the garbage collection. Therefore, when all the hard links are deleted, the portion file also gets deleted. Alternatively stated, the data of the portion file is deleted when all the hard links are deleted.
At step 406, the method 400 comprises deleting the metadata of the file. After deleting one of the hard links in the metadata file, the metadata file is also deleted.
Thus, the method 400 provides the simplest way of deleting the file in the data storage 102. The steps 402-to-406 are only illustrative and other alternatives can also be provided where one or more steps are added, one or more steps are removed, or one or more steps are provided in a different sequence without departing from the scope of the claims herein.
In one aspect, the present disclosure provides a computer program product comprising computer readable code means which, when executed in the control unit 108 of the data storage 102 (of FIG. 1), cause the control unit 108 to perform all of the methods (i.e., the method 200 of FIG. 2, the method 300 of FIG. 3 and the method 400 of FIG. 4). In yet another aspect, the present disclosure provides a computer program product comprising a non-transitory memory (e.g., the memory 104) having stored there on the computer readable code means. Furthermore, the control unit 108 of the data storage 102 comprising a program memory holding the aforementioned computer program product.
FIG. 5 illustrates an implementation scenario of deduplication of a data storage, in accordance with an embodiment of the present disclosure. FIG. 5 is described in conjunction with elements from FIGs. 1, 2, 3, and 4. With reference to FIG. 5, there is shown an implementation scenario 500 of deduplication of the data storage 102 (of FIG. 1). The implementation scenario 500 includes the prior file 104A (of FIG. 1). There is further shown a plurality of prior portions 502, such as a first prior portion 502A, a second prior portion 502B, a third prior portion 502C, a fourth prior portion 502D and a fifth prior portion 502E. There is further shown a metadata file 504
Initially, the prior file 104A (also represented as File 1, Fl) is subjected to deduplication by dividing the prior file 104A (i.e., Fl) into the plurality of prior portions 502 of a second size range in between 2 MB-to-6 MB. The plurality of prior portions 502 includes five prior portions, such as the first prior portion 502A, the second prior portion 502B, the third prior portion 502C, the fourth prior portion 502D and the fifth prior portion 502E. Each prior portion of the plurality of prior portions 502 is of 4MB (i.e., fixed size) in the implementation scenario 500. However, in practice, each prior portion of the plurality of prior portions 502 may have variable size also. Thereafter, a prior strong hash is calculated for each prior portion, such as the first prior portion 502A has a first prior strong hash, Hl, the second prior portion 502B has a second prior strong hash, H2, the third prior portion 502C has a third prior strong hash, H3, the fourth prior portion 502D has a fourth prior strong hash, H4, and the fifth prior portion 502E has a fifth prior strong hash, H5. Each prior strong hash is of 20 bytes hence, five strong hashes are of 100 bytes of data. Additionally, length (e.g., 32 bits) of each prior portion (i.e., 502A- 502D) is required to be stored. Furthermore, each prior portion of the plurality of prior portions 502 is stored in a separate prior portion file and name of the prior portion file is “H F”, where “H” is the strong hash of a 4MB chunk and “F” is the original file name. For example, the first prior portion 502A is stored in a first prior portion file and name of the first prior portion file is H1 F1, where Hl is the first prior strong hash associated with the first prior portion 502A and Fl is the original file name. Similarly, the second prior portion 502B, the third prior portion 502C, the fourth prior portion 502D and the fifth prior portion 502E is stored in a second prior portion file named as H2 F1, a third prior portion file named as H3 F1, a fourth prior portion file named as H4 F1, and a fifth prior portion file named as H5 F1, respectively. Thereafter, the metadata file 504 is created for the prior file 104A that includes all the five prior strong hashes of each of the plurality of prior portions 502. Therefore, the metadata file 504 includes Hl, H2, H3, H4 and H5 of each of the plurality of prior portions 502. The metadata file 504 may also be referred to as a header file. If a prior portion file with name whose prefix starts with the prior strong hash already exists in the data storage 102, in such a case, either a hard link to the prior portion file whose name starts with the prior strong hash is created, or a reference count of the prior portion is increased in the internal file system structure. Additionally, the names of the prior portion files may also be used for deduplication in order to have a key value storage which stores the strong hashes in memory (e.g., the memory 104 of the data storage 102) resulting in an improved (e.g., higher) performance. The metadata file 504 and each of the prior portion files associated with each prior portion of the plurality of prior portions 502 may also be referred to as file system (e.g., S3FS file system) layer. Moreover, each prior portion file, for example, the first prior portion file named as H1 F1 is divided into variable size chunks at a lower level and such division may also be referred to as a variable size chunking or variable length deduplication at granularity of an average 16KB. The division of each prior portion into variable size chunks may also be referred to as a pool layer.
FIG. 6 illustrates an implementation scenario of storing a new file in a data storage, in accordance with an embodiment of the present disclosure. FIG. 6 is described in conjunction with elements from FIGs. 1, 2, 3, 4, and 5. With reference to FIG. 6, there is shown an implementation scenario 600 of storing a new file in the data storage 102 (of FIG. 1). The implementation scenario 600 includes two files, such a first file 602 and a second file 604.
The first file 602 (also represented as File 1) corresponds to the prior file 104A (of FIG. 1). The second file 604 (also represented as File 2) corresponds to the new file (of FIG. 3). Each of the first file 602 (i.e., File 1) and the second file 604 (i.e., File 2) is deduplicated by division into a first plurality of prior portions 606 and a second plurality of prior portions 608, respectively. There is further shown a first metadata file 610 and a second metadata file 612, associated with the first file 602 and the second file 604, respectively. The first metadata file 610 stores all the strong hashes (i.e., Hl, H2, H3, H4, and H5) calculated for each prior portion of the first plurality of prior portions 606. Similarly, the second metadata file 612 stores all the strong hashes (i.e., Hl, H6, H3, H4, and H5) calculated for each prior portion of the second plurality of prior portions 608.
In the implementation scenario 600 of the data storage 102, it is assumed that the deduplication ratio is very high (otherwise the size of the metadata is insignificant). Secondly, it is assumed that each prior portion of the first plurality of prior portions 606 and the second plurality of prior portions 608, is of 4MB and many of them are unchanged. In the implementation scenario 600, the files (i.e., the first file 602 and the second file 604) are large and very similar, therefore metadata is also deduplicated.
In order to determine that a first prior portion of the first file 602 is identical to a first prior portion of the second file 604 or not, a command, such as “find (e.g., Is) Hl *” may be performed on a local S3 file system. Similarly, each prior portion of the first file 602 is checked with each prior portion of the second file 604. It is clearly visible from the FIG. 6, that many of the strong hashes stored in the first metadata file 610 and the second metadata file 612 are unchanged. Only, the second metadata file 612 stores a different strong hash (e.g., H6). Alternatively stated, the second file 604 has a different prior portion of 4MB. Therefore, the strong hashes (i.e., Hl, H3, H4, and H5) which are same in the first metadata file 610 and the second metadata file 612, a hard link is created and stored for each of them in the second metadata file 612. For example, hard links named as a H1 F2, a H3 F2, a H4 F2, and a H5 F2 is created for Hl, H3, H4, and H5, respectively. Alternatively, a reference counter is increased in lower file system (FS) level for the same strong hashes. Alternatively stated, the hard links can be replaced by either counters or GC or KVDB (i.e., key valued database management) for better performance. And the different strong hash (i.e., H6) is stored as it is in the second metadata file 612.
In an exemplary scenario, a virtual machine of size 40 Giga Byte (GB) is considered whose backup is to be maintained. In each generation of backup, there is a change in only one block of data (i.e., a prior portion of 4MB). This is assumed that there are hundred generations of backup. Also, for sake of simplicity, it is assumed that there is no internal deduplication. For said virtual machine with aforementioned assumptions, in one case, a conventional file system, such as a virtual machine disk (VMDK) is considered to ocean protect. In such case, the deduplication ratio of the data is very high as all blocks are identical in the 40GB except one block, so, the overall the deduplication ratio is close to 100, which means the 100 copies of almost the same data. Moreover, the total data stored in the system before deduplication is 4TB and the metadata is 1% of 4TB which is 40GB, the data will get deduplicated of almost 100X so it will also take 40GB, so total data we store will be 80GB which is factor of 50, the meta data is not a significant portion of the data after deduplication.
In another case, the data storage 102 with multi-level deduplication is considered for said virtual machine with aforementioned assumptions. In such a case, in each generation of data backup, a metadata file and one block of 4MB is generated. If the size of the meta data file is 80KB for single generation of data backup, then, for 100 generations of backup, total size of the metadata file is 8MB. Also, due to the multi-level deduplication, each metadata file differs from another metadata file by 20 bytes. Therefore, the total amount of data kept in the ocean protect file system is 40GB. In contrast to this, the total amount of data kept in the data storage 102 with multi-level deduplication is approximately equal to (original file divided to 4MB files) + 100^4 (changes between the files) + 8MB (metadata files), which is overall less than 40.5GB. Thus, the metadata size will be -0.01% of the data, which means the deduplication ratio will be close to 100, as the data size is deduplicated.
Modifications to embodiments of the present disclosure described in the foregoing are possible without departing from the scope of the present disclosure as defined by the accompanying claims. Expressions such as "including", "comprising", "incorporating", "have", "is" used to describe and claim the present disclosure are intended to be construed in a non-exclusive manner, namely allowing for items, components or elements not explicitly described also to be present. Reference to the singular is also to be construed to relate to the plural. The word "exemplary" is used herein to mean "serving as an example, instance or illustration". Any embodiment described as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments and/or to exclude the incorporation of features from other embodiments. The word "optionally" is used herein to mean "is provided in some embodiments and not provided in other embodiments". It is appreciated that certain features of the present disclosure, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the present disclosure, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable combination or as suitable in any other described embodiment of the disclosure.

Claims

1. A data storage (102) including a prior file (104A) that has been subjected to a first level of deduplication using a first division of the prior file (104A) into prior chunks of a first size range, wherein the prior file (104 A) is additionally divided into prior portions of a second size range greater than the first size range, each prior portion being stored in a prior portion file, each prior portion file being named using a prior strong hash calculated for the prior portion file as at least part of the name, the data storage (102) further comprising a metadata file associated with the prior file and including the prior strong hashes of all the prior portions.
2. A method (200) of deduplication of a data storage (102) according to claim 1, the method (200) comprising providing a prior file (104 A) arranged to be divided into a number of chunks having a size within a first size range, dividing the prior file (104 A) into portions of a second size range greater than the first size range, each portion comprising a number of complete chunks, for each portion: calculating a prior strong hash, storing data of the portion in a prior portion file specific to the portion, naming the prior portion file, where the name includes the prior strong hash of the respective portion, creating a metadata file, said metadata file including the prior strong hashes of all the prior portions of the prior file (104 A), deduplicating each portion based on the chunks in that portion.
3. A method (300) of storing a new file in a data storage (102) according to claim 1, having already stored therein a prior file (104 A) divided into prior portion files and associated with a metadata file including the strong hashes of the prior portion files, wherein the new file is arranged to be subjected to deduplication on the first level, the method (300) comprising the steps of dividing the new file into new portions of a second size range greater than the first size range, for each new portion calculating a new strong hash determining if a file whose name including the hash value exists if a file with name whose prefix starts with the new strong hash exists, creating a hard link to the file whose name starts with this new strong hash and naming of the hard link including the new strong hash if no file whose name starts with the new strong hash is found, storing the new portion file in the data storage (102), and naming the new portion file using the strong hash of the new portion file as at least part of the name creating a meta data file for the stored file
4. A method (400) of deleting a file in a data storage (102) according to claim 1, comprising the steps of searching for metadata file, if there is one or more hard links associated with the file, deleting one of the hard links, deleting the metadata of the file.
5. The method (200, 300, 400) according to any one of the claims 2 - 4, wherein all chunks have a constant, first size.
6. The method (200, 300, 400) according to any one of the claims 2 - 5, wherein all the file portions have a constant, second size.
7. The method (200, 300, 400) according to any one of the claims 2 - 6, wherein the size of each portion is between 2 MB and 6 MB.
8. The method (200, 300, 400) according to any one of the claims 5 - 7, wherein the second size is a multiple of the first size.
9. The method (200, 300, 400) according to any one of the claims 2 - 8, wherein the size of each chunk is between 2 and 20 kB.
10. The method (200, 300, 400) according to any one of the claims 2 - 9, wherein the chunks and/or the portions are of variable size.
11. The method (200, 300, 400) according to claim 10, wherein the division into chunks and/or the division into portions is achieved by CDC algorithm.
12. The method (200, 300, 400) according to any one of the claims 2 - 10, where each portion contains a consecutive set of smaller chunks.
13. A computer program product comprising computer readable code means which, when executed in a control unit (108) of a data storage (102) according to claim 1, will cause the control unit (108) to perform the method (200, 300, 400) according to any one of the claims 2 - 12.
14. A computer program product according to claim 13, comprising a non-transitory memory having stored there on the computer readable code means.
15. A control unit (108) for a data storage (102) according to claim 1, said control unit (108) comprising a program memory holding a computer program product according to claim 13 or 14.
PCT/EP2022/052334 2022-02-01 2022-02-01 Data storage and methods of deduplication of data storage, storing new file and deleting file WO2023147842A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/EP2022/052334 WO2023147842A1 (en) 2022-02-01 2022-02-01 Data storage and methods of deduplication of data storage, storing new file and deleting file

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2022/052334 WO2023147842A1 (en) 2022-02-01 2022-02-01 Data storage and methods of deduplication of data storage, storing new file and deleting file

Publications (1)

Publication Number Publication Date
WO2023147842A1 true WO2023147842A1 (en) 2023-08-10

Family

ID=80595102

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2022/052334 WO2023147842A1 (en) 2022-02-01 2022-02-01 Data storage and methods of deduplication of data storage, storing new file and deleting file

Country Status (1)

Country Link
WO (1) WO2023147842A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8639669B1 (en) * 2011-12-22 2014-01-28 Emc Corporation Method and apparatus for determining optimal chunk sizes of a deduplicated storage system
US20160034549A1 (en) * 2013-06-25 2016-02-04 Google Inc. Hierarchical Chunking of Objects in a Distributed Storage System

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8639669B1 (en) * 2011-12-22 2014-01-28 Emc Corporation Method and apparatus for determining optimal chunk sizes of a deduplicated storage system
US20160034549A1 (en) * 2013-06-25 2016-02-04 Google Inc. Hierarchical Chunking of Objects in a Distributed Storage System

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
JOÃO PAULO ET AL: "A Survey and Classification of Storage Deduplication Systems", ACM COMPUTING SURVEYS., vol. 47, no. 1, 1 July 2014 (2014-07-01), US, pages 11:1 - 11:30, XP055697355, ISSN: 0360-0300, DOI: 10.1145/2611778 *

Similar Documents

Publication Publication Date Title
US10621142B2 (en) Deduplicating input backup data with data of a synthetic backup previously constructed by a deduplication storage system
US9262434B1 (en) Preferential selection of candidates for delta compression
US9268783B1 (en) Preferential selection of candidates for delta compression
US9405764B1 (en) Method for cleaning a delta storage system
US8972672B1 (en) Method for cleaning a delta storage system
US11954373B2 (en) Data structure storage and data management
EP2633413B1 (en) Low ram space, high-throughput persistent key-value store using secondary memory
US10135462B1 (en) Deduplication using sub-chunk fingerprints
US10191934B2 (en) De-duplication system and method thereof
US9400610B1 (en) Method for cleaning a delta storage system
US9984090B1 (en) Method and system for compressing file system namespace of a storage system
US9026740B1 (en) Prefetch data needed in the near future for delta compression
CN109445702B (en) block-level data deduplication storage system
US10657103B2 (en) Combining data matches from multiple sources in a deduplication storage system
US10031937B2 (en) Similarity based data deduplication of initial snapshots of data sets
US11030198B2 (en) Reducing resource consumption of a similarity index in data deduplication
JP2020518207A (en) Lossless reduction of data using basic data sheaves, and performing multidimensional search and content-associative retrieval on losslessly reduced data using basic data sieves
US11620261B2 (en) Writing data to an LSM tree file structure using consistent cache staging
US20230394010A1 (en) File system metadata deduplication
US9116902B1 (en) Preferential selection of candidates for delta compression
WO2023147842A1 (en) Data storage and methods of deduplication of data storage, storing new file and deleting file
US11436108B1 (en) File system agnostic content retrieval from backups using disk extents
WO2023138788A1 (en) Method of backing up file-system onto object storgae system and data management module

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22706738

Country of ref document: EP

Kind code of ref document: A1