CN111124259A - Data compression method and system based on full flash memory array - Google Patents

Data compression method and system based on full flash memory array Download PDF

Info

Publication number
CN111124259A
CN111124259A CN201811287652.2A CN201811287652A CN111124259A CN 111124259 A CN111124259 A CN 111124259A CN 201811287652 A CN201811287652 A CN 201811287652A CN 111124259 A CN111124259 A CN 111124259A
Authority
CN
China
Prior art keywords
data block
data
fingerprint
compressed
storage unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811287652.2A
Other languages
Chinese (zh)
Inventor
夏文
古亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sangfor Technologies Co Ltd
Original Assignee
Sangfor Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sangfor Technologies Co Ltd filed Critical Sangfor Technologies Co Ltd
Priority to CN201811287652.2A priority Critical patent/CN111124259A/en
Publication of CN111124259A publication Critical patent/CN111124259A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0683Plurality of storage devices
    • G06F3/0688Non-volatile semiconductor memory arrays
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application discloses a data compression method and a data compression system based on a full flash memory array, which are used for improving the space utilization rate of data storage and avoiding the problem of space waste. The method in the embodiment of the application comprises the following steps: acquiring compressed data in a performance layer; dividing the compressed data into first data blocks with preset lengths, and calculating hash values of the first data blocks; matching the hash value of the first data block with a duplicate removal fingerprint library in the capacity layer to determine whether a matching fingerprint exists; and if the first data block does not exist, determining that the first data block is a non-repeated data block, compressing the first data block, and writing the compressed first data block back to the capacity layer in a log additional writing mode, wherein the log additional writing is used as a different-place updating mode and is used for improving the IO performance of the capacity layer.

Description

Data compression method and system based on full flash memory array
Technical Field
The present application relates to the field of data storage technologies, and in particular, to a data compression method and system based on a full flash memory array.
Background
Full flash memory array: flash memory Solid State Disks (SSDs) are widely used for caching mechanical hard disks, such as Ceph and ZFS, mainly because flash memory solid state disks have good random IO performance, while conventional mechanical hard disks do not perform well in the aspect of supporting random IO performance, and it is a general trend that full flash memory devices are deployed in current storage systems to comprehensively improve the overall performance of the storage systems. Considering that the cost of the SSD is far more expensive than that of the existing mechanical hard disk, and meanwhile, under the current cloud computing and virtualization environment, a large amount of repeated redundant data exists in the storage system, the logical storage space of the SSD storage system can be expanded through data deduplication and compression technologies, the device utilization rate of the SSD is improved, and the purpose of reducing the SSD cost is achieved.
Generally, the physical architecture of a full flash memory array is divided into a capacity layer (read cache) and a performance layer (write cache), and in the process of writing data back from the performance layer to the capacity layer, according to the existing data storage method, if the compressed data in the capacity layer is updated, the update is generally performed in situ, and the in-situ update method is easily to cause the problem that the storage space of the original compressed data block is not matched after the updated data block is compressed, thereby causing a lot of space fragments and wasting the storage space.
Disclosure of Invention
The embodiment of the application provides a data compression method and system based on a full flash memory array, which are used for improving the space utilization rate of data storage and avoiding the problem of space waste.
A first aspect of an embodiment of the present application provides a data compression method based on a full flash memory array, including:
acquiring compressed data in the performance layer;
segmenting the compressed data into first data blocks with preset lengths, and calculating hash values of the first data blocks;
matching the hash value of the first data chunk with a duplicate removal fingerprint library in the capacity layer to determine whether a matching fingerprint exists;
if the first data block does not exist, determining that the first data block is a non-repeated data block, compressing the first data block, writing the compressed first data block back to the capacity layer in a log additional writing mode, updating the fingerprint of the first data block into the duplicate removal fingerprint library, and updating the fingerprint of the first data block into the duplicate removal fingerprint library, wherein the log additional writing is used as a different-place updating mode for improving the IO performance of the capacity layer.
Preferably, the method further comprises:
if the first data block exists, determining that the first data block is repeated data, and writing back metadata information of the first data block to a metadata area of the capacity layer, wherein the metadata information comprises a corresponding relation between a logical address of the first data block in the compressed data, the matching fingerprint and a physical address of the matching fingerprint.
Preferably, the method further comprises:
and performing counting management on the reference times of the fingerprints in the de-duplication fingerprint library.
Preferably, the counting management of the number of references of the fingerprint in the deduplication fingerprint library is performed, and includes:
when the matched fingerprint of the first data block exists in the duplicate removal fingerprint library, performing incremental operation on the reference times of the matched fingerprint;
and the combination of (a) and (b),
and when the first data block which refers to the matched fingerprint in the duplicate fingerprint database is updated, performing decreasing operation on the reference times of the matched fingerprint.
Preferably, the writing back the compressed first data block to the capacity layer in a manner of additional writing in a log includes:
and writing the compressed first data block back to a log storage unit in a log additional writing mode, and writing the log storage unit back to the capacity layer after the log storage unit is full, wherein the storage space of the log storage unit is an integral multiple of the minimum writing unit of the capacity layer.
Preferably, after writing the compressed first data block back to the capacity layer in a manner of journal append writing, the method further includes:
updating metadata information of the first data block to a file metadata area of the capacity layer or the de-duplication fingerprint database, wherein the metadata information includes: and the compressed physical storage address of the first data block and the compressed length of the first data block are used for decompressing the first data block according to the metadata information at a later stage.
A second aspect of the present application provides a data compression system based on a full flash memory array, including:
an acquisition unit configured to acquire compressed data in the performance layer;
the segmentation calculation unit is used for segmenting the compressed data into a first data block with a preset length and calculating the hash value of the first data block;
a matching unit, configured to match the hash value of the first data chunk with a duplicate removal fingerprint library in the capacity layer to determine whether a matching fingerprint exists;
and the compression unit is used for determining that the first data block is a non-repeated data block when the matched fingerprint does not exist, compressing the first data block, writing the compressed first data block back to the capacity layer in a log additional writing mode, and updating the fingerprint of the first data block into the duplicate removal fingerprint library, wherein the log additional writing is used as a different-place updating mode to improve the IO performance of the capacity layer.
Preferably, the system further comprises:
and the deduplication unit is used for determining that the first data block is duplicated data when the matching fingerprint exists, and writing back metadata information of the first data block to a metadata area of the capacity layer, wherein the metadata information includes a corresponding relation between a logical address of the first data block in the compressed data, the matching fingerprint and a physical address of the matching fingerprint.
Preferably, the system further comprises:
and the counting unit is used for performing counting management on the reference times of the fingerprints in the de-duplication fingerprint database.
Preferably, the counting unit includes:
the first counting module is used for executing the increment operation on the reference times of the matched fingerprint when the matched fingerprint of the first data block exists in the de-duplication fingerprint library;
and the combination of (a) and (b),
and the second counting module is used for performing decreasing operation on the reference times of the matched fingerprints when the first data block which refers to the matched fingerprint in the de-duplication fingerprint library is updated.
Preferably, the compression unit comprises:
the compression module is configured to determine that the first data block is a non-duplicate data block when the matching fingerprint does not exist, compress the first data block, write back the compressed first data block to a log storage unit in a log additional write manner, and write back the log storage unit to the capacity layer after the log storage unit is full, where a storage space of the log storage unit is an integer multiple of a minimum write-in unit of the capacity layer.
Preferably, the system further comprises:
an updating unit, configured to update metadata information of the first data block into a file metadata area of the capacity layer or the deduplication fingerprint library, where the metadata information includes: and the compressed physical storage address of the first data block and the compressed length of the first data block are used for decompressing the first data block according to the metadata information at a later stage.
The embodiment of the present application further provides a data compression system based on a full flash memory array, which includes a processor, and when the processor executes a computer program stored in a memory, the processor is configured to implement the data compression method based on the full flash memory array provided in the first aspect of the embodiment of the present application.
An embodiment of the present application further provides a readable storage medium, on which a computer program is stored, where the computer program is used to implement the method for data compression based on a full flash memory array provided in the first aspect of the embodiment of the present application when the computer program is executed by a processor.
According to the technical scheme, the embodiment of the application has the following advantages:
in the embodiment of the application, compressed data is acquired from a performance layer in a full flash memory array, the data is divided into first data blocks with preset lengths, hash values of the first data blocks are calculated, the hash values of the first data blocks are matched with a duplicate removal fingerprint database of a capacity layer to determine whether a matched fingerprint exists, when the matched fingerprint does not exist, the first data blocks are determined to be non-duplicate data blocks, the first data blocks are compressed, and the compressed first data blocks are stored in the capacity layer in a log additional writing mode. The log additional writing has the characteristics of sequential writing and remote updating, namely when the file corresponding to the first data block is updated, the new data block in the corresponding file is compressed, and then the additional writing is executed in a time sequence, namely the new data block is stored in a new storage space address in the storage medium, namely remote updating, but not the storage address corresponding to the original first data block is updated, so that the problem that the length of the compressed new data block is not matched with the storage space of the compressed first data block after the file data is updated is avoided, the waste of the storage space in the storage medium is further avoided, smaller space fragments generated in the storage medium are also avoided, and the utilization rate of the storage space in the storage medium is improved.
Drawings
FIG. 1 is a schematic diagram of an embodiment of a full flash memory array based data compression method according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a physical architecture of a full flash memory array according to an embodiment of the present application;
FIG. 3 is a schematic diagram of another embodiment of a full flash memory array based data compression method according to an embodiment of the present application;
FIG. 4A is a diagram illustrating logical addresses and physical addresses before and after data deduplication in an embodiment of the present application;
fig. 4B is a schematic diagram of a data logical organization relationship of metadata information in a metadata area of a capacity layer in the embodiment of the present application.
FIG. 5 is a schematic diagram of another embodiment of a full flash memory array based data compression method according to an embodiment of the present application;
FIG. 6 is a schematic diagram of a first de-duplication fingerprint library and a second de-duplication fingerprint library in an embodiment of the present application;
FIG. 7A is a schematic diagram illustrating that the compressed first data block is updated in a different place in a manner of additional write in a log in the embodiment of the present application;
FIG. 7B is a diagram of a data bit map in an embodiment of the present application;
FIG. 8 is a schematic diagram of another embodiment of a full flash memory array based data compression method according to an embodiment of the present application;
FIG. 9 is a schematic diagram of an embodiment of a full flash array based data compression system according to an embodiment of the present application;
FIG. 10 is a schematic diagram of another embodiment of a full flash array based data compression system according to an embodiment of the present application;
FIG. 11 is a schematic diagram of another embodiment of a full flash array based data compression system according to an embodiment of the present application;
fig. 12 is a schematic diagram of another embodiment of a full flash array based data compression system according to an embodiment of the present application.
Detailed Description
The embodiment of the application provides a data compression method and system based on a full flash memory array, which are used for improving the space utilization rate of data storage and avoiding the problem of space waste.
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The terms "first," "second," "third," "fourth," and the like in the description and in the claims of the present application and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Generally, in order to save the storage space of data, when a file is stored, data in the file is decompressed to reduce the occupied space of the data.
The duplicate removal is to uniquely identify a data block by calculating a secure hash digest (such as a SHA1 fingerprint) of the data block, so that character-by-character matching of data is avoided, the storage system can quickly and conveniently identify duplicate data only by simply maintaining an index table of the secure hash digest, and the purpose of saving storage space can be achieved only by recording corresponding data pointer information for the duplicate data content.
The data compression is also a redundant data elimination technology, and the redundant data information is eliminated mainly through a coding mode, namely on the premise that the original data information is not lost, the original content is converted, and a repeated byte sequence is represented by codes with fewer bytes, so that the aims of eliminating partial redundant data and finally saving storage space are fulfilled.
For the existing compression deduplication technology, the hash value of the compressed data block is generally compared with the fingerprint in the deduplication fingerprint database, and if the matched fingerprint exists, the compressed data block is determined to be the duplicate data, and only the pointer information of the duplicate data needs to be recorded; if there is no matched fingerprint, determining that the compressed data block is a non-duplicated data block, and then compressing and storing the compressed data block to achieve the purpose of saving storage space, and this compression storage manner, when the file data is updated (deleted or changed), generally performs in-place update on the data, that is, in an address space storing the original compressed data, deleting the original compressed data, or updating the original compressed data, so that when the file data is updated, the problem that the address space of the compressed data block of the new file is not matched with that of the original compressed data block can be faced, for example: assuming that a storage space of a compressed non-repeated data block in an original file is 2K, when the file data is deleted, 2K space fragments can appear, when the file data is changed, if the length of the compressed non-repeated data block after updating is 1K, 1K space fragments can appear, and if the length of the compressed non-repeated data block after updating is 3K, the problem that the original 2K space address cannot be stored can appear; on the other hand, the storage space of the deleted non-repeated data blocks cannot be directionally recycled.
In view of the above problem, an embodiment of the present application provides a data compression method based on a full flash memory array, which is used to improve a space utilization rate of data storage and avoid a problem of space waste.
For convenience of understanding, the following describes a data compression method based on a full flash memory array in the present application, and referring to fig. 1, an embodiment of the data compression method based on the full flash memory array in the present application includes:
101. acquiring compressed data in a performance layer;
generally, for a device with a processor, IO performance of a storage system is a main factor affecting performance of the device system, and when an external memory of the device is deployed as a full flash memory array, a physical architecture of the full flash memory array is divided into a capacity layer and a performance layer, where the capacity layer refers to an SSD solid state disk with a slow IO response or a normal hard disk, and the performance layer refers to an SSD solid state disk with a fast IO response, specifically, refer to the physical architecture of the full flash memory array shown in fig. 2, where the performance layer is also called a write cache and the capacity layer is also called a read cache, and how to avoid a problem that a data storage space after updating is not matched with an original data storage space when data is updated in a process of writing back data of the performance layer to the capacity layer, i.e., a problem of waste of space, is a technical problem to be solved by the present application.
In the process of writing back the data in the performance layer to the capacity layer, in order to save the data storage space, generally, the written-back data needs to be decompressed, and before decompressing, the compressed data in the performance layer needs to be obtained, where the compressed data may be various file data or message data in application software, and is not limited specifically here.
It should be noted that, the application scenarios of the data compression method in the present application include, but are not limited to: the data in the performance layer is written back to the capacity layer, and may also be a deduplication compression scenario of a full flash array SSD cache layer, an online deduplication compression scenario of an SSD file system, and the like, which is not specifically limited herein.
102. The data are divided into first data blocks with preset lengths, and hash values of the first data blocks are calculated;
in the process of compressing data, the data is segmented into a first data block with a preset length, wherein the segmentation granularity of the data block can be 2K, 4K, 8K or other sizes, and after the segmentation is completed or when the segmentation is executed, the hash value of the first data block is calculated.
Specifically, the hash algorithm transforms an input of arbitrary length (also called a pre-map) into an output of fixed length, which is a hash value, by the hash algorithm. All hash functions have the basic property that if two hash values are not identical (according to the same function) then the original inputs of the two hash values are not identical. This property is the result of the hash function being deterministic. The hash value of each data block can generally be considered as a fingerprint of that data block.
It should be noted that, in the process of segmenting the data block, the preset length may be segmented according to the actual requirement of the specific application, and is not limited specifically here.
103. Matching the hash value of the first data block with a duplicate removal fingerprint library in the capacity layer to determine whether a matched fingerprint exists, if so, executing a step 105, and if not, executing a step 104;
after the hash value of the first data block, that is, the fingerprint of the first data block, is obtained, the first data block may be matched with a deduplication fingerprint library pre-stored in the capacity layer to determine whether a matching fingerprint exists in the deduplication fingerprint library, if not, step 104 is executed, and if so, step 105 is executed.
104. Determining that a first data block is a non-duplicated data block, compressing the first data block, writing the compressed first data block back to a capacity layer in a journal additional writing manner, updating a fingerprint of the first data block into the de-duplication fingerprint database, and updating metadata information of the first data block to a file metadata area in the capacity layer, wherein the metadata information includes: the physical storage address of the compressed first data block and the length of the compressed first data block;
and when the matched fingerprint corresponding to the first data block does not exist in the duplicate fingerprint database, which indicates that the first data block is a non-duplicate data block, compressing the first data block, wherein the preferred compression algorithm is an LZ4 compression algorithm. After the compression is completed, writing the compressed first data block back to the capacity layer in a way of adding and writing logs, updating the hash value (fingerprint) of the first data block into a duplicate removal fingerprint library, and updating the metadata information of the first data block to a file metadata area, wherein the metadata information comprises: the physical storage address of the first data block after compression and the length of the first data block after compression are convenient for processing the first data block as repeated data when the first data block appears next time, and decompression recovery can be performed on the first data block according to metadata information of the first data block at a later stage.
It should be noted that the hash value (fingerprint) of the first data block and the metadata information of the first data block may also be updated to the duplicate removal fingerprint library at the same time, as long as decompression recovery can be performed on the first data block according to the metadata information of the first data block at a later stage, and no specific limitation is imposed on an update address of the metadata information of the first data block.
In addition, because the log additional writing is performed according to the time sequence, when the file corresponding to the first data block is updated, the new data block in the corresponding file is compressed, and then the additional writing is performed according to the time sequence, namely the new data block is stored in a new storage space address in the storage medium, namely the new data block is updated in different places, but not updated in the storage address corresponding to the original first data block, so that the problem that the length of the new data block after being compressed is not matched with the storage space of the original first data block after being compressed after the file data is updated is avoided, the waste of the storage space in the storage medium is further avoided, smaller space fragments generated in the storage medium are also avoided, and the utilization rate of the storage space in the storage medium is improved; in addition, the remote updating only needs to execute the writing operation, while the in-place updating needs to execute the reading operation first and then execute the writing operation, so the remote updating mode of the log additional writing further improves the IO performance of the capacity layer.
Furthermore, because the minimum write unit of the SSD disk is 4K, and when a minimum write unit is not full, and it is required to perform a write operation in the minimum write unit next time, according to the characteristics of erasing and writing of the SSD disk, it is required to first read the pre-stored data in the minimum write unit, then erase the pre-stored data, and then rewrite the new data to be newly written and the read stored data, the present application can also store the compressed data in the log storage unit first when performing a remote update of the file data by way of log additional write, and write the log storage unit back to the capacity layer after the log storage unit is full, wherein the storage space of the log storage unit is an integer multiple of the minimum write unit of the capacity layer, that is, an integer multiple of 4K, such as 8K, 12K, or 16K, so that the characteristic that the minimum write unit of the SSD disk is 4K is adapted, the problem of random small writes (i.e. the length of the written data is less than the minimum write unit 4K) in the SSD disk is also avoided, i.e. the problem of wasted storage space in the storage medium is further avoided.
105. Other processes are performed.
When the first data block has a matching fingerprint in the deduplication fingerprint library, other procedures are executed, and no specific limitation is made here.
In the embodiment of the application, compressed data is acquired from a performance layer in a full flash memory array, the data is divided into first data blocks with preset lengths, hash values of the first data blocks are calculated, the hash values of the first data blocks are matched with a duplicate removal fingerprint database of a capacity layer to determine whether a matched fingerprint exists, when the matched fingerprint does not exist, the first data blocks are determined to be non-duplicate data blocks, the first data blocks are compressed, and the compressed first data blocks are stored in the capacity layer in a log additional writing mode. The log additional writing has the characteristics of sequential writing and remote updating, namely when the file corresponding to the first data block is updated, the new data block in the corresponding file is compressed, and then the additional writing is executed in a time sequence, namely the new data block is stored in a new storage space address in the storage medium, namely remote updating, but not the storage address corresponding to the original first data block is updated, so that the problem that the length of the compressed new data block is not matched with the storage space of the compressed first data block after the file data is updated is avoided, the waste of the storage space in the storage medium is further avoided, smaller space fragments generated in the storage medium are also avoided, and the utilization rate of the storage space in the storage medium is improved.
Referring to fig. 3, continuing with the description of the case when the fingerprint of the first data block matches the fingerprint in the de-duplication fingerprint library in the capacity layer based on the embodiment described in fig. 1, another embodiment of the data compression method based on the full flash memory array in the embodiment of the present application includes:
106. if the matching fingerprint exists, determining that a first data block is repeated data, and writing back metadata information of the first data block to a metadata area of the capacity layer, wherein the metadata information comprises a corresponding relation among a logical address of the first data block in the compressed data, the matching fingerprint and a physical address of the matching fingerprint;
and when the fingerprint of the first data block is matched with the fingerprint in the duplicate removal fingerprint database, determining that the first data block is duplicate data, and writing the metadata information of the first data block back to the metadata area of the capacity layer so as to restore the first data block according to the metadata information of the first data block in the data decompression process.
Specifically, the metadata information of the first data block includes a corresponding relationship between a logical address of the first data block in the compressed data, a matching fingerprint, and a physical address of the matching fingerprint, specifically, the logical address of the first data block in the compressed data refers to a logical order of the first data block in the compressed data (as in fig. 4A, data block B5 is the first data block in file 1), and the physical address of the matching fingerprint refers to a specific physical storage address of the matching fingerprint in a capacity layer, so as to perform decompression recovery on the first data block according to the physical address at a later stage, where fig. 4A is a schematic diagram of the logical address and the physical address before and after data deduplication in this embodiment; fig. 4B is a data logical organization relationship diagram of metadata information in a metadata area of a capacity layer in an embodiment of the present application, where it is easily understood that multiple data blocks may correspond to the same fingerprint, that is, multiple (N) logical addresses correspond to the same fingerprint, and one fingerprint corresponds to only one physical storage address of the fingerprint, so that decompression recovery is performed on the data block according to the physical storage of the data block corresponding to the fingerprint at a later time.
107. Count management is performed on the number of references to fingerprints in the deduplication fingerprint library.
In order to clarify space fragment information in the capacity layer, that is, space fragments generated after storage data in an original storage space in the capacity layer is deleted, count management may be performed on the number of references of fingerprints in the deduplication fingerprint library, and specifically, count management may be performed through the following two aspects:
firstly, when the matched fingerprint of the first data block exists in the duplicate removal fingerprint library, performing incremental operation on the reference times of the matched fingerprint;
if the first data block in the compressed data has a matching fingerprint in the duplicate fingerprint removal library, then performing a growing operation, preferably an accumulating operation, on the number of references of the matching fingerprint, that is, when the first data block has a matching fingerprint in the duplicate fingerprint removal library, performing a "+ 1" operation on the number of references of the matching fingerprint, of course, the growing operation may also be a multiplication operation or a hybrid operation, as long as it is a positive correlation operation, and this is not limited specifically here.
And secondly, when the first data block which refers to the matched fingerprint in the duplicate removal fingerprint database is updated, performing decreasing operation on the reference times of the matched fingerprint.
Specifically, if the file data corresponding to a certain matching fingerprint in the duplicate removal fingerprint library is deleted or changed, a decrement operation, preferably a subtraction operation, is performed on the number of references of the matching fingerprint, that is, when the first data block corresponding to the matching fingerprint is deleted or updated, a "-1" operation is performed on the number of references of the matching fingerprint, and of course, the decrement operation may also be a division operation or a hybrid operation, as long as it is a negative correlation operation, and no specific limitation is imposed here.
Therefore, the reference condition of each fingerprint can be clarified through the management of the reference times of the fingerprints in the duplicate fingerprint removal library, when the reference times of a certain fingerprint is 0, the fingerprint can be deleted according to the physical address of the fingerprint so as to increase the storage space of the capacity layer, and when the deletion operation is executed, the address information of the space debris is recorded so as to clarify the space debris information in the capacity layer, thereby facilitating the management of the storage space of the capacity layer.
Based on the above embodiments, the following continues to describe the data compression method based on the full flash memory array in the embodiment of the present application, and referring to fig. 5, another embodiment of the data compression method based on the full flash memory array in the present application includes:
501. acquiring compressed data in a performance layer;
502. segmenting the compressed data into first data blocks with preset lengths, and calculating hash values of the first data blocks;
it should be noted that steps 501 and 502 in this embodiment are similar to steps 101 and 102 in the embodiment described in fig. 1, and are not described again here.
503. Matching the hash value of the first data block with a first duplicate removal fingerprint library stored in a memory to determine whether a first matched fingerprint exists, wherein the fingerprint of the first duplicate removal fingerprint library is part of the content of the corresponding fingerprint in a second duplicate removal fingerprint library stored in the capacity layer, if not, executing step 504, and if so, executing step 505;
generally, in the prior art, only one duplicate removal fingerprint library is arranged in a capacity layer, and fingerprints of a large number of data blocks need to be stored in the duplicate removal fingerprint library, so that the repetition rate of file data in a file data compression process is increased, and along with the increase of the fingerprint amount, on one hand, the space occupancy rate of the duplicate removal fingerprint library in the capacity layer is increased; on the other hand, in the fingerprint matching process, matching with each fingerprint needs to be performed in sequence, so that IO (input/output) overhead in the data matching process is increased.
To solve the problem, a first duplicate removal fingerprint library is established in the memory, wherein the fingerprint in the first duplicate removal fingerprint library is a part of the content of the corresponding fingerprint in the second duplicate removal fingerprint library in the capacity layer, and if the fingerprint in the first duplicate removal fingerprint library is the first two bytes of the corresponding fingerprint in the second duplicate removal fingerprint library, the fingerprint in the first duplicate removal fingerprint library can also be the middle part byte or the back part byte of the corresponding fingerprint in the second duplicate removal fingerprint library according to the characteristic that the hash value is uniformly distributed, and no specific limitation is made here. Therefore, on one hand, the reading speed of the memory is faster than that of the capacity layer, so that the duplicate removal efficiency of the file data is improved; on the other hand, the fingerprints in the first duplicate removal fingerprint library are only part of the fingerprints in the second duplicate removal fingerprint library, so that the storage capacity of the fingerprints in the first duplicate removal fingerprint library is reduced, the matching efficiency of the fingerprints is improved, and the IO (input/output) overhead in the data matching process is reduced.
For example: as shown in fig. 6, assuming that the fingerprints in the first de-duplication fingerprint library are the first two bytes of the corresponding fingerprint in the second de-duplication fingerprint library, the probability of collision between two data determined to be the same according to the fingerprints in the first de-duplication fingerprint library is about 0.00001526, which means that 99.99% of data duplication can be rapidly determined. If the compressed data is 4TB, according to the granularity of 4KB data blocks, 1G data blocks exist, the total size of the second duplicate removal fingerprint library is about 20GB, and after the first duplicate removal fingerprint library is established in the memory, the first duplicate removal fingerprint library only needs to occupy 2GB of memory overhead, so that the overhead is reduced by 90% compared with that of the second duplicate removal fingerprint library in the capacity layer, and the effect of querying the memory by 99.99% can also be realized.
504. Determining that the first data block is a non-duplicated data block, compressing the first data block, writing the compressed first data block back to a log storage unit in a log additional writing mode by taking a preset length as a storage unit, writing the log storage unit back to a capacity layer after the log storage unit is full, wherein a storage space of the log storage unit is an integral multiple of a minimum writing unit of the capacity layer, correspondingly updating fingerprints and partial fingerprints of the first data block into a second duplicate removal fingerprint library and a first duplicate removal fingerprint library respectively, and updating metadata information of the first data block into a file metadata area of the capacity layer, wherein the metadata information comprises: the physical storage address of the compressed first data block and the length of the compressed first data block;
and when the first matching fingerprint corresponding to the first data block does not exist in the deduplication fingerprint library, which indicates that the first data block is a non-duplicate data block, compressing the first data block, wherein the preferred compression algorithm is an LZ4 compression algorithm. After the compression is finished, the compressed first data block is written back to the log storage unit in a log additional writing mode by taking the preset length as a storage unit, the log storage unit is written back to the capacity layer after the log storage unit is full, the storage space of the log storage unit is an integral multiple of the minimum writing unit of the capacity layer, the fingerprint (complete hash value) and partial fingerprint of the first data block are correspondingly updated to a second duplicate removal fingerprint library and a first duplicate removal fingerprint library respectively, wherein the fingerprint in the first duplicate removal fingerprint library is partial content of the fingerprint corresponding to the second duplicate removal fingerprint library.
Further, it is also necessary to update metadata information of the first data block to a file metadata area, where the metadata information includes: the physical storage address of the first data block after compression and the length of the first data block after compression are convenient for processing the first data block as repeated data when the first data block appears next time, and decompression recovery can be performed on the first data block according to metadata information of the first data block at a later stage.
It should be noted that the hash value (fingerprint) of the first data block and the metadata information of the first data block may also be updated to the duplicate removal fingerprint library at the same time, as long as decompression recovery can be performed on the first data block according to the metadata information of the first data block at a later stage, and no specific limitation is imposed on an update address of the metadata information of the first data block.
Specifically, if the storage space of one non-duplicated data block after compression in the original file is assumed to be 6K, the 6K is written back to the log storage unit with a preset length (for example, 1K) as a storage unit, the non-duplicated data block occupies 6 storage units in the log storage unit, and if the log storage unit is assumed to be 200K, the log storage unit is written back to the capacity layer after the log storage unit 200K is full, because the storage capacity of the log storage unit is an integral multiple of the minimum write unit (4K) of the capacity layer, in this way, in the process that the log storage unit is written back to the capacity layer again after the log storage unit is full, the situation that the minimum write unit of the capacity layer is not full can be avoided, and thus the waste of the storage space in the capacity layer is avoided; in addition, the remote updating only needs to execute the writing operation, while the in-place updating needs to execute the reading operation first and then execute the writing operation, so the remote updating mode of the log additional writing further improves the IO performance of the capacity layer.
It is easy to understand that the storage unit is a space unit much smaller than the log storage unit, and the storage unit is mainly used for constructing the data bitmap in step 507, and the smaller the storage unit is, the easier the space occupation state of the data in each storage unit is to be identified, but the smaller the storage unit is, the larger the data amount in the data bitmap is also correspondingly caused, so the size of the storage unit can be specifically set according to the configuration of the processor in practical application, and is not limited specifically here. For example, the preset length of the memory unit may also be 2K, 3K or other values, but the preset length of the memory unit is generally smaller than the minimum write unit (4K) of the capacity layer, and the preset length of the memory unit is not particularly limited herein.
In addition, the beneficial effect corresponding to the allopatric update manner of performing additional write of the log in the capacity layer is described in detail in step 104 of the embodiment described in fig. 1, and is not described again here.
505. If a first matching fingerprint corresponding to a first data block exists in a first duplicate removal fingerprint library, matching the first data block with the second duplicate removal fingerprint library stored in the capacity layer to determine whether a second matching fingerprint exists, if so, executing step 506, and if not, executing step 504;
when the hash value of the first data block has the first matching fingerprint in the first duplicate removal fingerprint library, because the fingerprint in the first duplicate removal fingerprint library is only part of the content of the corresponding fingerprint in the second duplicate removal fingerprint library, two different data blocks may also exist, wherein the hash values are partially the same, that is, a collision condition exists when one fingerprint in the first duplicate removal fingerprint library corresponds to a plurality of data blocks, in order to avoid a condition that one fingerprint in the first duplicate removal fingerprint library corresponds to a plurality of data blocks, when the first duplicate removal fingerprint has the first matching fingerprint matched with the first data block, the hash fingerprint of the first data block is further matched with the fingerprint in the second duplicate removal fingerprint library, because the fingerprint in the second duplicate removal fingerprint library is the complete hash value of the data block, one fingerprint can be realized to correspond to one data block. That is, a first data chunk can be determined to be a duplicate data chunk only if a second matching fingerprint exists for the first data chunk in a second deduplication fingerprint repository.
506. Determining that the first data block is duplicated data, and writing back metadata information of the first data block to a metadata area of the capacity layer, wherein the metadata information includes a correspondence relationship between a logical address of the first data block in the compressed data, the second matching fingerprint, and a physical address of the second matching fingerprint;
it should be noted that step 506 in this embodiment is similar to step 106 in the embodiment described in fig. 3, and is not described here again.
507. Constructing a data bit map table, wherein the data bit map table is used for recording space occupation states corresponding to each storage unit or a plurality of storage units, the space occupation states comprise a first state and a second state, the first state is invalid occupation, and the second state is valid occupation;
after the compressed first data block is stored in a storage unit with a preset length, a data bitmap table may be further constructed, where the data bitmap table is used to record a space occupation state corresponding to each storage unit (e.g., 1K) or multiple storage units, the space occupation state of each storage unit includes a first state and a second state, the first state indicates that the space occupation state is invalid, that is, data stored in the storage unit is invalid data, and the second state indicates that the space occupation state is valid, that is, data in the storage unit is valid data.
When the compressed first data block is written back to the capacity layer in a manner of additional writing in a log with a preset length as a storage unit, the space occupation state of the corresponding storage unit recorded in the data bit map is changed from the second state to the first state when the first data block in a certain storage unit is changed or deleted, otherwise, the space occupation state of the corresponding storage unit recorded in the data bit map is still the second state, specifically, for convenience of operation, the space occupation state of the storage unit can be represented as the first state by a number "0" and the space occupation state of the storage unit can be represented as the second state by a number "1" in the data bit map.
When the first data block is updated (deleted or changed), the compressed first data block is updated in a remote location in a journal additional writing mode, namely, the first data block is stored in a new storage address (storage unit), and the data of the original storage address of the first data block becomes invalid data, so that when the first data block in the storage unit is changed or deleted, the space occupation states recorded in the data bit diagram corresponding to the storage unit before updating are all in a first state, and when the first data block normally exists (is not updated), the space occupation state recorded in the data bit diagram corresponding to the storage unit is in a second state. Fig. 7A is a schematic diagram of performing a remote update on the compressed first data block in a manner of adding a write to a log, and the diagram also shows a correspondence relationship between a logical address of the data block, a matching fingerprint, and a physical address corresponding to the matching fingerprint, and fig. 7B is a schematic diagram of a data bit map.
508. Scanning the data bit diagram, acquiring the number of storage units in the first state in each log storage unit, and judging whether the space occupied by the storage units in the first state is greater than a preset space occupation threshold, if so, executing a step 509, and if not, executing a step 511;
in the process of updating the data block, it may occur that the space states of the plurality of storage units are invalid occupancy, that is, the storage data in the storage unit is invalid data, in order to timely perform space recovery on the invalid data in the storage unit, the data bit diagram may be scanned in real time or at regular time, and in the scanning process, the number of the storage units in the first state in each log storage unit is obtained, and it is determined whether the space occupied by the storage unit in the first state is greater than a preset space threshold, such as: in a certain log storage unit (200K), the number of storage units in the first state (i.e. invalid occupancy) is 100, if each storage unit is 1K, the space occupied by the storage unit in the first state is 50% of that of the log storage unit, if the preset space occupancy threshold is 40%, and the space utilization rate 50% occupied by the storage unit in the first state is greater than the preset space occupancy threshold 40%, that is, the invalid data in the log storage unit exceeds 40%, step 509 is executed, otherwise, step 511 is executed.
509. Deleting an updated data block in a storage unit in a first state in a log storage unit, acquiring an un-updated data block in a storage unit in a second state in the log storage unit, migrating the un-updated data block to a new storage unit, and updating metadata information of the un-updated data block to a file metadata area, wherein the metadata information of the un-updated data block comprises a physical storage address of the un-updated data block and a compressed length of the un-updated data block;
if the space ratio occupied by the storage unit in the first state in the log storage unit is greater than the preset space occupation threshold, indicating that there is too much invalid data in the log storage unit, a space reclamation operation may be performed on the log storage unit, namely, deleting the updated data block in the storage unit in the first state in the log storage unit, acquiring the non-updated data block in the storage unit in the second state in the log storage unit, and migrates the non-updated data blocks to a new storage unit (i.e., a new storage address), while updating the metadata information of the non-updated data blocks into the file metadata area, wherein the metadata information of the non-updated data block comprises the physical storage address of the non-updated data block and the compressed length of the non-updated data block, so as to perform decompression recovery operation on the data according to the metadata information of the non-updated data block at a later stage.
510. Dividing the non-updated data blocks into frequent migration and non-frequent migration types, and fixedly storing the non-updated data blocks which are not frequently migrated to avoid frequent migration of the non-updated data blocks which are not frequently migrated;
in the process of migrating the non-updated data blocks to the new storage units, in order to avoid the problem that other data blocks are repeatedly updated in the log storage unit where the non-updated data blocks are located, and the storage addresses of the non-updated data blocks need to be repeatedly migrated, the non-updated data blocks can be divided into frequent migration types and non-frequent migration types, and the non-updated data blocks which are not frequently migrated are fixedly stored, so that the frequent migration of the non-updated data blocks is avoided.
Specifically, the type division may be performed on the non-updated data blocks according to different dimensions:
dividing the non-updated data blocks into frequent migration types and non-frequent migration types according to the average value of time intervals of two times before and after the migration of the non-updated data blocks in a preset time period;
specifically, when migrating an un-updated data block, the un-updated data block may be divided into frequent migration and non-frequent migration types according to an average value of time intervals of two times before and after the un-updated data block is migrated within a preset time period (e.g. between 2018 and 4/1/5/1), assuming that 5 migrations of the un-updated data block occur during space recovery performed within the preset time period, a time interval between the 1 st migration and the 2 nd migration is 20 minutes, a time interval between the 2 nd migration and the 3 rd migration is 10 minutes, a time interval between the 3 rd migration and the 4 th migration is 30 minutes, and a time interval between the 4 th migration and the 5 th migration is 40 minutes, where the average value of the time intervals of two times before and after the migration of the un-updated data block occurs within the preset time period is 20 minutes, and if the preset time threshold is 10 minutes, and the average value of the time intervals of the two times before and after the migration is 20 minutes greater than the preset time threshold value of 10 minutes, so that the non-updated data block is of a frequently migrated data type.
It should be noted that, the average value of the two time intervals before and after the occurrence of the migration within the preset time period may also be changed into other forms such as a weighted average value, and is not limited herein.
Dividing the non-updated data blocks into frequent migration types and non-frequent migration types according to the migration frequency of the non-updated data blocks in a preset time period;
further, the non-updated data blocks may be divided into frequent migration and non-frequent migration types according to the migration frequency of the non-updated data blocks within a preset time period, for example, within the preset time period (for example, between 2018 and 4/1/5/1), the migration frequency of the non-updated data blocks is 20 times, the preset frequency threshold is 30 times, and the migration frequency of the non-updated data blocks is less than the preset frequency threshold, so the non-updated data blocks are of the non-frequent migration type.
It should be noted that the above exemplary contents are only an exemplary illustration of the dividing manner, and do not limit the specific dividing manner.
And thirdly, dividing the non-updated data blocks into frequent migration types and non-frequent migration types according to the reference times of the fingerprints corresponding to the non-updated data blocks.
Further, the non-updated data block may be divided into frequent migration and non-frequent migration types according to the number of references of the non-updated data block to the fingerprint within a preset time period, for example, the number of references of the non-updated data block to the fingerprint is 200, that is, the non-updated data block is a data block which is frequently referenced and is not updated, and the preset threshold of the number of references of the fingerprint is 50, and the number of references of the non-updated data block to the fingerprint is greater than the preset threshold of the number of references 50, so that the non-updated data block is the non-frequent migration type.
It should be noted that the above exemplary contents are only an exemplary illustration of the dividing manner, and do not limit the specific dividing manner.
In addition, the data type may be divided according to other dimensions, such as the creation time of the un-updated data block and the compressibility of the un-updated data block, and the division basis of the data type to which the un-updated data block belongs is not particularly limited herein.
511. Executing other processes;
and if the space occupied by the storage unit in the first state in each log storage unit is not larger than the preset space occupation threshold, executing other processes, wherein the processes are not limited specifically here.
512. Count management is performed on the number of references to fingerprints in the first and second de-duplication fingerprint repositories.
In order to effectively manage the first duplicate removal fingerprint library and the second duplicate removal fingerprint library of the memory and capacity layer and delete the invalid fingerprints in the first duplicate removal fingerprint library and the second duplicate removal fingerprint library in real time, counting management can be further performed on the number of times of reference of the fingerprints in the first duplicate removal fingerprint library and the second duplicate removal fingerprint library, and specifically, the counting management can be performed from the following two aspects:
when the matching fingerprints of the first data block exist in both a first duplicate removal fingerprint library and a second duplicate removal fingerprint library, performing incremental operation on the reference times of the matching fingerprints;
if the first data block in the compressed data has the first matching fingerprint and the second matching fingerprint in the first duplicate removal fingerprint library and the second duplicate removal fingerprint library respectively, then performing incremental operation, preferably cumulative operation, on the number of times of reference of the first matching fingerprint and the second matching fingerprint respectively, that is, when the first data block has the first matching fingerprint and the second matching fingerprint in the first duplicate removal fingerprint library and the second duplicate removal fingerprint library respectively, performing "+ 1" operation on the number of times of reference of the first matching fingerprint and the second matching fingerprint respectively, of course, the incremental operation may also be multiplicative operation or mixed operation, as long as the incremental operation is positive correlation operation, and no specific limitation is imposed here.
And secondly, when the first data blocks which reference the corresponding first matched fingerprints and second matched fingerprints in the first duplicate removal fingerprint library and the second duplicate removal fingerprint library are updated, performing decreasing operation on the reference times of the first matched fingerprints and the second matched fingerprints respectively.
Specifically, if the file data corresponding to the first matching fingerprint and the second matching fingerprint corresponding to the first duplicate removal fingerprint library and the second duplicate removal fingerprint library is deleted or changed, the subtraction operation is performed on the reference times of the first matching fingerprint and the second matching fingerprint respectively, preferably, the subtraction operation is a subtraction operation, that is, when the first data block corresponding to the first matching fingerprint and the second matching fingerprint is deleted or updated, the "-1" operation is performed on the reference times of the first matching fingerprint and the second matching fingerprint, of course, the subtraction operation may also be a division operation or a mixed operation, as long as the operation is a negative correlation operation, and no specific limitation is imposed here.
Like this, through the management to fingerprint number of times that quotes in first heavy fingerprint storehouse and the second heavy fingerprint storehouse of removing, alright in order to make clear the reference condition of every fingerprint, when the number of times that quotes of certain fingerprint is 0, alright in order to delete the fingerprint that corresponds in first heavy fingerprint storehouse and the second heavy fingerprint storehouse, can also further according to the physical address of this fingerprint, carry out the deletion operation to the first data block that this fingerprint corresponds, in order to increase the storage space of capacity layer, and when carrying out the deletion operation, carry out the record to the address information of this space piece, in order to make clear the space piece information in the capacity layer, conveniently to the management of capacity layer storage space.
In the embodiment of the application, a first duplicate removal fingerprint library is established in a memory, the fingerprint of the first duplicate removal fingerprint library is part of the content of the corresponding fingerprint in a second duplicate removal fingerprint library in a capacity layer, after the compressed data is obtained, the compressed data is divided into first data blocks with preset lengths, the hash value of the first data blocks is calculated, then the hash value of the first data blocks is matched with the first duplicate removal fingerprint library, and when the first duplicate removal fingerprint library does not have the first matched fingerprint, the first data blocks are compressed and stored, according to the characteristic that the hash values are uniformly distributed, the fingerprint in the first duplicate removal fingerprint library is set as part of the content of the corresponding fingerprint in the second duplicate removal fingerprint library, so that not only is the space storage occupied by the duplicate removal fingerprint library reduced, but also the first duplicate removal fingerprint library is stored in the memory, and the fingerprint reading speed of the fingerprint in the duplicate removal fingerprint library is improved, therefore, the IO performance of the storage system is improved.
Secondly, after compressing the non-duplicated data blocks, writing the non-duplicated data blocks back to the log storage unit in a way of additionally writing the log in a storage unit with a preset length, writing the data blocks back to the capacity layer after the log storage unit is full, and constructing a data bit diagram for correspondingly recording the space occupation state of each storage unit, so that when the first data block in the storage unit is updated, the data bit diagram corresponding to the storage unit is changed from the second state to the first state, therefore, when the space is recovered in the later period, the space occupation state of each log storage unit in the capacity layer can be obtained by scanning the data bit diagram, and when the space occupation rate of the log storage unit is greater than a preset space threshold value, the updated data blocks in the first state in the log storage unit are deleted, and the non-updated data blocks in the second state in the log storage unit are migrated to a new storage unit, and in the migration process, the non-updated data blocks are divided into frequent migration types and non-frequent migration types, so that the non-updated data blocks which are not frequently migrated are fixedly stored, frequent migration of the non-updated data blocks is avoided, and the IO performance of the full flash memory array is improved.
Continuing with the description of the full flash array based data compression method in the present application, referring to fig. 8, another embodiment of the full flash array based data compression method in the present application includes:
801. acquiring compressed data in the performance layer;
it should be noted that step 801 in this embodiment is similar to step 101 in the embodiment described in fig. 1, and is not limited herein.
802. Segmenting the compressed data into first data blocks with preset lengths, and calculating weak hash values of the first data blocks;
in the process of compressing data, the data is generally segmented into a first data block with a preset length, wherein the segmentation granularity of the data block can be 2K, 4K, 8K or other sizes, and after the segmentation is completed or when the segmentation is performed, a weak hash value of the first data block is calculated.
What is different from step 502 in the embodiment illustrated in fig. 5 is that after the first data block is segmented, the weak hash value of the first data block is calculated, and the weak hash generally refers to a hash algorithm that is fast in calculation and has a short hash length (generally 4-8 bytes), and more typical weak hashes include crc32, xxhash, and the like, and are mainly used for fast data check and retrieval. While the strong hash value is generally long (greater than 20 bytes), it takes a long time to check and retrieve.
It should be noted that, in the process of segmenting the data block, the preset length may be segmented according to the actual requirement of the specific application, and is not limited specifically here.
803. Matching the weak hash value of the first data block with a duplicate removal fingerprint database in the capacity layer to determine whether a matched fingerprint exists, if not, executing step 804, and if so, executing step 805;
unlike the prior art, in the present application, a weak hash value of a first data chunk is calculated, and then the weak hash value of the first data chunk is matched with a deduplication fingerprint library in a capacity layer to determine whether a matching fingerprint exists.
Since the weak hash generally refers to a hash algorithm which is fast in computation and has a short hash length (generally 4-8 bytes), typical weak hashes include crc32, xxhash, etc., and are mainly used for fast data checksum retrieval. Therefore, the method for calculating the weak hash value of the first data block can improve the speed of searching the first data block in the duplicate removal fingerprint database, thereby improving the duplicate removal efficiency of the data block and further improving the duplicate removal compression efficiency of the data.
804. Determining that a first data block is a non-duplicated data block, compressing the first data block, writing the compressed first data block back to a log storage unit in a log additional writing mode by taking a preset length as a storage unit, writing the log storage unit back to a capacity layer after the log storage unit is full, wherein a storage space of the log storage unit is an integral multiple of a minimum writing unit of the capacity layer, updating a fingerprint of the first data block into a deduplication fingerprint library, and updating metadata information of the first data block to a file metadata area in the capacity layer, wherein the metadata information comprises: the physical storage address of the compressed first data block and the length of the compressed first data block;
it should be noted that step 804 in this embodiment is different from step 503 in the embodiment shown in fig. 5 in that only the fingerprint of the first data block needs to be updated to the duplicate removal fingerprint library, and in this embodiment, only one duplicate removal fingerprint library exists in the capacity layer, and step 503 is to update the fingerprint and a part of the fingerprint of the first data block to the second duplicate removal fingerprint library and the first duplicate removal fingerprint library correspondingly, and the rest of the description is similar to step 503, and is not repeated here.
In addition, the metadata information of the first data block may also be updated to the duplicate removal fingerprint library in the capacity layer, as long as decompression and recovery of the first data block can be achieved, and the update address of the metadata information of the first data block is not particularly limited herein.
805. Reading an original data block corresponding to the matched fingerprint, and matching the first data block with the original data block to determine whether the first data block and the original data block are completely the same, if so, executing step 806, and if not, executing step 804;
and when the matched fingerprint corresponding to the weak hash value of the first data block exists in the duplicate removal fingerprint library, reading the original data block corresponding to the matched fingerprint according to the physical address of the matched fingerprint, and matching the first data block with the original data block one by one to determine whether the first data block is completely the same as the original data block.
Because the weak hash value has a certain collision rate, that is, there is a situation that one weak hash value corresponds to multiple data blocks, in order to avoid the situation, if a fingerprint corresponds to one data block, it is necessary to read out an original data block corresponding to the matched fingerprint according to a physical address of the matched fingerprint, and match the first data block with the original data block to determine whether the first data block and the original data block are completely the same, if yes, it is determined that the first data block is a duplicate data block, step 806 is executed, otherwise, it is determined that the first data block is a non-duplicate data block, step 804 is executed.
806. Determining that the first data block is a repeated data block, and writing back metadata information of the first data block to a metadata area of a capacity layer, wherein the metadata information comprises a corresponding relation between a logical address of the first data block in compressed data, a matching fingerprint and a physical address of the matching fingerprint;
it should be noted that step 806 in this embodiment is similar to step 106 in the embodiment described in fig. 3, and is not described here again.
807. Constructing a data bit map table, wherein the data bit map table is used for recording space occupation states corresponding to each storage unit or a plurality of storage units, the space occupation states comprise a first state and a second state, the first state is invalid occupation, and the second state is valid occupation;
808. scanning the data bit diagram, acquiring the number of storage units in a first state in each log storage unit, and judging whether the space occupied by the storage units in the first state is larger than a preset space occupation threshold, if so, executing step 809, and if not, executing step 811;
809. deleting an updated data block in a storage unit in a first state in a log storage unit, acquiring an un-updated data block in the storage unit in a second state in the log storage unit, migrating the un-updated data block to a new storage unit, and updating metadata information of the un-updated data block to a file metadata area, wherein the metadata information of the un-updated data block comprises a physical storage address of the un-updated data block and a compressed length of the un-updated data block;
810. dividing the non-updated data blocks into frequent migration and non-frequent migration types, and fixedly storing the non-updated data blocks which are not frequently migrated to avoid frequent migration of the non-updated data blocks which are not frequently migrated;
811. executing other processes;
it should be noted that steps 807 to 811 in this embodiment are similar to steps 507 to 511 in the embodiment described in fig. 5, and are not described again here.
812. Count management is performed on the number of references to fingerprints in the deduplication fingerprint library.
In order to clarify space occupation information of invalid data in the capacity layer, that is, invalid data information generated after the storage data in the original storage space in the capacity layer is updated, counting management may be performed on the number of references of the fingerprints in the deduplication fingerprint library, and specifically, counting management may be performed through the following two aspects:
firstly, when the matched fingerprint of the first data block exists in the duplicate removal fingerprint library, performing incremental operation on the reference times of the matched fingerprint;
if the weak hash value of the first data block in the compressed data has a matching fingerprint in the duplicate removal fingerprint database, and the first data block is identical to the original data block corresponding to the matching fingerprint, then performing an incremental operation, preferably an accumulative operation, on the number of references of the matching fingerprint, that is, when the first data block has a matching fingerprint in the duplicate removal fingerprint database, and the first data block is identical to the original data block corresponding to the matching fingerprint, performing a "+ 1" operation on the number of references of the matching fingerprint, of course, the incremental operation may also be a multiplication operation or a hybrid operation, as long as the operation is a positive correlation operation, and no specific limitation is imposed here.
And secondly, when the first data block which refers to the matched fingerprint in the duplicate removal fingerprint database is updated, performing decreasing operation on the reference times of the matched fingerprint.
Specifically, if the file data corresponding to a certain matching fingerprint in the duplicate removal fingerprint library is deleted or changed, a decrement operation, preferably a subtraction operation, is performed on the number of references of the matching fingerprint, that is, when the first data block corresponding to the matching fingerprint is deleted or updated, a "-1" operation is performed on the number of references of the matching fingerprint, and of course, the decrement operation may also be a division operation or a hybrid operation, as long as it is a negative correlation operation, and no specific limitation is imposed here.
Therefore, the reference condition of each fingerprint can be clarified through the management of the reference times of the fingerprints in the duplicate fingerprint removal library, when the reference times of a certain fingerprint is 0, the deletion operation can be executed on the original data block corresponding to the fingerprint according to the physical address of the fingerprint so as to increase the storage space of the capacity layer, and when the deletion operation is executed, the address information of the space fragment is recorded so as to clarify the space fragment information in the capacity layer, thereby facilitating the management of the storage space of the capacity layer.
In the embodiment of the application, after the compressed data is obtained from the performance layer, the compressed data is segmented into the first data blocks with preset lengths, and calculates a weak hash value of the first data block and matches the weak hash value of the first data block with a deduplication fingerprint library, when the fingerprint matched with the weak hash value of the first data block does not exist in the duplicate removal fingerprint database, the first data block is determined to be a non-duplicate data block, then the first data block is compressed and stored, since the weak hash value of the first data block is calculated in the application, and the weak hash generally refers to a hash algorithm which is fast in calculation and short in hash length (generally 4-8 bytes), and is mainly used for fast data check and retrieval, the application can improve the retrieval efficiency of the first data block in the deduplication database by calculating the weak hash value of the first data block, thereby improving the deduplication efficiency of the data block.
Secondly, after compressing the non-duplicated data blocks, writing the non-duplicated data blocks back to the log storage unit in a way of additionally writing the log in a storage unit with a preset length, writing the data blocks back to the capacity layer after the log storage unit is full, and constructing a data bit diagram for correspondingly recording the space occupation state of each storage unit, so that when the first data block in the storage unit is updated, the data bit diagram corresponding to the storage unit is changed from the second state to the first state, therefore, when the space is recovered in the later period, the space occupation state of each log storage unit in the capacity layer can be obtained by scanning the data bit diagram, and when the space occupation rate of the log storage unit is greater than a preset space threshold value, the updated data blocks in the first state in the log storage unit are deleted, and the non-updated data blocks in the second state in the log storage unit are migrated to a new storage unit, and in the migration process, the non-updated data blocks are divided into frequent migration types and non-frequent migration types, so that the non-updated data blocks which are not frequently migrated are fixedly stored, frequent migration of the non-updated data blocks is avoided, and the IO performance of the full flash memory array is improved.
With reference to fig. 9, the above describes a data compression method based on a full flash memory array in the embodiment of the present application, and the following continues to describe a data compression system based on a full flash memory array in the embodiment of the present application, where an embodiment of the data compression system based on a full flash memory array in the embodiment of the present application includes:
an obtaining unit 901 configured to obtain compressed data in the performance layer;
a segmentation calculating unit 902, configured to segment the compressed data into a first data block with a preset length, and calculate a hash value of the first data block;
a matching unit 903, configured to match the hash value of the first data chunk with a duplicate removal fingerprint library in the capacity layer to determine whether a matching fingerprint exists;
a compressing unit 904, configured to, when the matching fingerprint does not exist, determine that the first data block is a non-duplicate data block, compress the first data block, write back the compressed first data block to the capacity layer in a manner of adding a log, update the fingerprint of the first data block into the deduplication fingerprint library, and update metadata information of the first data block into a file metadata area in the capacity layer, where the metadata information includes: the physical storage address of the first data block after compression and the length of the first data block after compression.
In this embodiment of the application, the obtaining unit 901 obtains compressed data from a performance layer in a full flash memory array, the segmentation calculating unit 902 segments the data into first data blocks with preset lengths, calculates hash values of the first data blocks, and the matching unit 903 matches the hash values of the first data blocks with a duplicate removal fingerprint library of a capacity layer to determine whether a matching fingerprint exists, and when the matching fingerprint does not exist, determines that the first data blocks are non-duplicate data blocks, compresses the first data blocks, and stores the compressed first data blocks in the capacity layer in a manner of additionally writing a log.
The log additional writing has the characteristics of sequential writing and remote updating, namely when the file corresponding to the first data block is updated, the new data block in the corresponding file is compressed, and then the additional writing is executed in a time sequence, namely the new data block is stored in a new storage space address in the storage medium, namely remote updating, but not the storage address corresponding to the original first data block is updated, so that the problem that the length of the compressed new data block is not matched with the storage space of the compressed first data block after the file data is updated is avoided, the waste of the storage space in the storage medium is further avoided, smaller space fragments generated in the storage medium are also avoided, and the utilization rate of the storage space in the storage medium is improved.
Referring to fig. 10, a full flash memory array based data compression system in an embodiment of the present invention is described in detail below, where another embodiment of the full flash memory array based data compression system in the embodiment of the present invention includes:
an acquisition unit 1001 configured to acquire compressed data in the performance layer;
a segmentation calculating unit 1002, configured to segment the compressed data into a first data block with a preset length, and calculate a hash value of the first data block;
a matching unit 1003, configured to match the hash value of the first data chunk with a duplicate removal fingerprint library in the capacity layer to determine whether a matching fingerprint exists;
a compressing unit 1004, configured to, when the matching fingerprint does not exist, determine that the first data block is a non-duplicate data block, compress the first data block, write back the compressed first data block to the capacity layer in a manner of adding a log, update the fingerprint of the first data block into the deduplication fingerprint library, and update metadata information of the first data block to a file metadata area of the capacity layer, where the metadata information includes: the physical storage address of the first data block after compression and the length of the first data block after compression.
Preferably, the system further comprises:
a deduplication unit 1005, configured to determine that the first data chunk is duplicate data when the matching fingerprint exists, and write back metadata information of the first data chunk to a metadata area of the capacity layer, where the metadata information includes a correspondence relationship between a logical address of the first data chunk in the compressed data, the matching fingerprint, and a physical address of the matching fingerprint.
Preferably, the system further comprises:
a counting unit 1006, configured to perform counting management on the number of references of the fingerprints in the deduplication fingerprint library.
Preferably, the counting unit 1006 includes:
a first counting module 10061, configured to, when a matching fingerprint of the first data block exists in the deduplication fingerprint library, perform a growing operation on the number of references of the matching fingerprint;
and the combination of (a) and (b),
the second counting module 10062 is configured to perform a decreasing operation on the reference times of the matching fingerprint when the first data block referencing the matching fingerprint in the duplicate fingerprint repository is updated.
Preferably, the compressing unit 1004 includes:
a compression module 10041, configured to determine that the first data block is a non-duplicate data block when the matching fingerprint does not exist, compress the first data block, write back the compressed first data block to a log storage unit in a manner of additionally writing a log, and write back the log storage unit to the capacity layer after the log storage unit is full, where a storage space of the log storage unit is an integer multiple of a minimum write-in unit of the capacity layer, update the fingerprint of the first data block to the duplicate removal fingerprint database, and update metadata information of the first data block to a file metadata area of the capacity layer, where the metadata information includes: the physical storage address of the first data block after compression and the length of the first data block after compression.
In this embodiment of the application, the obtaining unit 1001 obtains compressed data from a performance layer in a full flash memory array, the segmentation calculating unit 1002 segments the data into first data blocks with preset lengths, calculates hash values of the first data blocks, and the matching unit 1003 matches the hash values of the first data blocks with a duplicate removal fingerprint library of a capacity layer to determine whether a matching fingerprint exists, and when the matching fingerprint does not exist, determines that the first data blocks are non-duplicate data blocks, the first data blocks are compressed, and the compressed first data blocks are stored in the capacity layer in a manner of additionally writing a log.
The log additional writing has the characteristics of sequential writing and remote updating, namely when the file corresponding to the first data block is updated, the new data block in the corresponding file is compressed, and then the additional writing is executed in a time sequence, namely the new data block is stored in a new storage space address in the storage medium, namely remote updating, but not the storage address corresponding to the original first data block is updated, so that the problem that the length of the compressed new data block is not matched with the storage space of the compressed first data block after the file data is updated is avoided, the waste of the storage space in the storage medium is further avoided, smaller space fragments generated in the storage medium are also avoided, and the utilization rate of the storage space in the storage medium is improved.
Referring to fig. 11, a full flash memory array based data compression system in an embodiment of the present invention is described in detail below, where another embodiment of the full flash memory array based data compression system in the embodiment of the present invention includes:
an acquisition unit 1101 configured to acquire data compressed in the performance layer;
a first segmentation calculating unit 1102, configured to segment the compressed data into first data blocks with preset lengths, and calculate hash values of the first data blocks;
a first matching unit 1103, configured to match the hash value of the first data block with a first duplicate removal fingerprint library stored in the memory, so as to determine whether a first matching fingerprint exists, where a fingerprint of the first duplicate removal fingerprint library is a partial content of a corresponding fingerprint stored in a second duplicate removal fingerprint library in the capacity layer;
a compressing unit 1104, configured to determine that a first data block is a non-duplicate data block when a first matching fingerprint does not exist in a first deduplication fingerprint library, compress the first data block, write back the compressed first data block to a log storage unit in a log additional writing manner with a preset length as a storage unit, write back the log storage unit to a capacity layer after the log storage unit is full, where a storage space of the log storage unit is an integer multiple of a minimum writing unit of the capacity layer, correspondingly update fingerprints and partial fingerprints of the first data block to a second deduplication fingerprint library and the first deduplication fingerprint library, respectively, and update metadata information of the first data block to a file metadata area of the capacity layer, where the metadata information includes: the physical storage address and the length of the first data block after compression;
a second matching unit 1105, configured to match the first data block with a second duplicate removal fingerprint library stored in the capacity layer to determine whether a second matching fingerprint exists when the first matching fingerprint exists in the first duplicate removal fingerprint library;
a deduplication unit 1106, configured to determine, when a second matching fingerprint exists in a second deduplication fingerprint library, that the first data block is duplicate data, and write back metadata information of the first data block to a metadata area of a capacity layer, where the metadata information includes a correspondence relationship between a logical address of the first data block in compressed data, the second matching fingerprint, and a physical address of the second matching fingerprint;
a constructing unit 1107, configured to construct a data bitmap table, where the data bitmap table is used to record a space occupation state corresponding to each storage unit or multiple storage units, and the space occupation state includes a first state and a second state, where the first state is invalid occupation and the second state is valid occupation;
the scanning unit 1108 is configured to scan the data bit map, obtain the number of the storage units in the first state in each log storage unit, and determine whether the space occupied by the storage unit in the first state is greater than a preset space occupation threshold;
a deletion migration unit 1109, configured to delete the updated data block in the storage unit in the first state in the log storage unit when the space occupied by the storage unit in the first state is greater than a preset space occupation threshold, acquire an un-updated data block in the storage unit in the second state in the log storage unit, migrate the un-updated data block to a new storage unit, and update metadata information of the un-updated data block to the file metadata area at the same time, where the metadata information of the un-updated data block includes a physical storage address of the un-updated data block and a length of the un-updated data block after compression;
the type dividing unit 1110 is configured to divide the non-updated data blocks into frequent migration types and non-frequent migration types, and perform fixed storage on the non-updated data blocks that are not frequently migrated, so as to avoid frequent migration of the non-updated data blocks that are not frequently migrated;
an execution unit 1111, configured to execute other processes when a space occupied by the storage unit in the first state is not greater than a preset space occupation threshold;
a first counting unit 1112, configured to perform count management on the number of references of fingerprints in the first and second de-duplication fingerprint libraries.
Preferably, the type dividing unit 1110 includes:
the first dividing module 11101 is configured to divide the non-updated data block into frequent migration types and non-frequent migration types according to an average value of time intervals of two times before and after the non-updated data block is migrated within a preset time period;
the second dividing module 11102, configured to divide the non-updated data block into frequent migration and infrequent migration types according to the frequency of migration of the non-updated data block within a preset time period;
the third dividing module 11103 is configured to divide the non-updated data block into frequent migration types and non-frequent migration types according to the reference times of the fingerprint corresponding to the non-updated data block.
Preferably, the first counting unit 1112 includes:
the first counting module 11121 is configured to, when the matching fingerprints of the first data block exist in both the first deduplication fingerprint library and the second deduplication fingerprint library, perform incremental operation on the reference times of the matching fingerprints;
the second counting module 11122 is configured to, when the first data block referencing the corresponding first matching fingerprint and second matching fingerprint in the first deduplication fingerprint library and the second deduplication fingerprint library is updated, perform decreasing operation on the number of times of referencing the first matching fingerprint and the second matching fingerprint respectively.
In the embodiment of the application, a first duplicate removal fingerprint library is established in a memory, the fingerprint of the first duplicate removal fingerprint library is part of the content of the corresponding fingerprint in a second duplicate removal fingerprint library in a capacity layer, after the compressed data is obtained, the compressed data is divided into first data blocks with preset lengths, the hash value of the first data blocks is calculated, then the hash value of the first data blocks is matched with the first duplicate removal fingerprint library, and when the first duplicate removal fingerprint library does not have the first matched fingerprint, the first data blocks are compressed and stored, according to the characteristic that the hash values are uniformly distributed, the fingerprint in the first duplicate removal fingerprint library is set as part of the content of the corresponding fingerprint in the second duplicate removal fingerprint library, so that not only is the space storage occupied by the duplicate removal fingerprint library reduced, but also the first duplicate removal fingerprint library is stored in the memory, and the fingerprint reading speed of the fingerprint in the duplicate removal fingerprint library is improved, therefore, the IO performance of the storage system is improved.
Secondly, after compressing the non-duplicated data blocks, writing the non-duplicated data blocks back to the log storage unit in a way of additionally writing the log in a storage unit with a preset length, writing the data blocks back to the capacity layer after the log storage unit is full, and constructing a data bit diagram for correspondingly recording the space occupation state of each storage unit, so that when the first data block in the storage unit is updated, the data bit diagram corresponding to the storage unit is changed from the second state to the first state, therefore, when the space is recovered in the later period, the space occupation state of each log storage unit in the capacity layer can be obtained by scanning the data bit diagram, and when the space occupation rate of the log storage unit is greater than a preset space threshold value, the updated data blocks in the first state in the log storage unit are deleted, and the non-updated data blocks in the second state in the log storage unit are migrated to a new storage unit, and in the migration process, the non-updated data blocks are divided into frequent migration types and non-frequent migration types, so that the non-updated data blocks which are not frequently migrated are fixedly stored, frequent migration of the non-updated data blocks is avoided, and the IO performance of the full flash memory array is improved.
Continuing with the description of the data compression system based on the full flash memory array in the embodiment of the present application, referring to fig. 12, another embodiment of the data compression system based on the full flash memory array in the embodiment of the present application includes:
an obtaining unit 1201, configured to obtain compressed data in a performance layer;
a second segmentation calculating unit 1202, configured to segment the compressed data into a first data block with a preset length, and calculate a weak hash value of the first data block;
a matching unit 1203, configured to match the weak hash value of the first data block with a duplicate removal fingerprint library in the capacity layer to determine whether a matching fingerprint exists;
a compressing unit 1204, configured to determine that the first data block is a non-duplicate data block when there is no matching fingerprint, compress the first data block, write back the compressed first data block to the log storage unit in a manner of additionally writing a log with a preset length as a storage unit, and write back the log storage unit to the capacity layer after the log storage unit is full, where a storage space of the log storage unit is an integer multiple of a minimum write-in unit of the capacity layer, update the fingerprint of the first data block to a duplicate removal fingerprint library, and update metadata information of the first data block to a file metadata area of the capacity layer, where the metadata information includes: the physical storage address and the length of the first data block after compression;
a reading matching unit 1205, configured to read an original data block corresponding to a matching fingerprint when the matching fingerprint exists, and match the first data block with the original data block to determine whether the first data block is identical to the original data block;
a deduplication unit 1206, configured to determine, when the first data block is identical to the original data block, that the first data block is a duplicate data block, and write back metadata information of the first data block to a metadata area of a capacity layer, where the metadata information includes a correspondence relationship between a logical address of the first data block in the compressed data, a matching fingerprint, and a physical address of the matching fingerprint;
a constructing unit 1207, configured to construct a data bitmap table, where the data bitmap table is used to record a space occupation state corresponding to each storage unit or multiple storage units, and the space occupation state includes a first state and a second state, where the first state is invalid occupation and the second state is valid occupation;
the scanning unit 1208 is configured to scan the data bit map, obtain the number of the storage units in the first state in each log storage unit, and determine whether the space occupied by the storage unit in the first state is greater than a preset space occupation threshold;
a deletion migration unit 1209, configured to delete an updated data block in a storage unit in the first state in the log storage unit when a space occupied by the storage unit in the first state is greater than a preset space occupation threshold, acquire an un-updated data block in a storage unit in the second state in the log storage unit, migrate the un-updated data block to a new storage unit, and update metadata information of the un-updated data block to the file metadata area, where the metadata information of the un-updated data block includes a physical storage address of the un-updated data block and a length of the un-updated data block after compression;
the type dividing unit 1210 is configured to divide the non-updated data blocks into frequent migration and non-frequent migration types, and perform fixed storage on the non-updated data blocks that are not frequently migrated, so as to avoid frequent migration of the non-updated data blocks that are not frequently migrated;
an execution unit 1211, configured to execute other processes when a space occupied by the storage unit in the first state is not greater than a preset space occupation threshold;
and a second counting unit 1212, configured to perform counting management on the number of references of the fingerprint in the deduplication fingerprint library.
Preferably, the type division unit 1210 includes:
a first type division module 12101, configured to divide the non-updated data block into frequent migration and non-frequent migration types according to an average value of time intervals between two times before and after the migration of the non-updated data block occurs within a preset time period;
a second type division module 12102, configured to divide the non-updated data block into frequent migration types and non-frequent migration types according to the frequency of migration of the non-updated data block within a preset time period;
a third type division module 12103, configured to divide the non-updated data blocks into frequent migration types and non-frequent migration types according to the number of references of the fingerprints corresponding to the non-updated data blocks.
Preferably, the second counting unit 1212 includes:
a first counting module 12121, configured to perform a growing operation on the reference times of the matched fingerprint when the matched fingerprint of the first data block exists in the duplicate fingerprint repository;
a second counting module 12122, configured to perform a decreasing operation on the reference times of the matching fingerprint when the first data block referencing the matching fingerprint in the deduplication fingerprint library is updated.
In the embodiment of the application, after the compressed data is obtained from the performance layer, the compressed data is segmented into the first data blocks with preset lengths, and calculates a weak hash value of the first data block and matches the weak hash value of the first data block with a deduplication fingerprint library, when the fingerprint matched with the weak hash value of the first data block does not exist in the duplicate removal fingerprint database, the first data block is determined to be a non-duplicate data block, then the first data block is compressed and stored, since the weak hash value of the first data block is calculated in the application, and the weak hash generally refers to a hash algorithm which is fast in calculation and short in hash length (generally 4-8 bytes), and is mainly used for fast data check and retrieval, the application can improve the retrieval efficiency of the first data block in the deduplication database by calculating the weak hash value of the first data block, thereby improving the deduplication efficiency of the data block.
Secondly, after compressing the non-duplicated data blocks, writing the non-duplicated data blocks back to the log storage unit in a way of additionally writing the log in a storage unit with a preset length, writing the data blocks back to the capacity layer after the log storage unit is full, and constructing a data bit diagram for correspondingly recording the space occupation state of each storage unit, so that when the first data block in the storage unit is updated, the data bit diagram corresponding to the storage unit is changed from the second state to the first state, therefore, when the space is recovered in the later period, the space occupation state of each log storage unit in the capacity layer can be obtained by scanning the data bit diagram, and when the space occupation rate of the log storage unit is greater than a preset space threshold value, the updated data blocks in the first state in the log storage unit are deleted, and the non-updated data blocks in the second state in the log storage unit are migrated to a new storage unit, and in the migration process, the non-updated data blocks are divided into frequent migration types and non-frequent migration types, so that the non-updated data blocks which are not frequently migrated are fixedly stored, frequent migration of the non-updated data blocks is avoided, and the IO performance of the full flash memory array is improved.
The data compression system based on the full flash memory array in the embodiment of the present application is described above from the perspective of the modular functional entity, and the data compression system based on the full flash memory array in the embodiment of the present application is described below from the perspective of hardware processing:
one embodiment of a data compression system of a full flash memory array in the embodiment of the present application includes:
a processor and a memory;
the memory is used for storing the computer program, and the processor is used for realizing the following steps when executing the computer program stored in the memory:
acquiring compressed data in the performance layer;
segmenting the compressed data into first data blocks with preset lengths, and calculating hash values of the first data blocks;
matching the hash value of the first data chunk with a duplicate removal fingerprint library in the capacity layer to determine whether a matching fingerprint exists;
if the first data block does not exist in the file, determining that the first data block is a non-duplicated data block, compressing the first data block, writing the compressed first data block back to the capacity layer in a log additional writing mode, updating the fingerprint of the first data block into the de-duplication fingerprint library, and updating the metadata information of the first data block into a file metadata area, wherein the metadata information includes: the physical storage address of the first data block after compression and the length of the first data block after compression.
In some embodiments of the present application, the processor may be further configured to:
if the first data block exists, determining that the first data block is repeated data, and writing back metadata information of the first data block to a metadata area of the capacity layer, wherein the metadata information comprises a corresponding relation between a logical address of the first data block in the compressed data, the matching fingerprint and a physical address of the matching fingerprint.
In some embodiments of the present application, the processor may be further configured to:
and performing counting management on the reference times of the fingerprints in the de-duplication fingerprint library.
In some embodiments of the present application, the processor may be further configured to:
when the matched fingerprint of the first data block exists in the duplicate removal fingerprint library, performing incremental operation on the reference times of the matched fingerprint;
and the combination of (a) and (b),
and when the first data block which refers to the matched fingerprint in the duplicate fingerprint database is updated, performing decreasing operation on the reference times of the matched fingerprint.
In some embodiments of the present application, the processor may be further configured to:
and writing the compressed first data block back to a log storage unit in a log additional writing mode, and writing the log storage unit back to the capacity layer after the log storage unit is full, wherein the storage space of the log storage unit is an integral multiple of the minimum writing unit of the capacity layer.
The following continues to describe the data compression system of the full flash memory array in the embodiment of the present application from the perspective of hardware processing: another embodiment of the data compression system of the full flash memory array in the embodiment of the present application includes:
a processor and a memory;
the memory is used for storing the computer program, and the processor is used for realizing the following steps when executing the computer program stored in the memory:
acquiring compressed data in a performance layer;
segmenting the compressed data into first data blocks with preset lengths, and calculating hash values of the first data blocks;
matching the hash value of the first data block with a first duplicate removal fingerprint library stored in a memory to determine whether a first matched fingerprint exists, wherein the fingerprint of the first duplicate removal fingerprint library is part of the content of the corresponding fingerprint in a second duplicate removal fingerprint library stored in the capacity layer;
if the first duplicate removal fingerprint library does not have the first matching fingerprint, determining that the first data block is a non-duplicate data block, compressing the first data block, writing the compressed first data block back to a log storage unit in a log additional writing mode by taking a preset length as a storage unit, writing the log storage unit back to a capacity layer after the log storage unit is full, wherein a storage space of the log storage unit is an integral multiple of a minimum writing unit of the capacity layer, correspondingly updating the fingerprint and partial fingerprint of the first data block to a second duplicate removal fingerprint library and the first duplicate removal fingerprint library, and updating the metadata information of the first data block to a file metadata area, wherein the metadata information comprises: the physical storage address of the compressed first data block and the length of the compressed first data block;
in some embodiments of the present application, the processor may be further configured to:
if a first matching fingerprint exists in a first duplicate removal fingerprint library, matching the first data block with the second duplicate removal fingerprint library stored in the capacity layer to determine whether a second matching fingerprint exists;
in some embodiments of the present application, the processor may be further configured to:
if a second matching fingerprint exists in a second duplicate removal fingerprint database, determining that the first data block is duplicate data, and writing back metadata information of the first data block to a metadata area of the capacity layer, wherein the metadata information includes a corresponding relation between a logical address of the first data block in the compressed data, the second matching fingerprint and a physical address of the second matching fingerprint.
In some embodiments of the present application, the processor may be further configured to:
constructing a data bit map table, wherein the data bit map table is used for recording space occupation states corresponding to each storage unit or a plurality of storage units, the space occupation states comprise a first state and a second state, the first state is invalid occupation, and the second state is valid occupation;
in some embodiments of the present application, the processor may be further configured to:
scanning a data bit chart, acquiring the number of storage units in a first state in each log storage unit, and judging whether the space occupied by the storage units in the first state is larger than a preset space occupation threshold value or not;
in some embodiments of the present application, the processor may be further configured to:
deleting the updated data block in the storage unit in the first state in the log storage unit, acquiring the non-updated data block in the storage unit in the second state in the log storage unit, migrating the non-updated data block to a new storage unit, and updating the metadata information of the non-updated data block to a file metadata area, wherein the metadata information of the non-updated data block comprises a physical storage address of the non-updated data block and the compressed length of the non-updated data block;
in some embodiments of the present application, the processor may be further configured to:
dividing the non-updated data blocks into frequent migration and non-frequent migration types, and fixedly storing the non-updated data blocks which are not frequently migrated to avoid frequent migration of the non-updated data blocks which are not frequently migrated;
in some embodiments of the present application, the processor may be further configured to:
dividing the non-updated data blocks into frequent migration types and non-frequent migration types according to the average value of the time intervals of the two times before and after the migration of the non-updated data blocks in a preset time period;
or the like, or, alternatively,
dividing the non-updated data blocks into frequent migration types and non-frequent migration types according to the migration frequency of the non-updated data blocks in a preset time period;
or the like, or, alternatively,
and dividing the non-updated data blocks into frequent migration types and non-frequent migration types according to the reference times of the fingerprints corresponding to the non-updated data blocks.
In some embodiments of the present application, the processor may be further configured to:
when the space occupied by the storage unit in the first state in each log storage unit is not more than a preset space occupation threshold, executing other processes;
in some embodiments of the present application, the processor may be further configured to:
count management is performed on the number of references to fingerprints in the first and second de-duplication fingerprint repositories.
In some embodiments of the present application, the processor may be further configured to:
when the matching fingerprints of the first data block exist in both the first duplicate removal fingerprint library and the second duplicate removal fingerprint library, performing incremental operation on the reference times of the matching fingerprints;
and the combination of (a) and (b),
and when the first data blocks which refer to the corresponding first matching fingerprints and the second matching fingerprints in the first duplicate removal fingerprint library and the second duplicate removal fingerprint library are updated, performing decreasing operation on the reference times of the first matching fingerprints and the second matching fingerprints respectively.
The following continues to describe the data compression system of the full flash memory array in the embodiment of the present application from the perspective of hardware processing: another embodiment of the data compression system of the full flash memory array in the embodiment of the present application includes:
a processor and a memory;
the memory is used for storing the computer program, and the processor is used for realizing the following steps when executing the computer program stored in the memory:
acquiring compressed data in the performance layer;
segmenting the compressed data into first data blocks with preset lengths, and calculating weak hash values of the first data blocks;
matching the weak hash value of the first data block with a duplicate removal fingerprint library in the capacity layer to determine whether a matching fingerprint exists;
if no matching fingerprint exists, determining that the first data block is a non-duplicated data block, compressing the first data block, writing the compressed first data block back to a log storage unit in a log additional writing mode by taking a preset length as a storage unit, writing the log storage unit back to a capacity layer after the log storage unit is full, updating the fingerprint of the first data block into a duplicate removal fingerprint library, and updating the metadata information of the first data block to a file metadata area, wherein the metadata information comprises: the physical storage address of the compressed first data block and the length of the compressed first data block;
in some embodiments of the present application, the processor may be further configured to:
reading an original data block corresponding to the matched fingerprint, and matching the first data block with the original data block to determine whether the first data block is completely the same as the original data block;
in some embodiments of the present application, the processor may be further configured to:
when the first data block is completely the same as the original data block, determining that the first data block is a repeated data block, and writing back metadata information of the first data block to a metadata area of a capacity layer, wherein the metadata information comprises a corresponding relation among a logical address, a matching fingerprint and a physical address of the matching fingerprint of the first data block in compressed data;
in some embodiments of the present application, the processor may be further configured to:
constructing a data bit map table, wherein the data bit map table is used for recording space occupation states corresponding to each storage unit or a plurality of storage units, the space occupation states comprise a first state and a second state, the first state is invalid occupation, and the second state is valid occupation;
in some embodiments of the present application, the processor may be further configured to:
scanning a data bit chart, acquiring the number of storage units in a first state in each log storage unit, and judging whether the space occupied by the storage units in the first state is larger than a preset space occupation threshold value or not;
in some embodiments of the present application, the processor may be further configured to:
when the space occupied by the storage unit in the first state is larger than a preset space occupation threshold, deleting the updated data block in the storage unit in the first state in the log storage unit, acquiring the non-updated data block in the storage unit in the second state in the log storage unit, transferring the non-updated data block to a new storage unit, and updating the metadata information of the non-updated data block into a duplicate removal fingerprint library, wherein the metadata information of the non-updated data block comprises the physical storage address of the non-updated data block and the compressed length of the non-updated data block;
in some embodiments of the present application, the processor may be further configured to:
dividing the non-updated data blocks into frequent migration and non-frequent migration types, and fixedly storing the non-updated data blocks which are not frequently migrated to avoid frequent migration of the non-updated data blocks which are not frequently migrated;
in some embodiments of the present application, the processor may be further configured to:
dividing the non-updated data blocks into frequent migration types and non-frequent migration types according to the average value of the time intervals of the two times before and after the migration of the non-updated data blocks in a preset time period;
or the like, or, alternatively,
dividing the non-updated data blocks into frequent migration types and non-frequent migration types according to the migration frequency of the non-updated data blocks in a preset time period;
or the like, or, alternatively,
and dividing the non-updated data blocks into frequent migration types and non-frequent migration types according to the reference times of the fingerprints corresponding to the non-updated data blocks.
In some embodiments of the present application, the processor may be further configured to:
when the space occupied by the storage unit in the first state is not more than a preset space occupation threshold, executing other processes;
in some embodiments of the present application, the processor may be further configured to:
performing counting management on the number of times of reference of the fingerprints in the duplicate fingerprint removal library;
in some embodiments of the present application, the processor may be further configured to:
when the matched fingerprint of the first data block exists in the duplicate removal fingerprint library, performing incremental operation on the reference times of the matched fingerprint;
and the combination of (a) and (b),
and when the first data block which refers to the matched fingerprint in the duplicate fingerprint database is updated, performing decreasing operation on the reference times of the matched fingerprint.
It is to be understood that, when the processor in the data compression system based on the full flash memory array described above executes the computer program, the functions of the units in the corresponding device embodiments may also be implemented, and are not described herein again. Illustratively, the computer program may be partitioned into one or more modules/units that are stored in the memory and executed by the processor to accomplish the present application. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution process of the computer program in the full flash memory array based data compression system. For example, the computer program may be partitioned into units in the full flash array based data compression system described above, and each unit may implement specific functions as described above in relation to the full flash array based data compression system.
The data compression system based on the full flash memory array can be computing equipment such as a desktop computer, a notebook computer, a palm computer and a cloud server. The data compression system based on the full flash memory array can include but is not limited to a processor and a memory. It will be understood by those skilled in the art that the processor and the memory are merely examples of a computer apparatus, and are not meant to be limiting, and may include more or less components, or some components in combination, or different components, for example, the full flash array based data compression system may further include input output devices, network access devices, buses, etc.
The Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like which is the control center for the computer device and which connects the various parts of the overall computer device using various interfaces and lines.
The memory may be used to store the computer programs and/or modules, and the processor may implement various functions of the computer device by running or executing the computer programs and/or modules stored in the memory and invoking data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the terminal, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device.
The present application further provides a computer-readable storage medium for implementing the functionality of a full flash array based data compression system, having a computer program stored thereon, which, when executed by a processor, the processor is operable to perform the steps of:
acquiring compressed data in the performance layer;
segmenting the compressed data into first data blocks with preset lengths, and calculating hash values of the first data blocks;
matching the hash value of the first data chunk with a duplicate removal fingerprint library in the capacity layer to determine whether a matching fingerprint exists;
if the first data block does not exist in the file, determining that the first data block is a non-duplicated data block, compressing the first data block, writing the compressed first data block back to the capacity layer in a log additional writing mode, updating the fingerprint of the first data block into the de-duplication fingerprint library, and updating the metadata information of the first data block into a file metadata area, wherein the metadata information includes: the physical storage address of the first data block after compression and the length of the first data block after compression.
In some embodiments of the present application, the computer program stored on the computer-readable storage medium, when executed by the processor, may be specifically configured to perform the following steps:
if the first data block exists, determining that the first data block is repeated data, and writing back metadata information of the first data block to a metadata area of the capacity layer, wherein the metadata information comprises a corresponding relation between a logical address of the first data block in the compressed data, the matching fingerprint and a physical address of the matching fingerprint.
In some embodiments of the present application, the computer program stored on the computer-readable storage medium, when executed by the processor, may be specifically configured to perform the following steps:
and performing counting management on the reference times of the fingerprints in the de-duplication fingerprint library.
In some embodiments of the present application, the computer program stored on the computer-readable storage medium, when executed by the processor, may be specifically configured to perform the following steps:
when the matched fingerprint of the first data block exists in the duplicate removal fingerprint library, performing incremental operation on the reference times of the matched fingerprint;
and the combination of (a) and (b),
and when the first data block which refers to the matched fingerprint in the duplicate fingerprint database is updated, performing decreasing operation on the reference times of the matched fingerprint.
In some embodiments of the present application, the computer program stored on the computer-readable storage medium, when executed by the processor, may be specifically configured to perform the following steps:
and writing the compressed first data block back to a log storage unit in a log additional writing mode, and writing the log storage unit back to the capacity layer after the log storage unit is full, wherein the storage space of the log storage unit is an integral multiple of the minimum writing unit of the capacity layer.
The present application also provides another computer-readable storage medium for implementing the functionality of a full flash array based data compression system, having a computer program stored thereon, which, when executed by a processor, the processor is operable to perform the steps of:
acquiring compressed data in a performance layer;
segmenting the compressed data into first data blocks with preset lengths, and calculating hash values of the first data blocks;
matching the hash value of the first data block with a first duplicate removal fingerprint library stored in a memory to determine whether a first matched fingerprint exists, wherein the fingerprint of the first duplicate removal fingerprint library is part of the content of the corresponding fingerprint in a second duplicate removal fingerprint library stored in the capacity layer;
if no first matching fingerprint exists in the first deduplication fingerprint library, determining that the first data block is a non-duplicate data block, compressing the first data block, writing the compressed first data block back to the log storage unit in a log additional writing mode by taking a preset length as a storage unit, writing the log storage unit back to a capacity layer after the log storage unit is full, wherein a storage space of the log storage unit is an integral multiple of a minimum writing unit of the capacity layer, correspondingly updating the fingerprint and part of the fingerprint of the first data block into the second deduplication fingerprint library and the first deduplication fingerprint library respectively, and updating the metadata information of the first data block to a file metadata area, wherein the metadata information comprises: the physical storage address of the compressed first data block and the length of the compressed first data block;
in some embodiments of the present application, the computer program stored on the computer-readable storage medium, when executed by the processor, may be specifically configured to perform the following steps:
if a first matching fingerprint exists in a first duplicate removal fingerprint library, matching the first data block with the second duplicate removal fingerprint library stored in the capacity layer to determine whether a second matching fingerprint exists;
in some embodiments of the present application, the computer program stored on the computer-readable storage medium, when executed by the processor, may be specifically configured to perform the following steps:
if a second matching fingerprint exists in a second duplicate removal fingerprint database, determining that the first data block is duplicate data, and writing back metadata information of the first data block to a metadata area of the capacity layer, wherein the metadata information includes a corresponding relation between a logical address of the first data block in the compressed data, the second matching fingerprint and a physical address of the second matching fingerprint.
In some embodiments of the present application, the computer program stored on the computer-readable storage medium, when executed by the processor, may be specifically configured to perform the following steps:
constructing a data bit map table, wherein the data bit map table is used for recording space occupation states corresponding to each storage unit or a plurality of storage units, the space occupation states comprise a first state and a second state, the first state is invalid occupation, and the second state is valid occupation;
in some embodiments of the present application, the computer program stored on the computer-readable storage medium, when executed by the processor, may be specifically configured to perform the following steps:
scanning a data bit chart, acquiring the number of storage units in a first state in each log storage unit, and judging whether the space occupied by the storage units in the first state is larger than a preset space occupation threshold value or not;
in some embodiments of the present application, the computer program stored on the computer-readable storage medium, when executed by the processor, may be specifically configured to perform the following steps:
deleting an updated data block in a storage unit in a first state in a log storage unit, acquiring an un-updated data block in the storage unit in a second state in the log storage unit, migrating the un-updated data block to a new storage unit, and updating metadata information of the un-updated data block to a file metadata area, wherein the metadata information of the un-updated data block comprises a physical storage address of the un-updated data block and a compressed length of the un-updated data block;
in some embodiments of the present application, the computer program stored on the computer-readable storage medium, when executed by the processor, may be specifically configured to perform the following steps:
dividing the non-updated data blocks into frequent migration and non-frequent migration types, and fixedly storing the non-updated data blocks which are not frequently migrated to avoid frequent migration of the non-updated data blocks which are not frequently migrated;
in some embodiments of the present application, the computer program stored on the computer-readable storage medium, when executed by the processor, may be specifically configured to perform the following steps:
dividing the non-updated data blocks into frequent migration types and non-frequent migration types according to the average value of the time intervals of the two times before and after the migration of the non-updated data blocks in a preset time period;
or the like, or, alternatively,
dividing the non-updated data blocks into frequent migration types and non-frequent migration types according to the migration frequency of the non-updated data blocks in a preset time period;
or the like, or, alternatively,
and dividing the non-updated data blocks into frequent migration types and non-frequent migration types according to the reference times of the fingerprints corresponding to the non-updated data blocks.
In some embodiments of the present application, the computer program stored on the computer-readable storage medium, when executed by the processor, may be specifically configured to perform the following steps:
when the space occupied by the storage unit in the first state in each log storage unit is not more than a preset space occupation threshold, executing other processes;
in some embodiments of the present application, the computer program stored on the computer-readable storage medium, when executed by the processor, may be specifically configured to perform the following steps:
performing counting management on the number of times of reference of the fingerprints in the first duplicate removal fingerprint library and the second duplicate removal fingerprint library;
in some embodiments of the present application, the computer program stored on the computer-readable storage medium, when executed by the processor, may be specifically configured to perform the following steps:
when the matching fingerprints of the first data block exist in both the first duplicate removal fingerprint library and the second duplicate removal fingerprint library, performing incremental operation on the reference times of the matching fingerprints;
and the combination of (a) and (b),
and when the first data blocks which refer to the corresponding first matching fingerprints and the second matching fingerprints in the first duplicate removal fingerprint library and the second duplicate removal fingerprint library are updated, performing decreasing operation on the reference times of the first matching fingerprints and the second matching fingerprints respectively.
The present application also provides another computer-readable storage medium for implementing the functionality of a full flash array based data compression system, having a computer program stored thereon, which, when executed by a processor, the processor is operable to perform the steps of:
acquiring compressed data in the performance layer;
segmenting the compressed data into first data blocks with preset lengths, and calculating weak hash values of the first data blocks;
matching the weak hash value of the first data block with a duplicate removal fingerprint library in the capacity layer to determine whether a matching fingerprint exists;
if no matching fingerprint exists, determining that the first data block is a non-duplicated data block, compressing the first data block, writing the compressed first data block back to a log storage unit in a log additional writing mode by taking a preset length as a storage unit, writing the log storage unit back to a capacity layer after the log storage unit is full, updating the fingerprint of the first data block into a duplicate removal fingerprint library, and updating the metadata information of the first data block to a file metadata area, wherein the metadata information comprises: the physical storage address of the compressed first data block and the length of the compressed first data block;
in some embodiments of the present application, the processor may be further configured to:
reading an original data block corresponding to the matched fingerprint, and matching the first data block with the original data block to determine whether the first data block is completely the same as the original data block;
in some embodiments of the present application, the processor may be further configured to:
when the first data block is completely the same as the original data block, determining that the first data block is a repeated data block, and writing back metadata information of the first data block to a metadata area of a capacity layer, wherein the metadata information comprises a corresponding relation among a logical address, a matching fingerprint and a physical address of the matching fingerprint of the first data block in compressed data;
in some embodiments of the present application, the processor may be further configured to:
constructing a data bit map table, wherein the data bit map table is used for recording space occupation states corresponding to each storage unit or a plurality of storage units, the space occupation states comprise a first state and a second state, the first state is invalid occupation, and the second state is valid occupation;
in some embodiments of the present application, the processor may be further configured to:
scanning a data bit chart, acquiring the number of storage units in a first state in each log storage unit, and judging whether the space occupied by the storage units in the first state is larger than a preset space occupation threshold value or not;
in some embodiments of the present application, the processor may be further configured to:
when the space occupied by the storage unit in the first state is larger than a preset space occupation threshold, deleting the updated data block in the storage unit in the first state in the log storage unit, acquiring the non-updated data block in the storage unit in the second state in the log storage unit, transferring the non-updated data block to a new storage unit, and updating the metadata information of the non-updated data block into a file metadata area, wherein the metadata information of the non-updated data block comprises the physical storage address of the non-updated data block and the compressed length of the non-updated data block;
in some embodiments of the present application, the processor may be further configured to:
dividing the non-updated data blocks into frequent migration and non-frequent migration types, and fixedly storing the non-updated data blocks which are not frequently migrated to avoid frequent migration of the non-updated data blocks which are not frequently migrated;
in some embodiments of the present application, the processor may be further configured to:
dividing the non-updated data blocks into frequent migration types and non-frequent migration types according to the average value of the time intervals of the two times before and after the migration of the non-updated data blocks in a preset time period;
or the like, or, alternatively,
dividing the non-updated data blocks into frequent migration types and non-frequent migration types according to the migration frequency of the non-updated data blocks in a preset time period;
or the like, or, alternatively,
and dividing the non-updated data blocks into frequent migration types and non-frequent migration types according to the reference times of the fingerprints corresponding to the non-updated data blocks.
In some embodiments of the present application, the processor may be further configured to:
when the space occupied by the storage unit in the first state is not more than a preset space occupation threshold, executing other processes;
in some embodiments of the present application, the processor may be further configured to:
performing counting management on the number of times of reference of the fingerprints in the duplicate fingerprint removal library;
in some embodiments of the present application, the processor may be further configured to:
when the matched fingerprint of the first data block exists in the duplicate removal fingerprint library, performing incremental operation on the reference times of the matched fingerprint;
and the combination of (a) and (b),
and when the first data block which refers to the matched fingerprint in the duplicate fingerprint database is updated, performing decreasing operation on the reference times of the matched fingerprint.
It will be appreciated that the integrated units, if implemented as software functional units and sold or used as a stand-alone product, may be stored in a corresponding one of the computer readable storage media. Based on such understanding, all or part of the flow of the method according to the above embodiments may be implemented by a computer program, which may be stored in a computer-readable storage medium and used by a processor to implement the steps of the above embodiments of the method. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims (14)

1. A data compression method based on a full flash memory array, wherein the full flash memory array comprises a performance layer and a capacity layer, and the method comprises the following steps:
acquiring compressed data in the performance layer;
segmenting the compressed data into first data blocks with preset lengths, and calculating hash values of the first data blocks;
matching the hash value of the first data chunk with a duplicate removal fingerprint library in the capacity layer to determine whether a matching fingerprint exists;
if the first data block does not exist in the duplicate data block, determining that the first data block is a non-duplicate data block, compressing the first data block, writing the compressed first data block back to the capacity layer in a log additional writing mode, and updating the fingerprint of the first data block into the duplicate removal fingerprint database, wherein the log additional writing is used as a different-place updating mode for improving the IO performance of the capacity layer.
2. The method of claim 1, further comprising:
if the first data block exists, determining that the first data block is repeated data, and writing back metadata information of the first data block to a metadata area of the capacity layer, wherein the metadata information comprises a corresponding relation between a logical address of the first data block in the compressed data, the matching fingerprint and a physical address of the matching fingerprint.
3. The method of claim 2, further comprising:
and performing counting management on the reference times of the fingerprints in the de-duplication fingerprint library.
4. The method of claim 3, wherein performing count management on the number of references to fingerprints in the de-duplication fingerprint repository comprises:
when the matched fingerprint of the first data block exists in the duplicate removal fingerprint library, performing incremental operation on the reference times of the matched fingerprint;
and the combination of (a) and (b),
and when the first data block which refers to the matched fingerprint in the duplicate fingerprint database is updated, performing decreasing operation on the reference times of the matched fingerprint.
5. The method according to any one of claims 1 to 4, wherein the writing back the compressed first data block to the capacity layer in a manner of journal append writing comprises:
and writing the compressed first data block back to a log storage unit in a log additional writing mode, and writing the log storage unit back to the capacity layer after the log storage unit is full, wherein the storage space of the log storage unit is an integral multiple of the minimum writing unit of the capacity layer.
6. The method of claim 5, wherein after writing back the compressed first data block to the capacity layer in a log append write, the method further comprises:
updating metadata information of the first data block to a file metadata area of the capacity layer or the de-duplication fingerprint database, wherein the metadata information includes: and the compressed physical storage address of the first data block and the compressed length of the first data block are used for decompressing the first data block according to the metadata information at a later stage.
7. A data compression system based on a full flash array, the full flash array including a performance layer and a capacity layer, the system comprising:
an acquisition unit configured to acquire compressed data in the performance layer;
the segmentation calculation unit is used for segmenting the compressed data into a first data block with a preset length and calculating the hash value of the first data block;
a matching unit, configured to match the hash value of the first data chunk with a duplicate removal fingerprint library in the capacity layer to determine whether a matching fingerprint exists;
and the compression unit is used for determining that the first data block is a non-repeated data block when the matched fingerprint does not exist, compressing the first data block, writing the compressed first data block back to the capacity layer in a log additional writing mode, and updating the fingerprint of the first data block into the duplicate removal fingerprint library, wherein the log additional writing is used as a different-place updating mode to improve the IO performance of the capacity layer.
8. The system of claim 7, further comprising:
and the deduplication unit is used for determining that the first data block is duplicated data when the matching fingerprint exists, and writing back metadata information of the first data block to a metadata area of the capacity layer, wherein the metadata information includes a corresponding relation between a logical address of the first data block in the compressed data, the matching fingerprint and a physical address of the matching fingerprint.
9. The system of claim 8, further comprising:
and the counting unit is used for performing counting management on the reference times of the fingerprints in the de-duplication fingerprint database.
10. The system of claim 9, wherein the counting unit comprises:
the first counting module is used for executing the increment operation on the reference times of the matched fingerprint when the matched fingerprint of the first data block exists in the de-duplication fingerprint library;
and the combination of (a) and (b),
and the second counting module is used for performing decreasing operation on the reference times of the matched fingerprints when the first data block which refers to the matched fingerprint in the de-duplication fingerprint library is updated.
11. The system according to any one of claims 7 to 10, wherein the compression unit comprises:
the compression module is used for determining that the first data block is a non-duplicate data block when the matching fingerprint does not exist, compressing the first data block, writing the compressed first data block back to a log storage unit in a log additional writing mode, and writing the log storage unit back to the capacity layer after the log storage unit is full, wherein the storage space of the log storage unit is an integral multiple of the minimum writing unit of the capacity layer.
12. The system of claim 11, further comprising:
an updating unit, configured to update metadata information of the first data block into a file metadata area of the capacity layer or the deduplication fingerprint library, where the metadata information includes: and the compressed physical storage address of the first data block and the compressed length of the first data block are used for decompressing the first data block according to the metadata information at a later stage.
13. A full flash array based data compression system comprising a processor, wherein the processor, when executing a computer program stored on a memory, is configured to implement the full flash array based data compression method of any of claims 1 to 6.
14. A readable storage medium, on which a computer program is stored, which, when being executed by a processor, is configured to implement the full flash memory array based data compression method according to any one of claims 1 to 6.
CN201811287652.2A 2018-10-31 2018-10-31 Data compression method and system based on full flash memory array Pending CN111124259A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811287652.2A CN111124259A (en) 2018-10-31 2018-10-31 Data compression method and system based on full flash memory array

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811287652.2A CN111124259A (en) 2018-10-31 2018-10-31 Data compression method and system based on full flash memory array

Publications (1)

Publication Number Publication Date
CN111124259A true CN111124259A (en) 2020-05-08

Family

ID=70485542

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811287652.2A Pending CN111124259A (en) 2018-10-31 2018-10-31 Data compression method and system based on full flash memory array

Country Status (1)

Country Link
CN (1) CN111124259A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111538465A (en) * 2020-07-07 2020-08-14 南京云信达科技有限公司 Linux-based high-performance data deduplication method
CN111625186A (en) * 2020-05-13 2020-09-04 深信服科技股份有限公司 Data processing method and device, electronic equipment and storage medium
WO2021248863A1 (en) * 2020-06-11 2021-12-16 华为技术有限公司 Data processing method and storage device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102999605A (en) * 2012-11-21 2013-03-27 重庆大学 Method and device for optimizing data placement to reduce data fragments
CN104156400A (en) * 2014-07-22 2014-11-19 中国科学院信息工程研究所 Storage method and device of mass network flow data
CN106681659A (en) * 2016-12-16 2017-05-17 郑州云海信息技术有限公司 Data compression method and device
CN108427538A (en) * 2018-03-15 2018-08-21 深信服科技股份有限公司 Storage data compression method, device and the readable storage medium storing program for executing of full flash array

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102999605A (en) * 2012-11-21 2013-03-27 重庆大学 Method and device for optimizing data placement to reduce data fragments
CN104156400A (en) * 2014-07-22 2014-11-19 中国科学院信息工程研究所 Storage method and device of mass network flow data
CN106681659A (en) * 2016-12-16 2017-05-17 郑州云海信息技术有限公司 Data compression method and device
CN108427538A (en) * 2018-03-15 2018-08-21 深信服科技股份有限公司 Storage data compression method, device and the readable storage medium storing program for executing of full flash array

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111625186A (en) * 2020-05-13 2020-09-04 深信服科技股份有限公司 Data processing method and device, electronic equipment and storage medium
CN111625186B (en) * 2020-05-13 2023-11-07 深信服科技股份有限公司 Data processing method, device, electronic equipment and storage medium
WO2021248863A1 (en) * 2020-06-11 2021-12-16 华为技术有限公司 Data processing method and storage device
US12001703B2 (en) 2020-06-11 2024-06-04 Huawei Technologies Co., Ltd. Data processing method and storage device
CN111538465A (en) * 2020-07-07 2020-08-14 南京云信达科技有限公司 Linux-based high-performance data deduplication method
CN111538465B (en) * 2020-07-07 2020-10-23 南京云信达科技有限公司 Linux-based high-performance data deduplication method

Similar Documents

Publication Publication Date Title
CN108427539B (en) Offline de-duplication compression method and device for cache device data and readable storage medium
US10466932B2 (en) Cache data placement for compression in data storage systems
CN111125033B (en) Space recycling method and system based on full flash memory array
CN108427538B (en) Storage data compression method and device of full flash memory array and readable storage medium
USRE49148E1 (en) Reclaiming space occupied by duplicated data in a storage system
US9965394B2 (en) Selective compression in data storage systems
US9880746B1 (en) Method to increase random I/O performance with low memory overheads
CN103098035B (en) Storage system
CN107506153B (en) Data compression method, data decompression method and related system
US10102150B1 (en) Adaptive smart data cache eviction
US8214620B2 (en) Computer-readable recording medium storing data storage program, computer, and method thereof
US20120159098A1 (en) Garbage collection and hotspots relief for a data deduplication chunk store
KR20170054299A (en) Reference block aggregating into a reference set for deduplication in memory management
CN111381779B (en) Data processing method, device, equipment and storage medium
CN107682016B (en) Data compression method, data decompression method and related system
CN111124940B (en) Space recovery method and system based on full flash memory array
US11455122B2 (en) Storage system and data compression method for storage system
US11436102B2 (en) Log-structured formats for managing archived storage of objects
CN111124259A (en) Data compression method and system based on full flash memory array
US20200341667A1 (en) Detecting data deduplication opportunities using entropy-based distance
CN105493080B (en) The method and apparatus of data de-duplication based on context-aware
CN111124939A (en) Data compression method and system based on full flash memory array
CN111198857A (en) Data compression method and system based on full flash memory array
CN111625186B (en) Data processing method, device, electronic equipment and storage medium
US20230367477A1 (en) Storage system, data management program, and data management method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination