CN113806341A - Data processing method and storage device - Google Patents

Data processing method and storage device Download PDF

Info

Publication number
CN113806341A
CN113806341A CN202010784929.3A CN202010784929A CN113806341A CN 113806341 A CN113806341 A CN 113806341A CN 202010784929 A CN202010784929 A CN 202010784929A CN 113806341 A CN113806341 A CN 113806341A
Authority
CN
China
Prior art keywords
data
granularity
storage device
block
compression
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010784929.3A
Other languages
Chinese (zh)
Inventor
任仁
刘中全
刘宏伟
朱芳芳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to EP20939891.6A priority Critical patent/EP4030310A4/en
Priority to PCT/CN2020/136106 priority patent/WO2021248863A1/en
Publication of CN113806341A publication Critical patent/CN113806341A/en
Priority to US17/741,079 priority patent/US20220269431A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0608Saving storage space on storage systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Human Computer Interaction (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a data processing method and storage equipment, and belongs to the technical field of storage. According to the method and the device, the storage device adopts different granularities during data de-duplication and compression, the data is subjected to data de-duplication processing by adopting large granularity, and the data is subjected to compression by adopting small granularity, so that the limitation that the re-deletion granularity and the compression granularity are necessarily the same is avoided, the reduction of the re-deletion rate caused by overlarge granularity and the reduction of the compression rate caused by undersize granularity are avoided to a certain extent, and the integral reduction rate of re-deletion compression is improved.

Description

Data processing method and storage device
The present application claims priority from chinese patent application No. 202010526840.7 entitled "a storage system, storage node, and data storage method" filed 11.06/11/2020, which is incorporated herein by reference in its entirety.
Technical Field
The present application relates to the field of storage technologies, and in particular, to a data processing method and a storage device.
Background
Deduplication and compression are key technologies in the storage industry. The storage device can reduce the data scale of actual storage by performing deduplication and compression, save the storage space occupied by the data in the storage device, and improve the storage efficiency of the storage device.
Currently, a storage device presets a fixed granularity, performs deduplication based on the granularity, and performs compression based on the granularity. For example, if the granularity is set to 8 Kilobytes (KB), the storage device determines whether each 8KB data block is a duplicate block when performing deduplication, and deletes a certain 8KB data block if the 8KB data block is a duplicate block; when the storage device compresses, the data block of 8KB is compressed each time.
When the method is adopted to process data, the deduplication granularity and the compression granularity must be the same, and the method has strong limitation.
Disclosure of Invention
The embodiment of the application provides a data processing method and a storage device, and solves the limitation of the data processing method to a certain extent. The technical scheme is as follows:
in a first aspect, a data processing method is provided, where the method is performed by a storage device, and includes: acquiring data; de-duplication of the data based on a first granularity; performing compression processing on the data based on a second granularity, wherein the size of the second granularity is larger than that of the first granularity; and storing the data subjected to the data de-duplication processing and the compression processing in a hard disk of the storage device.
In the method provided by the first aspect, the storage device adopts different granularities during deduplication processing and compression processing, deduplication processing is performed by adopting a large granularity, and compression processing is performed by adopting a small granularity, so that the limitation that deduplication granularity and compression granularity must be the same is eliminated, the decrease of deduplication rate caused by overlarge granularity and the decrease of compression rate caused by undersize granularity are avoided to a certain extent, and the increase of the overall reduction rate of deduplication compression is facilitated.
In the first aspect, the present application does not limit the order of the deduplication processing and the compression processing. In some scenarios, the compression process may be performed first and then the deduplication process may be performed, and in some scenarios, the deduplication process may be performed first and then the compression process may be performed. Taking the example of performing the deduplication processing first and then performing the compression processing, the duplicate blocks and the non-duplicate blocks are obtained after the deduplication processing, and the application may only perform the compression processing on the non-duplicate blocks. If the compression processing is executed firstly and then the repeated data deleting processing is executed, the data is compressed to obtain a compressed block, and then the repeated data deleting processing is executed aiming at the compressed block.
Optionally, the storage device stores therein metadata, the metadata being managed based on a metadata management granularity, a size of the metadata management granularity being smaller than or equal to a set maximum value and larger than or equal to a set minimum value, a size of the first granularity being equal to an integer multiple of the minimum value.
In this way, the minimum value of the metadata management granularity is used as the granularity used in the data deduplication processing, so that the optimal granularity is obtained in the data deduplication processing, and the deduplication rate is improved and the storage resources are saved.
Optionally, the size of the second granularity is a product of the minimum value and a compression ratio.
In this way, the granularity adopted in the compression processing is not a fixed value any more, but is dynamically selected according to the compression ratio, so that the compression ratio is better under the condition of ensuring that the data reading performance is not reduced.
Optionally, the deduplication processing of the data based on the first granularity includes: dividing the data into a plurality of data blocks; acquiring a fingerprint of each data block; determining a duplicate block and a non-duplicate block from the plurality of data blocks according to the fingerprint.
Optionally, the compressing the data based on the second granularity includes: and compressing the non-repeated blocks based on the second granularity to obtain compressed blocks, wherein the data subjected to the data de-duplication processing and the compression processing comprises the compressed blocks.
Optionally, the method further comprises: and recording the metadata of the compressed block.
Optionally, the recording the metadata of the compressed block includes: and if the number of the compression blocks is multiple and the addresses of the compression blocks are continuous, recording a piece of metadata for the compression blocks.
By recording a piece of metadata for a plurality of compressed blocks with continuous addresses, the number of recorded metadata is reduced, thereby saving storage resources occupied by the metadata in the storage device.
Optionally, the consecutive addresses of the plurality of compressed blocks refer to consecutive physical addresses and consecutive logical addresses of the plurality of compressed blocks.
Optionally, the piece of metadata includes an address of a first compressed block of the plurality of compressed blocks and a length of each compressed block.
By recording the metadata in such a way, the compressed metadata space is better on the basis that the data can be read through the metadata.
Optionally, the data is further compressed based on a third granularity before the de-duplication process and the compression process, the size of the third granularity being smaller than the size of the second granularity.
In this way, when the storage device finds that the original compression granularity (third granularity) of the compressed block is not the better compression granularity (second granularity), the storage device performs the compression again according to the better compression granularity (second granularity), so that the compression granularity of the compressed block is optimized, and the compression rate is improved.
Optionally, the storage device is a storage array.
Optionally, the storage device is a storage node in a distributed storage system.
Optionally, the first granularity is 4KB and the second granularity is 32 KB.
In a second aspect, a data processing method is provided, which is executed by a storage device and includes: acquiring data; determining a first granularity based on a metadata management granularity, the metadata management granularity being a granularity for managing metadata stored by the storage device, a size of the metadata management granularity being less than or equal to a set maximum value and greater than or equal to a set minimum value, the size of the first granularity being equal to an integer multiple of the minimum value; de-duplication of the data based on the first granularity; and storing the data subjected to the data de-duplication processing in a hard disk of the storage device.
In the method provided by the second aspect, since the granularity adopted in the deduplication processing of the storage device is determined based on the metadata management granularity, by using the minimum value of the metadata management granularity as the granularity used in the deduplication processing, it is facilitated to obtain a better granularity in the deduplication processing, so that the deduplication rate is improved, and the storage resources are saved.
Optionally, the deduplication processing of the data based on the first granularity includes: dividing the data into a plurality of data blocks; acquiring a fingerprint of each data block; determining a duplicate block and a non-duplicate block from the plurality of data blocks according to the fingerprint.
Optionally, the first granularity is 4 KB.
In a third aspect, a data processing method is provided, where the method is performed by a storage device, and includes: acquiring data; determining a second granularity based on a metadata management granularity, the metadata management granularity being a granularity for managing metadata stored by the storage device, a size of the metadata management granularity being less than or equal to a set maximum value and greater than or equal to a set minimum value, a size of the second granularity being a product of the minimum value and a set compression rate; performing compression processing on the data based on the second granularity; and storing the data subjected to the compression processing in a hard disk of the storage device.
In the method provided in the third aspect, since the granularity used in the compression processing of the storage device is no longer a fixed value, but is dynamically determined according to the metadata management granularity and the compression rate, the compression rate is better without reducing the data reading performance.
Optionally, the size of the second granularity is a product of the minimum value and a compression ratio.
Optionally, the compressing the data based on the second granularity includes:
and compressing the non-repeated blocks based on the second granularity to obtain compressed blocks, wherein the data subjected to the data de-duplication processing and the compression processing comprises the compressed blocks.
Optionally, the method further comprises: and recording the metadata of the compressed block.
Optionally, the recording the metadata of the compressed block includes:
and if the number of the compression blocks is multiple and the addresses of the compression blocks are continuous, recording a piece of metadata for the compression blocks.
Optionally, the piece of metadata includes an address of a first compressed block of the plurality of compressed blocks and a length of each compressed block.
Optionally, the data is further compressed based on a third granularity before the de-duplication process and the compression process, the size of the third granularity being smaller than the size of the second granularity.
Optionally, the method further comprises: and storing the fingerprint of the repeated block.
Optionally, the method further comprises: and recording the metadata of the non-repeated blocks, and storing the fingerprints of the non-repeated blocks in a fingerprint table.
Optionally, the second granularity is 32 KB.
In a fourth aspect, a storage device is provided, where the storage device includes at least one processor and a hard disk, where the at least one processor is configured to execute instructions to cause the storage device to perform the data processing method provided by at least one of the first aspect, any of the options of the first aspect, the second aspect, any of the options of the second aspect, the third aspect, and any of the options of the third aspect, and the hard disk is configured to store data. For specific details of the storage device provided in the fourth aspect, reference may be made to at least one of the first aspect, any optional manner of the second aspect, the third aspect, and any optional manner of the third aspect, and details are not repeated here.
In some embodiments, the at least one processor includes a first processor, a second processor, and a third processor;
the first processor is used for acquiring data;
the second processor to deduplicate the data based on a first granularity;
the third processor is configured to perform compression processing on the data based on a second granularity, where a size of the second granularity is larger than a size of the first granularity;
the first processor is further configured to store the data subjected to the deduplication processing and the compression processing in the hard disk.
In a fifth aspect, a storage device is provided, where the storage device has a function of implementing data processing in at least one of the first aspect, any of the options in the first aspect, the second aspect, any of the options in the second aspect, the third aspect, and any of the options in the third aspect. The storage device comprises at least one module, and the at least one module is configured to implement the data processing method provided by at least one of the first aspect, any one of the options of the first aspect, the second aspect, any one of the options of the second aspect, the third aspect, and any one of the options of the third aspect.
In some embodiments, the modules in the storage devices are implemented in software, and the modules in the storage devices are program modules. In other embodiments, the modules in the storage device are implemented in hardware or firmware. For specific details of the storage device provided by the fifth aspect, reference may be made to any one of the first aspect, any one of the optional manners of the first aspect, the second aspect, any one of the optional manners of the second aspect, the third aspect, or any one of the optional manners of the third aspect, and details are not repeated here.
In a sixth aspect, there is provided a computer-readable storage medium having at least one instruction stored therein, where the instruction is read by a processor to cause a storage device to execute the data processing method provided by at least one of the first aspect, any of the alternatives of the first aspect, the second aspect, any of the alternatives of the second aspect, and any of the alternatives of the third aspect, and the third aspect.
In a seventh aspect, a computer program product is provided that includes computer instructions stored in a computer readable storage medium. The processor of the storage device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the storage device to perform the data processing method provided by at least one of the first aspect, any of the options of the first aspect, the second aspect, any of the options of the second aspect, the third aspect, and any of the options of the third aspect.
In an eighth aspect, a chip is provided, which, when running on a storage device, causes the storage device to execute the data processing method provided by at least one of the first aspect, any of the options of the first aspect, the second aspect, any of the options of the second aspect, and any of the options of the third aspect and the third aspect.
Drawings
Fig. 1 is a schematic diagram of a system architecture of a distributed storage system according to an embodiment of the present application;
FIG. 2 is a diagram illustrating a storage metadata according to an embodiment of the present disclosure;
fig. 3 is a flowchart of a data processing method provided in an embodiment of the present application;
fig. 4 is a schematic diagram of a pre-deduplication provided in an embodiment of the present application;
fig. 5 is a flowchart of a data processing method provided in an embodiment of the present application;
fig. 6 is a schematic diagram of a post-deduplication provided in an embodiment of the present application;
FIG. 7 is a diagram of recording a piece of metadata for a plurality of compressed blocks according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of a storage device according to an embodiment of the present application.
Detailed Description
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
Since the embodiments of the present application relate to the application of the deduplication compression technology, for the convenience of understanding, the following description will first describe a term-related concept in the deduplication compression technology related to the embodiments of the present application.
(1) Deduplication (depopulate)
Deduplication, which is a short for deduplication, is a data reduction technology. A plurality of repeated data are stored in the storage system, the data occupy a large amount of hard disk space, repeated data can be deleted by using a deduplication technology, and only one copy of the same data is stored, so that the data storage space is saved. The technical principle of the deduplication comprises the following steps: firstly, performing blocking, namely dividing data to be written into a plurality of data blocks; then, performing fingerprint calculation, namely calculating the fingerprint of the data block according to each divided data block; and then, performing fingerprint search, namely, taking the fingerprint as an index, and performing search comparison in a fingerprint table. If the same fingerprint exists in the fingerprint table, which is indicated as a duplicate block, the data block is not saved, but the fingerprint index of the data block is saved. If the same fingerprint does not exist in the fingerprint table, indicating that the data chunk is a non-duplicate chunk (also called a unique chunk), the data chunk is saved and metadata for the data chunk is created. The deduplication is divided into pre-deduplication and post-deduplication according to different execution timings, and is described below by (2) and (3). According to the difference of the duplicate checking mode, the deduplication is divided into fixed-length deduplication and similar deduplication, and the following description is given by (4) and (5).
(2) Front deduplication
Pre-deduplication refers to performing deduplication before data is written to the hard disk. The pre-deduplication is also called online deduplication.
(3) Post-repeat deletion
The post-deduplication refers to performing deduplication after data is written to the hard disk. The post-deduplication is also called offline deduplication. The implementation mode of the post-deduplication comprises various modes. In some embodiments, after writing data to the hard disk, reading data from the hard disk into the cache; performing fingerprint calculation on the data in the cache, and determining repeated data by comparing whether the fingerprints are the same; if the repeated data is found, the repeated data is deleted again, and the deleted data is rewritten into the hard disk. In some embodiments, when data to be stored is acquired, a fingerprint of the data is computed; the data is written to the hard disk and the fingerprint is saved to the opportunity table. When the duplicate data needs to be deleted, the fingerprints are read from the opportunity table, and different fingerprints are compared to determine the duplicate data; and if the repeated data is found, performing deduplication.
(4) Fixed-length repeat deleting
When the fixed-length deduplication mode is adopted, different data blocks are required to be completely the same to be able to duplicate the data blocks as duplicate blocks. And, when the block is executed, the data is divided according to the preset granularity, and when the fingerprint is searched, the data is aligned according to the granularity.
(5) Similar deduplication
When a similar deduplication mode is adopted, the data blocks are not required to be completely matched, and the two data blocks are similar, so that the data block can be judged to be a repeated block. Also, when the blocking is performed, the data is divided at a predetermined granularity.
(6) Fingerprint (finger print, FP)
Fingerprints are an essential feature of data blocks. The data blocks themselves tend to be large, so the goal of the fingerprint is to expect small representations of data (e.g., 16, 32, 64, 128 bytes) to distinguish between different data blocks. In some embodiments, a hash algorithm is used to compute a fingerprint of a data chunk, where the fingerprint of the data chunk is a hash value of the database. Ideally, each data chunk has a unique fingerprint, and different data chunks have different fingerprints. Of course, in the presence of hash collisions, different data chunks may also have the same fingerprint.
(7) Compression
Compression is a byte-level data reduction technique, the idea being to use coding techniques to represent longer data in a shorter, coded format, thereby achieving a reduction in data size.
(8) Compression ratio
The compression ratio is a positive integer greater than or equal to 1. The compression rate indicates a ratio between the amount of data before compression and the amount of data after compression. For example, compressing 32KB of data to 8KB results in a compression ratio of 4: 1.
(9) Deduplication compression
Deduplication compression refers to a reduction technique used with deduplication and compression matching. When the deduplication compression scheme is adopted, after the duplicate blocks and the non-duplicate blocks are found, a plurality of non-duplicate blocks are compressed first, and the compressed blocks are stored. When reading data, the compressed block is decompressed. Because the non-repeated blocks are compressed, the data reduction effect is the superposition of the deduplication effect and the compression effect, and more data are reduced.
(10) Read amplification
The read amplification refers to the condition that the granularity of data actually read from the hard disk is larger than the granularity of data corresponding to the read request. The read amplification will bring the consumption of network bandwidth resources and affect the performance of data reading. For example, the granularity of data stored on a hard disk is 8 KB. The storage device receives a read request, and the read request instructs the storage device to read 4KB data; the storage device reads the 8KB data block in which the 4KB data is located from the hard disk, determines the 4KB data requested to be read from the 8KB data block, and returns the 4KB data to the initiator of the read request. In this example, the read request corresponds to a data granularity of 4KB, while the actual read granularity is 8KB, which may occupy too much bandwidth resources and affect data read performance due to the additional read of 4KB of data.
(11) Fingerprint index (finger print index, FPI)
A fingerprint index is an index of fingerprints of data chunks. The fingerprint index is used for inquiring corresponding fingerprints. For the sake of brevity, embodiments of the present application may be followed by "FPI" without introducing difficulties in understandingNumber of"is used to simplify a specific FPI, and a number means an identification of a corresponding data block, such as an FPI4Representing the fingerprint index of the data block 4.
(12) Memory cell
The storage unit refers to the smallest unit capable of performing a storage operation in the storage device, and the storage operation includes a data write operation or a data read/write operation. For example, a memory cell is a sector of a memory device.
(13) Physical address
The physical address refers to the actual address where data is stored in the hard disk. Specifically, each region on the hard disk is determined by the head (Heads), Cylinder (Cylinder, i.e., track), and Sector (Sector) in which the region is located. The physical address includes three parameters, a head parameter, a cylinder parameter, and a sector parameter. The head parameters are used to identify the head where the data is located. The cylinder parameter is used for identifying the cylinder where the data is located, and the sector parameter is used for identifying the sector where the data is located. By means of the physical address of the data it is possible to indicate which head the hard disk should use, from which cylinder the data should be read.
(14) Logical addresses
The logical addresses and the physical addresses are different. The Logical Address is collectively called a Logical Block Address (LBA). By adopting the LBA as the address of the data, the three-dimensional addressing mode based on the magnetic head, the cylinder and the sector, namely the physical address, is converted into one-dimensional linear addressing, so that the addressing efficiency is improved.
The logical address is the address of the logical space that the storage device presents to the host. When sending a write request or a read request to the storage device, the host carries the logical address in the write request or the read request. When receiving a write request or a read request, the storage device acquires a logical address carried by the write request or the read request, determines a physical address through one or more address conversions on the logical address, and writes data into the physical address or reads data from the physical address.
The logical address succession is, for example, LBA succession. For example, the LBA of data block 1 is 201, the LBA of data block 2 is 202, and the LBA of data block 3 is 203, which means that the logical addresses of data block 1, data block 2, and data block 3 are consecutive.
(15) Metadata
Metadata is data that describes attributes of business data. For example, the metadata describes fingerprints, logical addresses, physical addresses, mappings between logical addresses and physical addresses, mappings between fingerprints and logical addresses, and so on. The storage mode of the metadata is different from the storage mode of the service data. Metadata is typically stored using some data structure. The data structures for storing metadata are, for example, binary trees, B + trees, etc., which require that the metadata be managed at a certain granularity. In some embodiments of the present application, the granularity at which metadata is managed is referred to as metadata management granularity.
(16) Fingerprint watch
The fingerprint table is used for storing the fingerprint of each data block stored by the storage device.
(17) Opportunity table
The opportunity table is used to keep a fingerprint of the data blocks written in the storage device for the most recent period of time. The opportunity table is distinguished from the fingerprint table. The opportunity table may be understood as a temporary window for finding data blocks with a deduplication opportunity. Specifically, the storage device stores fingerprints of data blocks generated in a recent period in the opportunity table, and when a deduplication triggering condition (e.g., a load lower than a threshold value) is satisfied, the storage device finds duplicate blocks according to the fingerprints in the opportunity table, and stores the fingerprints of the duplicate blocks in the fingerprint table after the duplicate blocks are deduplicated.
The following describes an application scenario provided by an embodiment of the present application.
The method provided by the embodiment can be applied to a distributed storage system or a centralized storage device, and the two application scenarios are respectively described below.
Application scene one, distributed storage system
Referring to fig. 1, the present embodiment provides a distributed storage system 100, as shown in fig. 1, the system 100 includes a plurality of storage nodes 101 and at least one host 102. Each host 102 and the storage node 101 establish a communication connection through a wired or wireless network. For example, referring to fig. 1, a communication connection is established between each host 102 and the storage node 101 through an Internet Protocol (IP) network or other network.
Each storage node 101 includes a network card 1011, one or more hard disks 1012, a processor 1013, a processor 1014, and a memory 1015.
The network card 1011 is also called a Network Interface Card (NIC). Network card 1011 is used to communicate with host 102.
The hard disk 1012 is, for example, a Solid State Disk (SSD) or a Hard Disk Drive (HDD). The present embodiment does not limit the positional relationship between the storage node 101 and the hard disk 1012. In some embodiments, as shown in FIG. 1, hard disk 1012 is located internal to storage node 101. For example, the storage node 101 is a server in which a plurality of hard disks are installed. In other embodiments, hard disk 1012 is not internal to storage node 101, but is instead located within a hard disk enclosure coupled to storage node 101, which includes a plurality of hard disks 1012.
Processor 1013 is, for example, a Central Processing Unit (CPU). The processor 1013 configures, for example, one or more.
Processor 1014 is operative to undertake compression and/or deduplication functions to thereby reduce the computational burden on processor 1013. In some embodiments, processor 1014 is a processor having the same physical form as processor 1013. In other embodiments, processor 1014 is a different physical form than processor 1013. Optionally, the processor 1014 is a processing chip with computing capabilities. For example, the processor 1014 may be an accelerator card, a coprocessor, a Graphics Processing Unit (GPU), a Neural-Network Processing Unit (NPU), or the like. The processor 1014 is configured with one or more, for example.
Where storage node 101 includes both processor 1014 and processor 1013, processor 1014 and processor 1013 optionally cooperate to perform data processing operations. For example, processor 1013 is configured to receive data from a host, send data to processor 1014, and instruct processor 1014 to compress and/or re-puncture the data. Processor 1014 compresses and/or re-punctures upon receipt of instructions from processor 1013.
In some embodiments, where storage node 101 has multiple processors 1014, processor 1013 is configured to schedule multiple processors 1014. For example, the processor 1013 divides the compression task and/or the deduplication task into a plurality of subtasks, and allocates each subtask to the corresponding processor 1014.
In some embodiments, storage node 101 further comprises a communication bus (not shown in fig. 1), and processor 1014 and processor 1013 each access memory 1015, e.g., via the communication bus, to obtain instructions or code cached by memory 1015.
It is noted that processor 1014 is an optional component of storage node 101. In other embodiments, storage node 101 does not have processor 1014 and instead has processor 1013. For example, the operations of retrieving data, compressing and/or re-puncturing data may be performed independently by the processor 1013.
Host 102 includes application 1031 and client 1032.
Storage node 101 is capable of providing data storage services to host 102. For example, when the host 102 needs to store data to the storage node 101, the application 1031 (also referred to as an upper layer application) in the host 102 generates a write request, and sends the write request to the storage node 101. The storage node receives the write request through the network card 1011, writes data indicated by the write request to the hard disk 1012, and stores metadata of the data.
Storage node 101 is capable of providing data access services to host 102. For example, when the host 102 needs to access data stored in the storage node 101, the application 1031 in the host 102 generates a read request and sends the read request to the storage node 101. The storage node receives a read request through the network card 1011. The storage node determines the address of the data in the hard disk 1012 according to the data indicated by the read request and the stored metadata, reads the data from the corresponding address in the hard disk 1012, and sends the data to the host 102 through the network card 1011. The host 102 receives the data, thereby obtaining the data stored by the storage node 101.
Application scenario two and centralized storage device
The centralized storage device is, for example, a storage array. The storage array includes one or more controllers, also referred to as storage controllers, and one or more hard disks. The centralized storage facility may also be a storage node, as illustrated by storage node 101 of FIG. 1. The controller in a storage device is also referred to as a storage controller. The centralized storage device is connected with the host computer through a wired network or a wireless network.
In the application scenario described above, as data explosively grows, the data storage demand of the host computer increases day by day, and the space occupied by the data in the storage system also increases. To alleviate the space growth problem of the storage system, the deduplication compression technology has been the subject of intense research in the field. By deduplication compression, the size of data can be reduced, thereby effectively reducing the overhead of a storage system.
In the temporal deduplication compression scheme, since the metadata management granularity is fixed, the granularity used in compression and deduplication is also kept consistent.
However, research shows that under the condition that the same granularity is adopted for the deduplication and the compression, if the granularity is too large, the compression rate can be improved, and the deduplication rate can be reduced; if the granularity is too small, the rate of deduplication is improved, and the compression rate is reduced. Therefore, if the same granularity is used for the deduplication and the compression, either the deduplication rate or the compression rate is reduced, and the deduplication rate and the compression rate cannot be better at the same time.
In view of this, in the embodiment of the present application, the storage device may adopt different granularities when performing deduplication and compression, respectively. That is, the deduplication granularity and the compression granularity are different, so as to avoid a case where the deduplication rate is decreased due to too large granularity and a case where the compression rate is decreased due to too small granularity. In the following, several particle sizes and the relation between different particle sizes referred to in the present application are described by (a) to (g).
(a) Particle size (grain)
Granularity also refers to the size of the data, or the length of the data. The larger the granularity is, the larger the size of one data is represented. The units of the particle size include, but are not limited to, KB, Mega (MB, Mega, M for short), and the like. For example, the granularity is 4KB, meaning that the size of a piece of data is 4 KB. Granularity is an important parameter for a storage device. Granularity can affect many businesses of a storage device including, but not limited to, data reading, data storage, deduplication, compression, metadata management, and so on. In some embodiments of the present application, we will focus on the granularity involved in the types of traffic like deduplication, compression, metadata management, the relationship of different granularities, and the impact of granularity on storage devices.
(b) Granularity of deduplication
The deduplication granularity is used to indicate the granularity used when the storage device queries for duplicate data. The deduplication granularity is equal to the granularity of the blocks in the deduplication process. For example, where the size of the deduplication granularity is 4KB, the storage device will divide the data into multiple 4KB data blocks; the storage device will determine whether each 4KB block of data is a duplicate block; if a 4KB block of data is a duplicate block, the storage device deletes the 4KB block of data. In some embodiments, if the fixed-length re-puncturing is used for the re-puncturing, the size of the re-puncturing granularity is 4 KB. If the similar re-deleting mode is adopted for re-deleting, the size of the re-deleting granularity is 8 KB.
(c) Compressed particle size
The compression granularity is used to indicate the granularity at which the storage device performs data compression. The storage device will determine how much data to compress at one time based on the compression granularity. For example, where the size of the compression granularity is 32KB, the storage device will compress 32KB of data. The size of the compression granularity affects the compression rate. It has been found experimentally that the compressibility is proportional to the size of the compressed particle size. When the size of the compression granularity is less than 32KB, the larger the size of the compression granularity is, the larger the compression rate is; when the size of the compression granularity exceeds 32KB, the compression rate tends to be stable. In some embodiments, the size of the compression granularity is determined according to the compression rate. In some embodiments, the compression granularity is sized to 32 KB.
(d) Metadata management granularity
The metadata management granularity is used to indicate the granularity of metadata recorded by the storage device. For example, in some embodiments, the value of the metadata management granularity is not a fixed value, but can be dynamically changed within a certain range. Specifically, the metadata management granularity is one span. The interval has a minimum value and a maximum value. The granularity of metadata recorded by the storage device is, for example, the minimum value of the interval, the maximum value of the interval, or a value between the minimum value and the maximum value. For example, the size of the metadata management granularity is [4KB, 1M ], then the storage device optionally records a piece of 4KB metadata, or a piece of 1M metadata, or a piece of metadata with a granularity between 4KB and 1M.
In some embodiments, the size of the metadata management granularity is an integer multiple of the size of the storage unit. For example, the metadata is stored through at least one storage unit. The number relationship between the metadata and the storage unit is a one-to-one relationship or a one-to-many relationship. When the relationship between the metadata and the number of storage units is a one-to-one relationship, one storage unit stores one piece of metadata. When the relationship between the metadata and the number of the storage units is a one-to-many relationship, a plurality of storage units collectively store one piece of metadata, for example, a plurality of storage units that are continuous in physical address and continuous in logical address collectively store one piece of metadata. When metadata is stored in this manner, the minimum value of the size of the metadata management granularity is the size of one storage unit.
For example, referring to fig. 2, fig. 2 shows 8 storage units in the storage device, and the 8 storage units are respectively a storage unit 201, a storage unit 202, a storage unit 203, a storage unit 204, a storage unit 205, a storage unit 206, a storage unit 207, and a storage unit 208. The minimum value of the size of the metadata management granularity is the size of one storage unit out of 8 storage units, which is represented by 1 lattice in fig. 2. The maximum value of the size of the metadata management granularity is the sum of the sizes of 8 storage units, which is represented by 8 grids in fig. 2.
(e) Relationship between deduplication granularity and compression granularity
In some embodiments, the size of the deduplication granularity is inversely proportional to the size of the compression granularity. The size of the deduplication granularity is smaller than the size of the compression granularity. For example, the size of the re-puncturing granularity is 4KB and the size of the compression granularity is 32 KB. The smaller the deduplication granularity is, the more favorable the deduplication rate is, and the larger the compression rate is, the more favorable the compression rate is, so that the scheme of large-granularity compression and small-granularity deduplication is adopted by the storage device, the deduplication rate and the compression rate are favorably improved, and the effect of overall better reduction rate is achieved.
In some embodiments, both the deduplication granularity and the compression granularity are determined according to the metadata management granularity, see (f) and (g) below. The storage device respectively selects the granularity for the deduplication and the compression according to the metadata management granularity, so that the deduplication and the compression are both facilitated to operate according to the respective better granularity, and the deduplication rate and the compression rate are both better.
(f) Relationship between deduplication granularity and metadata management granularity
In some embodiments, the granularity of deduplication is determined from metadata management granularity. In some embodiments, the size of the deduplication granularity is equal to the minimum of the size of the metadata management granularity. For example, in the case where the metadata management granularity is [4KB, 1M ], the size of the re-puncturing granularity is equal to 4 KB. For example, in the case where the metadata management granularity is [8KB, 2M ], the size of the re-puncturing granularity is equal to 8 KB. In the case where the minimum value of the metadata management granularity is the size of one storage unit, the size of the deduplication granularity is, for example, the size of one storage unit. For example, referring to fig. 2, the size of the deduplication granularity is, for example, the size of the storage unit 201.
By taking the minimum value of the metadata management granularity as the deduplication granularity, the deduplication granularity is favorably selected to be the better granularity, so that the deduplication rate is improved, and storage resources are saved. The technical principle of this technical effect is discussed below.
On the one hand, the deduplication granularity may affect the deduplication rate. If the deduplication granularity is too large, the deduplication rate may decrease. For example, if the deduplication granularity is 32KB, the storage device will delete 32KB of data as duplicate blocks when 32KB of data is duplicate data. If only a part of the 32KB data is duplicated, for example, only 24KB data is duplicated, and another 8KB data is not duplicated, the storage device is not deduplicated. As can be seen from this example, the deduplication granularity is too large, which results in poor deduplication performance.
On the other hand, the deduplication granularity may impact the storage overhead of the metadata. Since the storage device records one piece of metadata for each repeated block, the smaller the deduplication granularity is, the more metadata the storage device records. Therefore, if the deduplication granularity is too small, an excessive number of duplicate blocks are generated, and the storage device needs to record too much metadata, so that the metadata occupies too much storage resources.
In contrast, in the present embodiment, the minimum value of the metadata management granularity is used as the deduplication granularity, and the deduplication granularity is sufficiently small, which is helpful for improving the deduplication rate. Moreover, when the storage device is deleted again, the metadata does not need to be recorded for the data smaller than the minimum value of the metadata management granularity, so that the resource waste caused by recording too much metadata is avoided.
In other embodiments, the size of the deduplication granularity is not a minimum of the metadata management granularity, but is an integer multiple of the minimum of the metadata management granularity. For example, in the case where the metadata management granularity is [4KB, 1M ], the size of the re-puncturing granularity is a multiple of any 4KB between 4KB and 1M. For example, the size of the deduplication granularity is 2 times or 3 times the minimum value of the metadata management granularity. Here, in the case where the minimum value of the metadata management granularity is the size of one storage unit, the size of the deduplication granularity is, for example, an integer multiple of the size of the storage unit. For example, referring to fig. 2, the size of the deduplication granularity is, for example, an integer multiple of the size of the storage unit 201.
(g) Relationship between compression granularity and metadata management granularity
In some embodiments, the compression granularity is determined from the metadata management granularity. In some embodiments, the compression granularity is determined jointly from the metadata management granularity and the compression rate. In some embodiments, the compression granularity is determined according to a minimum value of the metadata management granularity and a compression rate. For example, the size of the compression granularity is a product of the minimum value of the metadata management granularity and the compression rate. For example, if the compression rate is N: 1, the size of the compression granularity is N times the minimum value of the metadata management granularity, and N is a positive integer. For example, when the metadata management granularity is [8KB, 2M ] and the compression rate is 4: in case 1, the size of the compressed granularity is 8KB × 4KB 32 KB.
In the case where the minimum value of the metadata management granularity is the size of one storage unit, the size of the compression granularity is, for example, the product of the size of one storage unit and the compression rate. For example, referring to fig. 2, the compression granularity is, for example, the product of the size of the storage unit 201 and the compression rate. For example, in the case where the compression rate is 4:1, the compression granularity is 4 times the size of the storage unit 201, and the compression granularity corresponds to four lattices in fig. 2. The size of the compression granularity is not a fixed value any more, but is dynamically selected according to the compression rate, so that the compression rate is better under the condition of ensuring that the data reading performance is not reduced.
The various granularity and granularity relationship aspects described above apply, for example, to the flow of data written by a storage device. The flow of writing data involves a pre-deduplication flow and a post-deduplication flow. How the storage device utilizes the above-described granularities is illustrated by the method 300 and how the storage device utilizes the above-described granularities is illustrated by the method 400.
The method 300 and the method 400 described below are performed by a storage device.
In some embodiments, the method 300 or the method 400 is applied in a distributed storage system, and the storage device executing the method 300 or the method 400 is one or more storage nodes in the distributed storage system. For example, the storage device executing the method 300 or the method 400 is the storage node 101 in the system 100 shown in fig. 1, and the data processed by the method 300 or the method 400 is the data of the host 102 in the system 100.
In other embodiments, the method 300 or the method 400 is implemented in a centralized storage facility, and the storage facility performing the method 300 or the method 400 is a storage array.
In some embodiments, method 300 or method 400 is performed by a CPU. In other embodiments, method 300 or method 400 is performed by a CPU in cooperation with a dedicated deduplication-compressed processor, such as a hardware accelerator card. For example, the CPU is the processor 1013 shown in fig. 1, and the dedicated processor is the processor 1014 shown in fig. 1. Specifically, the process of deduplication compression relates to tasks of data blocking, fingerprint calculation, fingerprint searching, data compression, data storage and the like. For example, the dedicated processor executes the task of fingerprint calculation and the task of data compression, and the CPU executes other tasks related to the deduplication compression process, so that the task of fingerprint calculation and the task of data compression are offloaded from the CPU to the dedicated processor, occupation of CPU calculation resources by deduplication compression is reduced, and execution of the deduplication compression process is accelerated.
It should be noted that please refer to the method 300 for the same reason as the method 400 and the method 300, and details of the method 400 are not repeated.
Referring to fig. 3, fig. 3 is a flowchart of a data processing method 300 according to an embodiment of the present application.
Illustratively, the method 300 includes S310 to S360.
S310, the storage device acquires a plurality of data blocks.
In some embodiments, S310 specifically includes the following steps S311 to S313.
S311, the storage device receives a write request from the host.
The write request is used for requesting the storage device to store data, and the write request comprises the data to be stored and the logical address of the data.
S312, the storage device acquires data from the writing request.
S313, the storage device divides the data according to the first granularity to obtain a plurality of data blocks, and the size of each data block is equal to the size of the first granularity.
The first granularity is the deduplication granularity described above, that is, the granularity used by the storage device to perform deduplication processing. In some embodiments, the storage device manages the granularity according to the metadata, determining a first granularity. For example, the storage device determines a minimum value of the metadata management granularity, and takes an integer multiple of the minimum value of the metadata management granularity as the first granularity. Optionally, the minimum value of the metadata management granularity is taken as the first granularity. For example, in the case where the metadata management granularity is [4KB, 1M ], the storage device determines that the minimum value of the metadata management granularity is 4KB, determines that the size of the first granularity is 4KB, and divides the data into a plurality of data blocks of 4 KB.
S320, the storage device determines the fingerprint of each data block in the plurality of data blocks.
In some embodiments, the storage device performs a fingerprint calculation on each data block to obtain a fingerprint of each data block. In some embodiments, the fingerprint of the data chunk is a hash value of the data chunk, and the storage device performs a hash calculation on each data chunk to obtain the hash value of each data chunk.
S330, the storage device determines a repeated block and a non-repeated block in the plurality of data blocks according to the fingerprint of each data block.
The fingerprint of the repeated blocks is the same as the fingerprint of the data blocks already stored by the storage device. The data contained in the repeated block is identical to the data contained in the data block stored in the storage device; alternatively, the repeated blocks contain data that is different from the same portion of data that the data blocks already stored by the storage device contain. If the fixed-length deduplication mode is adopted, the data contained in the duplicate block is identical to the data contained in the data block stored in the storage device, and if the similar deduplication mode is adopted, only the data contained in the duplicate block is required to be identical to the data part contained in the data block stored in the storage device. A non-duplicate block refers to a block of data other than the duplicate block of the plurality of blocks of data. The non-duplicate blocks also weigh the data blocks that were not successfully deleted.
In some embodiments, the storage device determines whether the data block is a duplicate block or a non-duplicate block by performing a step of querying a fingerprint table, and when such an implementation is adopted, step S330 is also referred to as a fingerprint query. The fingerprint table is used for storing fingerprints of data blocks stored in the storage device. Specifically, taking the first data block of the plurality of data blocks as an example, step S330 includes: the storage device inquires a fingerprint table and compares the fingerprint of the first data block with the fingerprint in the fingerprint table; if the fingerprint of the first data block is the same as one of the fingerprints in the fingerprint table (i.e., the first data block hits in the fingerprint table), the storage device determines that the first data block is a duplicate block; if the fingerprint of the first data block is different from each fingerprint in the fingerprint table (i.e., the first data block misses the fingerprint table), the storage device determines that the first data block is a non-duplicate block.
In some embodiments, S330 is performed locally at the storage device. In other embodiments, S330 is performed by a storage device in cooperation with a dedicated server. The special server is an independent device, is coupled with the storage device or is connected with the storage device through a network, and stores the fingerprint table and can be specially responsible for fingerprint inquiry work. Specifically, the storage device sends the fingerprint of each data block to the server according to a preset rule, and the server determines to query the fingerprint table according to the fingerprint of each data block, so that the task of fingerprint query is shared to the server, and performance bottleneck brought to the storage device by the operand of fingerprint query is avoided.
The timing of performing S330 includes various cases. In some embodiments, step S330 is performed in real-time after step S320 is completed. In other embodiments, the storage device may determine whether the deduplication triggering condition is currently satisfied after performing step S320, and then perform step S330 if the deduplication triggering condition is satisfied. For example, in the case that the deduplication triggering condition is that the load is lower than the threshold, after the storage device executes step S320, it is determined whether the load is lower than the threshold; if the load is higher than the threshold, the obtained fingerprint is cached and waits, and if the load is lower than the threshold, S330 is performed.
The storage device performs operations on duplicate blocks that are different from operations performed on non-duplicate blocks. Specifically, the storage device may perform a deduplication operation on duplicate blocks and a compression operation on non-duplicate blocks. In the following, how to perform the deduplication operation on the storage device is illustrated through S340, and how to perform the compression operation on the storage device is illustrated through S350 and S360.
It should be noted that the present embodiment does not limit that the deduplication operation and the compression operation are both performed. In other embodiments, a deduplication operation is performed in lieu of a compaction operation. Specifically, the pre-deduplication includes a case where the pre-deduplication is successful and a case where the pre-deduplication is failed, and the case where the pre-deduplication is failed refers to a case where the storage device is compressed without performing deduplication. In case that the previous deduplication is successful, if the storage device determines that all the data blocks to be stored are duplicate blocks, the storage device performs S340 without performing S350 and S360; in case that the previous deduplication is successful, if the storage device determines that one part of the data blocks to be stored is a duplicate block and the other part is a non-duplicate block, the storage device performs S340, and performs S350 and S360; in case of the previous deduplication failure, the storage apparatus performs S350 and S360, but does not perform S340.
For example, referring to fig. 4, (a) in fig. 4 is an illustration of a case where the pre-deduplication is successful. Fig. 4 (b) is an illustration of a case where pre-deduplication fails (without deduplication and with compression), and in the description later herein, it will be illustrated by referring to the flow involved in the scenario shown in fig. 4.
The scenario shown in fig. 4 (a) is specifically: the host initiates write requests for 8 data blocks, namely a data block 1, a data block 2 and a data block 8; after the storage device receives the write request, fingerprint calculation is performed on the 8 data blocks respectively to obtain 8 fingerprints. Wherein, 8 fingerprints are respectively the fingerprints FP of the data block 11Fingerprint FP of data block 22Fingerprint FP to data block 88(ii) a The storage device inquires the fingerprint table according to the 8 fingerprints respectively to find out the fingerprint FP4With fingerprint FP7Hit the fingerprint table, i.e. the storage device finds the fingerprint FP associated with the data block 4 in the fingerprint table4The same fingerprint is found in the fingerprint table as the fingerprint FP of the data block 77The same fingerprint. Thus, the storage device determines that data block 4 and data block 7 are both duplicate blocks; the storage device does not store data block 4 and data block 7, and records the data blocks in the storage unit 2044 fingerprint FP4Corresponding fingerprint index FPI4Recording the fingerprint FP of the data block 7 in the memory unit 2077Corresponding fingerprint index FPI7. The storage device determines that data block 4 and other data blocks than data block 7 are non-duplicate blocks. The storage device compresses the data block 1, the data block 2 and the data block 3 to obtain a compressed block 1, and uses a piece of metadata to represent the compressed block 1 to obtain metadata 1 of the compressed block 1. The storage device stores metadata 1 of compressed block 1 in three physically and logically continuous storage units, storage unit 201, storage unit 202, and storage unit 203. The storage device compresses the data blocks 5, 6 and 8 to obtain compressed blocks 2, and uses a piece of metadata 2 to represent the compressed blocks 2 to obtain the metadata 2 of the compressed blocks 2; the storage device holds the metadata 2 of the compressed block 2 in three physically continuous and logically discontinuous storage units, the storage unit 205, the storage unit 206, and the storage unit 208. Where metadata 2 comprises two parts, one part being metadata 21 and the other part being metadata 22. The storage unit 205 and the storage unit 206 hold metadata 21, and the storage unit 208 holds metadata 22. In addition, the storage device will write compressed block 1 and compressed block 2 to the hard disk.
The scenario shown in fig. 4 (b) is specifically: the host initiates write requests for 8 data blocks, namely a data block 1, a data block 2 and a data block 8; after the storage device receives the write request, the storage device performs fingerprint calculation on 8 data blocks to obtain 8 fingerprints, wherein the 8 fingerprints are respectively fingerprints FP of the data block 11Fingerprint FP of data block 22Fingerprint FP to data block 88(ii) a The storage device may fail deduplication due to a load above a threshold or other reasons. In this case, the storage device compresses data block 1, data block 2, and data block 3 to obtain compressed block 1, and represents compressed block 1 with a piece of metadata to obtain metadata 1 of compressed block 1. The storage device stores metadata 1 of compressed block 1 in three physically and logically continuous storage units, storage unit 201, storage unit 202, and storage unit 203. The storage device compresses data block 4, data block 5 and data block 6 to obtain compressed block 2, and uses one elementThe data represents compressed block 2, resulting in metadata 2 for compressed block 2. The storage device stores metadata 2 of compressed block 2 in three physically and logically continuous storage units, storage unit 204, storage unit 205, and storage unit 206. The storage device compresses the data blocks 7 and 8 to obtain compressed blocks 3, and represents the compressed blocks 3 by using a piece of metadata to obtain the metadata 3 of the compressed blocks 3. The storage device stores the metadata 3 of the compressed block 3 in two physically and logically continuous storage units, the storage unit 207 and the storage unit 208. In addition, the storage device will write compressed block 1, compressed block 2, and compressed block 3 to the hard disk.
And S340, recording metadata for the repeated blocks by the storage device.
The storage device records the metadata of the repeated blocks without storing the repeated blocks, so that the repeated blocks are prevented from occupying the storage space of the hard disk, and the storage resource of the storage device is saved.
In some embodiments, the storage device further records metadata of the non-duplicate blocks and stores the fingerprints of the non-duplicate blocks in a fingerprint table so that when a new block is subsequently deduplicated, the fingerprints of the previously stored non-duplicate blocks can be looked up in the fingerprint table.
In some embodiments, the storage device holds a fingerprint of the repeated blocks. In some embodiments, the storage device maintains a fingerprint index of the duplicate blocks. Optionally, the metadata recorded by the storage device for the duplicate blocks is a fingerprint index (FPI) of the duplicate blocks. Specifically, the storage device writes the fingerprint index of the duplicate block to the storage unit of the metadata, using the fingerprint index of the duplicate block as the metadata of the duplicate block, thereby saving the fingerprint index in the storage unit of the metadata. For example, referring to FIG. 4, after the storage device determines that data chunk 4 and data chunk 7 are duplicate chunks, the fingerprint of data chunk 4 is indexed to FPI4Writing the data into the storage unit 204, and indexing the fingerprint of the data block 7 into the FPI7Write to memory cell 207.
Referring to fig. 4, the size of the storage space occupied by the metadata of the repeated blocks is equal to the size of the first granularity. For example, the size of the storage space occupied by the metadata of the duplicate blocks is equal to the granularity of metadata managementA minimum value. For example, referring to fig. 4, the minimum value of the metadata management granularity corresponds to one lattice in fig. 4, and the minimum value of the metadata management granularity is, for example, the size of one storage unit. For example, fingerprint index FPI for data chunk 44Fingerprint index FPI occupying storage unit 204, data block 44The size of the occupied storage space is equal to the size of the storage unit 204.
How the storage device records the metadata in S340 includes various implementations. In some embodiments, the storage device selects a first storage unit in which to store the metadata of the duplicate blocks according to the metadata management granularity. Wherein the granularity of the first storage unit is the minimum of the metadata management granularity. For example, the metadata management granularity is [4KB, 1M ], and the storage device selects a storage unit with a size of 4KB and stores the metadata of the repeated block in the storage unit of 4 KB.
S350, the storage device compresses the multiple non-repeated blocks to obtain compressed blocks.
A compressed block refers to a non-repeated block after compression. In some embodiments, S350 specifically includes the following steps a to c.
Step a, the storage device obtains the compression rate of the data.
How to obtain the compression ratio includes various implementations. In some embodiments, model training is performed according to the sample by adopting a machine learning mode to obtain a prediction model, after data is obtained, the data is input into the prediction model, compression rate prediction is performed on the data through the prediction model, and the compression rate is output. In other embodiments, the compression rate is preset empirically by the user. In other embodiments, the compression rate of the last data is taken as the compression rate of the data to be compressed, considering that there is a high probability that the compression rates of two consecutive data blocks are the same.
And b, dividing the plurality of non-repeated blocks into at least one data block group by the storage device according to the predicted compression rate, wherein the number of the non-repeated blocks contained in each data block group is equal to the compression rate.
For example, if the predicted compression rate is 3:1, the storage apparatus divides 3 non-repetitive blocks into one data block group such that one data block group contains 3 non-repetitive blocks, so as to put 3 non-repetitive blocks together for compression; if the predicted compression rate is 4:1, the storage apparatus divides the 4 non-duplicate blocks into one data block group such that one data block group contains 4 non-duplicate blocks, so as to put the 4 non-duplicate blocks together for compression. For example, referring to fig. 4, in the case of a compression rate of 3:1, three lattices in fig. 4 may correspond to one compression block, e.g., referring to (a) in fig. 4, the storage apparatus divides data block 1, data block 2, and data block 3 into one group, and divides data block 5, data block 6, and data block 8 into another group. As another example, referring to fig. 4 (b), the storage device divides data block 1, data block 2, and data block 3 into one group, divides data block 4, data block 5, and data block 6 into another group, and divides data block 7 and data block 8 into one group.
In some embodiments, the storage device considers not only the compression rate when grouping, but also whether different non-duplicate blocks are contiguous. For example, the storage device divides a plurality of non-duplicate blocks into the same data block group in succession according to the compression rate and the address of each non-duplicate block. Wherein contiguous includes, without limitation, at least one of physically contiguous or logically contiguous. Whether physical is contiguous is determined, for example, by whether physical addresses are contiguous, and whether logical is contiguous is determined, for example, by whether logical addresses are contiguous.
And c, the storage equipment compresses each data block group into a compressed block.
For example, referring to fig. 4 (a), the storage device compresses data block 1, data block 2, and data block 3 into compressed block 1, and compresses data block 5, data block 6, and data block 8 into compressed block 2. As another example, referring to fig. 4 (b), the storage device compresses data block 1, data block 2, and data block 3 into compressed block 1, data block 4, data block 5, and data block 6 into compressed block 2, and data block 7 and data block 8 into compressed block 3.
By performing the above-mentioned compression procedure, since one compressed block is compressed according to each non-duplicate block in one data block group, the second granularity is equal to the product of the number of non-duplicate blocks included in one data block group and the granularity of the non-duplicate blocks. Since one data block group contains the number of non-duplicate blocks equal to the compression rate, the granularity of the non-duplicate blocks is equal to the minimum value of the metadata management granularity, and thus the second granularity is equal to the product of the compression rate and the minimum value of the metadata management granularity. For example, referring to fig. 4, the compression ratio is equal to 3: the metadata management granularity 1 corresponds to one grid in fig. 4, and the second granularity corresponds to three grids in fig. 4. For example, the second granularity is equal to the sum of the sizes of three storage units, namely storage unit 201, storage unit 202 and storage unit 203.
It can be seen that the compression granularity (i.e. the second granularity) provided by the present embodiment is no longer the same fixed value as the deduplication granularity (i.e. the first granularity), but is determined according to the minimum value of the metadata management granularity and the compression rate, thus supporting the function of dynamically selecting the compression granularity. The dynamic selection of the compression granularity is helpful for ensuring that the compression rate is better under the condition that the data reading performance is not reduced, and the balance between the maximum compression rate and the read amplification is obtained. The technical principle for achieving this technical effect will be described below with reference to a specific example.
For example, the minimum value of the metadata management granularity is 8KB, and the storage device obtains 4 data blocks of 8KB size, the 4 data blocks being data block a, data block b, data block c, and data block d, respectively. The storage device predicts that the compression rates of the 4 data blocks are all 4:1, then the storage device determines that the compression granularity is 32KB, compresses the 32KB data of the data block a, the data block b, the data block c and the data block d together to obtain an 8KB compression block e, and stores the compression block e in the hard disk. Thereafter, the storage device receives a read request, which instructs the storage device to read data block a. The storage device reads the 8KB compressed block e from the hard disk in response to the read request; and the storage equipment decompresses the compressed block e to obtain a data block a, a data block b, a data block c and a data block d, and returns the data block a to the initiator of the read request. As can be seen from this example, the granularity of the data corresponding to the read request is 8KB, the granularity of the data actually read from the hard disk by the storage device (i.e., the granularity of the compressed block e) is also 8KB, and it can be seen that the granularity of the data actually read from the hard disk is the same as the granularity of the data corresponding to the read request, and a situation of read amplification does not occur, which avoids the reduction in read performance and the consumption of bandwidth resources caused by read amplification. And, because the compression granularity is better 32KB, the compression rate is better.
And S360, the storage device stores the compressed block into the hard disk and records metadata for the compressed block.
By executing S360, the storage apparatus stores the data (compressed block) subjected to the deduplication processing and the compression processing in the hard disk of the storage apparatus.
The metadata of the compressed block represents, for example, a mapping relationship between a logical address of the data to a physical address of the compressed block.
In some embodiments, the storage device stores the metadata of the compressed block in a storage unit whose size belongs to [ minimum of metadata management granularity, compression granularity ]. In this way, the minimum value of the granularity of the metadata of the compressed block is the minimum value of the metadata management granularity, and the maximum value of the granularity of the metadata of the compressed block is the compression granularity (second granularity). In some embodiments, the granularity of the metadata of the compressed block is a product of a minimum value of the metadata management granularity and a compression rate. For example, the compression ratio is N: 1, if the minimum value of the metadata management granularity is the size of one storage unit, the storage device selects N storage units, and then the metadata of the compressed block is stored in the N storage units. For example, referring to fig. 4, the compression ratio is 3:1, the minimum value of the metadata management granularity is the size of one storage unit, the storage device selects 3 storage units to record the metadata of one compressed block, and the granularity of the metadata of the compressed block is three times of the size of the storage unit. For example, the storage device selects the storage unit 201, the storage unit 202, and the storage unit 203 to record the metadata 1 of the compressed block 1, and the granularity of the metadata 1 of the compressed block 1 is the sum of the sizes of the storage unit 201, the storage unit 202, and the storage unit 203.
The way of recording the metadata by the storage device specifically includes at least one of the following ways one to two.
Mode a, the storage device records a piece of metadata for a plurality of compressed blocks whose addresses are consecutive.
In some embodiments, where the addresses of the plurality of compressed blocks are consecutive, the storage device uses a piece of metadata to represent the plurality of compressed blocks. The plurality of consecutive compressed block addresses are, for example, a plurality of consecutive compressed block physical addresses and a plurality of consecutive compressed block logical addresses. In some embodiments, a piece of metadata recorded by a storage device includes two parts, one part being an address of a first compressed block of a plurality of compressed blocks having consecutive addresses, and the other part being a length of each compressed block of the plurality of compressed blocks having consecutive addresses. For example, referring to FIG. 7, a cell in FIG. 7 identifies a minimum granularity of a data block or metadata, such as a 4KB block of data or a 4KB piece of metadata. Fig. 7 is an illustration of how compressed block 1, compressed block 2, and compressed block 3 are represented using a piece of metadata. As shown in fig. 7, after the storage device performs compression processing to obtain compressed blocks with consecutive addresses, namely, a compressed block 1, a compressed block 2, and a compressed block 3, a piece of metadata is recorded, where the recorded piece of metadata includes a metadata 1 of the compressed block 1 (first compressed block) and a jump table (jump table), the metadata 1 indicates an address of the compressed block 1, and the jump table includes a length of the compressed block 1, a length of the compressed block 2, and a length of the compressed block 3. For example, if the length of compressed block 1 is 9KB, the length of compressed block 2 is 7KB, and the length of compressed block 3 is 4KB, the skip list comprises 9KB, 7KB, and 4 KB.
By recording the metadata in such a manner, an effect of a better compressed metadata space is achieved on the basis that data can be read by the metadata, and the following illustrates a technical principle for achieving the technical effect.
In one aspect, since the length of each compressed block is recorded in the metadata, the offset of each compressed block with respect to the first compressed block can be specified. For example, the offset of the second compressed block from the first compressed block is the length of the first compressed block, and the offset of the third compressed block from the first compressed block is the sum of the lengths of the first and second compressed blocks. Therefore, when the second compressed block needs to be read, the second compressed block can be addressed in the hard disk according to the address of the first compressed block and the offset of the second compressed block relative to the first compressed block, and when the third compressed block needs to be read, the third compressed block can be addressed in the hard disk according to the address of the first compressed block and the offset of the third compressed block relative to the first compressed block. For example, in the scenario of fig. 7, when compressed block 2 needs to be read, the storage device may shift 9KB backward from the address of metadata 1 of compressed block 1, i.e., compressed block 2 may be found from the hard disk. When compressed block 3 needs to be read, the storage device shifts back by 9KB +7 KB-16 KB from the address of metadata 1 of compressed block 1, i.e. compressed block 3 can be found from the hard disk. Therefore, by the recording mode of the metadata, each compressed block can be respectively positioned in the hard disk, and each compressed block can be read.
On the other hand, since a piece of metadata is recorded for a plurality of compressed blocks with continuous addresses, the number of recorded metadata is reduced, thereby saving storage resources occupied by the metadata in the storage device. For example, in the scenario of fig. 7, since the storage device records the metadata 1 of the compressed block 1, and does not need to record the metadata 2 of the compressed block 2 and the metadata 3 of the compressed block 3, the storage space occupied by the metadata 2 of the compressed block 2 and the metadata 3 of the compressed block 3 is saved.
Mode B, the storage device records a plurality of pieces of metadata for a plurality of compressed blocks whose addresses are not consecutive.
In some embodiments, in a case where addresses of the plurality of compressed blocks are not consecutive, the storage device represents the plurality of compressed blocks using the plurality of pieces of metadata, respectively. Alternatively, as in the case of the method a, one piece of metadata is recorded for a compressed block in which two or more addresses are consecutive. Wherein the plurality of compressed blocks with non-consecutive addresses are logically spaced, for example, corresponding to the fingerprint index of the repeated block.
For example, referring to (a) of fig. 4, the compressed block 2 is obtained by compressing 3 non-repetitive blocks, i.e., a data block 5, a data block 6, and a data block 8. Where data block 5 and data block 6 are two logically adjacent blocks. Data block 6 and data block 8 are logically separated. Specifically, the logical address of the data block 5 is consecutive to the logical address of the data block 6, and the logical address of the data block 6 is not consecutive to the logical address of the data block 8. For example, data block 5 has a logical address of 205 and a length of 8KB, data block 6 has a logical address of 206 and a length of 8 KB. The logical address of data block 8 is 208 and the length is 8 KB. And data block 7 is logically originally present between data block 6 and data block 8, and data block 7 is found to be a duplicate block and is deleted again, so that storage unit 207 stores fingerprint index FPI4 of data block 7. The storage device compresses data blocks 5, 6 and 8 into compressed blocks 2, and the metadata 2 recorded for the compressed blocks 2 includes two pieces of metadata, metadata 21 and metadata 22, respectively. The metadata 21 indicates that the starting logical address is 205 and the length is 8KB by 2KB to 12 KB. The metadata 22 indicates that the starting logical address is 208 and 8KB in length.
In this example, the same compressed block (compressed block 2) corresponds to two pieces of metadata (metadata 21 and metadata 22). After decompression of the compressed block 2, the metadata 21 and the metadata 22 correspond to different parts of the decompressed data block. Specifically, the compressed block 2 is decompressed to obtain a data block 5, a data block 6 and a data block 8, the metadata 21 corresponds to the data block 5 and the data block 6, and the metadata 22 corresponds to the data block 8.
In the pre-deduplication method provided by this embodiment, the storage device performs deduplication and compression by using different granularities, so that the limitation that the deduplication granularity and the compression granularity must be the same is eliminated, and the situations that the deduplication rate is decreased due to too large granularity and the compression rate is decreased due to too small granularity are avoided to a certain extent, so that the overall reduction rate of deduplication compression is improved. Furthermore, as the deduplication granularity and the compression granularity are determined according to the metadata management granularity, the deduplication and the compression are respectively operated according to the better granularity, which is beneficial to achieving better deduplication rate and compression rate.
Referring to fig. 5, fig. 5 is a flowchart of a data processing method 400 according to an embodiment of the present application.
Illustratively, the method 400 includes S410 to S450.
And S410, storing the plurality of data blocks into the hard disk by the storage device.
In some embodiments, the storage device also computes a fingerprint for each data chunk, and stores the fingerprint for the data chunk and the physical location of the data chunk in the opportunity table. Where the opportunity table is for example in the form of key value pairs, the keys of the opportunity table are fingerprints of the data blocks. The value of the opportunity table is the physical location of the data block. The opportunity table is used to find the fingerprint of the repeated block.
In some embodiments, the storage device compresses the plurality of data blocks to obtain compressed blocks, and then stores the compressed blocks in the hard disk. The compression process is similar to that described above with respect to method 300. Specifically, the storage device predicts the compression rate of the data during the compression process; the storage device determines a second granularity according to the predicted compression rate and the metadata management granularity, and performs compression according to the second granularity. For example, the storage device determines, as the second granularity, a product of a minimum value of the metadata management granularity and a compression rate.
S420, the storage device determines a repeated block and a non-repeated block in the plurality of data blocks.
For example, the storage device reads the fingerprint of each data block from the opportunity table and compares the fingerprint of the data block with the recorded fingerprints in the fingerprint table. If the fingerprint of the data block is the same as one of the recorded fingerprints in the fingerprint table, the storage device determines that the data block is a duplicate block. If the fingerprint of the data block is different from each recorded fingerprint in the fingerprint table, the storage device determines that the data block is a non-repeated block, and the storage device records the fingerprint of the data block into the fingerprint table.
S430, the storage device updates the recorded metadata for the repeated blocks.
After the storage device finds a duplicate block that can be deleted repeatedly, the metadata of the duplicate block is updated. In some embodiments, the storage device updates the metadata of the duplicate blocks to fingerprints of the duplicate blocks. For example, the storage device determines a second storage unit, the second storage unit being used for storing metadata of the duplicate blocks, and the granularity of the second storage unit being a minimum of the metadata management granularity. The storage device writes the fingerprint of the duplicate block to the second storage unit so that the metadata held by the second storage unit is overwritten from the previously recorded metadata as the fingerprint of the duplicate block.
For example, referring to FIG. 6, the storage device is in a hard diskAfter the data block 1, the data block 2 to the data block 8 are stored, and it is determined that the data block 4 and the data block 7 are both duplicate blocks, the metadata of the data block 4 and the metadata of the data block 7 are updated. In particular, the storage device uses the fingerprint index FPI of data chunk 44Overwriting a fingerprint index FPI of a data block 4 to the storage unit 204 as metadata for the data block 44The data stored in the storage unit 204 is refreshed from the metadata of the data block 4 to the fingerprint index FPI4(ii) a In addition, the storage device uses the fingerprint index FPI of data chunk 77Overwriting the fingerprint index FPI of the data block 7 to the storage unit 207 as metadata for the data block 77The data stored in the storage unit 207 is refreshed from the metadata of the data block 4 to the fingerprint index FPI7. Where (a) in fig. 6 is an illustration of metadata before post-deduplication, fig. 6 (b) is an illustration of metadata to be updated, fig. 6 (c) is an illustration of metadata after merging, and fig. 6 (d) is an illustration of metadata after defragmentation.
And S440, the storage equipment performs garbage collection on the repeated blocks.
After updating the metadata for the duplicate blocks, the duplicate blocks may be referred to as garbage data, and the storage device may release the storage space occupied by the duplicate blocks by deleting the duplicate blocks. For example, referring to FIG. 6, the storage device updates the metadata held by storage unit 204 to the fingerprint index FPI for data chunk 44The metadata stored in the storage unit 207 is updated to the fingerprint index FPI of the data block 77After that, the data block 4 and the data block 7 are deleted. In some embodiments, the storage device reads both the duplicate blocks and the non-duplicate blocks from the hard disk, erases the storage units originally occupied by the duplicate blocks and the non-duplicate blocks in the hard disk, and rewrites the non-duplicate blocks into the hard disk, thereby performing garbage collection.
Wherein the repeated blocks include compressed blocks and normal data blocks. Thus, during garbage collection, either the compressed blocks or the normal data blocks may be overwritten. For a common data block, the storage device only needs to carry out valid data. For the compressed block, the storage device needs to decompress the compressed block with indefinite granularity and then recompress the data of the effective part in the decompressed data. Since the granularity of the compressed block is reduced, the compression rate is decreased at this time, and the storage device performs the defragmentation process by performing the following S450, thereby implementing the maximum granularity compression.
S450, the storage device conducts defragmentation on the non-repeated blocks.
By executing S450, the storage apparatus stores the data subjected to the deduplication processing and the compression processing (defragmented non-duplicate blocks) in the hard disk of the storage apparatus.
Specifically, in the defragmentation process, if the data to be defragmented is a duplicate block, the storage device retains a fingerprint index of the duplicate block. The data to be sorted includes, for example, data that has not been subjected to compression processing at the second granularity. If the data to be sorted is a compressed block, the storage device determines whether the compression granularity (i.e., the second granularity) corresponding to the compressed block is smaller than a better compression granularity (e.g., a product of the minimum value of the metadata management granularity and the compression rate), and if the compression granularity (i.e., the second granularity) corresponding to the compressed block is smaller than the better compression granularity, the plurality of compressed blocks are re-compressed according to the better compression granularity, so that the compression rate is improved. For example, the storage device decompresses the compressed block to obtain a non-repeated block, and then, the storage device determines the compression granularity from the minimum value of the metadata management granularity and the compression rate in the same manner as in step SF 50. The storage device divides the plurality of non-duplicate blocks into at least one data block group, compresses each data block group into a compressed block, and recompresses the compressed block.
For example, the compression rate is 3:1, the minimum value of the metadata management granularity is equal to the size of one storage unit, and the preferable compression granularity is equal to the size of three storage units. Referring to fig. 6 (c), in the defragmentation process, the storage device determines that the compression granularity corresponding to the compression block 2 is the size of 2 storage units, determines that the compression granularity corresponding to the compression block 3 is the size of 1 storage unit, and determines that the compression granularities corresponding to the compression block 2 and the compression block 3 are both smaller than the better compression granularity, then the storage device decompresses the compression block 2 to obtain the data block 5 and the data block 6, and decompresses the compression block 3 to obtain the data block 8. The storage device recompresses the data block 5, the data block 6 and the data block 8 to obtain a new compressed block 2, and writes the new compressed block 2 into the hard disk. In this example, the new compression block 2 corresponds to a compression granularity of three storage units, which is better than the compression granularity corresponding to the compression block 2 and the compression granularity corresponding to the compression block 3, thereby helping to improve the compression rate.
In the post-deduplication method provided by this embodiment, the storage device performs deduplication and compression by using different granularities, so that the limitation that the first granularity and the second granularity must be the same is eliminated, and the situations that the deduplication rate is decreased due to too large granularity and the compression rate is decreased due to too small granularity are avoided to a certain extent, thereby improving the overall reduction rate of deduplication compression. Furthermore, because the first granularity and the second granularity are determined according to the metadata management granularity, the deduplication and the compression are respectively operated according to the better granularity, and the deduplication rate and the compression rate are both better.
The method 300 and the method 400 of the present embodiment are introduced above, and the memory device of the present embodiment is described below from the perspective of logical functions.
Referring to fig. 8, fig. 8 shows a schematic diagram of a possible structure of the memory device according to the above embodiment. The storage device 600 shown in fig. 8, for example, implements the functionality of the storage device in the method 300 or the method 400. The storage device 600 includes an acquisition module 601, a deduplication module 602, a compression module 603, and a storage module 604.
An obtaining module 601, configured to obtain data; a deduplication module 602, configured to perform deduplication processing on data based on a first granularity; a compressing module 603, configured to perform compression processing on the data based on a second granularity, where a size of the second granularity is greater than a size of the first granularity; the storage module 604 is configured to store the data subjected to the deduplication processing and the compression processing in a hard disk of the storage device.
In some embodiments, the storage device further comprises: and the recording module is used for recording the metadata of the compressed block.
The division of the modules in this embodiment is schematic, and is only a logic function division, and there may be another division manner in actual implementation.
In some embodiments, at least one module in the storage device 600 is integrated in one processor, one chip, or one board. For example, the obtaining module 601, the deduplication module 602, and the compression module 603 are all integrated in the same processor, and the functions of the obtaining module 601, the deduplication module 602, and the compression module 603 are implemented by the processor.
In other embodiments, different modules of the memory device 600 are implemented by different processors or other different hardware. For example, the acquiring module 601 is implemented by the network card 1011 shown in fig. 1, the functions of the deduplication module 602 and the compression module 603 are implemented by different dedicated processors, respectively, and the function of the storage module 604 is implemented by a central processing unit.
Those of ordinary skill in the art will appreciate that the various method steps and modules described in connection with the embodiments disclosed herein can be implemented as electronic hardware, computer software, or combinations of both, and that the steps and elements of the various embodiments have been described above generally in terms of their functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the system, the apparatus and the module described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the module is merely a logical division, and other divisions may be realized in practice, for example, a plurality of modules or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may also be an electrical, mechanical or other form of connection.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules can be selected according to actual needs to achieve the purpose of the solution of the embodiments of the present application.
In addition, each module in the embodiments of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module. The integrated module can be realized in a form of hardware or a form of software module.
The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The terms "first," "second," and the like, in this application, are used for distinguishing between similar items and items that have substantially the same function or similar functionality, and it is to be understood that "first" and "second" do not have a logical or temporal dependency, nor do they define a quantity or order of execution. It will be further understood that, although the following description uses the terms first, second, etc. to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first granularity may be referred to as a second granularity, and similarly, a second granularity may be referred to as a first granularity, without departing from the scope of the various examples. Both the first and second particle sizes may be particle sizes, and in some cases, may be separate and distinct particle sizes.
The term "at least one" in this application means one or more, and the term "plurality" in this application means two or more, for example, a plurality of compressed blocks means two or more compressed blocks. The terms "system" and "network" are often used interchangeably herein.
It is also understood that the term "if" may be interpreted to mean "when" ("where" or "upon") or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined." or "if [ a stated condition or event ] is detected" may be interpreted to mean "upon determining.. or" in response to determining. "or" upon detecting [ a stated condition or event ] or "in response to detecting [ a stated condition or event ]" depending on the context.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the present application, and these modifications or substitutions should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer program instructions. When loaded and executed on a computer, produce, in whole or in part, the procedures or functions according to the embodiments of the application. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device.
The computer program instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer program instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center by wire or wirelessly. The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The available media may be magnetic media (e.g., floppy disks, hard disks, tapes), optical media (e.g., Digital Video Disks (DVDs), or semiconductor media (e.g., solid state disks), among others.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims (28)

1. A data processing method, performed by a storage device, comprising:
acquiring data;
de-duplication of the data based on a first granularity;
performing compression processing on the data based on a second granularity, wherein the size of the second granularity is larger than that of the first granularity;
and storing the data subjected to the data de-duplication processing and the compression processing in a hard disk of the storage device.
2. The method according to claim 1, wherein the storage device has metadata stored therein, the metadata being managed based on a metadata management granularity having a size less than or equal to a set maximum value and greater than or equal to a set minimum value, the size of the first granularity being equal to an integer multiple of the minimum value.
3. The method of claim 2, wherein the size of the second granularity is a product of the minimum value and a compression ratio.
4. The method of claim 1, wherein the de-duplication of the data based on the first granularity comprises:
dividing the data into a plurality of data blocks;
acquiring a fingerprint of each data block;
determining a duplicate block and a non-duplicate block from the plurality of data blocks according to the fingerprint.
5. The method of claim 4, wherein the compressing the data based on the second granularity comprises:
and compressing the non-repeated blocks based on the second granularity to obtain compressed blocks, wherein the data subjected to the data de-duplication processing and the compression processing comprises the compressed blocks.
6. The method of claim 5, further comprising: and recording the metadata of the compressed block.
7. The method of claim 6, wherein the recording metadata of the compressed block comprises:
and if the number of the compression blocks is multiple and the addresses of the compression blocks are continuous, recording a piece of metadata for the compression blocks.
8. The method of claim 7, wherein the piece of metadata comprises an address of a first compressed block of the plurality of compressed blocks and a length of each compressed block.
9. The method of claim 1, wherein the data is further compressed based on a third granularity prior to the deduplication and compression processes, the third granularity being smaller in size than the second granularity.
10. The method of claim 1, wherein the storage device is a storage array.
11. The method of claim 1, wherein the storage device is a storage node in a distributed storage system.
12. A storage device comprising at least one processor and a hard disk;
the at least one processor configured to obtain data; de-duplication of the data based on a first granularity; performing compression processing on the data based on a second granularity, wherein the size of the second granularity is larger than that of the first granularity; and storing the data subjected to the data de-duplication processing and the compression processing in the hard disk.
13. The storage device according to claim 12, wherein the storage device stores therein metadata, the metadata being managed based on a metadata management granularity having a size smaller than or equal to a set maximum value and larger than or equal to a set minimum value, the size of the first granularity being equal to an integer multiple of the minimum value.
14. The storage device of claim 13, wherein the size of the second granularity is a product of the minimum value and a compression ratio.
15. The memory device of claim 12, wherein the at least one processor is configured to divide the data into a plurality of data blocks; acquiring a fingerprint of each data block; determining a duplicate block and a non-duplicate block from the plurality of data blocks according to the fingerprint.
16. The storage device according to claim 15, wherein the at least one processor is configured to perform a compression process on the non-duplicate blocks based on the second granularity to obtain compressed blocks, and the deduplicated and compressed data comprises the compressed blocks.
17. The storage device of claim 16, wherein the at least one processor is further configured to record metadata of the compressed block.
18. The storage device according to claim 17, wherein the at least one processor is configured to record a piece of metadata for the plurality of compressed blocks if the number of the compressed blocks is plural and addresses of the plurality of compressed blocks are consecutive.
19. The storage device of claim 18, wherein the piece of metadata comprises an address of a first compressed block of the plurality of compressed blocks and a length of each compressed block.
20. The storage device of claim 12, wherein the data is further compressed based on a third granularity prior to the deduplication and compression processes, the third granularity being smaller in size than the second granularity.
21. The storage device of claim 12, wherein the storage device is a storage array.
22. The storage device of claim 12, wherein the storage device is a storage node in a distributed storage system.
23. A computer-readable storage medium having stored therein at least one instruction that is read by a processor to cause a storage device to perform the method of any one of claims 1-11.
24. A storage device, the storage device comprising:
the acquisition module is used for acquiring data;
a deduplication module to deduplicate the data based on a first granularity;
the compression module is used for compressing the data based on a second granularity, and the size of the second granularity is larger than that of the first granularity;
and the storage module is used for storing the data subjected to the data de-duplication processing and the compression processing in a hard disk of the storage device.
25. The storage device of claim 24, wherein the deduplication module is configured to divide the data into a plurality of data blocks; acquiring a fingerprint of each data block; determining a duplicate block and a non-duplicate block from the plurality of data blocks according to the fingerprint.
26. The storage device according to claim 25, wherein the compression module is configured to perform compression processing on the non-duplicate blocks based on the second granularity to obtain compressed blocks, and the data subjected to the de-duplication processing and the compression processing includes the compressed blocks.
27. The storage device of claim 26, wherein the storage device further comprises: and the recording module is used for recording the metadata of the compressed block.
28. The storage device according to claim 27, wherein the recording module is configured to record a piece of metadata for a plurality of the compressed blocks if the number of the compressed blocks is multiple and addresses of the plurality of the compressed blocks are consecutive.
CN202010784929.3A 2020-06-11 2020-08-06 Data processing method and storage device Pending CN113806341A (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP20939891.6A EP4030310A4 (en) 2020-06-11 2020-12-14 Data processing method and storage device
PCT/CN2020/136106 WO2021248863A1 (en) 2020-06-11 2020-12-14 Data processing method and storage device
US17/741,079 US20220269431A1 (en) 2020-06-11 2022-05-10 Data processing method and storage device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010526840 2020-06-11
CN2020105268407 2020-06-11

Publications (1)

Publication Number Publication Date
CN113806341A true CN113806341A (en) 2021-12-17

Family

ID=78943462

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010784929.3A Pending CN113806341A (en) 2020-06-11 2020-08-06 Data processing method and storage device

Country Status (1)

Country Link
CN (1) CN113806341A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114491145A (en) * 2022-01-27 2022-05-13 北京中电兴发科技有限公司 Metadata design method based on stream storage
CN116331044A (en) * 2023-05-31 2023-06-27 山东芯演欣电子科技发展有限公司 Charging data storage system for direct-current charging pile
CN116975032A (en) * 2023-07-14 2023-10-31 南京领行科技股份有限公司 Data alignment method, system, electronic device and storage medium

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114491145A (en) * 2022-01-27 2022-05-13 北京中电兴发科技有限公司 Metadata design method based on stream storage
CN114491145B (en) * 2022-01-27 2022-10-21 北京中电兴发科技有限公司 Metadata design method based on stream storage
CN116331044A (en) * 2023-05-31 2023-06-27 山东芯演欣电子科技发展有限公司 Charging data storage system for direct-current charging pile
CN116331044B (en) * 2023-05-31 2023-08-04 山东芯演欣电子科技发展有限公司 Charging data storage system for direct-current charging pile
CN116975032A (en) * 2023-07-14 2023-10-31 南京领行科技股份有限公司 Data alignment method, system, electronic device and storage medium
CN116975032B (en) * 2023-07-14 2024-04-12 南京领行科技股份有限公司 Data alignment method, system, electronic device and storage medium

Similar Documents

Publication Publication Date Title
US10466932B2 (en) Cache data placement for compression in data storage systems
US9965394B2 (en) Selective compression in data storage systems
KR102007070B1 (en) Reference block aggregating into a reference set for deduplication in memory management
CN113806341A (en) Data processing method and storage device
US11531482B2 (en) Data deduplication method and apparatus
CN106662981B (en) Storage device, program, and information processing method
CN103098035B (en) Storage system
EP3059677B1 (en) Multi-level deduplication
US11068405B2 (en) Compression of host I/O data in a storage processor of a data storage system with selection of data compression components based on a current fullness level of a persistent cache
JP3597945B2 (en) Method of holding embedded directory information for data compression in direct access storage device and system including directory record
US11232073B2 (en) Method and apparatus for file compaction in key-value store system
US10936228B2 (en) Providing data deduplication in a data storage system with parallelized computation of crypto-digests for blocks of host I/O data
US11455122B2 (en) Storage system and data compression method for storage system
US11321229B2 (en) System controller and system garbage collection method
US11461239B2 (en) Method and apparatus for buffering data blocks, computer device, and computer-readable storage medium
CN105493080B (en) The method and apparatus of data de-duplication based on context-aware
CN111611250A (en) Data storage device, data query method, data query device, server and storage medium
US20090198883A1 (en) Data copy management for faster reads
CN107506466B (en) Small file storage method and system
CN117312256B (en) File system, operating system and electronic equipment
US20220269431A1 (en) Data processing method and storage device
US20220164146A1 (en) Storage system and control method for storage system
US11249666B2 (en) Storage control apparatus
CN117312260B (en) File compression storage method and device, storage medium and electronic equipment
CN117312261B (en) File compression encoding method, device storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination