WO2020135384A1 - 进行数据压缩的方法和装置 - Google Patents

进行数据压缩的方法和装置 Download PDF

Info

Publication number
WO2020135384A1
WO2020135384A1 PCT/CN2019/127736 CN2019127736W WO2020135384A1 WO 2020135384 A1 WO2020135384 A1 WO 2020135384A1 CN 2019127736 W CN2019127736 W CN 2019127736W WO 2020135384 A1 WO2020135384 A1 WO 2020135384A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
hot
compression
preset threshold
determined
Prior art date
Application number
PCT/CN2019/127736
Other languages
English (en)
French (fr)
Inventor
牛进保
全绍晖
谈晓东
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP19902311.0A priority Critical patent/EP3883133A4/en
Publication of WO2020135384A1 publication Critical patent/WO2020135384A1/zh
Priority to US17/358,240 priority patent/US20210318836A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/60General implementation details not specific to a particular type of compression
    • H03M7/6064Selection of Compressor
    • H03M7/607Selection between different types of compressors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0655Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
    • G06F3/0659Command handling arrangements, e.g. command buffers, queues, command scheduling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0608Saving storage space on storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0673Single storage device
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/60General implementation details not specific to a particular type of compression
    • H03M7/6064Selection of Compressor
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/60General implementation details not specific to a particular type of compression
    • H03M7/6064Selection of Compressor
    • H03M7/6082Selection strategies
    • H03M7/6094Selection strategies according to reasons other than compression rate or data type
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2212/00Indexing scheme relating to accessing, addressing or allocation within memory systems or architectures
    • G06F2212/40Specific encoding of data in memory or cache
    • G06F2212/401Compressed data

Definitions

  • the present application relates to the field of storage technology, and in particular to a method and device for data compression.
  • a compression algorithm is used when storing data, and the same compression algorithm is used to perform compression processing on all data to be written, which may result in poor compression performance or a low compression rate.
  • the embodiments of the present application provide a data compression method and device.
  • the technical solution is as follows:
  • a method for data compression includes:
  • Receiving the first data judging whether the first data belongs to the hot write data, and compressing the first data when the first data is not the hot write data.
  • the storage device serves as the storage device of the system, and when data is written to the storage device in the system, the storage device can receive the written data (hereinafter may be referred to as Is the first data), and then it can be judged whether the first data is written for the first time, and if it is not written for the first time, it is determined whether the first data belongs to the hot write data. If it is determined that the first data is not hot-write data, a compression algorithm may be obtained to perform compression processing on the first data.
  • the compressing the first data includes: when the first data is cold read data, using a first compression algorithm to compress the first data, when the first data If a piece of data is hot-read data, the second compression algorithm is used to compress the first data; wherein, the compression rate of the first compression algorithm is greater than that of the second compression algorithm.
  • the storage device determines that the first data is hot-write data
  • first compression algorithm a compression algorithm with a high compression rate (first compression algorithm) to compress. If the data is hot read data and If the data is not hot-swapped, it indicates that the data may be frequently accessed and read, but the modification rate is low, and a compression algorithm with high decompression performance (second compression algorithm) may be used.
  • the method further includes: storing the compressed first data in the first storage area.
  • the first storage area is used to store non-hot write data.
  • the storage device may control the compressed first data to be stored in the first storage area.
  • the non-hot write data is stored in the first storage area, which is isolated from the hot write data.
  • the method further includes: receiving second data, determining whether the second data belongs to hot write data, and storing the second data when the second data is hot write data To the second storage area.
  • the storage device serves as the storage device of the system, and when data is written to the storage device in the system, the storage device can receive the written data (hereinafter may be referred to as Second data).
  • the second data belongs to the hot write data. If it is the hot write data, the second data may be directly stored in the second storage area without compressing the second data. In this way, isolated storage of hot-write data and non-hot-write data can be achieved.
  • the judging whether the first data is hot-write data includes: if the current time point and the last time the storage address corresponding to the first data is written If the difference is greater than the first preset threshold, it is determined that the first data is not hot write data, if the difference between the current time point and the time point at which the storage address corresponding to the first data was written last time is less than or If it is equal to the first preset threshold, it is determined that the first data is hot write data.
  • the storage device when the storage device receives the first data, it can determine the current time point, and can obtain the time point at which the storage address corresponding to the first data was written last time, that is, the first The time point at which the storage address corresponding to the data to be updated is written, and then the difference between the time point and the current time point can be determined. Judge the size of the difference and the first preset threshold. If the difference is greater than the first preset threshold, determine that the first data is not hot write data. If the difference is less than or equal to the first preset threshold, determine the first One data is hot write data. In this way, when the above-mentioned difference is greater than the first preset threshold, it is determined that the first data is not hot write data.
  • the update time of the first data is relatively long and will not be rewritten frequently, so the first data is not hot write data .
  • the first preset threshold it is determined that the first data is hot write data. This is because the update time of the first data is relatively short and will often be rewritten, so the first data is hot write data. Therefore, it can be accurately determined whether the data belongs to hot write data.
  • the method further includes: if the number of readouts of the data in the storage address corresponding to the first data within a preset time period before the current time point is less than or equal to the first Two preset thresholds, it is determined that the first data is cold read data, if the data in the storage address corresponding to the first data reads more times than the current time before the preset time If the second preset threshold is determined, it is determined that the first data is hot-read data.
  • the preset duration can be preset and stored in the storage device, and the second preset threshold can be preset and stored in the storage device.
  • the storage device records the number of times the data of each storage address is read, is read once, and the number of times of reading is increased once.
  • the storage device may obtain the number of readouts of the data in the storage address corresponding to the first data within a preset duration before the current time point, and then determine the size of the readout times and the second preset threshold, if the If the number of readouts is greater than the second preset threshold, the first data may be determined to be hot read data, and if the number of readouts is less than or equal to the second preset threshold, the first data is determined to be cold read data.
  • the number of readouts is greater than the second preset threshold, it means that the first data is read out frequently, so it can be determined that the first data is hot read data, otherwise it is cold read data. Therefore, it can be accurately determined whether the data belongs to the hot reading data.
  • a storage device in a second aspect, includes a processor and an interface.
  • the processor implements instructions to implement the data compression method provided in the first aspect.
  • an apparatus for performing data compression includes one or more modules.
  • the one or more modules execute the instructions to implement the data compression method provided in the first aspect.
  • a computer-readable storage medium that stores instructions, and when the computer-readable storage medium runs on a storage device, causes the storage device to perform the data compression provided in the first aspect Methods.
  • a computer program product containing instructions, which when executed on a storage device, causes the storage device to perform the method for data compression provided in the first aspect.
  • the storage device determines whether the first data belongs to hot write data, and if the first data is not hot write data, compresses the first data. In this way, to determine whether the data is hot-write data, for different judgment results, select a different compression algorithm, which can improve the compression performance or can increase the compression rate.
  • FIG. 1 is a schematic diagram of a compression ratio and compression performance provided by an embodiment of the present application.
  • FIG. 2 is a schematic structural diagram of a storage device provided by an embodiment of the present application.
  • FIG. 3 is a schematic flowchart of a data compression method provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of a container provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of storing data provided by an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a data compression apparatus according to an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of an apparatus for performing data compression according to an embodiment of the present application.
  • the embodiments of the present application may be applicable to storage devices in the field of all-flash storage.
  • the storage devices may be servers, server clusters, storage arrays, etc., or may be applied to storage devices in the field of non-all-flash storage.
  • Compression ratio the ratio of the size after data compression to the size before data compression.
  • the embodiments of the present application provide a method for performing data compression, and the execution subject of the method may be a storage device.
  • FIG. 2 shows a structural block diagram of a storage device in an embodiment of the present application.
  • the storage device may include at least an interface 201 and a processor 202.
  • the interface 201 can be used to receive data.
  • the interface 201 can be a hardware interface, such as a network interface card (Network Interface Card, NIC) or a host bus adapter (Host Bus Adaptor, HBA), etc., or a program. Interface modules, etc.
  • the processor 202 may be a combination of a central processing unit (Central Processing Unit, CPU) and memory, or may be a field programmable logic gate array (Field Programmable Gate Array, FPGA) or other hardware.
  • the processor 202 is the control center of the storage device, and uses various interfaces and lines to connect various parts of the entire storage device.
  • Step 301 Receive first data.
  • the storage device serves as the storage device of the system.
  • the storage device may receive the written data (which may be referred to as first data in the future).
  • Step 302 Determine whether the first data belongs to hot write data.
  • the storage device may determine whether the first data is written for the first time, and if it is not written for the first time, determine whether the first data is hot-write data.
  • the storage address of the first data is determined, and then the first data is written to the storage area corresponding to the storage address.
  • the storage address of the first data is determined, and then the first data is compressed using a preset compression algorithm and stored in a storage area corresponding to the storage address.
  • the first data is not written for the first time, it means that the first data is to update the data that has been written in the storage device.
  • Method 1 If the difference between the current time point and the time point at which the storage address corresponding to the last first data was written is greater than the first preset threshold, it is determined that the first data is not hot write data, if the current time point If the difference between the time point at which the storage address corresponding to the last first data was written is less than or equal to the first preset threshold, it is determined that the first data is hot write data.
  • the first preset threshold can be preset and stored in the storage device, such as 3 hours.
  • the storage device when the storage device receives the first data, it can determine the current time point, and can obtain the time point at which the storage address corresponding to the last first data is written, that is, the data to be updated by the first data The time point at which the corresponding storage address is written, and then the difference between the time point and the current time point can be determined. Judge the size of the difference and the first preset threshold. If the difference is greater than the first preset threshold, determine that the first data is not hot write data. If the difference is less than or equal to the first preset threshold, determine the first One data is hot write data.
  • the first preset threshold is 15 minutes
  • the time point at which the storage address of the first data was written last is 10:15
  • the current time point is 10:50
  • the difference between the two time points is 35 minutes, greater than the first preset threshold, it can be determined that the first data is not hot write data.
  • the first data is not hot-write data. This is because the update time of the first data is relatively long and will not be frequently rewritten, so the first data is not Hot write data.
  • the first data is hot write data. This is because the update time of the first data is relatively short and will often be rewritten, so the first data is hot write data.
  • Method 2 Determine the container where the first data is to be written and the data block written in the container, and subtract the current time point from the last time the storage address corresponding to each data block in the container was written, Obtain the difference of the data writing time of the storage address corresponding to each data block.
  • the first product corresponding to each data block is added to obtain a first weighted value. If the first weighted value is greater than the third preset threshold, it is determined that the first data is not hot write data, and if the first weighted value is less than or equal to the third preset threshold, it is determined that the first data is hot write data.
  • a container is a logical storage unit in a storage device, and can store multiple data blocks.
  • the third preset threshold can be preset and stored in the storage device.
  • the storage device when it receives the first data, it can first determine the container to be written (for example, a container can be randomly selected from the writable containers) and the written data block, and then determine the container When the storage address corresponding to each current data block was last written, the current time is subtracted from the last time the storage address corresponding to each data block in the container was written to obtain the corresponding data block The difference in the data writing time of the storage address. Then, a pre-stored correspondence between the difference range and the weight value is obtained, and in the correspondence relationship, the weight value corresponding to the difference range to which the difference value corresponding to each data block belongs is determined.
  • the container to be written for example, a container can be randomly selected from the writable containers
  • the difference value corresponding to the data block is multiplied by the weight value corresponding to the difference value corresponding to the data block to obtain the first product corresponding to the data block. Then add the first product of all the data blocks in the container to obtain the first weighted value, and determine the size of the first weighted value and the third preset threshold, if the first weighted value is greater than the third preset threshold, it can be determined
  • the first data is not hot write data. If the first weighted value is less than or equal to the third preset threshold, it may be determined that the first data is hot write data.
  • Method 3 If the difference between the current time point and the time point at which the storage address corresponding to the last first data was written is greater than the first preset threshold, it is determined that the first data is not hot-write data, if the current time point If the difference between the time point at which the storage address corresponding to the last first data was written is less than or equal to the fifth preset threshold, it is determined that the first data is hot write data.
  • the fifth preset threshold is smaller than the first preset threshold, and may also be preset and stored in the storage device.
  • the storage device when it receives the first data, it can determine the current time point, and can obtain the time point at which the storage address corresponding to the last first data is written, that is, the data to be updated by the first data The time point at which the corresponding storage address is written, and then the difference between the time point and the current time point can be determined. Judging the size of the difference and the first preset threshold, if the difference is greater than the first preset threshold, it is determined that the first data is not hot-write data, and the size of the difference and the fifth preset threshold can be determined, if If the difference is less than or equal to the fifth preset threshold, it is determined that the first data is hot write data.
  • the first data is warm writing data.
  • Step 303 When the first data is not hot-write data, compress the first data.
  • the storage device may acquire a compression algorithm and perform compression processing on the first data.
  • the first data when the first data is hot-write data, it indicates that the first data will be modified frequently, and it does not make much sense to compress this type of data, so no compression processing is performed.
  • the first data may be compressed based on the data type, and the corresponding processing may be as follows:
  • the first data is not hot write data, and the first data is non-image data or non-video data, the first data is compressed.
  • the storage device may determine whether the first data is non-image data, and if it is non-image data, the first data may be compressed.
  • the first data is not compressed. This is because the image data itself is already compressed, and the existing lossless compression algorithm cannot be further compressed, so the image data is not further compressed.
  • the storage device may determine whether the first data is non-video data, and if it is non-video data, the first data may be compressed.
  • the first data is not compressed. This is because the video data itself is already compressed, and the existing lossless compression algorithm cannot be further compressed, so the video data is not further compressed.
  • the first data is not hot-write data, but the first data is a log file, because each log file has a fixed format, you can store the fixed format as a template, and other data is compressed using this template as a reference , The compressed data only stores the difference value from the template.
  • the first data may be compressed based on the readout information of the first data, and the corresponding processing may be as follows:
  • the first compression algorithm is used to compress the first data.
  • the second compression algorithm is used to compress the first data.
  • the compression rate of the first compression algorithm is greater than that of the second compression algorithm, but the compression performance of the first compression algorithm is lower than that of the second compression algorithm, and the decompression performance of the second compression algorithm is higher than that of the first compression algorithm
  • the decompression performance may be a high compression lossless compression algorithm (Zstandard, ZSTD), a high compression lossless compression algorithm (GNUzip, GZIP), etc.
  • the second compression algorithm may be a high compression with the same compression format as LZ4 Rate algorithm, the algorithm can be (Abraham L Ziv with High Compression Ratio, LZ4HC) and so on.
  • the storage device determines that the first data is hot-write data
  • first compression algorithm a compression algorithm with a high compression rate (first compression algorithm) to compress. If the data is hot read data and If the data is not hot-swapped, it indicates that the data may be frequently accessed and read, but the modification rate is low, and a compression algorithm with high decompression performance (second compression algorithm) may be used.
  • the first compression algorithm when acquiring the first compression algorithm, it may be obtained from a pre-stored correspondence relationship, which may be a correspondence relationship between the readout type and the compression algorithm, and the readout type includes hot read data and cold read data , As shown in Table 1:
  • processing for determining whether the first data is cold read data is also provided in this application, and the corresponding three processing methods may be as follows:
  • Method 1 If the number of readouts of the data in the storage address corresponding to the first data within a preset time period before the current time point is less than or equal to the second preset threshold, the first data is determined to be cold read data If the number of readouts of the data in the storage address corresponding to the first data within a preset time period before the current time point is greater than the second preset threshold, the first data is determined to be hot read data.
  • the preset duration can be preset and stored in the storage device, and the second preset threshold can be preset and stored in the storage device.
  • the storage device records the number of times the data of each storage address is read, is read once, and the number of reads is increased once.
  • the storage device may obtain the number of readouts of the data in the storage address corresponding to the first data within a preset duration before the current time point, and then determine the size of the readout times and the second preset threshold, if the If the number of readouts is greater than the second preset threshold, the first data may be determined to be hot read data, and if the number of readouts is less than or equal to the second preset threshold, the first data is determined to be cold read data.
  • the second preset threshold is 20 times
  • the preset duration is 2 hours
  • the current time is 10:50
  • the data in the storage address corresponding to the first data is preset before the current time If the number of readouts within the duration is 30, which is greater than the second preset threshold, it can be determined that the first data is hot read data.
  • the number of readouts is greater than the second preset threshold, it means that the first data is read out frequently, so it can be determined that the first data is hot read data, otherwise it is cold read data.
  • Method 2 Determine the container to which the first data is written, and the data block written in the container. Determine the number of reads of the storage address corresponding to each data block within a preset time period before the current time point. Determine the weight corresponding to the range of reading times to which the number of readings belongs, and multiply the weight corresponding to the range of reading times to obtain the second product of each data block. The second product corresponding to each data block is added to obtain a second weighted value. If the second weighted value is greater than the fourth preset threshold, the first data is determined to be hot read data, and if the second weighted value is less than or equal to the fourth preset threshold, the first data is determined to be cold read data.
  • the container is a logical storage unit in the storage device, which can store multiple data blocks.
  • the preset duration may be the same as the preset duration mentioned above, and the fourth preset threshold may be preset and stored in the storage device.
  • the storage device when it receives the first data, it can first determine the container to be written and the written data block, and then determine the time period within the preset time period from the current time point, and determine the period of time The number of reads of the storage address corresponding to each data block, and then obtain the correspondence between the pre-stored range of read times and weights, and from this correspondence, determine the range of read times to which the number of reads corresponding to each data block belongs Corresponding weight, in this way, the weight corresponding to each data block can be obtained. Then, for any data block, the weight corresponding to the data block is multiplied by the number of readouts corresponding to the data block to obtain the second product corresponding to the data block. The second products corresponding to all data blocks in the container are added to obtain a second weighted value.
  • the first data is determined to be hot read data, if the second weighted value is less than or equal to the fourth preset threshold, Then it is determined that the first data is cold read data.
  • Method 3 If the number of readouts of the data in the storage address corresponding to the first data within a preset time period before the current time point is less than or equal to the second preset threshold, the first data is determined to be cold read data If the number of readouts of the data in the storage address corresponding to the first data within a preset time period before the current time point is greater than the sixth preset threshold, the first data is determined to be hot read data.
  • the sixth preset threshold is greater than the second preset threshold, and may also be preset and stored in the storage device.
  • the storage device may acquire the number of readouts of the data in the storage address corresponding to the first data within a preset duration before the current time point, and then determine the number of readouts and the second preset threshold Size, if the number of readouts is less than or equal to the second preset threshold, the first data can be determined to be cold read data, and the size of the number of readouts and the sixth preset threshold can be determined, if the number of readouts is greater than the Six preset thresholds, it is determined that the first data is hot read data.
  • the first data is determined to be warm reading data.
  • the embodiment of the present application further provides a process of storing the compressed first data in a storage area, and the corresponding process may be as follows:
  • the first storage area is used to store non-hot write data.
  • the storage device may control the compressed first data to be stored in the first storage area.
  • the embodiment of the present application further provides a storage process of the hot write data, and the corresponding processing may be as follows:
  • Receive second data It is determined whether the second data belongs to hot write data. When the second data is hot write data, the second data is stored in the second storage area.
  • the storage device serves as the storage device of the system.
  • the storage device may receive the written data (which may be referred to as second data in the following).
  • the second data belongs to the hot write data. If it is the hot write data, the second data may be directly stored in the second storage area without compressing the second data.
  • the second storage area is different from the first storage area mentioned above.
  • the first storage area stores compressed data
  • the second storage area stores uncompressed data.
  • the first storage area is a non-write hot container
  • the second storage area is a write hot container, which are used to store non-hot write data and hot write data, respectively.
  • the hot write data is often rewritten, and garbage collection (Garbage Collection, GC) can be performed, while the non-hot write data is not often rewritten, and generally does not need to be Garbage collection, so it can improve the efficiency of garbage collection.
  • garbage collection Garbage Collection, GC
  • the compression algorithm may be a dictionary compression algorithm (Abraham Lempel and Jacob Ziv, LZ4) algorithm.
  • the end-to-end reduction rate can be increased to more than 20%, and the impact on the overall performance of the storage device is less than 5%.
  • the storage device determines whether the first data belongs to hot write data, and if the first data is not hot write data, compresses the first data. In this way, to determine whether the data is hot-write data, for different judgment results, select a different compression algorithm, which can improve the compression performance or can increase the compression rate.
  • FIG. 6 is a structural diagram of an apparatus for performing data compression provided by an embodiment of the present application.
  • the device can be implemented as part or all of the device through software, hardware, or a combination of both.
  • An apparatus provided in an embodiment of the present application may implement the process described in FIG. 2 of the embodiment of the present application.
  • the apparatus includes: an acquisition module 610, an input module 620, and a distribution module 630, where:
  • the receiving module 610 is used to receive the first data, and may specifically be used to perform step 301 and the hidden steps contained therein;
  • the identification module 620 is used to determine whether the first data belongs to hot-write data, which can be specifically used to perform step 302 and the hidden steps contained therein;
  • the compression module 630 is configured to compress the first data when the first data is not hot-write data, and may specifically be used to perform step 303 and the implicit steps contained therein.
  • the compression module 630 is used to:
  • the first compression algorithm is used to compress the first data
  • the second compression algorithm is used to compress the first data; wherein, the compression rate of the first compression algorithm is greater than that of the second compression algorithm.
  • the device further includes:
  • the storage module 640 is configured to store the compressed first data in the first storage area.
  • the receiving module 610 is also used to receive second data
  • the identification module is also used to determine whether the second data is hot-write data
  • the device also includes:
  • the storage module 640 is configured to store the second data in the second storage area when the second data is hot write data.
  • the identification module is used to:
  • the first data is not hot-write data, if the current time If the difference between the point and the time point at which the storage address corresponding to the first data was written last time is less than or equal to the first preset threshold, it is determined that the first data is hot write data.
  • the identification module is also used to:
  • the first data is determined to be cold read Data
  • the first data is determined to be Hot reading data
  • the storage device determines whether the first data belongs to hot write data, and if the first data is not hot write data, compresses the first data. In this way, to determine whether the data is hot-write data, for different judgment results, select a different compression algorithm, which can improve the compression performance or can increase the compression rate.
  • the device for compressing data provided in the above embodiments is only exemplified by the division of the above functional modules.
  • the above functions can be allocated by different functional modules as needed Completed, that is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above.
  • the data compression apparatus and the data compression method embodiment provided in the above embodiments belong to the same concept. For the specific implementation process, see the method embodiments, and details are not described here.
  • the computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be from a website site, computer, server or data center Transmission to another website, computer, server or data center via wired (such as coaxial cable, optical fiber, digital subscriber line) or wireless (such as infrared, wireless, microwave, etc.).
  • the computer-readable storage medium may be any available medium that can be accessed by a server or a terminal, or a data storage device such as a server, a data center, or the like that includes one or more available medium integrations.
  • the usable medium may be a magnetic medium (such as a floppy disk, a hard disk, a magnetic tape, etc.), an optical medium (such as a Digital Video Disk (DVD), etc.), or a semiconductor medium (such as a solid-state hard disk, etc.).
  • a magnetic medium such as a floppy disk, a hard disk, a magnetic tape, etc.
  • an optical medium such as a Digital Video Disk (DVD), etc.
  • DVD Digital Video Disk
  • semiconductor medium such as a solid-state hard disk, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Memory System (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请提供了一种进行数据压缩的方法和装置,涉及存储技术领域。所述方法包括:存储设备在接收待存储的第一数据后,可以判断第一数据是否属于热写数据,如果第一数据不是热写数据,则对第一数据进行压缩。采用本申请,可以使压缩性能提升。

Description

进行数据压缩的方法和装置 技术领域
本申请涉及存储技术领域,特别涉及一种进行数据压缩的方法和装置。
背景技术
在存储技术领域中,为了节约存储空间,会对数据进行压缩后进行存储,当前有多种通用的压缩算法,不同的压缩算法有不同的压缩率和压缩性能,一般是压缩率越高,压缩性能就越差,反之压缩率越低,压缩性能就越好。
相关技术中,在存储数据时均采用一种压缩算法,对所有的待写入数据使用同一种压缩算法进行压缩处理,有可能会导致压缩性能较差或者压缩率较低。
发明内容
为了解决相关技术中的问题,本申请实施例提供了一种进行数据压缩的方法和装置。所述技术方案如下:
第一方面,提供了一种进行数据压缩的方法,所述方法包括:
接收第一数据,判断所述第一数据是否属于热写数据,当所述第一数据不是热写数据,则对所述第一数据进行压缩。
本申请实施例所示的方案,在一个系统中,存储设备作为该系统的存储设备,在该系统中有数据往存储设备进行写入时,存储设备可以接收写入的数据(后续可以称为是第一数据),然后可以判断第一数据是否是第一次被写入,如果不是第一次被写入,则确定第一数据是否属于热写数据。如果确定第一数据不是热写数据,可以获取压缩算法,对第一数据进行压缩处理。
在一种可能的实施方式中,所述对所述第一数据进行压缩,包括:当所述第一数据为冷读数据,则使用第一压缩算法压缩所述第一数据,当所述第一数据为热读数据,则使用第二压缩算法压缩所述第一数据;其中,所述第一压缩算法的压缩率大于所述第二压缩算法的压缩率。
本申请实施例所示的方案,存储设备确定出第一数据是热写数据之后,可以确定第一数据是不是冷读数据,如果确定第一数据为冷读数据,则可以获取预先存储的第一压缩算法,然后使用第一压缩算法,对第一数据进行压缩,得到压缩后的第一数据。如果确定第一数据不是冷读数据,则可以获取预先存储的第二压缩算法,然后使用第二压缩算法,对第一数据进行压缩,得到压缩后的第一数据。也就是说,如果第一数据不是热写数据。且是冷读数据,则使用第一压缩算法进行压缩,如果第一数据不是热写数据,且是热读数据,则使用第二压缩算法进行压缩。这样,如果数据是冷读数据且不是热写数据,则表明该数据不会被频繁访问修改,可以采用压缩率较高的压缩算法(第一压缩算法)进行压缩,如果数据是热读数据且不是热写数据,则表明该数据可能会被频繁访问读取,但是修改率较低,则可以采用高解压性能的压缩算法(第二压缩算法)。
在一种可能的实施方式中,所述方法还包括:存储压缩后的第一数据至第一存储区域。
其中,第一存储区域用于存储非热写数据。
本申请实施例所示的方案,存储设备可以控制将压缩后的第一数据存储至第一存储区域。这样,非 热写数据存储在第一存储区域,与热写数据实现隔离。
在一种可能的实施方式中,所述方法还包括:接收第二数据,判断所述第二数据是否属于热写数据,当所述第二数据是热写数据,则存储所述第二数据至第二存储区域。
本申请实施例所示的方案,在一个系统中,存储设备作为该系统的存储设备,在该系统中有数据往存储设备进行写入时,存储设备可以接收写入的数据(后续可以称为第二数据)。
然后判断第二数据是否属于热写数据,如果是热写数据,则可以不对第二数据进行压缩,直接将第二数据存储至第二存储区域。这样,可以实现热写数据与非热写数据的隔离存储。
在一种可能的实施方式中,所述判断所述第一数据是否属于热写数据,包括:如果当前的时间点与上一次所述第一数据所对应的存储地址被写入的时间点的差值大于第一预设阈值,则确定所述第一数据不是热写数据,如果当前的时间点与上一次所述第一数据所对应的存储地址被写入的时间点的差值小于或等于第一预设阈值,则确定所述第一数据是热写数据。
本申请实施例所示的方案,存储设备在接收到第一数据时,可以确定当前的时间点,并且可以获取上一次第一数据所对应的存储地址被写入的时间点,也就是第一数据要更新的数据所对应的存储地址被写入的时间点,然后可以确定该时间点与当前的时间点的差值。判断该差值与第一预设阈值的大小,如果该差值大于第一预设阈值,则确定第一数据不是热写数据,如果该差值小于或等于第一预设阈值,则确定第一数据为热写数据。这样,在上述差值大于第一预设阈值时,确定第一数据不是热写数据,这是由于第一数据的更新时间比较长,不会经常被重写,所以第一数据不是热写数据。在上述差值小于或等于第一预设阈值时,确定第一数据是热写数据,这是由于第一数据的更新时间比较短,会经常被重写,所以第一数据是热写数据。所以可以准确的确定出数据是否属于热写数据。
在一种可能的实施方式中,所述方法还包括:如果所述第一数据所对应的存储地址中的数据在距离当前的时间点之前的预设时长之内的读出次数小于或等于第二预设阈值,则确定所述第一数据为冷读数据,如果所述第一数据所对应的存储地址中的数据在距离当前的时间点之前的预设时长之内的读出次数大于所述第二预设阈值,则确定所述第一数据为热读数据。
其中,预设时长可以预设,并且存储至存储设备,第二预设阈值可以预设,并且存储至存储设备中。
本申请实施例所示的方案,存储设备会记录每个存储地址的数据的读取次数,被读取一次,读取次数增加一次。存储设备可以获取第一数据所对应的存储地址中的数据在距离当前的时间点之前的预设时长之内的读出次数,然后判断该读出次数与第二预设阈值的大小,如果该读出次数大于第二预设阈值,则可以确定第一数据为热读数据,如果该读出次数小于或等于第二预设阈值,则确定第一数据为冷读数据。这样,上述在读出次数大于第二预设阈值时,说明第一数据被频繁的读出,所以可以确定为第一数据为热读数据,反之为冷读数据。所以可以准确的确定出数据是否属于热读数据。
第二方面,提供了一种存储设备,该存储设备包括处理器和接口,所述处理器通过执行指令来实现上述第一方面所提供的进行数据压缩的方法。
第三方面,提供了一种进行数据压缩的装置,该装置包括一个或多个模块,该一个或多个模块通过执行指令来实现上述第一方面所提供的进行数据压缩的方法。
第四方面,提供了一种计算机可读存储介质,计算机可读存储介质存储有指令,当计算机可读存储 介质在存储设备上运行时,使得存储设备执行上述第一方面所提供的进行数据压缩的方法。
第五方面,提供了一种包含指令的计算机程序产品,当其在存储设备上运行时,使得存储设备执行上述第一方面所提供的进行数据压缩的方法。
本申请实施例提供的技术方案带来的有益效果至少包括:
本申请实施例中,存储设备在接收待存储的第一数据后,判断第一数据是否属于热写数据,如果第一数据不是热写数据,则对第一数据进行压缩。这样,判断数据是否是热写数据,对于不同的判断结果,选择不同的压缩算法,可以使压缩性能提升或者可以提高压缩率。
另外,由于本申请实施例可以使压缩性能,相应的解压缩性能也会提升。
附图说明
图1是本申请实施例提供的一种压缩率与压缩性能的示意图;
图2是本申请实施例提供的一种存储设备的结构示意图;
图3是本申请实施例提供的一种进行数据压缩的方法流程示意图;
图4是本申请实施例提供的一种容器的示意图;
图5是本申请实施例提供的一种存储数据的示意图;
图6是本申请实施例提供的一种进行数据压缩的装置的结构示意图;
图7是本申请实施例提供的一种进行数据压缩的装置的结构示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
为了便于对本申请实施例的理解,下面首先介绍本申请实施例涉及的系统架构、以及所涉及到名词的概念。
本申请实施例可以适用于全闪存储领域中的存储设备,存储设备可以是服务器、服务器集群、存储阵列等,也可以适用于非全闪存领域的存储设备。
压缩率,数据压缩后的大小与数据压缩前的大小的比值。
压缩性能与压缩率的关系,如图1所示,一般是压缩率越高、压缩及解压性能越差,压缩率越低、压缩性能及解压性能越好。
本申请实施例提供了一种进行数据压缩的方法,该方法的执行主体可以是存储设备。
图2示出了本申请实施例中存储设备的结构框图,该存储设备至少可以包括接口201、处理器202。其中,接口201可以用于实现数据的接收,具体实现,接口201可以为硬件接口,如网络接口卡(Network Interface Card,NIC)或主机总线适配器(Host Bus Adaptor,HBA)等,也可以为程序接口模块等。处理器202可以是中央处理单元(Central Processing Unit,CPU)和存储器的组合,还可以是现场可编程逻辑门阵列(Field Programmable Gate Array,FPGA)或其他硬件。处理器202是存储设备的控制中心,利用各种接口和线路连接整个存储设备的各个部分。
下面将结合具体实施方式,对图3所示的处理流程进行详细的说明,内容可以如下:
步骤301,接收第一数据。
在实施中,在一个系统中,存储设备作为该系统的存储设备,在该系统中有数据往存储设备写入时,存储设备可以接收写入的数据(后续可以称为是第一数据)。
步骤302,判断第一数据是否属于热写数据。
在实施中,存储设备接收到第一数据后,可以判断第一数据是否是第一次被写入,如果不是第一次被写入,则确定第一数据是否属于热写数据。
另外,如果确定第一数据是第一次被写入,则确定第一数据的存储地址,然后将第一数据写入至该存储地址所对应的存储区域。或者,如果确定第一数据是第一次被写入,则确定第一数据的存储地址,然后将第一数据使用预设的压缩算法进行压缩,存储至该存储地址对应的存储区域。
需要说明的是,如果第一数据不是第一次被写入,则说明第一数据是对已经写入在存储设备中的数据进行更新。
可选的,本申请中有多种方式可以确定出第一数据是否属于热写数据,以下给出两种可行的方式:
方式一:如果当前的时间点与上一次第一数据所对应的存储地址被写入的时间点的差值大于第一预设阈值,则确定第一数据不是热写数据,如果当前的时间点与上一次第一数据所对应的存储地址被写入的时间点的差值小于或等于第一预设阈值,则确定第一数据是热写数据。
其中,第一预设阈值可以预设,并且存储至存储设备中,如3小时等。
在实施中,存储设备在接收到第一数据时,可以确定当前的时间点,并且可以获取上一次第一数据所对应的存储地址被写入的时间点,也就是第一数据要更新的数据所对应的存储地址被写入的时间点,然后可以确定该时间点与当前的时间点的差值。判断该差值与第一预设阈值的大小,如果该差值大于第一预设阈值,则确定第一数据不是热写数据,如果该差值小于或等于第一预设阈值,则确定第一数据为热写数据。
例如,第一预设阈值为15分钟,对于上一次第一数据的存储地址被写入的时间点为10点15分,当前的时间点为10点50分,两个时间点的差值为35分钟,大于第一预设阈值,则可以确定第一数据不是热写数据。
需要说明的是,在上述差值大于第一预设阈值时,确定第一数据不是热写数据,这是由于第一数据的更新时间比较长,不会经常被重写,所以第一数据不是热写数据。在上述差值小于或等于第一预设阈值时,确定第一数据是热写数据,这是由于第一数据的更新时间比较短,会经常被重写,所以第一数据是热写数据。
方式二:确定第一数据所要写入的容器,以及在容器中写入的数据块,将当前的时间点与容器中各数据块所对应的存储地址上一次被写入的时间点相减,得到各数据块所对应的存储地址的数据写入时间的差值。确定差值所属的差值范围对应的权值,与差值范围对应的权值相乘,得到各数据块对应的第一乘积。将每个数据块对应的第一乘积相加,得到第一加权值。如果第一加权值大于第三预设阈值,则确定第一数据不是热写数据,如果第一加权值小于或等于第三预设阈值,则确定第一数据是热写数据。
其中,如图4所示,容器(Container)为存储设备中的逻辑存储单元,可以存放多个数据块。第三预设阈值可以预设,并且存储在存储设备中。
在实施中,存储设备在接收到第一数据时,可以首先确定将要写入的容器(例如,可以随机从可写的容器中选取一个容器)、以及写入的数据块,然后确定该容器中当前各数据块所对应的存储地址上一 次被写入的时间点,将当前时间点与容器中各数据块所对应的存储地址上一次被写入的时间点相减,得到各数据块所对应的存储地址的数据写入时间的差值。然后获取预先存储的差值范围与权值的对应关系,在该对应关系中,确定各数据块对应的差值所属的差值范围对应的权值。对于任一数据块,将该数据块对应的差值与该数据块对应的差值所对应的权值相乘,得到该数据块对应的第一乘积。然后将该容器中所有数据块的第一乘积相加,得到第一加权值,判断第一加权值与第三预设阈值的大小,如果第一加权值大于第三预设阈值,则可以确定第一数据不是热写数据,如果第一加权值小于或等于第三预设阈值,则可以确定第一数据是热写数据。
方式三:如果当前的时间点与上一次第一数据所对应的存储地址被写入的时间点的差值大于第一预设阈值,则确定第一数据不是热写数据,如果当前的时间点与上一次第一数据所对应的存储地址被写入的时间点的差值小于或等于第五预设阈值,则确定第一数据是热写数据。
其中,第五预设阈值小于第一预设阈值,也可以预设,并且存储在存储设备中。
在实施中,存储设备在接收到第一数据时,可以确定当前的时间点,并且可以获取上一次第一数据所对应的存储地址被写入的时间点,也就是第一数据要更新的数据所对应的存储地址被写入的时间点,然后可以确定该时间点与当前的时间点的差值。判断该差值与第一预设阈值的大小,如果该差值大于第一预设阈值,则确定第一数据不是热写数据,并且可以判断该差值与第五预设阈值的大小,如果该差值小于或等于第五预设阈值,则确定第一数据为热写数据。
另外,如果上述差值大于第五预设阈值,且小于第一预设阈值,则确定第一数据为温写数据。
步骤303,当第一数据不是热写数据,则对第一数据进行压缩。
在实施中,存储设备如果确定第一数据不是热写数据,可以获取压缩算法,对第一数据进行压缩处理。
另外,在第一数据是热写数据时,表明第一数据会频繁被修改,对该类数据做压缩的意义不大,所以不进行压缩处理。
可选的,可以基于数据类型,对第一数据进行压缩,相应的处理可以如下:
当第一数据不是热写数据,且第一数据是非图像数据或者非视频数据,则对第一数据进行压缩。
在实施中,存储设备如果确定第一数据不是热写数据,则可以确定第一数据是否是非图像数据,如果是非图像数据,则可以对第一数据进行压缩。
另外,如果不是非图像数据则不对第一数据进行压缩,这是由于图像数据本身已经是经过压缩后的,现有的无损压缩算法无法再进一步压缩,所以对图像数据不进行进一步压缩。
存储设备如果确定第一数据不是热写数据,则可以确定第一数据是否是非视频数据,如果是非视频数据,则可以对第一数据进行压缩。
另外,如果不是非视频数据则不对第一数据进行压缩,这是由于视频数据本身已经是经过压缩后的,现有的无损压缩算法无法再进一步压缩,所以对视频数据不进行进一步压缩。
另外,如果第一数据不是热写数据,但是第一数据是日志文件,由于每条日志文件都有固定的格式,可以将固定的格式存储为一个模板,其它的数据以这个模板为参考进行压缩,压缩后的数据只存储与模板的差异值。
可选的,还可以基于第一数据的读出信息,对第一数据进行压缩,相应的处理可以如下:
当第一数据为冷读数据,则使用第一压缩算法压缩第一数据。当第一数据为热读数据,则使用第二压缩算法压缩第一数据。
其中,第一压缩算法的压缩率大于第二压缩算法的压缩率,但是第一压缩算法的压缩性能低于第二压缩算法的压缩性能,且第二压缩算法的解压性能高于第一压缩算法的解压性能。例如,第一压缩算法可以是高压缩率的无损压缩算法(Zstandard,ZSTD)、高压缩率的无损压缩算法(GNUzip,GZIP)等,第二压缩算法可以是与LZ4具有相同压缩格式的高压缩率算法,该算法可以是(Abraham Lempel and Jacob Ziv with High compression ratio,LZ4HC)等。
在实施中,存储设备确定出第一数据是热写数据之后,可以确定第一数据是不是冷读数据,如果确定第一数据为冷读数据,则可以获取预先存储的第一压缩算法,然后使用第一压缩算法,对第一数据进行压缩,得到压缩后的第一数据。如果确定第一数据不是冷读数据,则可以获取预先存储的第二压缩算法,然后使用第二压缩算法,对第一数据进行压缩,得到压缩后的第一数据。也就是说,如果第一数据不是热写数据。且是冷读数据,则使用第一压缩算法进行压缩,如果第一数据不是热写数据,且是热读数据,则使用第二压缩算法进行压缩。这样,如果数据是冷读数据且不是热写数据,则表明该数据不会被频繁访问修改,可以采用压缩率较高的压缩算法(第一压缩算法)进行压缩,如果数据是热读数据且不是热写数据,则表明该数据可能会被频繁访问读取,但是修改率较低,则可以采用高解压性能的压缩算法(第二压缩算法)。
需要说明的是,在获取第一压缩算法时,可以是从预先存储的对应关系中获取,该对应关系可以为读出类型与压缩算法的对应关系,读出类型包括热读数据和冷读数据,如表一所示:
表一
读出类型 压缩算法
冷读数据 第一压缩算法
热读数据 第二压缩算法
可选的,本申请中还提供了确定第一数据是否属于冷读数据的处理,相应的三种处理方式可以如下:
方式一:如果第一数据所对应的存储地址中的数据在距离当前的时间点之前的预设时长之内的读出次数小于或等于第二预设阈值,则确定第一数据为冷读数据,如果第一数据所对应的存储地址中的数据在距离当前的时间点之前的预设时长之内的读出次数大于第二预设阈值,则确定第一数据为热读数据。
其中,预设时长可以预设,并且存储至存储设备,第二预设阈值可以预设,并且存储至存储设备中。
在实施中,存储设备会记录每个存储地址的数据的读取次数,被读取一次,读取次数增加一次。
存储设备可以获取第一数据所对应的存储地址中的数据在距离当前的时间点之前的预设时长之内的读出次数,然后判断该读出次数与第二预设阈值的大小,如果该读出次数大于第二预设阈值,则可以确定第一数据为热读数据,如果该读出次数小于或等于第二预设阈值,则确定第一数据为冷读数据。
例如,第二预设阈值为20次,预设时长为2个小时,当前的时间点为10点50分,第一数据所对应的存储地址中的数据在距离当前的时间点之前的预设时长之内的读出次数为30次,大于第二预设阈值,则可以确定第一数据是热读数据。
需要说明的是,上述在读出次数大于第二预设阈值时,说明第一数据被频繁的读出,所以可以确定为第一数据为热读数据,反之为冷读数据。
方式二:确定第一数据所要写入的容器,以及在容器中写入的数据块。确定距离当前时间点之前的预设时长之内各数据块所对应的存储地址的读出次数。确定读出次数所属的读出次数范围对应的权值,将读出次数与读出次数范围对应的权值相乘,得到各数据块的第二乘积。将每个数据块对应的第二乘积 相加,得到第二加权值。如果第二加权值大于第四预设阈值,则确定第一数据为热读数据,如果第二加权值小于或等于第四预设阈值,则确定第一数据为冷读数据。
其中,容器(Container)为存储设备中的逻辑存储单元,可以存放多个数据块。预设时长可以与上述提到的预设时长相同,第四预设阈值可以预设,并且存储至存储设备中。
在实施中,存储设备在接收到第一数据时,可以首先确定将要写入的容器、以及写入的数据块,然后确定距离当前时间点预设时长之内的时间段,确定这段时间内各数据块所对应的存储地址的读出次数,然后获取预先存储的读出次数范围与权值的对应关系,从该对应关系中,确定各数据块对应的读出次数所属的读出次数范围所对应的权值,这样,可以得到各数据块分别对应的权值。然后对于任一数据块,将该数据块对应的权值与该数据块对应的读出次数相乘,得到该数据块对应的第二乘积。将该容器中所有数据块对应的第二乘积相加,得到第二加权值。
判断第二加权值与第四预设阈值的大小,如果第二加权值大于第四预设阈值,则确定第一数据为热读数据,如果第二加权值小于或等于第四预设阈值,则确定第一数据为冷读数据。
方式三:如果第一数据所对应的存储地址中的数据在距离当前的时间点之前的预设时长之内的读出次数小于或等于第二预设阈值,则确定第一数据为冷读数据,如果第一数据所对应的存储地址中的数据在距离当前的时间点之前的预设时长之内的读出次数大于第六预设阈值,则确定第一数据为热读数据。
其中,第六预设阈值大于第二预设阈值,也可以是预设,并存储在存储设备中。
在实施中,存储设备可以获取第一数据所对应的存储地址中的数据在距离当前的时间点之前的预设时长之内的读出次数,然后判断该读出次数与第二预设阈值的大小,如果该读出次数小于或等于第二预设阈值,则可以确定第一数据为冷读数据,并且可以判断该读出次数与第六预设阈值的大小,如果该读出次数大于第六预设阈值,则确定第一数据为热读数据。
另外,如果上述读出次数大于第二预设阈值,且小于或等于第六预设阈值,则确定第一数据为温读数据。
可选的,本申请实施例还提供了将压缩后的第一数据存储至某个存储区域的处理,相应的处理可以如下:
存储压缩后的第一数据至第一存储区域。
其中,第一存储区域用于存储非热写数据。
在实施中,存储设备可以控制将压缩后的第一数据存储至第一存储区域。
可选的,本申请实施例还提供了热写数据的存储过程,相应的处理可以如下:
接收第二数据。判断第二数据是否属于热写数据。当第二数据是热写数据,则存储第二数据至第二存储区域。
在实施中,在一个系统中,存储设备作为该系统的存储设备,在该系统中有数据往存储设备进行写入时,存储设备可以接收写入的数据(后续可以称为第二数据)。
然后判断第二数据是否属于热写数据,如果是热写数据,则可以不对第二数据进行压缩,直接将第二数据存储至第二存储区域。
需要说明的是,第二存储区域与上述提到的第一存储区域不相同,第一存储区域存储的是压缩后的数据,第二存储区域存储的是不进行压缩的数据。
例如,如图5所示,第一存储区域为非写热容器,第二存储区域为写热容器,分别用于存储非热写数据、热写数据。
这样,通过对热写数据和非热写数据进行分离存储,热写数据经常被重写,可以将进行垃圾回收(Garbage Collection,GC),而非热写数据不经常被重写,一般不用进行垃圾回收,所以可以提升垃圾回收的效率。
另外,对于上述提到的温写数据和温读数据,如果某个数据是温写数据和温读数据,压缩算法可以是字典压缩算法(Abraham Lempel and Jacob Ziv,LZ4)算法。
通过本申请,能够提升端到端的缩减率至20%以上,对存储设备的整体性能影响小于5%。
本申请实施例中,存储设备在接收待存储的第一数据后,判断第一数据是否属于热写数据,如果第一数据不是热写数据,则对第一数据进行压缩。这样,判断数据是否是热写数据,对于不同的判断结果,选择不同的压缩算法,可以使压缩性能提升或者可以提高压缩率。
另外,由于本申请实施例可以使压缩性能,相应的解压缩性能也会提升。
图6是本申请实施例提供的进行数据压缩的装置的结构图。该装置可以通过软件、硬件或者两者的结合实现成为装置中的部分或者全部。本申请实施例提供的装置可以实现本申请实施例图2所述的流程,该装置包括:获取模块610、输入模块620和分配模块630,其中:
接收模块610,用于接收第一数据,具体可以用于执行步骤301以及其包含的隐含步骤;
识别模块620,用于判断所述第一数据是否属于热写数据,具体可以用于执行步骤302以及其包含的隐含步骤;
压缩模块630,用于当所述第一数据不是热写数据,则对所述第一数据进行压缩,具体可以用于执行步骤303以及其包含的隐含步骤。
可选的,所述压缩模块630,用于:
当所述第一数据为冷读数据,则使用第一压缩算法压缩所述第一数据;
当所述第一数据为热读数据,则使用第二压缩算法压缩所述第一数据;其中,所述第一压缩算法的压缩率大于所述第二压缩算法的压缩率。
可选的,如图7所示,所述装置还包括:
存储模块640,用于存储压缩后的第一数据至第一存储区域。
可选的,所述接收模块610,还用于接收第二数据;
所述识别模块,还用于判断所述第二数据是否属于热写数据;
所述装置,还包括:
存储模块640,用于当所述第二数据是热写数据,则存储所述第二数据至第二存储区域。
可选的,所述识别模块,用于:
如果当前的时间点与上一次所述第一数据所对应的存储地址被写入的时间点的差值大于第一预设阈值,则确定所述第一数据不是热写数据,如果当前的时间点与上一次所述第一数据所对应的存储地址被写入的时间点的差值小于或等于第一预设阈值,则确定所述第一数据是热写数据。
可选的,所述识别模块,还用于:
如果所述第一数据所对应的存储地址中的数据在距离当前的时间点之前的预设时长之内的读出次数小于或等于第二预设阈值,则确定所述第一数据为冷读数据,如果所述第一数据所对应的存储地址中的数据在距离当前的时间点之前的预设时长之内的读出次数大于所述第二预设阈值,则确定所述第一数据为热读数据。
本申请实施例中,存储设备在接收待存储的第一数据后,判断第一数据是否属于热写数据,如果第 一数据不是热写数据,则对第一数据进行压缩。这样,判断数据是否是热写数据,对于不同的判断结果,选择不同的压缩算法,可以使压缩性能提升或者可以提高压缩率。
另外,由于本申请实施例可以使压缩性能,相应的解压缩性能也会提升。
本申请实施例图6所示的数据压缩的装置的具体实现,可以参考图2所描述的存储设备。
需要说明的是:上述实施例提供的进行数据压缩的装置在进行数据压缩时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的进行数据压缩的装置与进行数据压缩的方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现,当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令,在服务器或终端上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴光缆、光纤、数字用户线)或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是服务器或终端能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(如软盘、硬盘和磁带等),也可以是光介质(如数字视盘(Digital Video Disk,DVD)等),或者半导体介质(如固态硬盘等)。

Claims (20)

  1. 一种进行数据压缩的方法,其特征在于,所述方法包括:
    接收第一数据;
    判断所述第一数据是否属于热写数据;
    当所述第一数据不是热写数据,则对所述第一数据进行压缩。
  2. 根据权利要求1所述的方法,其特征在于,所述对所述第一数据进行压缩,包括:
    当所述第一数据为冷读数据,则使用第一压缩算法压缩所述第一数据;
    当所述第一数据为热读数据,则使用第二压缩算法压缩所述第一数据;其中,所述第一压缩算法的压缩率大于所述第二压缩算法的压缩率。
  3. 根据权利要求1或2所述的方法,其特征在于,所述方法还包括:
    存储压缩后的第一数据至第一存储区域。
  4. 根据权利要求1至3任一所述的方法,其特征在于,所述方法还包括:
    接收第二数据;
    判断所述第二数据是否属于热写数据;
    当所述第二数据是热写数据,则存储所述第二数据至第二存储区域。
  5. 根据权利要求1至4任一所述的方法,其特征在于,所述判断所述第一数据是否属于热写数据,包括:
    如果当前的时间点与上一次所述第一数据所对应的存储地址被写入的时间点的差值大于第一预设阈值,则确定所述第一数据不是热写数据,如果当前的时间点与上一次所述第一数据所对应的存储地址被写入的时间点的差值小于或等于第一预设阈值,则确定所述第一数据是热写数据。
  6. 根据权利要求2所述的方法,其特征在于,所述方法还包括:
    如果所述第一数据所对应的存储地址中的数据在距离当前的时间点之前的预设时长之内的读出次数小于或等于第二预设阈值,则确定所述第一数据为冷读数据,如果所述第一数据所对应的存储地址中的数据在距离当前的时间点之前的预设时长之内的读出次数大于所述第二预设阈值,则确定所述第一数据为热读数据。
  7. 一种进行数据压缩的存储设备,其特征在于,所述存储设备包括接口和处理器,其中:
    所述接口,用于接收第一数据;
    所述处理器,用于:
    判断所述第一数据是否属于热写数据;
    当所述第一数据不是热写数据,则对所述第一数据进行压缩。
  8. 根据权利要求7所述的存储设备,其特征在于,所述处理器,用于:
    当所述第一数据为冷读数据,则使用第一压缩算法压缩所述第一数据;
    当所述第一数据为热读数据,则使用第二压缩算法压缩所述第一数据;其中,所述第一压缩算法的压缩率大于所述第二压缩算法的压缩率。
  9. 根据权利要求7或8所述的存储设备,其特征在于,所述处理器,还用于:
    存储压缩后的第一数据至第一存储区域。
  10. 根据权利要求7至9任一所述的存储设备,其特征在于,所述接口,还用于接收第二数据;
    所述处理器,还用于:
    判断所述第二数据是否属于热写数据;
    当所述第二数据是热写数据,则存储所述第二数据至第二存储区域。
  11. 根据权利要求7至10任一所述的存储设备,其特征在于,所述处理器,用于:
    如果当前的时间点与上一次所述第一数据所对应的存储地址被写入的时间点的差值大于第一预设阈值,则确定所述第一数据不是热写数据,如果当前的时间点与上一次所述第一数据所对应的存储地址被写入的时间点的差值小于或等于第一预设阈值,则确定所述第一数据是热写数据。
  12. 根据权利要求8所述的存储设备,其特征在于,所述处理器,还用于:
    如果所述第一数据所对应的存储地址中的数据在距离当前的时间点之前的预设时长之内的读出次数小于或等于第二预设阈值,则确定所述第一数据为冷读数据,如果所述第一数据所对应的存储地址中的数据在距离当前的时间点之前的预设时长之内的读出次数大于所述第二预设阈值,则确定所述第一数据为热读数据。
  13. 一种进行数据压缩的装置,其特征在于,所述装置包括:
    接收模块,用于接收第一数据;
    识别模块,用于判断所述第一数据是否属于热写数据;
    压缩模块,用于当所述第一数据不是热写数据,则对所述第一数据进行压缩。
  14. 根据权利要求13所述的装置,其特征在于,所述压缩模块,用于:
    当所述第一数据为冷读数据,则使用第一压缩算法压缩所述第一数据;
    当所述第一数据为热读数据,则使用第二压缩算法压缩所述第一数据;其中,所述第一压缩算法的压缩率大于所述第二压缩算法的压缩率。
  15. 根据权利要求13或14所述的装置,其特征在于,所述装置还包括:
    存储模块,用于存储压缩后的第一数据至第一存储区域。
  16. 根据权利要求13至15任一所述的装置,其特征在于,所述接收模块,还用于接收第二数据;
    所述识别模块,还用于判断所述第二数据是否属于热写数据;
    所述装置,还包括:
    存储模块,用于当所述第二数据是热写数据,则存储所述第二数据至第二存储区域。
  17. 根据权利要求13至16任一所述的装置,其特征在于,所述识别模块,用于:
    如果当前的时间点与上一次所述第一数据所对应的存储地址被写入的时间点的差值大于第一预设阈值,则确定所述第一数据不是热写数据,如果当前的时间点与上一次所述第一数据所对应的存储地址被写入的时间点的差值小于或等于第一预设阈值,则确定所述第一数据是热写数据。
  18. 根据权利要求14所述的装置,其特征在于,所述识别模块,还用于:
    如果所述第一数据所对应的存储地址中的数据在距离当前的时间点之前的预设时长之内的读出次数小于或等于第二预设阈值,则确定所述第一数据为冷读数据,如果所述第一数据所对应的存储地址中的数据在距离当前的时间点之前的预设时长之内的读出次数大于所述第二预设阈值,则确定所述第一数据为热读数据。
  19. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有指令,当所述计算机可读存储介质在存储设备上运行时,使得所述存储设备执行所述权利要求1-6中任一权利要求所述的方 法。
  20. 一种包含指令的计算机程序产品,其特征在于,当所述计算机程序产品在存储设备上运行时,使得所述存储设备执行所述权利要求1-6中任一权利要求所述的方法。
PCT/CN2019/127736 2018-12-26 2019-12-24 进行数据压缩的方法和装置 WO2020135384A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP19902311.0A EP3883133A4 (en) 2018-12-26 2019-12-24 DATA COMPRESSION METHOD AND APPARATUS
US17/358,240 US20210318836A1 (en) 2018-12-26 2021-06-25 Data compression method and apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811604685.5A CN109802684B (zh) 2018-12-26 2018-12-26 进行数据压缩的方法和装置
CN201811604685.5 2018-12-26

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/358,240 Continuation US20210318836A1 (en) 2018-12-26 2021-06-25 Data compression method and apparatus

Publications (1)

Publication Number Publication Date
WO2020135384A1 true WO2020135384A1 (zh) 2020-07-02

Family

ID=66557690

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/127736 WO2020135384A1 (zh) 2018-12-26 2019-12-24 进行数据压缩的方法和装置

Country Status (4)

Country Link
US (1) US20210318836A1 (zh)
EP (1) EP3883133A4 (zh)
CN (1) CN109802684B (zh)
WO (1) WO2020135384A1 (zh)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109802684B (zh) * 2018-12-26 2022-03-25 华为技术有限公司 进行数据压缩的方法和装置
CN110543281A (zh) * 2019-07-19 2019-12-06 苏州浪潮智能科技有限公司 一种存储压缩实现方法、装置、设备及存储介质
CN111277274A (zh) * 2020-01-13 2020-06-12 平安国际智慧城市科技股份有限公司 数据压缩方法、装置、设备及存储介质
CN112965664A (zh) * 2021-03-08 2021-06-15 北京金山云网络技术有限公司 一种数据压缩的方法和相关装置
US11681456B2 (en) * 2021-05-19 2023-06-20 Huawei Cloud Computing Technologies Co., Ltd. Compaction policies for append-only stores
CN114356225A (zh) * 2021-12-17 2022-04-15 得一微电子股份有限公司 存储器的数据存储方法、装置、终端设备以及存储介质
CN115905168B (zh) * 2022-11-15 2023-11-07 本原数据(北京)信息技术有限公司 基于数据库的自适应压缩方法和装置、设备、存储介质
CN116303409B (zh) * 2023-05-24 2023-08-08 北京庚顿数据科技有限公司 超高压缩比的工业生产时序数据透明压缩方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101526923A (zh) * 2009-04-02 2009-09-09 成都市华为赛门铁克科技有限公司 一种数据处理方法、装置和闪存存储系统
US20140208007A1 (en) * 2013-01-22 2014-07-24 Lsi Corporation Management of and region selection for writes to non-volatile memory
CN104516824A (zh) * 2013-10-01 2015-04-15 国际商业机器公司 数据存储系统中的存储管理方法和系统
CN109802684A (zh) * 2018-12-26 2019-05-24 华为技术有限公司 进行数据压缩的方法和装置

Family Cites Families (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102609360B (zh) * 2012-01-12 2015-03-25 华为技术有限公司 一种数据处理方法、装置及系统
US9355112B1 (en) * 2012-12-31 2016-05-31 Emc Corporation Optimizing compression based on data activity
CN104125458B (zh) * 2013-04-27 2017-08-08 展讯通信(上海)有限公司 内存数据无损压缩方法及装置
CN103516369B (zh) * 2013-06-20 2016-12-28 易乐天 一种自适应数据压缩和解压缩的方法和系统及存储装置
CN104199784B (zh) * 2014-08-20 2017-12-08 浪潮(北京)电子信息产业有限公司 一种基于分级存储的数据迁移方法及装置
US10101938B2 (en) * 2014-12-30 2018-10-16 International Business Machines Corporation Data storage system selectively employing multiple data compression techniques
US9990308B2 (en) * 2015-08-31 2018-06-05 Oracle International Corporation Selective data compression for in-memory databases
US9985649B1 (en) * 2016-06-29 2018-05-29 EMC IP Holding Company LLC Combining hardware and software approaches for inline data compression
US10503443B2 (en) * 2016-09-13 2019-12-10 Netapp, Inc. Systems and methods for allocating data compression activities in a storage system
US10116329B1 (en) * 2016-09-16 2018-10-30 EMC IP Holding Company LLC Method and system for compression based tiering
CN106775461B (zh) * 2016-11-30 2020-01-21 华为技术有限公司 热点数据确定方法、设备及装置
CN107463606B (zh) * 2017-06-22 2020-11-13 浙江力石科技股份有限公司 一种用于大数据存储系统的数据压缩引擎及方法
US10115437B1 (en) * 2017-06-26 2018-10-30 Western Digital Technologies, Inc. Storage system and method for die-based data retention recycling
CN107465413B (zh) * 2017-07-07 2020-11-17 南京城市职业学院 一种自适应数据压缩系统及其方法
CN108829344A (zh) * 2018-05-24 2018-11-16 北京百度网讯科技有限公司 数据存储方法、装置及存储介质
CN108932738B (zh) * 2018-07-03 2022-08-16 南开大学 一种基于字典的位片索引压缩方法
CN108920107B (zh) * 2018-07-13 2022-02-01 深圳忆联信息系统有限公司 筛选冷数据的方法、装置、计算机设备及存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101526923A (zh) * 2009-04-02 2009-09-09 成都市华为赛门铁克科技有限公司 一种数据处理方法、装置和闪存存储系统
US20140208007A1 (en) * 2013-01-22 2014-07-24 Lsi Corporation Management of and region selection for writes to non-volatile memory
CN104516824A (zh) * 2013-10-01 2015-04-15 国际商业机器公司 数据存储系统中的存储管理方法和系统
CN109802684A (zh) * 2018-12-26 2019-05-24 华为技术有限公司 进行数据压缩的方法和装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3883133A4 *

Also Published As

Publication number Publication date
CN109802684B (zh) 2022-03-25
EP3883133A4 (en) 2022-01-19
CN109802684A (zh) 2019-05-24
US20210318836A1 (en) 2021-10-14
EP3883133A1 (en) 2021-09-22

Similar Documents

Publication Publication Date Title
WO2020135384A1 (zh) 进行数据压缩的方法和装置
US20200150890A1 (en) Data Deduplication Method and Apparatus
US9927998B2 (en) Flash memory compression
US10277248B2 (en) Compression engine with consistent throughput
US20220147255A1 (en) Method and apparatus for compressing data of storage system, device, and readable storage medium
US20220164316A1 (en) Deduplication method and apparatus
US20190014016A1 (en) Data acquisition device, data acquisition method and storage medium
US20170068458A1 (en) Hardware-accelerated storage compression
CN106170757A (zh) 一种数据存储方法及装置
CN107329904A (zh) 数据读取方法及装置
CN113873255A (zh) 一种视频数据传输方法、视频数据解码方法及相关装置
CN106681659A (zh) 数据压缩的方法及装置
CN109086008A (zh) 固态硬盘的数据处理方法以及固态硬盘
US11327929B2 (en) Method and system for reduced data movement compression using in-storage computing and a customized file system
CN107577549A (zh) 一种存储重删功能的测试方法
JP2015158910A (ja) ラップ読出しから連続読出しを行うメモリサブシステム
US20140081589A1 (en) Method for measuring performance of an appliance
JP2022528284A (ja) 圧縮データの記憶及び取得の最適化
US10489350B2 (en) Data compression with inline compression metadata
CN111857574A (zh) 一种写请求数据压缩方法、系统、终端及存储介质
US20220253238A1 (en) Method and apparatus for accessing solid state disk
US9600415B1 (en) Method, apparatus, and computer program stored in computer readable medium for managing storage server in database system
CN113806389A (zh) 一种数据处理方法、装置、计算设备与存储介质
CN111930510A (zh) 电子设备和数据处理方法
WO2021212337A1 (zh) 一种数据访问方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19902311

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2019902311

Country of ref document: EP

Effective date: 20210616