WO2019228098A1 - 一种数据压缩方法及装置 - Google Patents

一种数据压缩方法及装置 Download PDF

Info

Publication number
WO2019228098A1
WO2019228098A1 PCT/CN2019/083589 CN2019083589W WO2019228098A1 WO 2019228098 A1 WO2019228098 A1 WO 2019228098A1 CN 2019083589 W CN2019083589 W CN 2019083589W WO 2019228098 A1 WO2019228098 A1 WO 2019228098A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
block
compressed
compression
blocks
Prior art date
Application number
PCT/CN2019/083589
Other languages
English (en)
French (fr)
Inventor
高翔
杜维
汪宁
胡天军
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2019228098A1 publication Critical patent/WO2019228098A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0608Saving storage space on storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0673Single storage device
    • G06F3/0674Disk device
    • G06F3/0676Magnetic disk device
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/40Conversion to or from variable length codes, e.g. Shannon-Fano code, Huffman code, Morse code

Definitions

  • the input data to be compressed is a fixed-size data block
  • compressed blocks of different sizes are output after compression.
  • FIG. 1 when the lossless data compression technology is applied to a file system, a file can be divided into data blocks 1 to n each having a size of 4 KB. Then, 1 to n 4 KB data blocks are sequentially compressed. , Which can correspondingly obtain compressed blocks 1 to n of different sizes.
  • the size of the compressed blocks output by the compression process is random, when storing the compressed blocks, the storage locations of the compressed blocks in the storage space of the storage device are also irregular. For example, for a disk, the storage space of the disk can be divided into several equal-sized disk blocks.
  • Disk block B stores the second half of compressed block 2 and the first half of compressed block 3.
  • Disk block C stores the second half of compressed block 3 and the first half of compressed block 4. Partly. Since the stored content is read from the disk in units of disk blocks, in the above case, if the compressed block 3 needs to be read, the contents stored in disk block B and disk block C need to be read. Read them all. In addition, since the second half of the compressed block 2 and the first half of the compressed block 4 that are read out are not complete compression blocks, they cannot be decompressed effectively, so that the extraly read content is an invalid part. This phenomenon is also called random read amplification. In view of this, to implement compression in the above-mentioned existing manner, subsequent devices will likely read a lot of invalid content when reading compressed blocks, which will bring additional burden on the device.
  • the present application provides a data compression method and device to solve problems such as random read amplification caused by using existing data compression processing methods.
  • the present application provides a data compression method
  • the execution subject of the method may be any device capable of executing the data compression method provided in the present application, such as an image generation server, a personal computer, or a mobile terminal.
  • the original data can be input to a compression module for compression processing, and then n compression blocks containing the compressed data are output in order.
  • the first of each compression block is The capacity is the same, and the first capacity represents the number of bytes of the compressed processed data that the compressed block can contain.
  • n compression blocks are stored in a storage medium, and the storage medium includes m disk blocks, and each disk block has a second capacity that is the same, and the second capacity represents the data stored in the disk block. The number of bytes.
  • the storage form of the first compressed block with the same capacity in the storage medium may have the following two designs:
  • the second capacity is p times the first capacity
  • the storage form of the n compressed blocks in the storage medium is: p complete compressed blocks are stored in one complete disk block.
  • the first capacity is q times the second capacity
  • the storage form of the n compressed blocks in the storage medium is: one complete compressed block is stored in the q complete disk blocks.
  • n, m, p, and q are positive integers, and p is less than or equal to n, and q is less than or equal to m.
  • the original data is compressed by limiting the size of each compressed block that is output, so that the first capacity of each compressed block that is output is the same.
  • the storage medium The storage form in the form is that at least one complete compressed block is stored in one complete disk block, or one complete compressed block is stored in at least one complete disk block.
  • the subsequent compressed blocks can be read and decompressed during the process. It is possible to read more valid data, and the data read from the compressed block can be successfully decompressed, so that the random read amplification phenomenon that occurs in a fixed input manner can be avoided.
  • by controlling the size of the output compression block to be fixed it is easier to achieve the purpose of maintaining a lower compression rate than the conventional method using a fixed input.
  • At least two compressed blocks in the n compressed blocks that contain the same compressed data if there are at least two compressed blocks in the n compressed blocks that contain the same compressed data, at least two compressed blocks are stored in the storage medium.
  • the storage locations of the blocks in the storage medium are the same. In this way, deduplication processing can be implemented for two identical compressed blocks, so as to save storage space.
  • an index may also be created to facilitate finding and accessing the compressed block where the original data is located.
  • Method 1 The original data can be divided into i data blocks, and each data block contains the same number of bytes of data as the first capacity, where the j-th data block in the i data blocks contains at most two compressed blocks. Decompressed data. Further, a first index is established for the j-th data block, and a correspondence between the established first index and the j-th data block is recorded, and the first index is used to identify the data contained in the j-th data block. A storage location in the storage medium. Among them, i is a positive integer, and j is any positive integer less than or equal to i.
  • the content contained in the first index corresponding to the j-th data block is: the first or second identification bit, the next The block number of the compressed block and the intra-block offset of the j-th data block; wherein the first identification bit is used to identify the data in the next compressed block as the original data; the second identification bit is used to identify the next data block
  • the data in one compressed block is the compressed data; the intra-block offset of the jth data block is the position of the head of the data after the decompressed processing of the next compressed block in the jth data block.
  • the content contained in the first index corresponding to the j-th data block is: a third identification bit, and the third identification bit is used to identify the
  • the data in the current compressed block is the data obtained after compression processing; relative to the block distance of the first data block located before the jth data block, the first data block contains the decompressed data in the current compressed block, or , The first data block contains the decompressed data in the current compressed block and the decompressed data in the previous compressed block; the block distance of the second data block after the jth data block, the second The data block contains the decompressed data in the current compressed block and the decompressed data in the next compressed block; the intra-block offset of the first data block, and the intra-block offset of the first data block is the current compression The position of the head of the decompressed data of the block in the first data block.
  • the first index corresponding to the j-th data block further includes the block number of the current compression block.
  • an index may be created by using a compressed block as an object. Specifically, a second index is established for the k-th compressed block in the n-compressed blocks, and a correspondence between the established second index and the k-th compressed block is recorded, where the second index is used to identify the k-th compressed block.
  • the byte range of the raw data corresponding to the block input to the compression module for compression. k is a positive integer less than or equal to n.
  • the process of inputting the original data to a compression module for compression processing and sequentially outputting n compression blocks containing the compressed data is as follows.
  • the original data is sequentially input to the compression module in bytes for compression processing, and the following processing is repeatedly performed until all the bytes included in the original data are input to the compression module:
  • the size of the output compressed block can be fixed, and the worst case of the compression effect in the compression process is to fill the original data in the compressed block. This is easier to achieve guaranteed compression than the existing fixed input method. The purpose of lower rates.
  • the preset value is a maximum value of the number of bytes of the original data input to the compression module for compression processing this time.
  • the present application provides a data compression apparatus.
  • the data compression apparatus may be any device capable of executing the data compression method provided in the present application, such as a mirror generation server, a personal computer, or a mobile terminal.
  • the data compression device includes a function or a module or a means that executes the method described in the first aspect and any one of the implementation manners or designs in the first aspect, and optionally, the data compression
  • the device may also include functions or modules or means that perform the reading and decompressing process of the compressed block.
  • the functions or modules or units or means may be implemented by software, or by hardware, and may also be implemented by hardware executing corresponding software.
  • the data compression device includes a transceiver module and a processing module, and the transceiver module and the processing module may correspond to the foregoing first aspect and any implementation manner or design method involved in the first aspect. Here, Do not repeat them.
  • the data compression device includes a processor, and may further include a communication interface, where the communication interface is used to send and receive signals, and the processor executes program instructions to complete the first aspect and the first aspect.
  • the data compression device may further include one or more memories, which are used for coupling with the processor, and store the necessary functions for implementing the first aspect described above and any possible implementation manner or design method of the first aspect.
  • Computer program instructions and / or data may be integrated with the processor, or may be provided separately from the processor. This application is not limited.
  • the processor may execute computer program instructions stored in the memory to complete the first aspect described above and any possible implementation manner or design method involved in the first aspect.
  • the present application provides a chip that can communicate with a memory, or the chip includes a memory, and the chip executes program instructions stored in the memory to implement the first aspect and the first aspect. Any possible implementation of the aspect or corresponding function involved in the design.
  • the present application provides a computer storage medium that stores computer-readable instructions.
  • the computer-readable instructions When the computer-readable instructions are executed, the first aspect and any possible implementation manner of the first aspect are implemented. Or the corresponding functions involved in the design.
  • the present application further provides a computer program product including a software program, which when run on a computer, enables the first aspect described above and any possible implementation manner or design corresponding function in the first aspect to be implemented.
  • FIG. 1 is a schematic diagram of a compression process in the prior art
  • FIG. 2 is a schematic diagram of a storage form of a compressed block output in a conventional compression processing mode on a disk;
  • FIG. 3 is a schematic diagram of a compressed file system applicable to the present application.
  • FIG. 4A is a schematic diagram of a first situation that occurs when a compressed block output in a conventional compression processing mode is stored in a storage medium;
  • FIG. 4B is a schematic diagram of the second situation that occurs when the compressed block output in the existing compression processing mode is stored in a storage medium
  • FIG. 5 is a schematic flowchart of a data compression method according to an embodiment of the present application.
  • FIG. 6A is a first schematic diagram of a storage form of a first compressed block with the same capacity in a magnetic disk according to an embodiment of the present application
  • 6B is a second schematic diagram of a storage form of a first compressed block of the same capacity in a disk according to an embodiment of the present application;
  • FIG. 6C is a third schematic diagram of a storage form of a first compressed block with the same capacity in a magnetic disk according to an embodiment of the present application.
  • FIG. 7 is a schematic diagram of a compression processing flow in an embodiment of the present application.
  • FIG. 8 is a first schematic scenario diagram of a compression processing flow in a special case in an embodiment of the present application.
  • 9A is a second schematic scenario diagram of a compression processing flow in a special case in an embodiment of the present application.
  • 9B is a schematic diagram of an improved compression processing flow in a special case in an embodiment of the present application.
  • FIG. 10 is a schematic diagram of a situation that cannot occur in a correspondence between a compressed block and a data block in an embodiment of the present application.
  • FIG. 11 is a schematic diagram of a correspondence relationship between a data block and a compressed block in an embodiment of the present application.
  • FIG. 12 is a schematic diagram of a second index corresponding to a compressed block according to an embodiment of the present application.
  • FIG. 13 is a schematic diagram of a sequence format in an LZ4 compression algorithm according to an embodiment of the present application.
  • 15 is a second schematic diagram of an original character sequence input in an embodiment of the present application.
  • 16 is a third schematic diagram of an original character sequence input in an embodiment of the present application.
  • FIG. 17 is a first schematic diagram of a data compression apparatus according to an embodiment of the present application.
  • FIG. 18 is a second schematic diagram of a data compression apparatus according to an embodiment of the present application.
  • this application can be applied to a compressed file system.
  • this type of file system may include a metadata area and a data area, and the metadata area includes a super block and an inode area.
  • the super block of the metadata area can include control information and data structure of the file system.
  • the inode area of the metadata area can include file description information, such as file length, file type, etc.
  • the file type is, for example, a regular file (regular file). ), Directory file (directory inode), soft link (symbol link inode), special file (special inode) and so on.
  • the data stored in the data area may be data obtained by performing file-level compression processing based on the lossless compression technology.
  • the compressed data in the data area is stored as a set of disk blocks on a physical storage space of a storage medium (for example, a disk, a flash memory, etc.).
  • a storage medium for example, a disk, a flash memory, etc.
  • the data of the same file can be stored in consecutive disk blocks, or it can also be cross-stored in discontinuous disk blocks.
  • disk blocks A1 to An store data of the same file
  • disk blocks B1 to Bx + 1 and disk blocks C1 and C2 may cross-store data of different files and the like.
  • the disk block can be used to represent a small physical storage space obtained by dividing the physical storage space of the storage medium.
  • the compression algorithm used by the compressed file system generally compresses the raw data of a fixed number of bytes in sequence. Due to the different attributes of the data content and type, the actual compression ratio will also differ. Therefore, the fixed word The byte size of the raw data of the number of sections after compression processing is not fixed.
  • Case 1 When the byte size of the fixed raw data for compression processing is small, there are fewer regular bytes in each input byte because there are fewer bytes in each input. The size of the compressed block obtained after processing is smaller, but the compression ratio is actually larger. For example, for the string abcdabcdefabcdabcdef ..., if the fixed input original data is 4 characters each time, then there is no regular data in abcd and there is no way to compress it, so the compressed block output is actually 4 characters Although the compressed block size is large, the compression ratio is actually large.
  • the storage form shown in FIG. 4A is likely to occur: the complete compressed block 1 and the first half of the compressed block 2 are stored in disk block A, and the compressed block is stored in disk block B.
  • the second half of 2 and the first half of compressed block 3 disk block C stores the second half of compressed block 3 and the first half of compressed block 4 ...
  • disk block X stores compressed block n-1 And the second half of the full compression block n.
  • the random read amplification phenomenon is, for example, because when reading the stored content from the storage medium, the reading is performed in units of disk blocks, so if the compressed block 3 needs to be read, the disk block B and All the contents stored in disk block C are read out, which causes some redundant data to be read. In addition, for the extra read redundant data, because it is not a complete compressed block, it cannot be decompressed effectively. In order to successfully decompress these data, it is necessary to read the previous disk block of disk block B and the disk block. What is stored in C's next disk block, which places an extra burden on the device.
  • IO input output
  • Case 2 When the byte size of the fixed original data subjected to compression processing is large, compared to Case 1, the compression rate can be reduced, but the size of the compressed block obtained after the compression processing is large.
  • the compressed block is stored on the storage medium, the storage form shown in FIG. 4B is likely to occur: the compressed block i is stored in disk block A to disk block X, where the front end of the disk block A and the compressed block i are not aligned, and the disk block X It is also not aligned with the tail of compressed block i, that is, the front end of disk block A also stores the content of the previous compressed block i-1, and the tail of disk block X also stores the content of the next compressed block i + 1.
  • the decompression of the later data in the compressed block i needs to depend on the data in front of the compressed block A.
  • the compressed block A needs to be decompressed.
  • the previous data is online, which may also cause a large amount of memory consumption during the decompression process.
  • this application proposes a data compression method and device. By controlling the size of the compressed block after the compression process, the size of the compressed block is fixed, which can effectively avoid the use of Problems that may occur when the input method is fixed.
  • a schematic flowchart of a data compression method according to an embodiment of the present application includes the following steps:
  • Step 501 Obtain raw data to be compressed.
  • the execution subject of the data compression process may be any device capable of executing the data compression method provided in the present application, such as an image generation server, a personal computer, or a mobile terminal.
  • the obtained raw data to be compressed is, for example, source code and resource files of some operating systems.
  • There are many specific ways to obtain the original data such as copying or downloading the original data from other servers or storage devices, which is not limited in this application.
  • Step 502 Input the original data into the compression module for compression processing, and output n compression blocks containing the compressed data in sequence, and the first capacity of each compression block is the same.
  • a compression algorithm is configured in the compression module.
  • the compressed data of the original data can be output as compressed blocks of a fixed size.
  • the number of compressed blocks is n, and n is a positive integer.
  • a fixed-size compressed block can be understood as a fixed capacity of the compressed block.
  • the capacity of the compressed block is called the first capacity, and the first capacity is the number of bytes of the compressed data that the compressed block can contain.
  • Step 503 Store n compressed blocks in m disk blocks in the storage medium.
  • the storage medium includes m disk blocks, where m is a positive integer, and each disk block has the same size, that is, each disk block has the same capacity.
  • the capacity of the disk block is called the second capacity, and the second capacity is the number of bytes of data stored in the disk block.
  • p complete compressed blocks can be stored in one complete disk block.
  • p is a positive integer, and p is less than or equal to n.
  • compressed blocks 1 to 10 can be stored in disk block A to disk block E, and a complete disk A block can store two complete compressed blocks.
  • the content of the compressed block 3 is to be read, the content of the complete compressed block 3 can be read from the disk block B, so the compressed block 3 can also be successfully decompressed.
  • the content of the complete compressed block 4 can also be read from the disk block B, so the compressed block 4 can also be successfully decompressed, and the subsequent decompressed data can also be effectively used.
  • Case 2 When the first capacity is q times the second capacity, a complete compressed block can be stored in q complete disk blocks. Among them, q is a positive integer and q is less than or equal to m.
  • compressed block 1 can be stored in disk block A to disk block E
  • compressed block 2 can be stored in disk block F to disk.
  • Block J the contents of compressed block 1 can be read from disk block A to disk block E
  • the contents of compressed block 2 can also be read from disk block F to disk block J.
  • the read compressed blocks are all complete compressed blocks, all the read compressed blocks can be successfully decompressed, and subsequent decompressed data can also be used.
  • compression block 1 to compression Block 5 may be stored in each of disk block A to disk block E.
  • there is a one-to-one correspondence between the disk block and the compressed block so when you need to read which compressed block, you can directly read the disk block corresponding to the compressed block, and do not read redundant data from the disk block.
  • the read data can also be successfully decompressed.
  • the original data is compressed by limiting the size of each compressed block to be output, so that the first capacity of each output compressed block is the same, and when the first compressed blocks of the same capacity are stored in a disk,
  • the storage form in the disk is at least one complete compressed block stored in a complete disk block, or at least one complete disk block is stored in a complete compressed block.
  • the size of the output compression block is fixed, it is easier to achieve the purpose of a lower compression rate than the conventional method using a fixed input.
  • the output compressed block is 4KB
  • the original data input to the compression module for compression processing is also at least 4KB.
  • only 4KB of raw data is filled in the compressed block.
  • the existing method if there is less fixed input raw data, it is likely that the input raw data contains fewer regular bytes, which makes it difficult to compress the raw data. For example, 3 KB of raw data may be input.
  • a 4KB compressed block is output instead, resulting in a large compression rate.
  • the first capacity of the compressed block can be made as small as possible. In this way, when the compressed block is read from the disk block and decompressed, When decompressing the second half of the compressed block, even if the first half of the compressed block needs to be online, but because the first capacity of the compressed block itself is not very large, the first half of the compressed block stored in memory is small, so that Minimize memory usage.
  • the content contained in the at least two compressed blocks will be the same.
  • the storage locations of the at least two compressed blocks are the same, that is, the at least two compressed blocks share the same storage location, that is, the compressed blocks are deduplicated.
  • the original data may be sequentially input to the compression module in bytes for compression processing.
  • the processing flow shown in FIG. 7 may be repeatedly performed until all the bytes included in the original data are input.
  • Step 701 Determine that the number of bytes of the data after the compression process reaches the first capacity.
  • Step 702 Determine whether the number of bytes s of the original data input to the compression module for compression processing is greater than the first capacity.
  • step 703 is performed; if the determination result is no, step 704 is performed.
  • step 702 The above judgment performed in step 702 is mainly used to determine whether there is a compression gain when the original data is input to the compression module for compression processing, and the number of bytes of the original data that is input to the compression module for compression processing this time. When the number of bytes of the compressed data is greater than the compression data, it can be determined that there is a compression gain. When the number of bytes of the original data input to the compression module for compression processing is less than or equal to the compressed data The number of bytes can be determined as no compression gain. If there is a compression benefit, step 703 may be performed, otherwise, step 704 is performed.
  • Step 703 The compressed data is included in a compressed block and output.
  • the data contained in the output compressed block is the data after compression processing.
  • Step 704 Continue to input the raw data of t bytes to the compression module, and include the raw data of s bytes and the raw data of t bytes in a compressed block and output.
  • s, t are positive integers
  • t is the difference between the first capacity and s.
  • the data contained in the output compressed block is the original data input to the compression module this time, and the number of bytes of the original data input to the compression module this time is (s + t).
  • the number of bytes of data after compression processing and the number of bytes of original data input to the compression module for compression processing can be re-stated. For example, you can configure a counter to count the number of bytes of compressed data, and configure another counter to count the number of bytes of raw data input to the compression module. When outputting a compressed block, these two counters It is cleared. With this design, it is easy to count the number of bytes of the compressed data and the number of bytes of the input original data during each compression process. Of course, in practical applications, other methods may also be used for statistics, which is not limited in this application.
  • the compression ratio of this compression processing is 133%, that is, there is no compression benefit in this compression processing.
  • the 3KB of raw data input to the compression module is placed in the compression block, and the 1KB of raw data is continuously input in the compression module.
  • the 4KB of raw data is not compressed and is directly placed in the compression block by the compression module, and then output Compressed block, which contains 4KB of raw data.
  • the processing method of the last compressed block output: if the remaining data is less than the number of bytes of data after compression processing Or equal to the number of bytes of the remaining data, the compressed data of the remaining data is included in the last compressed block and output; if the number of bytes of the compressed data of the remaining data is greater than the number of bytes of the remaining data, then The remaining data is included in the last compressed block and output.
  • the last compressed block should contain the remaining 3KB of original data. If the remaining 3KB of original data is 3.5KB after compression processing, the last compressed block may also contain the remaining 3KB of original data. If the remaining 3KB of original data is 2KB after compression processing, the last compressed block may contain 2KB of data after compression processing.
  • an upper limit is set for the number of bytes of raw data input to the compression module for compression processing each time, and the upper limit may be a preset value, which is used to represent the current input to the compression module for compression processing.
  • the maximum number of bytes of raw data When the number of bytes of the input original data is equal to the preset value, the original data can no longer be input, but the compressed data is included in a compressed block and output.
  • the original data can be logically divided into several data blocks.
  • the number of bytes included in the divided data block in the embodiment of the present application can be equal to the first capacity, that is, the data block and the compressed block are large.
  • the first half of the original data from data block s + 1 to data block k are sequentially input to the compression module for compression processing, and the compressed data is included in compression block c + 2, where if the input data block s + 1 to The number of bytes of some data in the data block k has reached the preset value, but the compressed data filled in the compressed block c + 2 has not reached the first capacity. In this case, the compressed block c + can be directly output. 2, and the remaining capacity of the compressed block c + 2 (shown as a shaded portion in FIG. 8) is no longer filled with other data.
  • Case 3 When the compression block output during the compression process reaches the first capacity, if the number of bytes of the original data input in the last data block corresponding to the compression block output this time is less than the set threshold, the The original data input to the compression module is recompressed this time. Among them, the original data other than the original data input in the last data block is still filled in the current compression block and output after compression processing, and the last data block The raw data input in is compressed and filled in the next compressed block and output.
  • the original data can be logically divided into several data blocks.
  • the original data is compressed.
  • the following data can be filled in the compressed block c + 2.
  • the input original data already corresponds to the first half of the data block s + 3 (As shown by dashed line 2 in FIG. 9A).
  • the data block s + 3 is the last data block corresponding to the current output compression block c + 2, and the original data input into the compression module for compression processing in the data block s + 3 is shown in the shaded part in FIG. 9A.
  • the compression-processed data previously filled in the compression block c + 2 can be discarded, and then the original data is re-compressed from the second half of the data block s + 1 again.
  • the original data corresponding to the second half of the data block s + 1 to the data block s + 2 can be filled in the compression block c + 2 after compression processing.
  • the compression block c + 2 There is also remaining capacity (shown in the shaded part in Figure 9B). The remaining capacity is not used for the time being, and the next compressed block is directly generated.
  • the original data from data block s + 3 is compressed and filled in compressed block c + 3. in.
  • the size of the differential packet in the file system can be reduced as much as possible. When some changes are made to the original data, it can minimize the impact on the storage form of the compressed blocks stored in the current disk.
  • an index may be created for the compressed file system, where both the data block is used as an object to create the index, and You can create indexes on compressed blocks.
  • the created index can be built into the metadata area of the compressed file system or the data area of the compressed file system. The method of creating an index is described in detail below.
  • Method 1 Create an index with the data block as the object
  • the original data can be logically divided into several data blocks, the original data can be divided into i data blocks in the embodiments of the present application.
  • the number of bytes included in each data block is the same as the first capacity. For example, if the first capacity is 4KB, then for 64KB of raw data, it can be divided into 16 4KB data blocks, and each data block contains 4KB bytes.
  • a first index can be established for the jth data block, and the correspondence between the established first index and the jth data block can be recorded, where the first index is used for Identifies the storage location of the data contained in the jth data block on the disk.
  • i is a positive integer
  • j is any positive integer less than or equal to i.
  • the data filled in the compressed block includes only two cases, one is the compressed data filled when there is a compression benefit, and the other is when there is no compression benefit. Filled raw data.
  • the data block s in FIG. 10 includes the decompressed data in the compression block c to the compression block c + 2.
  • the compression block c + 1 corresponds to the data in the shaded part of the data block s. It is obvious that the data in the shaded part is smaller than the compression.
  • the j-th data block includes at most two pieces of compressed data after being decompressed.
  • the content contained in the first index corresponding to the j-th data block is:
  • a first identification bit or a second identification bit a block number of the next compressed block; an intra-block offset of the j-th data block.
  • the first identification bit is used to identify that the data in the next compressed block is original data
  • the second identification bit is used to identify that the data in the next compressed block is compressed data
  • the jth data The intra-block offset of a block is the position of the head of the data of the next compressed block after decompression processing in the j-th data block.
  • the compressed block stored in the disk can be numbered according to the actual division of the physical storage space of the disk to mark the storage position of the compressed block in the disk.
  • these data blocks all contain one Decompressed data header of the newly generated compressed block.
  • the head of the data of the newly generated compressed block after decompression processing is connected to the tail of the data of the current compression block after decompression processing.
  • the header of the decompressed data of a newly generated compressed block may not be connected to the tail of the decompressed data of the current compressed block, which is not limited in this application.
  • the starting position of the data block s corresponds to the starting position of the compressed block c.
  • the compressed block c can be regarded as a newly generated next compressed block, and the data block s contains Compress the header of the data after decompressing the block c.
  • the starting position of the data block s + 1 corresponds to the starting position of the compressed block c + 1.
  • the compressed block c + 1 can also be regarded as the next generated compressed block.
  • the data block s + 1 contains the header of the compressed block c + 1 after the decompression process.
  • the position shown by the dashed line in the data block s + 3 corresponds to the starting position of the compression block c + 3, so compared to the compression block c + 2, the compression block c + 3 can also be regarded as a new generation
  • the next compressed block, and the data block s + 3 contains the header of the compressed data of the compressed block c + 3.
  • the above-mentioned data blocks can be referred to as the first block, and the first block can be defined as the data block containing the first part of the data after the next compressed block is decompressed.
  • the first index given in the above first example may be adopted.
  • the first block can be further divided into an uncompressed mode and a compressed mode. In the following, the contents of the first index are described in detail by combining these two modes.
  • the data in the compression block c contained in the data block s is the original data.
  • the data block The data in the compressed block c + 2 contained in s + 2 is the original data. Therefore, the data block s and the data block s + 2 can be regarded as the first block, and the first block is an uncompressed mode.
  • the first index corresponding to the data block s is shown in Table 1:
  • the intra-block offset of the data block s is zero.
  • the first identification bit can identify that the data in the compressed block c is the original data, that is, it identifies that the data block s is the first block, and the first block is the uncompressed mode.
  • the first index corresponding to the data block s + 2 is, for example, shown in Table 2:
  • the position shown by the dashed line in the data block s + 2 corresponds to the head of the decompressed data of the compression block c + 2, so the intra-block offset of the data block s + 2 is not zero.
  • the original data After inputting the original data after the position indicated by the dotted line in the data block s + 2 to the compression module, the original data is still output, and the output original data is filled in the compression block c + 2.
  • the first identification bit can identify that the data in the compressed block c + 2 is the original data, that is, the data block s + 2 is identified as the first block, and the first block is in the non-compressed mode.
  • the data in the next compressed block included in the data block is the compressed data, that is, the data block is the first block, and the first block is in the compression mode.
  • the compressed data of such data blocks has compression benefits, so the data in the compressed block c + 1 contained in data block s + 1
  • the data in the compressed block c + 3 contained in the data block s + 3 is also the compressed data.
  • the data block s + 1 and the data block s + 3 can be regarded as the first block, and the first block is in a compressed mode.
  • the first index corresponding to the data block s + 1 is shown in Table 3:
  • the intra-block offset of the data block s + 1 is zero.
  • the second identification bit can identify that the data in the compressed block c + 1 is the compressed data, that is, the data block s + 1 is identified as the first block, and the first block is the compressed mode.
  • the first index corresponding to the data block s + 3 is as shown in Table 4:
  • the position shown by the dashed line in the data block s + 3 corresponds to the head of the data of the compression block c + 3 after the decompression process, so the intra-block offset of the data block s + 3 is not zero.
  • the compressed data is output from the compression module, and the compressed data is filled in the compression block c + 3. in.
  • the second identification bit can identify that the data in the compressed block c + 3 is the compressed data, that is, it identifies that the data block s + 3 is the first block, and the first block is the compression mode.
  • the content contained in the first index corresponding to the j-th data block is:
  • a third identification bit a block distance relative to a first data block before the j-th data block; a block distance relative to a second data block after the j-th data block; an intra-block offset of the first data block.
  • the third identification bit is used to identify that the data in the current compressed block is data obtained after compression processing.
  • the first data block contains the decompressed data in the current compression block, or the first data block contains the decompressed data in the current compression block and the decompressed data in the previous compression block.
  • the intra-block offset of the first data block is the position of the head of the data after the decompressing processing of the current compressed block in the first data block.
  • the second data block contains the decompressed data in the current compressed block and the decompressed data in the next compressed block.
  • the first data block and the second data block can also be understood as the first blocks described in the first example.
  • this type of data block only contains the decompressed data in the current compression block, and does not include the next new data block.
  • the header of the decompressed data in the generated compressed block For example, data block s + 4 and data block s + 5 both contain the decompressed data in the current compressed block c + 3, but do not include the decompressed data in the next new compressed block c + 4. Header.
  • non-first blocks may be defined as data blocks that include the data of the current compressed block after decompression processing, but do not include the head of the data of the next compressed block after decompression processing.
  • the non-first block actually contains part of the data of the current compressed block after decompression processing.
  • the first first block located before the non-first block also contains the decompressed data of the compressed block.
  • There is an adjacent non-first block after the first block so the adjacent non-first block also contains the decompressed data of the compressed block, and the first first block after the non-first block also contains the compression. Blocks are decompressed, so non-first blocks are only compressed.
  • the first index given in the above second example may be used, and the content of the first index is described in detail below.
  • the first index corresponding to the data block s + 4 is as shown in Table 5:
  • Intra-block offset of the first data block Third flag Block distance from the first data block Block distance from the second data block
  • the intra-block offset of the first data block is the position shown by the dashed line in the data block s + 3 shown in FIG. 11, and the position shown by the dashed line in the data block s + 3 corresponds to the compressed block c + 2 after being decompressed.
  • the header of the processed data The block distance from the first data block, that is, the block distance from the data block s + 3.
  • the block distance to the second data block that is, the block distance to the data block s + 6.
  • the data block s + 3 includes the compressed data of the compressed block c + 2 and the compressed data of the compressed block c + 3, so the data block s + 3 is located at the data block s +
  • the first block before 4 in addition to the decompressed data of the compressed block c + 3, the data block s + 6 also contains the decompressed data of the compressed block c + 4, so the data block s + 6 is located in the data First block after block s + 4.
  • the block distance to the first block can be expressed in units of data blocks. In this case, the block distance to the relative data block s + 3 is equal to 1 data block, and the block distance to the relative data block s + 6 is equal to 2 data blocks.
  • the block distance can also be expressed in units of bytes. For example, if a data block is 4KB, the block distance relative to data block s + 3 is equal to 4KB, and the block distance relative to data block s + 6 is equal to 8KB. .
  • the third identification bit may identify that the data in the compressed block c + 3 is the compressed data, that is, it is identified that the data block s + 3 is a non-first block, and the non-first block is a compression mode.
  • the above block distance expression is only an exemplary description. In actual application, other ways may be used to represent the block distance. For example, based on the first method, the block distance relative to the first data block is reduced. First, the block distances relative to the second data block are all increased by one.
  • the first index corresponding to the data block s + 5 can also refer to the first index shown in Table 5, which will not be introduced one by one here.
  • the first index corresponding to the j-th data block may further include a block of the current compressed block.
  • Number and fourth flag are used to identify that the data in the current compressed block is data after compression processing, that is, to identify whether the j-th data block is in a compression mode, and whether the j-th data block is the first block. Determined according to a block distance from the first data block.
  • the j-th data block when the value of the block distance from the first data block is zero, the j-th data block may be determined as the first block, and when the value of the block distance from the first data block is not zero, the j-th data block may be determined Is not the first block.
  • the first index corresponding to the data block s + 4 is continued to be listed.
  • the first index corresponding to the data block s + 4 is as shown in Table 6:
  • the block distance relative to the first data block can be split into two parts, one part representing the high x bits and the other part representing the low y bits.
  • the block distance with respect to the first data block is composed of 8 bits
  • the upper x bits may represent the upper 4 bits
  • the lower y bits may represent the lower 4 bits.
  • the above examples are only for illustrative purposes, and the specific implementation is not limited to the above examples. It can be configured according to actual requirements under the condition that the sum of the bits occupied by the three block distances relative to the first data block, the intra-block offset of the first data block, and the block distance relative to the second data block is not changed. x and y values.
  • the block number of the compressed block c + 3 is added to the first index, so that the position of the corresponding compressed block is directly determined according to the content of the first index corresponding to the data block s + 4, without searching for the data block s +
  • the first index corresponding to 3 determines the position of the corresponding compressed block, making searching easier and more efficient.
  • the first index may also be represented in the manner provided in the third example.
  • the role of the block distance with respect to the second data block involved in the second and third examples described above is mainly to determine which data block corresponds to the end of the original data after the compressed block is decompressed.
  • the block distance relative to the second data block may be an optional content
  • the first index may include the block distance relative to the second data block or may not include the block distance relative to the second data block.
  • the specified word can be determined first
  • the data block corresponding to the section range is further searched for the first index corresponding to the data block, and the first index is used to find the position of the compressed block to be read.
  • the first data block corresponds to the 1KB to 4KB original data
  • the data block s corresponds to Raw data from 17KB to 20KB
  • data block s + 1 corresponds to raw data from 21KB to 24KB
  • data block s + 2 corresponds to raw data from 25KB to 28KB
  • data block s + 3 corresponds to 29KB Raw data up to 32KB
  • data block s + 4 corresponds to raw data from 33KB to 36KB ... and so on.
  • a few search scenarios are listed below for specific description.
  • Scenario 1 If you need to find the 17 KB to 20 KB of raw data, you can find the first index corresponding to the data block s (for example, shown in Table 1 above). According to the block number of the compressed block c included in the first index, it can be determined that the data block s corresponds to the data in the compressed block c, and the storage location of the compressed block c on the disk can be determined. Further, the block of the data block s If the internal offset is zero, it means that from the beginning of the compressed block c, all the data in the compressed block c corresponds to the data block s. Since the data in the compressed block c is the original data, and the internal offset of the data block s is Shifted to 0, so the data in compressed block c can be read directly, that is, the original data of 17 KB to 20 KB can be obtained.
  • Scenario 2 If you need to find the 29KB to 32KB raw data, you can find the first index corresponding to the data block s + 3 (such as shown in Table 4 above). By the block number of the compressed block c + 3 contained in the first index, it can be determined that the data block s + 3 corresponds to the data in the compressed block c + 3, and the storage location of the compressed block c + 3 on the disk can be determined Further, according to the intra-block offset of the data block s + 3, it can be determined that the part after the position of the dotted line in the data block s + 3 corresponds to the data in the compressed block c + 3, and the dotted line in the data block s + 3 The part before the position corresponds to the data in the previous compression block c + 2.
  • the data block s + 2 before the data block s + 3 is the first block, so you can directly search
  • the first index corresponding to the data block s + 2 records the block number of the compressed block c + 2 and the intra-block offset of the data block s + 2.
  • the compressed block c + The block number of 2 can determine the storage location of the compressed block c + 2. From the intra-block offset of the data block s + 2, it can be determined that the part before the dotted line in the data block s + 3 corresponds to the compressed block c + The data in 2. Further, the compressed block c + 2 is read.
  • the compressed block c + 2 can be read first.
  • the data in the block can be obtained by copying the part before the dashed line in the data block s + 3 corresponding to the data in the compressed block c + 2.
  • the copy process here can also be regarded as a special decompression process, and the original data copied from the compressed block can also be understood as the decompressed data.
  • the data in the compressed block c + 3 is the compressed data
  • the data in the compressed block c + 3 can be decompressed and processed from the compressed block c + 3.
  • the data after the decompression process can be obtained in the compressed block c + 3 corresponding to the portion after the dotted line in the data block s + 3, and the obtained data is the 29th to 32KB original data.
  • the block distance from the first data block included in the first index corresponding to the data block s + 2 can be determined.
  • the first first block before the data block s + 2. After finding the first first block before the data block s + 2, you can refer to the above process to further read and decompress the data in the compressed block, combined with the first data block recorded in the first index corresponding to the data block s + 2
  • the intra-block offset of the relative data block and the intra-block offset of the first data block are obtained from the decompressed data to obtain the required original data.
  • Scenario 3 If you need to find the 33KB to 36KB raw data, you can find the first index corresponding to the data block s + 4.
  • the block distance from the first data block can be used to find the first data block as data block s + 3, and further find data block s.
  • the first index corresponding to +3 (for example, shown in Table 4 above).
  • the compressed block c + 3 can be determined.
  • the first 3KB of the original data after the decompression processing is the decompressed processing data included after the position shown by the dashed line in the data block s + 3.
  • the compressed data of the compressed block c + 3 is the 4th to the 4th of the original data.
  • the original data of 7KB and 4KB is the data after the decompression process contained in the data block s + 4, and thus the original data to be searched can be obtained.
  • the decompression processing on the compressed block can both decompress the entire data in the compressed block, and determine whether the required original data is found during the decompression process. After the required original data is found, Aborting the decompression process.
  • the first index corresponding to the data block s + 4 is as shown in Table 6 above, since the first index contains the block number of the compressed block c + 3, it is unnecessary to search for the corresponding data block s + 3.
  • the storage location of the compressed block c + 3 can be obtained.
  • the compressed block c + 3 can be decompressed by referring to the method shown in the above-mentioned A to obtain the original data of 33 KB to 36 KB.
  • Scenario 4 If you need to find the original data from 30KB to 31KB, you can find the first index corresponding to the data block s + 3 (such as shown in Table 4 above). By the block number of the compressed block c + 3 contained in the first index, it can be determined that the data block s + 3 corresponds to the data in the compressed block c + 3, and the storage location of the data block s + 3 on the disk can be determined . Read and decompress the data in the compressed block c + 3. According to the intra-block offset of data block s + 3, it is assumed here that the intra-block offset of data block s + 3 in this scenario is 1KB, that is, the data block s + 3 The data contained between the starting position and the position shown by the dashed line is 1KB. Then, it can be determined that the first 2KB of the original data in the compressed data of the compressed block c + 3 is the 30th to 31KB of the original data .
  • Scenario 5 If you need to find the 29KB raw data, first find the first index corresponding to the data block s + 3 (for example, Table 4 above), and continue to use the example given in scenario 4, because the data block s +
  • the intra-block offset of 3 is 1KB, so it is determined that the compressed data of the compressed block c + 3 does not include the original data of the 29KB, so it is necessary to further find the first first block corresponding to the data block s + 3.
  • the first index corresponding to the first index in the scene is the first index corresponding to the data block s + 2 (for example, shown in Table 2 above).
  • the data block s + 2 corresponds to the data in the compressed block c + 2
  • the data block s + 2 can be determined to be on the disk Storage location.
  • the data in the compressed block c + 2 is the original data.
  • the intra-block offset of the data block s + 2 it is assumed here that the data in the block of the data block s + 2 is in this scenario. The offset is taken as 1KB, that is, the data contained between the start position of the data block s + 2 and the position shown by the dashed line is 1KB.
  • the original 3KB of the read data of the compressed block c + 2 can be determined.
  • the data is the data contained in the data block s + 2
  • the 4 KB of the original data in the read data of the compressed block c + 2 is the 29 KB of the original data contained in the data block s + 3.
  • Method 2 Create an index using the compressed block as the object
  • k is a positive integer less than or equal to n.
  • the byte range of the original data corresponding to each compressed block that is input to the compression module for compression processing can be represented by an offset within the file. For example, assuming that the first capacity of compressed block 1 to compressed block n are 4 KB, if the offset 0 in the file corresponding to compressed block 1 indicates a byte range of 0 to 8 KB, it can be characterized that the original data of 0 to 8 KB is compressed. Post-filling is in compressed block 1. If the offset 1 in the file corresponding to compressed block 2 indicates a byte range of 8-20 KB, it can be characterized that the original data of 8-20 KB is compressed and filled in compressed block 1. If the offset n in the file corresponding to the compressed block n indicates a byte range of 128 to 136KB, it can be characterized that the original data of 128 to 136KB is compressed and filled in the compressed block n.
  • the indexing method shown in the second method may use a binary search method to find a compressed block corresponding to the original data.
  • the second method is more efficient in the scenario where there are fewer compressed blocks, and because there are fewer compressed blocks, the number of indexes to be established is relatively small, which can also save storage space.
  • LZ4 is a compression format in bytes.
  • a compression block may be composed of several sequences after compression processing, which is hereinafter simply referred to as a compression sequence. Each compressed sequence can record a certain byte length literal and a sliding window match. The format is as shown in FIG. 13.
  • a compressed sequence contains tokens, literals, and offsets. Optionally, it may also include a literal variable length (linear small-integer code (lsic) _literal), and a sliding window matching variable length (lsic_match).
  • literal represents the part of the original data stored directly in the compressed sequence, that is, the original data that cannot be compressed.
  • match represents the part of the compressed sequence that can be matched.
  • offset represents the offset between the currently input original data and the same original data that has been entered before. The offset is represented by a fixed 2 bytes. If offset is 0, it means that the length of the match byte is zero, that is, it does not exist. Sliding window matching, the maximum offset (MAXDISTANCE) is 65535.
  • the token is represented by 1 byte, that is, 8 bits. Among them, the upper 4 bits can be used to identify the byte length of token (token_literal), and the lower 4 bits can be used to identify the byte length of match (token_match).
  • token_literal can represent a maximum of 15 bytes in literal. If the actual byte length is greater than or equal to 15, there is lsic_literal in the compressed sequence, which is used to identify the remaining byte length of the literal other than 15 bytes.
  • MIMMATCH minimum byte length
  • the compression algorithm in the embodiment of the present application is based on the principle of dynamic programming. There are two specific implementation methods:
  • the schematic diagram of the input original sequence (the original sequence can be understood as the original data corresponding to the compressed sequence) is shown in FIG. 14.
  • the naive cost [i] cost function is:
  • i is the end position of the original sequence input this time
  • j is the start position of the original sequence input this time, that is, the start position of the literal part of the original sequence input this time
  • k is the input location
  • the starting position of the part of the original sequence that is the match, i, j, k satisfy the relationship: j ⁇ k ⁇ i, j ⁇ i.
  • the original sequence input this time is the original data composed of the jth to i-1 bytes.
  • the cost function represents the minimum byte length of the compressed sequence when the original data is input in sequence until the last original sequence input ends at i (or until the input original data is a suffix ending with i), where The suffix ending with i can be understood as the original data composed of 0th to i-1th bytes.
  • the cost function represents the minimum byte length of the compressed sequence when the original data is input in turn until the last original sequence input ends at j (or until the input original data ends with a suffix of j), where, A suffix ending with j can be understood as inputting the original data composed of 0th to j-1th bytes.
  • k-j is the byte length of the part of the original sequence input as a literal.
  • the lsic function represents the byte length of lsic_literal or lsic_match, specifically:
  • lsic [i] represents the byte length of lsic_literal when the original input sequence does not exist as a match
  • lsic [k-j] represents the byte length of lsic_literal when there is a match part in the original input sequence
  • ml is a valid match byte length, which cannot be longer than the length of the original sequence input this time, that is, ml ⁇ ij, longestmatch is the maximum value of the suffix length obtained by continuously repeating the prefix of the suffix ending with i (you can Calculated through the modified knuth-morris-pratt (KMP) algorithm mismatch function), for example: if the original sequence is ABABAB, then longestmatch is 4, if the original sequence is ABCABCABCABC, Then longestmatch is 9, if the original sequence is ABCDABC, then longestmatch is 3, and if the original sequence is BABCDABC, then longestmatch is also 3.
  • KMP modified knuth-morris-pratt
  • m is the starting position of the template string (template), which can also be obtained by the modified KMP algorithm mismatch function.
  • template string The position of the template in the input original sequence is shown in Figure 15.
  • the template string is located in the original input sequence.
  • the template string can contain part of the original data or all of the original data in the original data that is the literal part of the original input sequence, or the original data that is the literal part of the original input sequence. Some or all of the raw data in the data.
  • k-m represents the value of offset. Since offset has a maximum of two bytes, its maximum value MAXDISTANCE is 65535.
  • the input original sequence can also be transferred by deformation ending with literal.
  • cost function of cost [i] is:
  • i is the end position of the part of the original sequence input as a literal.
  • ml represents the effective byte length of the match, which is different from (1), where ml is a loop argument and the value ranges from MIMMATCH to longestmatch.
  • MINMATCH and longestmatch can refer to the explanation in (1).
  • lsicdelta lit is the length of the literal of cost [i-1] plus one, which increases the length of lsic_literal (value 0 to 1, for example, when cost [i-1] corresponds to the length of the literal is 14, 15 + 254, 15 + 255 + At 254 ..., lsicdelta lit is 1, otherwise 0).
  • the cost function represents the minimum byte length of the compressed sequence when the original data is input in sequence until the end of the input original sequence with i as the literal part, and the input original sequence with i as the end of the literal part It can be understood as inputting the original data composed of the 0th to the i-1th byte, in which the original data of the i-1th byte is not necessarily a part of the literal, but the data after the i-1th byte From the original data, compression is first treated as a literal, but the actual byte length may be zero.
  • the cost function indicates the minimum byte length of the compressed sequence when the original data is input in turn until the end position of the input original sequence is i-1.
  • the input original sequence is i-1.
  • the end position of the literal part can be understood as the input of the original data consisting of 0 to i-2 bytes, where the original data of the i-2th byte is not necessarily a literal part, but from the i-th Starting from 2 bytes of raw data, compression is first treated as a literal, but the actual byte length may be zero.
  • the cost function indicates the minimum byte length of the compressed sequence when the original data is input sequentially until the end position of the input original sequence with i-ml as the literal part.
  • the input original sequence is i-ml as
  • the end position of the literal part can be understood as the input of the original data composed of 0th to i-ml-1 bytes, where the original data of the i-ml-1th byte is not necessarily a literal part, but From the original data after the i-ml-1 byte, the compression process is first treated as a literal, but the actual byte length may be zero.
  • lsic [ml-MINMATCH] indicates the byte length of lsic_match when there is a match part in the original input sequence.
  • An embodiment of the present application provides a data compression apparatus, and the data compression apparatus may be any device capable of executing the methods involved in the foregoing method embodiments of the present application, such as an image generation server, a personal computer, or a mobile terminal.
  • the data compression device includes functions or modules or means for performing the method involved in the foregoing method embodiments of the present application, and optionally, the data compression device may also include a process of reading and decompressing compressed blocks.
  • the functions or modules or units or means may be implemented by software, or by hardware, and may also be implemented by hardware executing corresponding software.
  • FIG. 17 shows a first schematic structural diagram of a data compression apparatus according to an embodiment of the present application.
  • the data compression device 1700 includes a transceiver module 1701, a processing module 1702, and a compression module 1703.
  • the transceiver module 1701 may be configured to obtain raw data to be compressed.
  • the processing module 1702 may be used to input the original data to the compression module 1703 for compression processing, and output n compression blocks containing the compressed data in sequence, wherein the first capacity of each compressed block output is the same, The first capacity represents the number of bytes of the compressed data that can be contained in the compressed block.
  • the processing module 1702 may also be configured to store n said compressed blocks in a storage medium, wherein said storage medium includes m disk blocks, and each disk block has the same second capacity, and said second capacity represents said The number of bytes of data stored by the disk block.
  • the storage form of the first compressed block with the same capacity in the storage medium may have the following two designs:
  • the second capacity is p times the first capacity
  • the storage form of the n compressed blocks in the storage medium is: p complete compressed blocks are stored in one complete disk block.
  • the first capacity is q times the second capacity
  • the storage form of the n compressed blocks in the storage medium is: one complete compressed block is stored in the q complete disk blocks.
  • n, m, p, and q are positive integers, and p is less than or equal to n, and q is less than or equal to m.
  • the processing module 1702 stores the at least two compressed blocks in the compressed blocks.
  • the at least two compressed blocks may also be deduplicated, so that the storage positions of the at least two compressed blocks in the storage medium are the same.
  • the processing module 1702 may also be used to create an index.
  • the processing module 1702 may be configured to divide the original data into i data blocks, and each data block contains the same number of bytes of data as the first capacity. Each data block contains a maximum of two compressed blocks that have been decompressed.
  • the processing module 1702 may be further configured to establish a first index for the j-th data block, and record a correspondence between the established first index and the j-th data block.
  • the first index is used to identify the j-th data block.
  • the storage location of the contained data in the storage medium. i is a positive integer
  • j is any positive integer less than or equal to i.
  • the content contained in the first index corresponding to the j-th data block is: the first identification bit or the Two identification bits, a block number of the next compressed block, and an intra-block offset of the j-th data block.
  • the first identification bit is used to identify that the data in the next compressed block is original data
  • the second identification bit is used to identify that the data in the next compressed block is compressed data
  • the jth data is the position of the head of the data of the next compressed block after the decompression processing, in the j-th data block.
  • the content contained in the first index corresponding to the j-th data block is: the third flag, the Three identification bits are used to identify that the data in the current compressed block is data obtained after compression processing; relative to the block distance of the first data block located before the jth data block, the first data block contains the current compression The decompressed data in the block, or the first data block contains the decompressed data in the current compressed block and the decompressed data in the previous compressed block; the data that is relatively behind the jth data block The block distance of the second data block.
  • the second data block contains the decompressed data in the current compressed block and the decompressed data in the next compressed block.
  • the intra-block offset of the first data block, the first The intra-block offset of the data block is the position of the head of the data after the decompression processing of the current compression block in the first data block.
  • the first index corresponding to the j-th data block may further include the block number of the current compressed block.
  • the compression processing process performed by the processing module 1702 may specifically include: sequentially inputting the original data in bytes to the compression module 1703 for compression processing, and repeatedly performing the following processing Until all bytes of the original data are input to the compression module:
  • the processing module 1702 may be further configured to: when the number of bytes of the original data input to the compression module 1703 for compression processing this time is equal to a preset value, and after the compression processing this time, If the number of bytes of data has not reached the first capacity, the compressed data is included in a compressed block and output.
  • the preset value is input to the compression module 1703 for compression. The maximum number of bytes of raw data processed.
  • FIG. 18 shows a second schematic diagram of a structure of a data compression device according to an embodiment of the present application.
  • the data compression device 1800 may include a processor 1801 and a communication interface 1802.
  • the processor 1801 may be a central processing unit (CPU) or a network processor (NP).
  • the processor 1801 may also be other types of chips, such as a baseband circuit, a radio frequency circuit, an application specific integrated circuit (ASIC), a programmable logic device (PLD), or any combination thereof.
  • the communication device 1800 may further include a memory 1803 for storing a program executed by the processor 1801 and data required for processing.
  • the memory 1803 may be integrated in the processor 1801, or may be separately provided from the processor 1801.
  • the processor 1801 may be configured to use the communication interface 1802 to obtain raw data to be compressed.
  • the processor 1801 may be configured to input the original data to the compression module for compression processing, and output n compression blocks containing the compressed data in sequence, where the first capacity of each compressed block output is the same, so The first capacity indicates the number of bytes of the compressed data that the compressed block can contain.
  • the processor 1801 may also be configured to store n compressed blocks in a storage medium, where the storage medium includes m disk blocks, and each disk block has the same second capacity, and the second capacity represents the disk block The number of bytes of data stored.
  • processor 1801 For specific functions of the processor 1801, the communication interface 1802, and the memory 1803, refer to the corresponding descriptions in the foregoing method embodiments of the present application, and details are not described herein again.
  • the present application further provides a chip that can communicate with a memory, or the chip includes a memory, and the chip executes program instructions stored in the memory to implement the foregoing method embodiment Corresponding functions of the methods involved in.
  • the present application also provides a computer storage medium, where the computer storage medium stores computer-readable instructions, and when the computer-readable instructions are executed, the method and method involved in the foregoing method embodiments are implemented. Corresponding function.
  • the present application also provides a computer program product containing a software program, which when run on a computer, enables the functions corresponding to the methods involved in the above method embodiments to be implemented.
  • this application may be provided as a method, a system, or a computer program product. Therefore, this application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Moreover, this application may take the form of a computer program product implemented on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
  • computer-usable storage media including, but not limited to, disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing device to work in a specific manner such that the instructions stored in the computer-readable memory produce a manufactured article including an instruction device, the instructions
  • the device implements the functions specified in one or more flowcharts and / or one or more blocks of the block diagram.
  • These computer program instructions can also be loaded onto a computer or other programmable data processing device, so that a series of steps can be performed on the computer or other programmable device to produce a computer-implemented process, which can be executed on the computer or other programmable device.
  • the instructions provide steps for implementing the functions specified in one or more flowcharts and / or one or more blocks of the block diagrams.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Signal Processing For Digital Recording And Reproducing (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

一种数据压缩方法及装置(1800),该方法中,通过限制输出的各个压缩块的大小来对原始数据进行压缩处理,可以使得输出的各个压缩块的第一容量相同,第一容量相同的压缩块存储在磁盘中时,在磁盘中的存储形式为至少一个完整的压缩块存储在一个完整的磁盘块中,或者为至少一个完整的磁盘块中存储一个完整的压缩块。采用上述方式,后续压缩块进行读取和解压过程中,可以尽可能多的读取出有效的数据,且从压缩块中读取出的数据均可以成功解压出来,由此可以避免采用固定输入的方式出现的随机读放大现象。

Description

一种数据压缩方法及装置
本申请要求于2018年5月30日提交中国专利局、申请号为201810542482.1、发明名称为“一种数据压缩方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
背景技术
随着信息技术的快速发展,数据量呈现爆炸性增长,这对存储器件容量形成了新的挑战。针对这一现状,现有技术中提出了无损压缩技术,通过对数据存储方式进行优化,采用某种算法来精简地表示存在规律的数据,使得在不影响原数据内容的情况下,实现对数据的压缩,以减少数据存储时占用的容量。
目前,无损数据压缩技术中采用较多的方式为:输入的待压缩的数据为固定大小的数据块,经压缩后输出大小不等的压缩块。例如,参照图1所示,当无损数据压缩技术应用于文件系统中时,一文件可分为大小均为4KB的数据块1~n,那么,将1~n个4KB的数据块依次进行压缩,可对应得到大小不等的压缩块1~n。由于压缩过程输出的压缩块的大小随机,故存储压缩块时,压缩块在存储设备的存储空间中的存储位置也是无规则的。例如,对于磁盘来说,磁盘的存储空间可划分为若干个等大的磁盘块,那么将压缩块存储在磁盘中时,可能出现图2所示的情况,磁盘块A中存储完整的压缩块1、以及压缩块2的前半部分,磁盘块B中存储压缩块2的后半部分、以及压缩块3的前半部分,磁盘块C中存储压缩块3的后半部分、以及压缩块4的前半部分等。由于在从磁盘中读取存储的内容时是以磁盘块为单位进行读取的,故在上述情况下,若需要读取压缩块3,则需要将磁盘块B和磁盘块C中存储的内容都读取出来。并且,对于额外读取出的压缩块2的后半部分、以及压缩块4的前半部分,由于并不是完整的压缩块,所以无法有效解压出来,使得额外读取出的内容为无效部分。这种现象也称为随机读放大现象。鉴于此,按照上述现有方式来实现压缩,后续设备在读取压缩块时很可能会读取到很多无效内容,给设备带来额外的负担。
发明内容
本申请提供一种数据压缩方法及装置,以解决采用现有数据压缩处理方式而引发的随机读放大等问题。
第一方面,本申请提供一种数据压缩方法,该方法的执行主体可以是任意能够执行本申请提供的数据压缩方法的设备,例如为镜像生成服务器、个人计算机或移动终端等。该方法中,获取到待压缩的原始数据之后,可以将原始数据输入至压缩模块进行压缩处理后依次输出n个包含有经压缩处理后的数据的压缩块,其中,每个压缩块的第一容量相同,所述第一容量表征所述压缩块能够包含的经压缩处理后的数据的字节数。进一步地,将n个所述压缩块存储在存储介质中,所述存储介质包括m个磁盘块,每个磁盘块的第二容量相同,所述第二容量表征所述磁盘块存储的数据的字节数。
该方法中,第一容量相同的压缩块在存储介质中的存储形式可以有以下两种设计:
一种可能的设计中,第二容量为第一容量的p倍,n个压缩块在存储介质中的存储形式为:一个完整的磁盘块中存储p个完整的压缩块。
另一种可能的设计中,第一容量为第二容量的q倍,n个压缩块在存储介质中的存储形式为:q个完整的磁盘块中存储一个完整的压缩块。
其中,n、m、p、q为正整数,且p小于或等于n,q小于或等于m。
以上方法中,通过限制输出的各个压缩块的大小来对原始数据进行压缩,可以使得输出的各个压缩块的第一容量相同,第一容量相同的压缩块存储在存储介质中时,在存储介质中的存储形式为至少一个完整的压缩块存储在一个完整的磁盘块中,或者为至少一个完整的磁盘块中存储一个完整的压缩块,这样,后续压缩块进行读取和解压过程中可以尽可能多的读取出有效的数据,且从压缩块中读取出的数据均可以成功解压出来,从而可以避免采用固定输入的方式出现的随机读放大现象。并且,通过控制输出的压缩块的大小固定,相比现有采用固定输入的方式,也更容易达到维持较低的压缩率的目的。
一种可能的实现方式中,若n个压缩块中存在包含的经压缩处理后的数据均相同的至少两个压缩块,则将至少两个压缩块存储在存储介质中时,至少两个压缩块在存储介质中的存储位置相同。通过这种方式,可以实现对完全相同的两个压缩块的去重处理,以便节省存储空间。
一种可能的实现方式中,在将压缩块存储到存储介质中后,还可以创建索引,以便于查找和访问原始数据所在的压缩块。
方式一,可以将原始数据划分为i个数据块,每个数据块包含的数据的字节数与第一容量相同,其中,i个数据块中第j个数据块最多包含两个压缩块中经解压处理后的数据。进一步地,为第j个数据块建立第一索引,并记录建立的第一索引与第j个数据块之间的对应关系,所述第一索引用于标识第j个数据块包含的数据在所述存储介质中的存储位置。其中,i为正整数,j为小于或等于i的任一正整数。
当第j个数据块包含下一个压缩块中经解压处理后的数据时,第j个数据块对应的第一索引中包含的内容为:第一标识位或第二标识位、所述下一个压缩块的块号、以及第j个数据块的块内偏移;其中,第一标识位用于标识所述下一个压缩块中的数据为原始数据;第二标识位用于标识所述下一个压缩块中的数据为经压缩处理后的数据;第j个数据块的块内偏移为所述下一个压缩块经解压处理后的数据的首部在第j个数据块中的位置。
当第j个数据块仅包含当前压缩块中经解压处理后的数据时,第j个数据块对应的第一索引中包含的内容为:第三标识位,第三标识位用于标识所述当前压缩块中的数据为经压缩处理后得到的数据;相对位于第j个数据块之前的第一数据块的块距离,第一数据块中包含当前压缩块中经解压处理后的数据,或者,第一数据块中包含当前压缩块中经解压处理后的数据、以及上一个压缩块中经解压处理后的数据;相对位于第j个数据块之后的第二数据块的块距离,第二数据块中包含当前压缩块中经解压处理后的数据、以及下一个压缩块中经解压处理后的数据;第一数据块的块内偏移,第一数据块的块内偏移为当前压缩块经解压处理后的数据的首部在第一数据块中的位置。
可选的,当第j个数据块包含当前压缩块中经解压处理后的数据时,第j个数据块对应的第一索引中还包含所述当前压缩块的块号。
方式二,还可以以压缩块为对象创建索引。具体的,为n个压缩块中第k个压缩块建立第二索引,并记录建立的第二索引与第k个压缩块之间的对应关系,其中,第二索引用于标识第k个压缩块对应的输入至压缩模块进行压缩处理的原始数据的字节范围。k为小于或等于n的正整数。
一种可能的实现方式中,将所述原始数据输入至压缩模块进行压缩处理后依次输出n个包含有经压缩处理后的数据的压缩块的过程如下。
将所述原始数据以字节为单位依次输入至所述压缩模块进行压缩处理,并重复执行如下处理,直至将所述原始数据包含的所有字节数均输入至所述压缩模块为止:
在每次确定出本次经压缩处理后的数据的字节数达到所述第一容量后,判断本次输入至所述压缩模块进行压缩处理的所述原始数据的字节数s是否大于所述第一容量;若判断结果为是,则将本次经压缩处理后的数据包含在一个所述压缩块中并输出;若判断结果为否,则继续输入t个字节数的所述原始数据至所述压缩模块,将s个字节数的所述原始数据以及所述t个字节数的所述原始数据包含在一个所述压缩块中并输出。其中,s,t为正整数,t为所述第一容量与s之间的差值。
采用上述方式,可以使得输出的压缩块大小固定,并且压缩过程出现的压缩效果最差的情况为将原始数据填充在压缩块中,这相比现有采用固定输入的方式,更容易达到保证压缩率较低的目的。
一种可能的实现方式中,将所述原始数据以字节为单位依次输入至所述压缩模块进行压缩处理之后,当本次输入至所述压缩模块进行压缩处理的所述原始数据的字节数等于预设值时,若本次经压缩处理后的数据的字节数仍没有达到所述第一容量,则将本次经压缩处理后的数据包含在一个所述压缩块并输出,所述预设值为本次输入至所述压缩模块进行压缩处理的所述原始数据的字节数的最大值。通过上述方式,可以有效避免过多的冗余数据填充在同一压缩块中,以减轻后续读取压缩块并解压过程中的处理压力。
第二方面,本申请提供一种数据压缩装置,所述数据压缩装置可以是任意能够执行本申请提供的数据压缩方法的设备,例如为镜像生成服务器、个人计算机或移动终端等。所述数据压缩装置包括执行本申请中第一方面以及第一方面任一所述的实现方式或设计中涉及的方法的功能或模块或手段(means),并且,可选的,所述数据压缩装置也可以包括执行压缩块的读取和解压过程的功能或模块或手段(means)。所述功能或模块或单元或手段(means)可以通过软件实现,或者通过硬件实现,也可通过硬件执行相应的软件实现。
在一种可能的设计中,所述数据压缩装置包括收发模块和处理模块,收发模块和处理模块可以和上述第一方面以及第一方面任一实现方式或设计中涉及的方法相对应,在此不予赘述。
在另一种可能的设计中,所述数据压缩装置包括处理器,还可以包括通信接口,所述通信接口用于收发信号,所述处理器执行程序指令,以完成上述第一方面以及第一方面任意可能的实现方式或设计中涉及的方法。所述数据压缩装置还可以包括一个或多个存储器,所述存储器用于与处理器耦合,其保存实现上述第一方面以及第一方面任意可能的实现方式或设计中涉及的方法的功能的必要计算机程序指令和/或数据。所述一个或多个存储器可以和处理器集成在一起,也可以与处理器分离设置。本申请并不限定。所述处理器可执行所述存储器存储的计算机程序指令,完成上述第一方面以及第一方面任意可能的实现方式 或设计中涉及的方法。
第三方面,本申请提供一种芯片,所述芯片可以与存储器相通信,或者所述芯片中包括存储器,所述芯片执行所述存储器中存储的程序指令,以实现上述第一方面以及第一方面任意可能的实现方式或设计中所涉及的相应功能。
第四方面,本申请提供一种计算机存储介质,所述计算机存储介质存储有计算机可读指令,所述计算机可读指令被执行时,使得实现上述第一方面以及第一方面任意可能的实现方式或设计中所涉及的相应功能。
第五方面,本申请还提供一种包含软件程序的计算机程序产品,当其在计算机上运行时,使得实现上述第一方面以及第一方面任意可能的实现方式或设计中所涉及的相应功能。
附图说明
图1为现有技术中压缩处理的示意图;
图2为现有压缩处理方式下输出的压缩块在磁盘中的存储形式的示意图;
图3为本申请可适用的压缩文件系统的示意图;
图4A为现有压缩处理方式下输出的压缩块存储在存储介质中时出现的情况一的示意图;
图4B为现有压缩处理方式下输出的压缩块存储在存储介质中时出现的情二的示意图;
图5为本申请实施例提供的数据压缩方法的流程示意图;
图6A为本申请实施例中第一容量相同的压缩块在磁盘中的存储形式的示意图一;
图6B为本申请实施例中第一容量相同的压缩块在磁盘中的存储形式的示意图二;
图6C为本申请实施例中第一容量相同的压缩块在磁盘中的存储形式的示意图三;
图7为本申请实施例中压缩处理流程的示意图;
图8为本申请实施例中特殊情况下的压缩处理流程的场景示意图一;
图9A为本申请实施例中特殊情况下的压缩处理流程的场景示意图二;
图9B为本申请实施例中特殊情况下的改进后的压缩处理流程的场景示意图;
图10为本申请实施例中压缩块和数据块的对应关系中不可能出现的情况的示意图;
图11为本申请实施例中数据块与压缩块之间的对应关系的示意图;
图12为本申请实施例中压缩块对应的第二索引的示意图;
图13为本申请实施例中LZ4压缩算法中序列的格式示意图;
图14为本申请实施例中输入的原始字符序列的示意图一;
图15为本申请实施例中输入的原始字符序列的示意图二;
图16为本申请实施例中输入的原始字符序列的示意图三;
图17为本申请实施例提供的数据压缩装置的示意图一;
图18为本申请实施例提供的数据压缩装置的示意图二。
具体实施方式
为了使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请作进一步地详细描述。
首先,对本申请可适用的应用场景进行说明。
本申请可适用在压缩文件系统中,例如参照图3所示,该类文件系统可以包括元数据区和数据区构成,元数据区包括超级块和索引节点(inode)区。元数据区的超级块内可以包括文件系统的控制信息、数据结构等内容,元数据区的inode区可以包括文件的描述信息,例如文件长度、文件类型等,文件类型例如为普通文件(regular inode)、目录文件(directory inode)、软链接(symbol link inode)、特殊文件(special inode)等。数据区中存储的数据可以是基于无损压缩技术进行文件级别的压缩处理后得到的数据。数据区中经压缩处理后的数据在存储介质(例如为磁盘、闪存等)的物理存储空间上按照磁盘块的集合来进行存储。其中,同一文件的数据可以在连续的磁盘块中存储,或者,也可以在不连续的磁盘块中交叉存储。例如,图3中磁盘块A1~An中存储同一文件的数据,磁盘块B1~Bx+1与磁盘块C1、C2可以交叉存储不同文件的数据等。应理解,本申请中引入磁盘块的概念并不意味着存储介质仅限定为磁盘,磁盘块可以用来表示存储介质的物理存储空间经划分后得到的小的物理存储空间。
目前,压缩文件系统采用的压缩算法一般是将固定字节数的原始数据依次进行压缩处理,由于数据的内容、类型等属性不同,实际压缩率(compression ratio)也会存在差别,因此,固定字节数的原始数据在进行压缩处理后得到的数据的字节大小并不固定。其中,压缩率为经压缩处理后的数据大小与压缩前的原始数据大小之比,例如将100M的原始数据经压缩处理后得到的数据是90M,那么压缩率为90/100*100%=90%。基于此,数据区中经压缩处理后的数据在逻辑上可以看做是由若干个大小不等的压缩块组成。这些压缩块存储在物理存储空间上的磁盘块中时,由于压缩块的大小不固定,导致压缩块在磁盘块上存储时呈无规则排布。例如,可能出现图4A和图4B中示出的两种情况:
情况一、当进行压缩处理的固定原始数据的字节大小较小时,因每次输入的字节较少,导致每次输入的字节中包含的有规律的字节较少,故尽管经压缩处理后得到的压缩块的大小较小,但实际上压缩率较大。例如,对于字符串abcdabcdefabcdabcdef...,若每次固定输入的原始数据为4个字符,那么abcd中不存在有规律的数据,也就没办法被压缩,那么输出的压缩块实际也是4个字符,虽然压缩块大小较大,但实际上压缩率较大。
针对这种情况,压缩块在存储介质上存储时很可能出现图4A所示的存储形式:磁盘块A中存储完整的压缩块1、以及压缩块2的前半部分,磁盘块B中存储压缩块2的后半部分、以及压缩块3的前半部分,磁盘块C中存储压缩块3的后半部分、以及压缩块4的前半部分......磁盘块X中存储压缩块n-1的后半部分、以及完整的压缩块n等。
在情况一所示的存储形式下,很容易出现随机读放大(random read amplification)现象,也可称为输入输出(inputoutput,IO)放大(amplification)现象。示例性的,随机读放大现象例如为:由于在从存储介质中读取存储的内容时是以磁盘块为单位进行读取的,故若需要读取压缩块3,则需要将磁盘块B和磁盘块C中存储的内容都读取出来,导致额外读取了一些冗余数据。另外,对于额外读取出的冗余数据,由于并不是完整的压缩块,所以无法有效解压出来,而为了成功解压这些数据,还需要去读取磁盘块B的上一个磁盘块、以及磁盘块C的下一个磁盘块中存储的内容,这会给设备带来额外的负担。
情况二、当进行压缩处理的固定原始数据的字节大小较大时,相较于情况一,能够降低压缩率,但经压缩处理后得到的压缩块的大小较大。压缩块在存储介质上存储时很可能出现图4B所示的存储形式:压缩块i在磁盘块A~磁盘块X中存储,其中,磁盘块A与压 缩块i的前端没有对齐,磁盘块X与压缩块i的尾端也没有对齐,即,磁盘块A的前端还存储有上一压缩块i-1的内容,磁盘块X的尾端也存储有下一个压缩块i+1的内容。
在情况二所示的存储形式下,也可能出现随机读放大现象。由于在从存储介质中读取存储的内容时是以磁盘块为单位进行读取的,故若需要读取压缩块i,则需要将磁盘块A至磁盘块X中存储的内容都读取出来,导致额外读取了一些冗余数据。另外,对于额外读取出的冗余数据,由于并不是完整的压缩块,所以无法有效解压出来,而为了成功解压这些数据,还需要读取磁盘块A的上一个磁盘块、以及磁盘块X的下一个磁盘块中存储的内容,这会给设备带来额外的负担。此外,在对压缩块i进行解压处理时,压缩块i中靠后的数据的解压需要依赖压缩块A前面的数据,为了能正确解压出压缩块i中靠后的数据,需要将压缩块A前面的数据都在线,也可能会使得解压过程中的内存占用量较多。
由上述两种情况可见,在采用固定输入的方式时,无论固定输入的原始数据的字节数是取较大值,还是较小值时,都可能会出现随机读放大的现象,会额外读取出一些冗余数据,而额外读取出的数据也无法有效被解压出来。为解决上述情况一和情况二中可能出现的问题,本申请提出了一种数据压缩方法及装置,通过控制经压缩处理后的压缩块的大小,使压缩块的大小固定,从而可以有效避免采用固定输入的方式时可能出现的问题。
下面结合具体实施例对本申请提供的技术方案进行详细说明。
参照图5所示,为本申请实施例提供的数据压缩方法的流程示意图,包括以下步骤:
步骤501、获取待压缩的原始数据。
本申请实施例中,数据压缩流程的执行主体可以是任意能够执行本申请提供的数据压缩方法的设备,例如为镜像生成服务器、个人计算机或移动终端等。示例性的,获取的待压缩的原始数据例如为一些操作系统的源代码和资源文件等。具体获取方式可以有多种方式,例如从其它服务器或存储设备中拷贝或下载得到的原始数据,本申请对此并不限定。
步骤502、将原始数据输入至压缩模块进行压缩处理后依次输出n个包含有经压缩处理后的数据的压缩块,每个压缩块的第一容量相同。
本申请实施例中,压缩模块中配置有压缩算法,原始数据经压缩处理后的数据可作为一个个固定大小的压缩块输出,压缩块的个数为n个,n为正整数。其中固定大小的压缩块可以理解为压缩块的容量固定。为便于区分,将压缩块的容量称为第一容量,第一容量即为压缩块能够包含的经压缩处理后的数据的字节数。
步骤503、将n个压缩块存储在存储介质中的m个磁盘块中。
其中,存储介质包括m个磁盘块,m为正整数,每个磁盘块的大小相同,即每个磁盘块的容量相同。为便于区分,将磁盘块的容量称为第二容量,第二容量即为磁盘块存储的数据的字节数。下文中,为便于描述,以存储介质为磁盘为例对本申请实施例进行进一步说明,但本申请实施例中存储介质并不限定于磁盘,还可以是闪存等其它任何可以支持写入及读取功能的存储介质。
本申请实施例中,第一容量相同的压缩块在磁盘中的存储形式有两种情况:
情况一、第二容量为第一容量的p倍的情况下,一个完整的磁盘块中可以存储p个完整的压缩块。其中,p为正整数、且p小于或等于n。
示例性的,参照图6A所示,假设n=10、m=5、p=2,那么,压缩块1至压缩块10可以存储在磁盘块A至磁盘块E中,并且,一个完整的磁盘块中可以存储两个完整的压缩块。 该情况下,如果想要读取压缩块3的内容,那么可以从磁盘块B中可以读取到的完整的压缩块3的内容,故压缩块3也可以成功被解压。另外,从磁盘块B中还可以读取到完整的压缩块4的内容,故压缩块4也可以成功被解压,后续解压后的数据也可以被有效利用。
情况二、第一容量为第二容量的q倍的情况下,q个完整的磁盘块中可以存储一个完整的压缩块。其中,q为正整数、q小于或等于m。
示例性的,参照图6B所示,假设n=2、m=10、q=5,那么压缩块1可以存储在磁盘块A至磁盘块E中,压缩块2可以存储在磁盘块F至磁盘块J中。该情况下,可以从磁盘块A至磁盘块E中读取到压缩块1的内容,也可以从磁盘块F至磁盘块J中读取到压缩块2的内容。并且由于读取到的压缩块都是完整的压缩块,所以读取出的压缩块均可以被成功解压,后续解压后的数据也可以被利用。
示例性的,存在一种情况一和情况二都可以适用的存储形式:参照图6C所示,假设n=5、m=5、p=1(或q=1),那么压缩块1至压缩块5可以分别存储在磁盘块A至磁盘块E中。该情况下,磁盘块与压缩块一一对应,所以需要读取哪个压缩块时,可以直接读取到与该压缩块对应的磁盘块,不会从磁盘块中读取出冗余的数据,并且,读取出的数据也能够成功被解压出来。
采用本申请提供的上述方式,通过限制输出的各个压缩块的大小来对原始数据进行压缩,可以使得输出的各个压缩块的第一容量相同,第一容量相同的压缩块存储在磁盘中时,在磁盘中的存储形式为至少一个完整的压缩块存储在一个完整的磁盘块中,或者为至少一个完整的磁盘块中存储一个完整的压缩块,通过对上述示例进行分析可以看出,压缩块在磁盘块中的采用上述存储形式,后续压缩块进行读取和解压过程中,可以尽可能多的读取出有效的数据,且从压缩块中读取出的数据均可以成功解压出来,由此可以避免采用固定输入的方式出现的随机读放大现象。
并且,本申请实施例中,通过控制输出的压缩块的大小固定,相比现有采用固定输入的方式,也更容易达到较低的压缩率的目的。例如,假设输出的压缩块为4KB,那么输入至压缩模块进行压缩处理的原始数据最少也有4KB,即最差的情况也只是将4KB的原始数据填充在压缩块中。而现有方式中,若固定输入的原始数据较少,很可能导致输入的原始数据中包含的有规律的字节较少,使得原始数据很难被压缩,例如,可能出现输入3KB的原始数据,经压缩处理后反而输出4KB的压缩块的情况,导致压缩率较大。而若固定输入的原始数据较多,虽然可以达到较低的压缩块,但是由于输出的压缩块相对来说较大,后续很容易出现随机读放大的现象。通过对比可见,本申请提供的上述方式,不仅可以达到较低的压缩率,而且还能够避免随机读放大现象。
另外,为了避免出现解压过程中内存占用量较多的问题,本申请实施例中可以尽可能的使压缩块的第一容量较小,这样,从磁盘块中读取压缩块并解压时,在解压压缩块后半部分数据时就算需要压缩块前半部分的数据均在线,但由于压缩块本身的第一容量并没有很大,所以,在内存中存储的压缩块前半部分数据较小,从而可以尽量减轻内存占用量。
此外,为了节省存储空间,若输出的n个压缩块中存在包含的经压缩处理后的数据均相同的至少两个压缩块时,由于所述至少两个压缩块中包含的内容相同,故将所述至少两个压缩块存储在磁盘中时,所述至少两个压缩块的存储位置相同,即所述至少两个压缩块共用同一存储位置,即对压缩块做去重处理。
下面,对原始数据输入至压缩模块进行压缩处理并输出n个压缩块的过程进行说明。
一种实现方式中,可以将原始数据以字节为单位依次输入至压缩模块进行压缩处理,进行压缩处理时可以重复执行图7所示的处理流程,直至原始数据包含的所有字节数均输入至压缩模块为止:
步骤701、确定本次经压缩处理后的数据的字节数达到第一容量。
步骤702、判断本次输入至压缩模块进行压缩处理的原始数据的字节数s是否大于第一容量。
若判断结果为是,则执行步骤703;若判断结果为否,则执行步骤704。
其中,步骤702执行的上述判断,主要是用于判断本次原始数据输入至压缩模块进行压缩处理时是否有压缩收益,其中,当本次输入至压缩模块进行压缩处理的原始数据的字节数大于本次经压缩处理后的数据的字节数时,可以确定为有压缩收益,当本次输入至压缩模块进行压缩处理的原始数据的字节数小于或等于本次经压缩处理后的数据的字节数时,可以确定为没有压缩收益。在有压缩收益的情况下,可以执行步骤703,反之,执行步骤704。
步骤703:将本次经压缩处理后的数据包含在一个压缩块中并输出。
在有压缩收益的情况下,输出的压缩块包含的数据为本次经压缩处理后的数据。
步骤704:继续输入t个字节数的原始数据至压缩模块,将s个字节数的原始数据以及t个字节数的原始数据包含在一个压缩块中并输出。其中,s,t为正整数,t为第一容量与s之间的差值。
在无压缩收益的情况下,输出的压缩块包含的数据为本次输入至压缩模块的原始数据、且本次输入至压缩模块的原始数据的字节数为(s+t)个。
其中,在每输出一个压缩块时,可以重新统计经压缩处理后的数据的字节数、以及输入至压缩模块进行压缩处理的原始数据的字节数。例如,可以配置一个计数器用来统计经压缩处理后的数据的字节数,配置另一个计数器用来统计输入至压缩模块的原始数据的字节数,当输出一个压缩块时,这两个计数器则清零。通过这种设计,可以便于统计每次压缩处理过程中经压缩处理后的数据的字节数以及输入的原始数据的字节数。当然,实际应用时,也可以采用其它方式来进行统计,本申请对此并不限定。
列举一个简单的例子对上述处理流程进行示例性说明。假设压缩块的第一容量为4KB,即压缩块中最多可包含的字节数为4KB。那么,当本次经压缩处理后的数据的字节数达到4KB时,若本次输入至压缩模块进行压缩处理的原始数据的字节数为5KB时,可说明本次压缩处理的压缩率为80%,即本次压缩处理有压缩收益,这种情况下可以将经压缩处理后的4KB的数据包含在一个压缩块中并输出。若本次输入至压缩模块进行压缩处理的原始数据的字节数为3KB时,可说明本次压缩处理的压缩率为133%,即本次压缩处理没有压缩收益,这种情况下可以将本次输入至压缩模块的3KB的原始数据放在压缩块中,并继续在压缩模块中输入1KB的原始数据,这4KB的原始数据不做压缩处理,由压缩模块直接放在压缩块中,进而输出压缩块,该压缩块中包含4KB的原始数据。
具体实施时,将所述原始数据以字节为单位依次输入至所述压缩模块进行压缩处理之后,除了图7所示的处理流程之外,还可能存在其它几种特殊情况:
情况1、当原始数据中未输入至压缩模块进行压缩处理的剩余数据小于或等于第一容量 时,输出的最后一个压缩块的处理方式:若剩余数据经压缩处理后的数据的字节数小于或等于剩余数据的字节数,则将剩余数据经压缩处理后的数据包含在最后一个压缩块中并输出;若剩余数据经压缩后的数据的字节数大于剩余数据的字节数,则将剩余数据包含在最后一个压缩块中并输出。
例如,假设第一容量为4KB,原始数据中未输入至压缩模块进行压缩处理的剩余数据为3KB。若剩余的3KB的原始数据经压缩处理后的数据为5KB,那么最后一个压缩块中应包含剩余的3KB的原始数据。若剩余的3KB的原始数据经压缩处理后的数据为3.5KB,那么最后一个压缩块中也可以包含剩余的3KB的原始数据。若剩余的3KB的原始数据经压缩处理后的数据为2KB,那么最后一个压缩块中可以包含经压缩处理后的2KB的数据。
情况2、当本次输入至压缩模块进行压缩处理的原始数据的字节数等于预设值时,若本次经压缩处理后的数据的字节数仍没有达到第一容量,则将本次经压缩处理后的数据包含在一个压缩块并输出。
上述情况2中,为每次输入至压缩模块进行压缩处理的原始数据的字节数设置了一个上限,该上限可以为预设值,预设值用于表征本次输入至压缩模块进行压缩处理的原始数据的字节数的最大值。当输入原始数据的字节数等于该预设值时,可以不再继续输入原始数据,而是将本次经压缩处理后的数据包含在一个压缩块中并输出。例如,参照图8所示,原始数据在逻辑上可以分为若干个数据块,本申请实施例中划分的数据块包含的字节数可以等于第一容量,即数据块和压缩块等大。将数据块s+1至数据块k的前半部分原始数据依次输入至压缩模块进行压缩处理,经压缩处理后的数据包含在压缩块c+2中,其中,若输入的数据块s+1至数据块k的部分数据的字节数已经达到该预设值,但是压缩块c+2中填充的经压缩后的数据还未达到第一容量,这种情况下,可以直接输出压缩块c+2,并且压缩块c+2的剩余容量(如图8中阴影部分所示)不再填充其它数据。
情况3、当本次压缩处理过程中输出的压缩块达到第一容量时,若本次输出的压缩块对应的最后一个数据块中输入的原始数据的字节数小于设定阈值,则可以将本次输入至压缩模块的原始数据重新进行压缩处理,其中,除了最后一个数据块中输入的原始数据以外的原始数据,经压缩处理后仍然填充在当前压缩块中并输出,而最后一个数据块中输入的原始数据经压缩处理后将填充在下一个压缩块中并输出。
例如,参照图9A所示,原始数据在逻辑上可以分为若干个数据块,其中,从数据块s+1的后半部分(如图9A中虚线1所示)起,原始数据经压缩处理后的数据可以填充在压缩块c+2中,当压缩块c+2中填充的经压缩处理后的数据已经达到第一容量时,输入的原始数据已经对应到数据块s+3的前半部分(如图9A中虚线2所示)。其中,数据块s+3即为当前输出的压缩块c+2对应的最后一个数据块,数据块s+3中输入至压缩模块进行压缩处理的原始数据(如图9A中阴影部分所示)小于设定阈值。这种情况下,可以将之前填充在压缩块c+2中的经压缩处理后的数据舍弃,进而重新从数据块s+1的后半部分起,对原始数据重新进行压缩处理。进一步地,参照图9B所示,数据块s+1的后半部分至数据块s+2对应的原始数据经压缩处理后可以填充在压缩块c+2中,这时压缩块c+2中还存在剩余容量(如图9B中阴影部分所示),剩余容量暂时不利用,直接新生成下一个压缩块,从数据块s+3起的原始数据经压缩处理后填充在压缩块c+3中。当然,实际应用时,在出现上述情况3所示的情况时,并不限定于对原始数据重新进行压缩处理这一方式,还可以采用其它方式, 本申请对此并不限定。
通过情况3所述的设计,可以尽可能减少文件系统中差分包的大小,当原始数据出现一些改动时,可以尽量减少对当前磁盘中存储的压缩块的存储形式造成影响。
本申请实施例中,在将n个压缩块存储到磁盘中之后,为了便于查找和访问原始数据所在的压缩块,可以为压缩文件系统创建索引,其中既可以以数据块为对象创建索引,也可以以压缩块为对象创建索引。并且,创建的索引既可以内置在压缩文件系统的元数据区,也可以内置在压缩文件系统的数据区中。下面对创建索引的方式进行具体说明。
方式一:以数据块为对象创建索引
由于原始数据在逻辑上可被划分为若干个数据块,故本申请实施例中可以将原始数据划分为i个数据块。其中,每个数据块包含的字节数与所述第一容量相同。例如,若第一容量为4KB,那么对于64KB的原始数据,可以划分为16个4KB的数据块,每个数据块包含有4KB的字节数。
针对i个数据块中第j个数据块,可以为第j个数据块建立第一索引,并记录建立的第一索引与第j个数据块之间的对应关系,其中,第一索引用于标识第j个数据块包含的数据在磁盘中的存储位置。其中,i为正整数,j为小于或等于i的任一正整数。
其中,由上述图7所示的压缩处理流程可知,在压缩块中填充的数据仅包含两种情况,一种是有压缩收益时填充的经压缩处理后的数据,一种是没有压缩收益时填充的原始数据。鉴于上述特点,压缩块和数据块的对应情况不会出现类似于图10所示的情况。其中,图10中数据块s包含压缩块c至压缩块c+2中经解压处理后的数据,压缩块c+1对应着数据块s中阴影部分的数据,很明显阴影部分的数据小于压缩块c+1中填充的数据,即压缩块c+1中实际填充的数据既不是有压缩收益的经压缩处理后的数据,也不是原始数据,所以这种情况不会在本申请实施例中出现。因此,本申请实施例中第j个数据块最多包含两个压缩块中经解压处理后的数据。
本申请的第一示例中,当第j个数据块包含下一个压缩块中经解压处理后的数据时,第j个数据块对应的第一索引中包含的内容为:
第一标识位或第二标识位;所述下一个压缩块的块号;所述第j个数据块的块内偏移。
其中,第一标识位用于标识所述下一个压缩块中的数据为原始数据,第二标识位用于标识所述下一个压缩块中的数据为经压缩处理后的数据,第j个数据块的块内偏移为下一个压缩块经解压处理后的数据的首部在第j个数据块中的位置。关于压缩块的块号,可以按照磁盘的物理存储空间的实际划分情况对存储在磁盘中的压缩块进行编号,以标记压缩块在磁盘中的存储位置。
参照图11所示的数据块与压缩块之间的关系的示意图。对于数据块s、数据块s+1、数据块s+2、数据块s+3、以及数据块s+6这类数据块,从数据流的角度上看,这类数据块均包含着一个新生成的压缩块的经解压处理后的数据的首部。其中,在图11所示的例子中,新生成的压缩块经解压处理后的数据的首部与当前压缩块经解压处理后的数据的尾部相接。实际应用时,一个新生成的压缩块的经解压处理后的数据的首部也可以不与当前压缩块的经解压处理后的数据的尾部相接,本申请对此并不限定。
比如,数据块s的起始位置对应着压缩块c的起始位置,相对于压缩块c-1来说,压缩块c可以看做是新生成的下一个压缩块,并且数据块s包含了压缩块c经解压处理后的数据 的首部。再比如,数据块s+1的起始位置对应着压缩块c+1的起始位置,相对于压缩块c来说,压缩块c+1也可以看做是新生成的下一个压缩块,并且数据块s+1包含了压缩块c+1经解压处理后的数据的首部。又比如,数据块s+3中虚线所示的位置对应着压缩块c+3的起始位置,那么相对于压缩块c+2来说,压缩块c+3也可以看做是新生成的下一个压缩块,并且,数据块s+3包含了压缩块c+3经解压处理后的数据的首部。
以上这类数据块可以被称为首块,首块可以定义为包含有下一个压缩块经解压处理后的数据的首部的数据块。针对首块,可以采用上述第一示例中给出的第一索引。进一步地,首块还可以分为非压缩模式、以及压缩模式。下面,结合这两种模式,对第一索引的内容进行详细介绍。
情况1、没有压缩收益的情况下,数据块包含的下一压缩块中的数据为原始数据,即;数据块为首块、且为非压缩模式。
例如,图11所示的数据块s、数据块s+2等,这类数据块经压缩处理后的数据没有压缩收益,故数据块s包含的压缩块c中的数据为原始数据,数据块s+2包含的压缩块c+2中的数据为原始数据。所以,数据块s、数据块s+2可以看做是为首块、且首块为非压缩模式。
示例性的,数据块s对应的第一索引例如为表1所示:
表1
Figure PCTCN2019083589-appb-000001
其中,由于数据块s的起始位置对应着压缩块c经解压处理后的数据的首部,所以数据块s的块内偏移为零。第一标识位可以标识压缩块c中的数据为原始数据,也即标识了数据块s为首块、且首块为非压缩模式。
示例性的,数据块s+2对应的第一索引例如为表2所示:
表2
Figure PCTCN2019083589-appb-000002
其中,数据块s+2中虚线所示的位置对应着压缩块c+2经解压处理的数据的首部,所以数据块s+2的块内偏移不为零。从数据块s+2中的虚线所示的位置之后的原始数据输入至压缩模块之后,输出的依然为原始数据,并且输出的原始数据填充在了压缩块c+2中。第一标识位可以标识压缩块c+2中的数据为原始数据,也即标识了数据块s+2为首块、且首块为非压缩模式。
情况2、有压缩收益的情况下,数据块包含的下一压缩块中的数据为经压缩处理后的数据,即:数据块为首块、且首块为压缩模式。
例如,图11所示的数据块s+1、数据块s+3等,这类数据块经压缩处理后的数据有压缩收益,故数据块s+1包含的压缩块c+1中的数据为经压缩处理后的数据,数据块s+3包含的压缩块c+3中的数据也为经压缩处理后的数据。鉴于上述特征,数据块s+1、数据块s+3可以看做是首块、且首块为压缩模式。
示例性的,数据块s+1对应的第一索引例如为表3所示:
表3
Figure PCTCN2019083589-appb-000003
其中,由于数据块s+1的起始位置对应着压缩块c+1经解压处理的数据的起始位置,所以数据块s+1的块内偏移为零。第二标识位可以标识压缩块c+1中的数据为经压缩处理后的数据,也即标识了数据块s+1为首块、且首块为压缩模式。
示例性的,数据块s+3对应的第一索引例如为表4所示:
表4
Figure PCTCN2019083589-appb-000004
其中,数据块s+3中虚线所示的位置对应着压缩块c+3经解压处理后的数据的首部,所以数据块s+3的块内偏移不为零。从数据块s+3中的虚线所示的位置之后的原始数据输入至压缩模块之后,从压缩模块中输出经压缩处理后的数据,并且经压缩处理后的数据填充在了压缩块c+3中。第二标识位可以标识压缩块c+3中的数据为经压缩处理后的数据,也即标识了数据块s+3为首块、且首块为压缩模式。
本申请的第二示例中,当第j个数据块仅包含当前压缩块中经解压处理后的数据时,第j个数据块对应的第一索引中包含的内容为:
第三标识位;相对位于第j个数据块之前的第一数据块的块距离;相对位于第j个数据块之后的第二数据块的块距离;第一数据块的块内偏移。
其中,第三标识位用于标识当前压缩块中的数据为经压缩处理后得到的数据。第一数据块中包含当前压缩块中经解压处理后的数据,或者,第一数据块中包含当前压缩块中经解压处理后的数据、以及上一个压缩块中经解压处理后的数据。第一数据块的块内偏移为所述当前压缩块经解压处理后的数据的首部在第一数据块中的位置。第二数据块中包含当前压缩块中经解压处理后的数据、以及下一个压缩块中经解压处理后的数据。实际上,第一数据块和第二数据块也可以理解成是上述第一示例中所述的首块。
继续参照图11所示的数据块与压缩块之间的关系的示意图。对于数据块s+4、数据块s+5这类数据块,从数据流的角度上看,这类数据块中仅包含着当前压缩块中经解压处理后的数据,而没有包含下一个新生成的压缩块中经解压处理后的数据的首部。比如,数据块s+4、数据块s+5均包含着当前的压缩块c+3中经解压处理后的数据,而没有包括下一个新的压缩块c+4中经解压处理后的数据的首部。
以上这类数据块可以被称为非首块,非首块可以定义为包含有当前压缩块经解压处理后的数据,但不包含下一个压缩块经解压处理后的数据的首部的数据块。并且,实际上非首块包含了当前压缩块经解压处理后的部分数据,位于非首块之前的第一个首块中也包含有该压缩块经解压处理后的数据,另外,若位于非首块之后存在着相邻的非首块,那么相邻的非首块中也包含有该压缩块经解压处理后的数据,位于非首块之后的第一个首块中也包含有该压缩块经解压处理后的数据,所以非首块仅为压缩模式。针对非首块,可以采用上述第二示例中给出的第一索引,下面对第一索引的内容进行详细介绍。
示例性的,数据块s+4对应的第一索引例如为表5所示:
表5
第一数据块的块内偏移 第三标识位
相对第一数据块的块距离 相对第二数据块的块距离
其中,第一数据块的块内偏移为图11所示的数据块s+3中的虚线所示的位置,数据块s+3中虚线所示的位置对应着压缩块c+2经解压处理的数据的首部。相对第一数据块的块距离,也即相对数据块s+3的块距离。相对第二数据块的块距离,也即相对数据块s+6的块距离。其中,数据块s+3中除了包含有压缩块c+2经解压处理后的数据,还包含有压缩块c+3经解压处理后的数据,所以数据块s+3为位于数据块s+4之前的首块,数据块s+6中除了包含压缩块c+3经解压处理后的数据,还包含有压缩块c+4经解压处理后的数据,所以数据块s+6为位于数据块s+4之后的首块。一种方式中,相对首块的块距离可以以数据块为单位来表示,这种情况下,相对数据块s+3的块距离等于1个数据块,相对数据块s+6的块距离等于2个数据块。另一方式中,块距离还可以以字节为单位来表示,例如若一个数据块为4KB,那么相对数据块s+3的块距离等于4KB,相对数据块s+6的块距离等于8KB等。第三标识位可以标识压缩块c+3中的数据为经压缩处理后的数据,也即标识数据块s+3为非首块,且非首块为压缩模式。当然,上述块距离的表示方式仅为示例性说明,实际应用时,还可以采用其它方式来表示块距离,例如,在第一个方式的基础上,将相对第一数据块的块距离均减一,相对第二数据块的块距离均加一等。
数据块s+5对应的第一索引也可以参照表5所示的第一索引,这里不再一一介绍。
可选的,本申请的第三示例中,在本申请的第二示例所示的第一索引的基础上,第j个数据块对应的第一索引中还可以包含所述当前压缩块的块号和第四标识位。这里,第四标识位用于标识所述当前压缩块中的数据为经压缩处理后的数据,即标识所述第j个数据块为压缩模式,而第j个数据块是否为首块,还需根据相对第一数据块的块距离来确定。例如,当相对第一数据块的块距离取值为零时,可以确定第j个数据块为首块,当相对第一数据块的块距离取值不为零时,可以确定第j个数据块为非首块。
示例性的,继续列举数据块s+4对应的第一索引,在第三示例中数据块s+4对应的第一索引例如为表6所示:
表6
Figure PCTCN2019083589-appb-000005
本示例中,相对第一数据块的块距离可以拆分成两部分表示,一部分表示高x位,另一部分表示低y位。例如,相对第一数据块的块距离由8个比特位构成时,那么,高x位可以表示高4个比特位,低y位可以表示低4个比特位。当然,上述例子仅作为示例性说明,具体实施时并不限于上述例子。可以在保证相对第一数据块的块距离、第一数据块的块内偏移、相对第二数据块的块距离这三者占用的比特位的总和不变的情况下,根据实际需求来配置x和y的值。另外,在第一索引中增加了压缩块c+3的块号,以便根据数据块 s+4对应的第一索引的内容直接确定出对应的压缩块的位置,而无需通过查找数据块s+3对应的第一索引来确定出对应的压缩块的位置,使得查找更为简便、效率更高。此外,对于第一示例中的情况2、以及第二示例,也可以采用第三示例中提供的方式来表示第一索引。
需要说明的是,上述第二示例和第三示例中涉及的相对第二数据块的块距离的作用主要是为了确定压缩块经解压处理后的原始数据的尾端对应着哪个数据块。实际应用时,相对第二数据块的块距离可以是一个可选的内容,在第一索引中既可以包含相对第二数据块的块距离,也可以不包含相对第二数据块的块距离。
在为数据块建立起如上述第一示例至第三示例所示的第一索引后,若后续需要在压缩文件系统中读取出指定字节范围内的原始数据时,可以首先确定出指定字节范围对应的数据块,进而查找该数据块对应的第一索引,利用第一索引查找到需要读取出的压缩块的位置。
例如,继续参照图11所示,假设原始数据大小为128KB,划分为32个4KB的数据块,第一个数据块对应着第1KB至第4KB的原始数据,以此类推,数据块s对应着第17KB至第20KB的原始数据,数据块s+1对应着第21KB至第24KB的原始数据,数据块s+2对应着第25KB至第28KB的原始数据,数据块s+3对应着第29KB至第32KB的原始数据,数据块s+4对应着第33KB至第36KB的原始数据......以此类推。下面列举几个查找场景进行具体说明。
场景一、若需要查找第17KB至第20KB的原始数据,则可以查找到数据块s对应的第一索引(例如上述表1所示)。通过第一索引中包含的压缩块c的块号,可以确定出数据块s对应着压缩块c中的数据,且可以确定出压缩块c在磁盘中的存储位置,进一步,数据块s的块内偏移为零,则说明从压缩块c首部起,该压缩块c中的所有数据均与数据块s相对应,由于压缩块c中的数据为原始数据、且数据块s的块内偏移为0,所以可以直接读取出压缩块c中的数据,即可以得到第17KB至20KB的原始数据。
场景二、若需要查找第29KB至第32KB的原始数据,则可以查找到数据块s+3对应的第一索引(例如上述表4所示)。通过第一索引中包含的压缩块c+3的块号,可以确定出数据块s+3对应着压缩块c+3中的数据,且可以确定出压缩块c+3在磁盘中的存储位置,进一步,根据数据块s+3的块内偏移,可以确定出数据块s+3中的虚线所在位置之后的部分对应着压缩块c+3中的数据,数据块s+3中的虚线所在位置之前的部分对应着上一个压缩块c+2中的数据。
这种情况下,需要查找位于数据块s+3之前的第一个首块对应的第一索引,本场景中位于数据块s+3之前的数据块s+2即为首块,所以可以直接查找数据块s+2对应的第一索引,数据块s+2对应的第一索引中记录着压缩块c+2的块号、以及数据块s+2的块内偏移,由压缩块c+2的块号,可以确定出压缩块c+2的存储位置,由数据块s+2的块内偏移,可以确定出数据块s+3中虚线所在位置之前的部分对应着压缩块c+2中的数据。进一步,读取出压缩块c+2,因压缩块c+2中的数据为原始数据、且数据块s+2的块内偏移不为零,所以可以先读取出压缩块c+2中的数据,进而可以通过拷贝得到数据块s+3中虚线所在位置之前的部分对应着压缩块c+2中的数据。需要理解的是,这里拷贝过程也可以看做是一种特殊的解压过程,从压缩块中经拷贝原始数据也可以理解为是经解压处理后的数据。读取压缩块c+3中的数据,因压缩块c+3中的数据为经压缩处理后的数据,所以可以对压缩块c+3 中的数据进行解压处理,从压缩块c+3经解压处理后的数据中可以得到与数据块s+3中的虚线所在位置之后的部分对应的压缩块c+3中的数据,得到的数据即为第29KB至第32KB的原始数据。
当然,实际应用时,若位于数据块s+3之前的数据块s+2不是非首块,那么,可以通过数据块s+2对应的第一索引中包含的相对第一数据块的块距离,确定出位于数据块s+2之前的第一个首块。查找到位于数据块s+2之前的第一个首块之后,可以参照上述流程进一步读取并解压压缩块中的数据,结合数据块s+2对应的第一索引中记录的第一数据块的块内偏移、以及相对第一数据块的块内偏移,从解压处理后的数据中得到需要的原始数据。
场景三、若需要查找第33KB至第36KB的原始数据,则可以查找到数据块s+4对应的第一索引。
A、当数据块s+4对应的第一索引如上述表5所示时,通过相对第一数据块的块距离,可以查找到第一数据块为数据块s+3,进一步查找数据块s+3对应的第一索引(例如上述表4所示),通过压缩块c+3的块号,可以确定出数据块s+4对应着压缩块c+3中的数据、且可以确定出压缩块c+3在磁盘中的存储位置。对压缩块c+3进行解压处理,根据数据块s+3的块内偏移,这里假设本场景下数据块s+3的块内偏移取1KB,那么,可以确定出压缩块c+3经解压处理后的原始数据中前3KB的原始数据为数据块s+3中虚线所示位置之后包含的解压处理后的数据,压缩块c+3经解压处理后的原始数据中第4KB至第7KB这4KB的原始数据,即为数据块s+4包含的解压处理后的数据,由此可以得到需要查找的原始数据。这里,需要说明的是,对压缩块进行解压处理,既可以解压出压缩块中全部的数据,也可以在解压过程中判断是否查找到所需的原始数据,当查找到所需的原始数据后中止解压过程。
B、当数据块s+4对应的第一索引如上述表6所示时,由于第一索引中包含有压缩块c+3的块号,所以可以无需再去查找数据块s+3对应的第一索引,便可以获知压缩块c+3的存储位置,后续可以参照上述A示出的方式,对压缩块c+3进行解压处理后,得到第33KB至36KB的原始数据。
场景四:若需要查找第30KB至第31KB的原始数据,则可以查找到数据块s+3对应的第一索引(例如上述表4所示)。通过第一索引中包含的压缩块c+3的块号,可以确定出数据块s+3对应着压缩块c+3中的数据、且可以确定出数据块s+3在磁盘中的存储位置。读取并解压压缩块c+3中的数据,根据数据块s+3的块内偏移,这里假设本场景下数据块s+3的块内偏移取1KB,即数据块s+3的起始位置至虚线所示的位置之间包含的数据为1KB,那么,可以确定出压缩块c+3经解压处理后的原始数据中前2KB的原始数据即为第30KB至第31KB的原始数据。
场景五:若需要查找第29KB的原始数据,则首先查找到数据块s+3对应的第一索引(例如上述表4)所示,继续沿用场景四中给出的示例,由于数据块s+3的块内偏移取1KB,所以判断出压缩块c+3经解压处理后的数据中不包括第29KB的原始数据,因此需要进一步查找位于数据块s+3之前的第一个首块对应的第一索引,本场景中第一个首块对应的第一索引为数据块s+2对应的第一索引(例如上述表2所示)。根据s+2对应的第一索引包含的压缩块c+2的块号,可以确定出数据块s+2对应着压缩块c+2中的数据、且可以确定出数据块s+2在磁盘中的存储位置。读取压缩块c+2中的数据,这里压缩块c+2中的数据即为原 始数据,根据数据块s+2的块内偏移,这里假设本场景下数据块s+2的块内偏移取1KB,即数据块s+2的起始位置至虚线所示的位置之间包含的数据为1KB,那么,可以确定出读取出的压缩块c+2的数据中前3KB的原始数据为数据块s+2包含的数据,读取出的压缩块c+2的数据中第4KB的原始数据即为数据块s+3包含的第29KB的原始数据。
当然,实际应用时,还可以支持查找第29KB至第31KB的原始数据,这种情况下可以将场景四和场景五所示的方式相结合,本申请实施例中不再一一列举说明。
方式二:以压缩块为对象创建索引
为n个压缩块中第k个压缩块建立第二索引,并记录建立的第二索引与第k个压缩块之间的对应关系,其中,第二索引用于标识第k个压缩块对应的输入至压缩模块进行压缩处理的原始数据的字节范围。k为小于或等于n的正整数。
参照图12列举的压缩块对应的第二索引的示意图。其中,各个压缩块对应的输入至压缩模块进行压缩处理的原始数据的字节范围可以用文件内偏移表示。例如,假设压缩块1至压缩块n的第一容量均为4KB,若压缩块1对应的文件内偏移0指示0~8KB的字节范围,则可以表征0~8KB的原始数据经压缩处理后填充在了压缩块1中,若压缩块2对应的文件内偏移1指示8~20KB的字节范围,则可以表征8~20KB的原始数据经压缩处理后填充在了压缩块1中,若压缩块n对应的文件内偏移n指示128~136KB的字节范围,则可以表征128~136KB的原始数据经压缩处理后填充在了压缩块n中。
方式二所示的建立索引的方式,在查找原始数据对应的压缩块时,可以采用二分查找的方法。鉴于这种查找方式,该方式二应用在压缩块较少的场景下时查找效率更高一些,并且,因为压缩块较少,建立的索引个数比较少,也可以节省存储空间。
以上为对本申请实施例提供的数据压缩方法的具体介绍。由于本申请实施例中需要产生固定大小的压缩块,故压缩算法的设计相应地也会有所调整。下面,以LZ4算法为例,对本申请实施例中压缩算法的设计方式进行简要说明。
在介绍压缩算法的设计方式之前,首先对LZ4的压缩格式进行解释,以便于理解。
LZ4压缩算法中,若当前输入的原始数据与之前已输入的原始数据存在相同的数据时,可以将当前输入的原始数据与之前已输入的相同的原始数据相匹配,进而无需重复表示这部分相同的原始数据,以实现压缩。具体的,LZ4是以字节为单位的压缩格式,一个压缩块可以由若干个经压缩处理后的序列(sequence)组成,下文中简称为压缩序列。每个压缩序列可以记录一定字节长度的字面量(literal)和一个滑动窗口匹配(match),其格式例如为图13所示。一个压缩序列中包含令牌(token)、字面量(literal)、偏移量(offset)。可选的,还可以包括字面量变长量(线性小整数代码(linear small-integer code,lsic)_literal)、以及滑动窗口匹配变长量(lsic_match)。
literal代表压缩序列中直接存储的原始数据的部分,即不能被压缩的原始数据。match代表压缩序列中能够匹配的部分。offset表示当前输入的原始数据与之前已输入的相同的原始数据之间的偏移量,offset用固定的2个字节来表示,若offset为0,表示match字节长度为零,即不存在滑动窗口匹配,offset的最大值(MAXDISTANCE)为65535。
token用1个字节,即8个比特位(bit)来表示,其中,高4bit可以用于标识literal的字节长度(token_literal),低4bit可以用于标识match的字节长度(token_match)。
token_literal的最小值为0,最大值为15,故token_literal最多可以表示literal的字节长 度为15个字节。若实际literal的字节长度大于或等于15时,该压缩序列中存在lsic_literal,用于标识除了15个字节长度以外的literal的剩余字节长度。
token_match的最小值为0,最大值为15,故token_match最多可以表示match的字节长度为15个字节,若实际match的字节长度大于或等于15时,该压缩序列中存在lsic_match,用于标识除了15个字节长度以外的match的剩余字节长度。并且,需要说明的是,当offset不为零时,match的最小字节长度(MINMATCH)为4个字节,所以当token_match=0时,表示match的字节长度为4个字节,当token_match=15时,表示match的字节长度为19个字节。
LZ4对于输出的最后两个特殊压缩序列,有一些特殊规则:例如最后一个压缩序列的最后5个字节为literal,且没有offset字段、即没有match等,本申请实施例中不考虑这些细节,故不再具体展开详述。
结合上述概念对压缩算法的部分设计调整进行简要说明,本申请实施例中的压缩算法基于动态规划原理,具体实现方式有两种:
(1)以sequence为单位转移的动态规划转移方程
不考虑输出特殊压缩序列的情况时,输入的原始序列(原始序列可以理解为压缩序列对应的原始数据)的示意图参照图14所示,这种情况下朴素的cost[i]代价函数为:
Figure PCTCN2019083589-appb-000006
其中,i为本次输入的原始序列的结尾位置,j为本次输入的原始序列的起始位置,也即本次输入的原始序列中作为literal的部分的起始位置,k为本次输入的原始序列中作为match的部分的起始位置,i、j、k之间满足的关系为:j≤k≤i,j<i。其中,本次输入的原始序列即为第j至i-1个字节组成的原始数据。
cost[i]代价函数表示依次输入原始数据直到输入的最后一个原始序列在i处结尾(或直到输入的原始数据为以i为结尾的后缀)时,输出的压缩序列的最小字节长度,其中,以i为结尾的后缀可以理解为第0至第i-1个字节组成的原始数据。
cost[j]代价函数表示依次输入原始数据直到输入的最后一个原始序列在j处结尾(或直到输入的原始数据以j为结尾的后缀)时,输出的压缩序列的最小字节长度,其中,以j为结尾的后缀可以理解为输入第0至第j-1个字节组成的原始数据。
k-j为本次输入的原始序列中作为literal的部分的字节长度。
lsic函数表示lsic_literal、或lsic_match的字节长度,具体为:
Figure PCTCN2019083589-appb-000007
当x=i时,lsic[i]表示本次输入的原始序列中不存在作为match的部分时lsic_literal的字节长度;
当x=k-j时,lsic[k-j]表示本次输入的原始序列中存在作为match的部分时lsic_literal的字节长度;
当x=i-k-MINMATCH时,lsic[i-k-MINMATCH]表示本次输入的原始序列中存在作为match的部分时lsic_match的字节长度,MINMATCH=4。
增加约束条件,并进一步化简得:
Figure PCTCN2019083589-appb-000008
其中,ml为有效的match的字节长度,其不能长于本次输入的原始序列的长度,即ml≤i-j,longestmatch为以i为结尾的后缀的前缀连续重复得到的后缀长度的最大值(可以通过修改后的克努斯-莫里斯-普拉特(knuth-morris-pratt,KMP)算法失配函数求出),例如:若原始序列为ABABAB,则longestmatch为4,若原始序列为ABCABCABCABC,则longestmatch为9,若原始序列为ABCDABC,则longestmatch为3,若原始序列为BABCDABC,则longestmatch也为3。
m为模板串(template)的起始位置,也可以通过修改后的KMP算法失配函数求出,template在输入的原始序列中的位置如图15所示,模板串位于本次输入的原始序列中作为match的部分之前,模板串可以包含本次输入的原始序列中作为literal的部分的原始数据中的部分原始数据或全部原始数据,也可以包含之前输入的原始序列中作为literal的部分的原始数据中的部分原始数据或全部原始数据。
k-m代表offset的取值,由于offset最多两个字节,所以其最大值MAXDISTANCE为65535。
(2)以literal结尾来转移的动态规划转移方程
输入的原始序列也可以通过以literal为结尾的变形来进行转移,例如参照图16所示,这种情况下cost[i]代价函数为:
Figure PCTCN2019083589-appb-000009
其中,i为本次输入的原始序列中作为literal的部分的结尾位置。
ml表示有效的match的字节长度,不同于(1),这里ml为循环自变量,取值范围为MINMATCH~longestmatch。其中,MINMATCH、longestmatch的定义可参照(1)中的解释。
lsicdelta lit为cost[i-1]的literal的长度加一时增加的lsic_literal的长度(取值0~1,例如当cost[i-1]对应literal的长度为14、15+254、15+255+254……时,lsicdelta lit为1,否则为0)。
cost[i]代价函数表示依次输入原始数据直到输入的原始序列以i为literal的部分的结尾位置时,输出的压缩序列的最小字节长度,输入的原始序列以i为literal的部分的结尾位置可以理解为输入第0至i-1个字节组成的原始数据,其中,第i-1个字节的原始数据不一定是作为literal的部分,而是从第i-1个字节之后的原始数据起,进行压缩处理时首先是作为literal处理,但实际literal的字节长度可能为零。
cost[i-1]代价函数表示依次输入原始数据直到输入的原始序列以i-1为literal的部分的结尾位置时,输出的压缩序列的最小字节长度,输入的原始序列以i-1为literal的部分的结尾位置可以理解为输入第0至i-2个字节组成的原始数据,其中,第i-2个字节的原始数据不一定是作为literal的部分,而是从第i-2个字节之后的原始数据起,进行压缩处理时首先是作为literal处理,但实际literal的字节长度可能为零。
cost[i-ml]代价函数表示依次输入原始数据直到输入的原始序列以i-ml为literal的部分的结尾位置时,输出的压缩序列的最小字节长度,输入的原始序列以i-ml为literal的部分的结尾位置可以理解为输入第0至i-ml-1个字节组成的原始数据,其中,第i-ml-1个字节的原始数据不一定是作为literal的部分,而是从第i-ml-1个字节之后的原始数据起,进行压缩处理时首先是作为literal处理,但实际literal的字节长度可能为零。
lsic函数的定义可参照(1)的解释,当x=ml-MINMATCH时,lsic[ml-MINMATCH]表示本次输入的原始序列中存在作为match的部分时lsic_match的字节长度。
下面,基于相同的技术构思,结合附图对本申请实施例提供的数据压缩装置进行介绍。
本申请实施例提供一种数据压缩装置,所述数据压缩装置可以是任意能够执行本申请上述方法实施例所涉及的方法的设备,例如为镜像生成服务器、个人计算机或移动终端等。所述数据压缩装置包括执行本申请上述方法实施例所涉及的方法的功能或模块或手段(means),并且,可选的,所述数据压缩装置也可以包括执行压缩块的读取和解压过程的功能或模块或手段(means)。所述功能或模块或单元或手段(means)可以通过软件实现,或者通过硬件实现,也可通过硬件执行相应的软件实现。
图17示出了本申请实施例提供的一种数据压缩装置的结构示意图一。其中,所述数据压缩装置1700包括收发模块1701、处理模块1702、以及压缩模块1703。其中,收发模块1701可以用于获取待压缩的原始数据。处理模块1702可以用于将所述原始数据输入至压缩模块1703进行压缩处理后依次输出n个包含有经压缩处理后的数据的压缩块,其中,输出的每个压缩块的第一容量相同,所述第一容量表征所述压缩块能够包含的经压缩处理后的数据的字节数。处理模块1702还可以用于将n个所述压缩块存储在存储介质中,其中,所述存储介质包括m个磁盘块,每个磁盘块的第二容量相同,所述第二容量表征所述磁盘块存储的数据的字节数。
具体的,第一容量相同的压缩块在存储介质中的存储形式可以有以下两种设计:
一种可能的设计中,第二容量为第一容量的p倍,n个压缩块在存储介质中的存储形式为:一个完整的磁盘块中存储p个完整的压缩块。
另一种可能的设计中,第一容量为第二容量的q倍,n个压缩块在存储介质中的存储形式为:q个完整的磁盘块中存储一个完整的压缩块。
其中,n、m、p、q为正整数,且p小于或等于n,q小于或等于m。
一种实现方式中,若所述n个压缩块中存在包含的经压缩处理后的数据均相同的至少两个压缩块,则所述处理模块1702将所述至少两个压缩块存储在所述存储介质中时,还可以对所述至少两个压缩块进行去重处理,使得所述至少两个压缩块在所述存储介质中的存储位置相同。
所述处理模块1702还可以用于创建索引。一种实现方式中,处理模块1702可以用于将原始数据划分为i个数据块,每个数据块包含的数据的字节数与所述第一容量相同,其中,i个数据块中第j个数据块最多包含两个压缩块中经解压处理后的数据。之后,处理模块1702还可以用于为第j个数据块建立第一索引,并记录建立的第一索引与第j个数据块之间的对应关系,第一索引用于标识第j个数据块包含的数据在所述存储介质中的存储位置。i为正整数,j为小于或等于i的任一正整数。
第一种可能的设计中,当第j个数据块包含下一个压缩块中经解压处理后的数据时,第j个数据块对应的第一索引中包含的内容为:第一标识位或第二标识位、所述下一个压缩块的块号、以及第j个数据块的块内偏移。其中,第一标识位用于标识所述下一个压缩块中的数据为原始数据,第二标识位用于标识所述下一个压缩块中的数据为经压缩处理后的数据,第j个数据块的块内偏移为所述下一个压缩块经解压处理后的数据的首部在第j个数据块中的位置。
第二种可能的设计中,当第j个数据块仅包含当前压缩块中经解压处理后的数据时,第j个数据块对应的第一索引中包含的内容为:第三标识位,第三标识位用于标识所述当前压缩块中的数据为经压缩处理后得到的数据;相对位于第j个数据块之前的第一数据块的块距离,所述第一数据块中包含当前压缩块中经解压处理后的数据,或者,第一数据块中包含当前压缩块中经解压处理后的数据、以及上一个压缩块中经解压处理后的数据;相对位于第j个数据块之后的第二数据块的块距离,第二数据块中包含当前压缩块中经解压处理后的数据、以及下一个压缩块中经解压处理后的数据;第一数据块的块内偏移,第一数据块的块内偏移为所述当前压缩块经解压处理后的数据的首部在第一数据块中的位置。
第三种可能的设计中,当第j个数据块中包含有当前压缩块的块号时,第j个数据块对应的第一索引中还可以包含所述当前压缩块的块号。
在一种可能的实现方式中,所述处理模块1702执行的压缩处理过程具体可以为:将所述原始数据以字节为单位依次输入至所述压缩模块1703进行压缩处理,并重复执行如下处理,直至将所述原始数据包含的所有字节数均输入至所述压缩模块为止:
在每次确定出本次经压缩处理后的数据的字节数达到第一容量后,判断本次输入至压缩模块1703进行压缩处理的原始数据的字节数s是否大于所述第一容量;若判断结果为是,则将本次经压缩处理后的数据包含在一个压缩块中并输出;若判断结果为否,则继续输入t个字节数的原始数据至压缩模块1703,将s个字节数的原始数据以及t个字节数的原始数据包含在一个压缩块中并输出。其中,s,t为正整数,t为所述第一容量与s之间的差值。
在一种可能的实现方式中,所述处理模块1702还可以用于:当本次输入至压缩模块1703进行压缩处理的原始数据的字节数等于预设值时,而本次经压缩处理后的数据的字节数仍没有达到所述第一容量的情况下,将本次经压缩处理后的数据包含在一个压缩块并输出,所述预设值为本次输入至压缩模块1703进行压缩处理的原始数据的字节数的最大值。
图18示出了本申请实施例提供的一种数据压缩装置的结构示意图二。数据压缩装置1800可以包括处理器1801和通信接口1802。其中,处理器1801可以是中央处理器(central processing unit,CPU)或网络处理器(network processor,NP)。处理器1801还可以是其他类型芯片,例如基带电路,射频电路,专用集成电路(application specific integrated circuit,ASIC),可编程逻辑器件(programmable logic device,PLD)或其任意组合。通信装置1800还可以包括存储器1803,用于存储处理器1801所执行的程序以及所需处理的数据。所述存储器1803可以集成在所述处理器1801中,也可以与所述处理器1801分离设置。
所述处理器1801可以用于用通信接口1802获取待压缩的原始数据。处理器1801可以用于将所述原始数据输入至压缩模块进行压缩处理后依次输出n个包含有经压缩处理后的数据的压缩块,其中,输出的每个压缩块的第一容量相同,所述第一容量表征所述压缩块能够包含的经压缩处理后的数据的字节数。处理器1801还可以用于将n个压缩块存储在存 储介质中,其中,所述存储介质包括m个磁盘块,每个磁盘块的第二容量相同,所述第二容量表征所述磁盘块存储的数据的字节数。
关于上述处理器1801、通信接口1802、存储器1803的具体功能,可参见本申请上述方法实施例中相对应的描述,在此不再赘述。
基于相同的技术构思,本申请还提供一种芯片,所述芯片可以与存储器相通信,或者所述芯片中包括存储器,所述芯片执行所述存储器中存储的程序指令,以实现上述方法实施例中涉及的方法所对应的功能。
基于相同的技术构思,本申请还提供一种计算机存储介质,所述计算机存储介质存储有计算机可读指令,当所述计算机可读指令被执行时,使得实现上述方法实施例中涉及的方法所对应的功能。
基于相同的技术构思,本申请还提供一种包含软件程序的计算机程序产品,当其在计算机上运行时,使得实现上述方法实施例中涉及的方法所对应的功能。
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本申请是参照根据本申请的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
显然,本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的精神和范围。这样,倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。

Claims (17)

  1. 一种数据压缩方法,其特征在于,包括:
    获取待压缩的原始数据;
    将所述原始数据输入至压缩模块进行压缩处理后依次输出n个包含有经压缩处理后的数据的压缩块,其中,每个压缩块的第一容量相同,所述第一容量表征所述压缩块能够包含的经压缩处理后的数据的字节数;
    将n个所述压缩块存储在存储介质中,所述存储介质包括m个磁盘块,每个磁盘块的第二容量相同,所述第二容量表征所述磁盘块存储的数据的字节数;
    其中,所述第二容量为所述第一容量的p倍,n个所述压缩块在所述存储介质中的存储形式为:一个完整的所述磁盘块中存储p个完整的所述压缩块;或者,所述第一容量为所述第二容量的q倍,n个所述压缩块在所述存储介质中的存储形式为:q个完整的所述磁盘块中存储一个完整的所述压缩块;n、m、p、q为正整数,且p小于或等于n,q小于或等于m。
  2. 如权利要求1所述的方法,其特征在于,所述方法还包括:
    若所述n个压缩块中存在包含的经压缩处理后的数据均相同的至少两个压缩块,则将所述至少两个压缩块存储在所述存储介质中时,所述至少两个压缩块在所述存储介质中的存储位置相同。
  3. 如权利要求1所述的方法,其特征在于,还包括:
    将所述原始数据划分为i个数据块,每个数据块包含的数据的字节数与所述第一容量相同,其中,所述i个数据块中第j个数据块最多包含两个压缩块中经解压处理后的数据;
    为所述第j个数据块建立第一索引,并记录建立的第一索引与所述第j个数据块之间的对应关系,所述第一索引用于标识所述第j个数据块包含的数据在所述存储介质中的存储位置;
    i为正整数,j为小于或等于i的任一正整数。
  4. 如权利要求3所述的方法,其特征在于,当所述第j个数据块包含下一个压缩块中经解压处理后的数据时,所述第j个数据块对应的第一索引中包含的内容为:
    第一标识位或第二标识位、所述下一个压缩块的块号、以及所述第j个数据块的块内偏移;
    其中,所述第一标识位用于标识所述下一个压缩块中的数据为原始数据;所述第二标识位用于标识所述下一个压缩块中的数据为经压缩处理后的数据;所述第j个数据块的块内偏移为所述下一个压缩块经解压处理后的数据的首部在所述第j个数据块中的位置。
  5. 如权利要求3所述的方法,其特征在于,当所述第j个数据块仅包含当前压缩块中经解压处理后的数据时,所述第j个数据块对应的第一索引中包含的内容为:
    第三标识位,所述第三标识位用于标识所述当前压缩块中的数据为经压缩处理后得到的数据;
    相对位于所述第j个数据块之前的第一数据块的块距离,所述第一数据块中包含当前压缩块中经解压处理后的数据,或者,所述第一数据块中包含当前压缩块中经解压处理后的数据、以及上一个压缩块中经解压处理后的数据;
    相对位于所述第j个数据块之后的第二数据块的块距离,所述第二数据块中包含当前压缩块中经解压处理后的数据、以及下一个压缩块中经解压处理后的数据;
    所述第一数据块的块内偏移,所述第一数据块的块内偏移为所述当前压缩块经解压处理 后的数据的首部在所述第一数据块中的位置。
  6. 如权利要求5所述的方法,其特征在于,所述第j个数据块对应的第一索引中还包含所述当前压缩块的块号。
  7. 如权利要求1至6任一所述的方法,其特征在于,将所述原始数据输入至压缩模块进行压缩处理后依次输出n个包含有经压缩处理后的数据的压缩块,包括:
    将所述原始数据以字节为单位依次输入至所述压缩模块进行压缩处理;
    重复执行如下处理,直至将所述原始数据包含的所有字节数均输入至所述压缩模块为止:
    在每次确定出本次经压缩处理后的数据的字节数达到所述第一容量后,判断本次输入至所述压缩模块进行压缩处理的所述原始数据的字节数s是否大于所述第一容量;
    若判断结果为是,则将本次经压缩处理后的数据包含在一个所述压缩块中并输出;
    若判断结果为否,则继续输入t个字节数的所述原始数据至所述压缩模块,将s个字节数的所述原始数据以及所述t个字节数的所述原始数据包含在一个所述压缩块中并输出,s,t为正整数,t为所述第一容量与s之间的差值。
  8. 如权利要求7所述的方法,其特征在于,将所述原始数据以字节为单位依次输入至所述压缩模块进行压缩处理之后,还包括:
    当本次输入至所述压缩模块进行压缩处理的所述原始数据的字节数等于预设值时,若本次经压缩处理后的数据的字节数仍没有达到所述第一容量,则将本次经压缩处理后的数据包含在一个所述压缩块并输出,所述预设值为本次输入至所述压缩模块进行压缩处理的所述原始数据的字节数的最大值。
  9. 一种数据压缩装置,其特征在于,包括:
    收发模块,用于获取待压缩的原始数据;
    处理模块,用于将所述原始数据输入至压缩模块进行压缩处理后依次输出n个包含有经压缩处理后的数据的压缩块,其中,每个压缩块的第一容量相同,所述第一容量表征所述压缩块能够包含的经压缩处理后的数据的字节数;将n个所述压缩块存储在存储介质中,所述存储介质包括m个磁盘块,每个磁盘块的第二容量相同,所述第二容量表征所述磁盘块存储的数据的字节数;
    其中,所述第二容量为所述第一容量的p倍,n个所述压缩块在所述存储介质中的存储形式为:一个完整的所述磁盘块中存储p个完整的所述压缩块;或者,所述第一容量为所述第二容量的q倍,n个所述压缩块在所述存储介质中的存储形式为:q个完整的所述磁盘块中存储一个完整的所述压缩块;n、m、p、q为正整数,且p小于或等于n,q小于或等于m。
  10. 如权利要求9所述的装置,其特征在于,所述处理模块,还用于:
    若所述n个压缩块中存在包含的经压缩处理后的数据均相同的至少两个压缩块,则将所述至少两个压缩块存储在所述存储介质中时,所述至少两个压缩块在所述存储介质中的存储位置相同。
  11. 如权利要求9所述的装置,其特征在于,所述处理模块,还用于:
    将所述原始数据划分为i个数据块,每个数据块包含的数据的字节数与所述第一容量相同,其中,所述i个数据块中第j个数据块最多包含两个压缩块中经解压处理后的数据;
    为所述第j个数据块建立第一索引,并记录建立的第一索引与所述第j个数据块之间的对应关系,所述第一索引用于标识所述第j个数据块包含的数据在所述存储介质中的存储位置;
    i为正整数,j为小于或等于i的任一正整数。
  12. 如权利要求11所述的装置,其特征在于,当所述第j个数据块包含下一个压缩块中经解压处理后的数据时,所述第j个数据块对应的第一索引中包含的内容为:
    第一标识位或第二标识位、所述下一个压缩块的块号、以及所述第j个数据块的块内偏移;
    其中,所述第一标识位用于标识所述下一个压缩块中的数据为原始数据;所述第二标识位用于标识所述下一个压缩块中的数据为经压缩处理后的数据;所述第j个数据块的块内偏移为所述下一个压缩块经解压处理后的数据的首部在所述第j个数据块中的位置。
  13. 如权利要求11所述的装置,其特征在于,当所述第j个数据块仅包含当前压缩块中经解压处理后的数据时,所述第j个数据块对应的第一索引中包含的内容为:
    第三标识位,所述第三标识位用于标识所述当前压缩块中的数据为经压缩处理后得到的数据;
    相对位于所述第j个数据块之前的第一数据块的块距离,所述第一数据块中包含当前压缩块中经解压处理后的数据,或者,所述第一数据块中包含当前压缩块中经解压处理后的数据、以及上一个压缩块中经解压处理后的数据;
    相对位于所述第j个数据块之后的第二数据块的块距离,所述第二数据块中包含当前压缩块中经解压处理后的数据、以及下一个压缩块中经解压处理后的数据;
    所述第一数据块的块内偏移,所述第一数据块的块内偏移为所述当前压缩块经解压处理后的数据的首部在所述第一数据块中的位置。
  14. 如权利要求13所述的装置,其特征在于,所述第j个数据块对应的第一索引中还包含所述当前压缩块的块号。
  15. 如权利要求9至14任一所述的装置,其特征在于,所述处理模块,具体用于:
    将所述原始数据以字节为单位依次输入至所述压缩模块进行压缩处理;
    重复执行如下处理,直至将所述原始数据包含的所有字节数均输入至所述压缩模块为止:
    在每次确定出本次经压缩处理后的数据的字节数达到所述第一容量后,判断本次输入至所述压缩模块进行压缩处理的所述原始数据的字节数s是否大于所述第一容量;
    若判断结果为是,则将本次经压缩处理后的数据包含在一个所述压缩块中并输出;
    若判断结果为否,则继续输入t个字节数的所述原始数据至所述压缩模块,将s个字节数的所述原始数据以及所述t个字节数的所述原始数据包含在一个所述压缩块中并输出,s,t为正整数,t为所述第一容量与s之间的差值。
  16. 如权利要求15所述的装置,其特征在于,所述处理模块,还用于:
    当本次输入至所述压缩模块进行压缩处理的所述原始数据的字节数等于预设值时,若本次经压缩处理后的数据的字节数仍没有达到所述第一容量,则将本次经压缩处理后的数据包含在一个所述压缩块并输出,所述预设值为本次输入至所述压缩模块进行压缩处理的所述原始数据的字节数的最大值。
  17. 一种计算机存储介质,其特征在于,包括计算机可读指令,当所述计算机可读指令被执行时,实现如权利要求1至8任一项所述的方法。
PCT/CN2019/083589 2018-05-30 2019-04-22 一种数据压缩方法及装置 WO2019228098A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810542482.1A CN110557124B (zh) 2018-05-30 2018-05-30 一种数据压缩方法及装置
CN201810542482.1 2018-05-30

Publications (1)

Publication Number Publication Date
WO2019228098A1 true WO2019228098A1 (zh) 2019-12-05

Family

ID=68697805

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/083589 WO2019228098A1 (zh) 2018-05-30 2019-04-22 一种数据压缩方法及装置

Country Status (2)

Country Link
CN (1) CN110557124B (zh)
WO (1) WO2019228098A1 (zh)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111431539B (zh) * 2020-03-04 2023-12-08 嘉楠明芯(北京)科技有限公司 一种神经网络数据的压缩方法、装置及计算机可读存储介质
CN112102144B (zh) * 2020-09-03 2023-08-22 海宁奕斯伟集成电路设计有限公司 压缩数据的排布方法、装置和电子设备
CN115480692A (zh) * 2021-06-16 2022-12-16 华为技术有限公司 一种数据压缩方法及装置
CN113726341B (zh) * 2021-08-25 2023-09-01 杭州海康威视数字技术股份有限公司 一种数据处理方法、装置、电子设备及存储介质
CN113590051B (zh) * 2021-09-29 2022-03-18 阿里云计算有限公司 数据存储和读取方法、装置、电子设备及介质
CN114172521B (zh) * 2022-02-08 2022-05-10 苏州浪潮智能科技有限公司 一种解压缩芯片验证方法、装置、设备及可读存储介质
CN116318171B (zh) * 2023-05-15 2023-10-03 北京爱芯科技有限公司 Lz4解压缩硬件加速实现/压缩方法、装置、介质及芯片

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102360342A (zh) * 2011-10-11 2012-02-22 中国人民解放军国防科学技术大学 海量影像数据快速存储显示固态硬盘
CN103516369A (zh) * 2013-06-20 2014-01-15 易乐天 一种自适应数据压缩和解压缩的方法和系统及存储装置
US20140108704A1 (en) * 2012-10-16 2014-04-17 Delphi Technologies, Inc. Data decompression method for a controller equipped with limited ram
CN105808151A (zh) * 2014-12-29 2016-07-27 华为技术有限公司 固态硬盘存储设备和固态硬盘存储设备的数据存取方法
CN107947799A (zh) * 2017-11-28 2018-04-20 郑州云海信息技术有限公司 一种数据压缩方法及装置

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101727298B (zh) * 2009-11-04 2012-05-23 北京东方广视科技股份有限公司 实现独立磁盘冗余阵列的方法和装置
CN103020205B (zh) * 2012-12-05 2018-07-31 中科天玑数据科技股份有限公司 一种分布式文件系统上基于硬件加速卡的压缩解压缩方法
JP6319740B2 (ja) * 2014-03-25 2018-05-09 インターナショナル・ビジネス・マシーンズ・コーポレーションInternational Business Machines Corporation データ圧縮を高速化する方法、並びに、データ圧縮を高速化するためのコンピュータ、及びそのコンピュータ・プログラム
WO2017042978A1 (ja) * 2015-09-11 2017-03-16 株式会社日立製作所 計算機システム、ストレージ装置、及びデータの管理方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102360342A (zh) * 2011-10-11 2012-02-22 中国人民解放军国防科学技术大学 海量影像数据快速存储显示固态硬盘
US20140108704A1 (en) * 2012-10-16 2014-04-17 Delphi Technologies, Inc. Data decompression method for a controller equipped with limited ram
CN103516369A (zh) * 2013-06-20 2014-01-15 易乐天 一种自适应数据压缩和解压缩的方法和系统及存储装置
CN105808151A (zh) * 2014-12-29 2016-07-27 华为技术有限公司 固态硬盘存储设备和固态硬盘存储设备的数据存取方法
CN107947799A (zh) * 2017-11-28 2018-04-20 郑州云海信息技术有限公司 一种数据压缩方法及装置

Also Published As

Publication number Publication date
CN110557124A (zh) 2019-12-10
CN110557124B (zh) 2021-06-22

Similar Documents

Publication Publication Date Title
WO2019228098A1 (zh) 一种数据压缩方法及装置
US9390099B1 (en) Method and apparatus for improving a compression ratio of multiple documents by using templates
CN106570018B (zh) 序列化与反序列化的方法、装置、系统以及电子设备
CN103236847B (zh) 基于多层哈希结构与游程编码的数据无损压缩方法
CN107682016B (zh) 一种数据压缩方法、数据解压方法及相关系统
CN107506153B (zh) 一种数据压缩方法、数据解压方法及相关系统
CN106537327A (zh) 快闪存储器压缩
CN103516369A (zh) 一种自适应数据压缩和解压缩的方法和系统及存储装置
CN103152430B (zh) 一种缩减数据占用空间的云存储方法
US9479194B2 (en) Data compression apparatus and data decompression apparatus
CN107919943A (zh) 二进制数据的编码、解码方法和装置
JP2017194762A (ja) インデックス生成プログラム、インデックス生成装置、インデックス生成方法、検索プログラム、検索装置および検索方法
CN107423425B (zh) 一种对k/v格式的数据快速存储和查询方法
US10917110B1 (en) Multiple symbol decoder
CN107423321B (zh) 适用大批量小文件云存储的方法及其装置
WO2021082926A1 (zh) 一种数据压缩的方法及装置
KR20200121760A (ko) 인코딩된 데이터에 대한 조건부 트랜스코딩
CN115225724B (zh) 使用分区和无关位消除的数据压缩技术
CN103049388B (zh) 一种分页存储器件的压缩管理方法及装置
CN114416752B (zh) Kv ssd的数据处理方法及装置
US11748307B2 (en) Selective data compression based on data similarity
US11960451B2 (en) Method, computer-readable medium and file system for deduplication utilzing calculation range and re-chunking
CN113765854B (zh) 一种数据压缩方法及服务器
WO2024187947A1 (zh) 一种数据处理方法及相关设备
US9852143B2 (en) Enabling random access within objects in zip archives

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19810188

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19810188

Country of ref document: EP

Kind code of ref document: A1