CN110557124A

CN110557124A - Data compression method and device

Info

Publication number: CN110557124A
Application number: CN201810542482.1A
Authority: CN
Inventors: 高翔; 杜维; 汪宁; 胡天军
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2018-05-30
Filing date: 2018-05-30
Publication date: 2019-12-10
Anticipated expiration: 2038-05-30
Also published as: WO2019228098A1; CN110557124B

Abstract

The application provides a data compression method and a device, in the method, original data is compressed by limiting the size of each output compressed block, so that the first capacity of each output compressed block is the same, and when the compressed blocks with the same first capacity are stored in a disk, the storage form in the disk is that at least one complete compressed block is stored in one complete disk block, or one complete compressed block is stored in at least one complete disk block. By adopting the mode, in the process of reading and decompressing subsequent compression blocks, effective data can be read as much as possible, and the data read from the compression blocks can be successfully decompressed, so that the random read-amplification phenomenon in a fixed input mode can be avoided.

Description

data compression method and device

Technical Field

The present application relates to the field of data compression technologies, and in particular, to a data compression method and apparatus.

Background

With the rapid development of information technology, the amount of data has been explosively increased, which has created new challenges for the capacity of memory devices. In view of the current situation, a lossless compression technology is proposed in the prior art, and data with a regular existence is simplified and represented by adopting a certain algorithm through optimizing a data storage mode, so that the data is compressed under the condition of not influencing the original data content, and the occupied capacity during data storage is reduced.

At present, the lossless data compression technology adopts more modes as follows: the input data to be compressed are data blocks with fixed size, and compressed data blocks with different sizes are output. For example, referring to fig. 1, when the lossless data compression technique is applied to a file system, a file can be divided into data blocks 1 to n with a size of 4KB, and then, 1 to n 4KB data blocks are sequentially compressed, so that compressed blocks 1 to n with different sizes can be obtained. Because the size of the compressed block output in the compression process is random, the storage position of the compressed block in the storage space of the storage device is also random when the compressed block is stored. For example, for a magnetic disk, the storage space of the magnetic disk may be divided into several equal-sized disk blocks, and when the compressed block is stored in the magnetic disk, the situation shown in fig. 2 may occur, where the complete compressed block 1 is stored in the disk block a, and the first half of the compressed block 2 is stored, the second half of the compressed block 2 is stored in the disk block B, and the first half of the compressed block 3 is stored in the disk block C, and the first half of the compressed block 4 is stored in the disk block C. In the above case, when the compressed block 3 needs to be read, the contents stored in the disk block B and the disk block C need to be read. In addition, the second half of the additionally read compressed block 2 and the first half of the compressed block 4 are not complete compressed blocks, and therefore cannot be effectively decompressed, and the additionally read content becomes an invalid portion. This phenomenon is also referred to as random read amplification phenomenon. In view of this, when the compression is implemented according to the above-mentioned existing manner, the subsequent device is likely to read many invalid contents when reading the compressed block, which causes an additional burden to the device.

Disclosure of Invention

the application provides a data compression method and a data compression device, which are used for solving the problems of random read amplification and the like caused by the adoption of the conventional data compression processing mode.

In a first aspect, the present application provides a data compression method, and an execution subject of the method may be any device capable of executing the data compression method provided by the present application, such as an image generation server, a personal computer, or a mobile terminal. In the method, after the original data to be compressed is acquired, the original data can be input to a compression module for compression processing, and then n compression blocks containing the compressed data are sequentially output, wherein the first capacity of each compression block is the same, and the first capacity represents the number of bytes of the compressed data which can be contained in the compression block. Further, n of the compressed blocks are stored in a storage medium, the storage medium includes m disk blocks, and a second capacity of each disk block is the same, and the second capacity represents the number of bytes of data stored in the disk block.

In the method, the storage form of the first compressed block with the same capacity in the storage medium can be designed as follows:

in one possible design, the second capacity is p times the first capacity, and the n compressed blocks are stored in the storage medium in the form of: p complete compressed blocks are stored in one complete disk block.

in another possible design, the first capacity is q times the second capacity, and the n compressed blocks are stored in the storage medium in the form of: one complete compressed block is stored out of q complete disk blocks.

Wherein n, m, p and q are positive integers, p is less than or equal to n, and q is less than or equal to m.

in the method, the original data is compressed by limiting the size of each output compressed block, so that the first capacity of each output compressed block is the same, and when the compressed blocks with the same first capacity are stored in the storage medium, the storage form in the storage medium is that at least one complete compressed block is stored in one complete disk block or one complete compressed block is stored in at least one complete disk block. In addition, the purpose of maintaining a lower compression ratio is easier to achieve by controlling the size of the output compression block to be fixed compared with the conventional method of adopting fixed input.

In a possible implementation manner, if there are at least two compressed blocks in the n compressed blocks, where the compressed data contained in the at least two compressed blocks are all the same, when the at least two compressed blocks are stored in the storage medium, the storage locations of the at least two compressed blocks in the storage medium are the same. In this way, deduplication processing of two identical compressed blocks can be achieved in order to save storage space.

In one possible implementation, after the compressed block is stored in the storage medium, an index may be created to facilitate searching and accessing the compressed block in which the original data is located.

In a first mode, the original data may be divided into i data blocks, and each data block includes data with the same byte number as the first capacity, where a jth data block of the i data blocks includes decompressed data in at most two compressed blocks. Further, a first index is established for the jth data block, and the corresponding relationship between the established first index and the jth data block is recorded, where the first index is used to identify a storage location of data included in the jth data block in the storage medium. Wherein i is a positive integer, and j is any positive integer less than or equal to i.

When the jth data block contains data decompressed in the next compressed block, the content contained in the first index corresponding to the jth data block is: a first identification bit or a second identification bit, a block number of the next compressed block, and an intra-block offset of a jth data block; the first identification bit is used for identifying the data in the next compressed block as original data; the second identification bit is used for identifying the data in the next compressed block as the data after compression processing; the intra-block offset of the jth data block is the position of the header of the decompressed data of the next compressed block in the jth data block.

When the jth data block only contains the decompressed data in the current compressed block, the content contained in the first index corresponding to the jth data block is: the third identification bit is used for identifying the data in the current compression block as the data obtained after compression processing; the first data block comprises data decompressed in the current compression block or comprises data decompressed in the current compression block and data decompressed in the previous compression block relative to the block distance of the first data block before the jth data block; the block distance of a second data block which is positioned behind the jth data block is opposite, and the second data block comprises data which is decompressed in the current compression block and data which is decompressed in the next compression block; and the intra-block offset of the first data block is the position of the header of the data of the current compressed block after decompression processing in the first data block.

optionally, when the jth data block includes data decompressed in the current compressed block, the first index corresponding to the jth data block further includes a block number of the current compressed block.

In a second way, indexes can be created for objects by using the compressed blocks. Specifically, a second index is established for the kth compressed block in the n compressed blocks, and a corresponding relationship between the established second index and the kth compressed block is recorded, where the second index is used to identify a byte range of original data input to the compression module for compression processing corresponding to the kth compressed block. k is a positive integer less than or equal to n.

In a possible implementation manner, the process of sequentially outputting n compressed blocks containing compressed data after the original data is input to the compression module for compression processing is as follows.

Sequentially inputting the original data to the compression module by taking bytes as units for compression processing, and repeatedly executing the following processing until all bytes contained in the original data are input to the compression module:

After determining that the byte number of the compressed data reaches the first capacity each time, judging whether the byte number s of the original data input to the compression module for compression is larger than the first capacity or not; if the judgment result is yes, the data after the compression processing is contained in one compression block and output; if the judgment result is negative, continuing to input the original data with the number of t bytes to the compression module, and including the original data with the number of s bytes and the original data with the number of t bytes in one compression block and outputting the compressed data. Wherein s, t is a positive integer, and t is a difference between the first capacity and s.

By adopting the mode, the size of the output compression block can be fixed, and original data is filled in the compression block under the condition of the worst compression effect in the compression process, so that the purpose of ensuring lower compression ratio is easier to achieve compared with the existing mode of adopting fixed input.

In a possible implementation manner, after the original data is sequentially input to the compression module in units of bytes for compression, when the number of bytes of the original data input to the compression module for compression at this time is equal to a preset value, if the number of bytes of the data after compression at this time does not reach the first capacity yet, the data after compression at this time is included in one compression block and output, and the preset value is the maximum value of the number of bytes of the original data input to the compression module for compression at this time. By the method, excessive redundant data can be effectively prevented from being filled in the same compression block, so that the processing pressure in the subsequent process of reading and decompressing the compression block is relieved.

In a second aspect, the present application provides a data compression apparatus, which may be any device capable of executing the data compression method provided by the present application, such as an image generation server, a personal computer, or a mobile terminal. The data compression apparatus comprises functions or modules or means (means) to perform the methods involved in the implementation or design of the first aspect and the first aspect of the present application, and optionally, the data compression apparatus may also comprise functions or modules or means (means) to perform the reading and decompression processes of the compressed blocks. The functions or modules or units or means (means) may be implemented by software, or by hardware executing corresponding software.

In a possible design, the data compression apparatus includes a transceiver module and a processing module, and the transceiver module and the processing module may correspond to any implementation manner or design method of the first aspect and the first aspect, and are not described herein again.

In another possible design, the data compression apparatus includes a processor, and may further include a communication interface, where the communication interface is configured to send and receive signals, and the processor executes program instructions to implement the method according to the first aspect and any possible implementation manner or design of the first aspect. The data compression apparatus may further comprise one or more memories for coupling with the processor, which store the necessary computer program instructions and/or data to implement the functions of the method referred to in the first aspect above, as well as any possible implementation or design of the first aspect. The one or more memories may be integral with the processor or separate from the processor. The present application is not limited. The processor may execute the computer program instructions stored in the memory to perform the method of the first aspect described above and any possible implementation or design of the first aspect.

in a third aspect, the present application provides a chip, where the chip may communicate with a memory, or where the chip includes a memory, and the chip executes program instructions stored in the memory to implement the corresponding functions involved in the first aspect and any possible implementation manner or design of the first aspect.

In a fourth aspect, the present application provides a computer storage medium having stored thereon computer readable instructions which, when executed, cause the implementation of the corresponding functions involved in the implementation or design of the first aspect and any possible implementation manner or design of the first aspect described above.

In a fifth aspect, the present application further provides a computer program product comprising a software program which, when run on a computer, causes the implementation of the corresponding functions involved in the implementation or design of the first aspect and any possible implementation manner or design of the first aspect described above.

Drawings

FIG. 1 is a schematic diagram of a prior art compression process;

FIG. 2 is a schematic diagram of a storage form of compressed blocks output in a conventional compression processing manner in a magnetic disk;

FIG. 3 is a schematic diagram of a compressed file system to which the present application is applicable;

FIG. 4A is a diagram illustrating a first situation that occurs when compressed blocks output in a conventional compression processing manner are stored in a storage medium;

FIG. 4B is a diagram illustrating a second situation that occurs when compressed blocks output in a conventional compression processing manner are stored in a storage medium;

fig. 5 is a schematic flowchart of a data compression method according to an embodiment of the present application;

FIG. 6A is a first schematic diagram illustrating a first example of a storage form of compressed blocks with the same capacity in a disk according to an embodiment of the present invention;

FIG. 6B is a second schematic diagram of a storage form of a first compressed block with the same capacity in a disk in an embodiment of the present application;

FIG. 6C is a third schematic diagram of a storage form of a first compressed block with the same capacity in a disk in an embodiment of the present application;

FIG. 7 is a schematic diagram of a compression process flow in an embodiment of the present application;

FIG. 8 is a first scenario diagram illustrating a compression processing flow under special circumstances in the embodiment of the present application;

FIG. 9A is a diagram illustrating a second scenario of a compression process flow under special circumstances in the embodiment of the present application;

FIG. 9B is a schematic diagram of a modified compression process flow under a special situation in the embodiment of the present application;

FIG. 10 is a diagram illustrating an example of a situation that is unlikely to occur in the correspondence relationship between compressed blocks and data blocks in the embodiment of the present application;

FIG. 11 is a diagram illustrating a correspondence relationship between data blocks and compressed blocks in an embodiment of the present application;

FIG. 12 is a diagram illustrating a second index corresponding to a compressed block in an embodiment of the present application;

FIG. 13 is a diagram showing the format of a sequence in an LZ4 compression algorithm in an embodiment of the present application;

FIG. 14 is a first diagram illustrating an original character sequence input in an embodiment of the present application;

FIG. 15 is a second diagram of an original character sequence input in the embodiment of the present application;

FIG. 16 is a third diagram illustrating an original character sequence inputted in the embodiment of the present application;

FIG. 17 is a first schematic diagram of a data compression apparatus according to an embodiment of the present application;

Fig. 18 is a second schematic diagram of a data compression apparatus according to an embodiment of the present application.

Detailed Description

in order to make the objects, technical solutions and advantages of the present application more clear, the present application will be further described in detail with reference to the accompanying drawings.

first, an application scenario to which the present application is applicable will be described.

the present application is applicable to a compressed file system, for example, as shown in fig. 3, the file system may include a metadata area and a data area, and the metadata area includes a super block and an index node (inode) area. The superblock of the metadata area may include contents such as control information, data structure, etc. of a file system, and the inode area of the metadata area may include description information of files such as file length, file type, etc., and the file type may be, for example, a general file (regular inode), a directory file (directory inode), a uniform link (symbollink inode), a special file (special inode), etc. The data stored in the data area may be data obtained by performing a file-level compression process based on a lossless compression technique. The compressed data in the data area is stored in a physical storage space of a storage medium (e.g., a magnetic disk, a flash memory, etc.) according to a set of disk blocks. The data of the same file may be stored in consecutive disk blocks, or may be stored across discontinuous disk blocks. For example, in FIG. 3, the disk blocks A1-An store data of the same file, and the disk blocks B1-Bx +1 and the disk blocks C1 and C2 may store data of different files across. It should be understood that the introduction of the concept of disk blocks in this application does not mean that the storage medium is limited to disks, and disk blocks may be used to represent the small physical storage space of the storage medium obtained by dividing the physical storage space.

At present, the compression algorithm adopted by a compressed file system generally sequentially compresses fixed-byte-count original data, and because the attributes such as the content and the type of the data are different, the actual compression ratio (compression ratio) is also different, and therefore, the byte size of the data obtained after the compression processing is performed on the fixed-byte-count original data is not fixed. The compression ratio is the ratio of the size of the data after the compression process to the size of the original data before the compression process, for example, if the data obtained by compressing 100M of the original data is 90M, the compression ratio is 90/100 × 100% — 90%. Based on this, the data after compression processing in the data area can be logically regarded as being composed of a plurality of compression blocks with different sizes. When the compressed blocks are stored in the disk blocks on the physical storage space, the compressed blocks are randomly arranged when being stored on the disk blocks because the size of the compressed blocks is not fixed. For example, two cases shown in fig. 4A and 4B may occur:

In the first case, when the byte size of the fixed original data to be compressed is small, the regular bytes included in the bytes input each time are small because the number of bytes input each time is small, and thus the compression rate is actually large although the size of the compressed block obtained by the compression processing is small. For example, for a character string abcdabcdefabcdeff, if the original data input is fixed to 4 characters each time, there is no regular data in abcd, and the data cannot be compressed, and the output compressed block is actually 4 characters, although the size of the compressed block is large, the compression rate is actually large.

For this case, the compressed block is likely to appear in the storage form shown in fig. 4A when stored on the storage medium: the disk block a stores the complete compressed block 1 and the first half of the compressed block 2, the disk block B stores the second half of the compressed block 2 and the first half of the compressed block 3, and the disk block C stores the second half of the compressed block 3 and the first half of the compressed block 4.

in the storage form shown in case one, a random read amplification (random read amplification) phenomenon, which may also be referred to as an input/output (IO) amplification (amplification) phenomenon, easily occurs. Illustratively, the random read amplification phenomenon is, for example: since the stored content is read in units of disk blocks when the stored content is read from the storage medium, if the compressed block 3 needs to be read, the content stored in the disk block B and the disk block C needs to be read, resulting in additional reading of some redundant data. In addition, for the extra read redundant data, the redundant data cannot be effectively decompressed because the redundant data is not a complete compressed block, and in order to successfully decompress the data, the redundant data needs to be read from the previous disk block of the disk block B and the content stored in the next disk block of the disk block C, which causes extra burden to the device.

in the case two, when the byte size of the fixed original data subjected to the compression processing is large, the compression rate can be reduced as compared with the case one, but the size of the compressed block obtained after the compression processing is large. The compressed block, when stored on a storage medium, is likely to appear in the storage form shown in FIG. 4B: the compressed block i is stored in a disk block a to a disk block X, wherein the front ends of the disk block a and the compressed block i are not aligned, and the tail ends of the disk block X and the compressed block i are not aligned, that is, the front end of the disk block a also stores the content of the previous compressed block i-1, and the tail end of the disk block X also stores the content of the next compressed block i + 1.

In the storage form shown in case two, a random read amplification phenomenon may also occur. Since the stored content is read in units of disk blocks when the stored content is read from the storage medium, if the compressed block i needs to be read, the content stored in the disk block a to the disk block X needs to be read, which results in additional reading of some redundant data. In addition, redundant data that is additionally read cannot be effectively decompressed because the data is not a complete compressed block, and in order to successfully decompress the data, the contents stored in the previous disk block of the disk block a and the content stored in the next disk block of the disk block X need to be read, which causes additional burden to the apparatus. In addition, when the compressed block i is decompressed, decompression of data in the compressed block i later needs to depend on data in front of the compressed block a, and in order to correctly decompress data in the compressed block i later, data in front of the compressed block a needs to be online, and the memory usage amount in the decompression process may also be large.

it can be seen from the above two situations that, when the fixed input mode is adopted, no matter the byte number of the original data which is fixedly input is a large value or a small value, random reading and amplification may occur, some redundant data may be additionally read, and the additionally read data cannot be effectively decompressed. In order to solve the problems possibly occurring in the first and second cases, the present application provides a data compression method and apparatus, which can effectively avoid the problems possibly occurring when a fixed input mode is adopted by controlling the size of the compressed block after compression processing to make the size of the compressed block fixed.

the technical solution provided by the present application is described in detail below with reference to specific embodiments.

Referring to fig. 5, a schematic flow chart of a data compression method provided in the embodiment of the present application is shown, including the following steps:

step 501, obtaining original data to be compressed.

In the embodiment of the present application, the execution subject of the data compression process may be any device capable of executing the data compression method provided by the present application, for example, an image generation server, a personal computer, or a mobile terminal. Illustratively, the obtained original data to be compressed is, for example, source code and resource files of some operating systems. The specific obtaining manner may be various manners, such as copying or downloading the obtained original data from other servers or storage devices, which is not limited in this application.

Step 502, inputting the original data into a compression module for compression, and then sequentially outputting n compression blocks containing the compressed data, wherein the first capacity of each compression block is the same.

In the embodiment of the application, a compression algorithm is configured in the compression module, data obtained by compressing original data can be output as compression blocks with fixed sizes, the number of the compression blocks is n, and n is a positive integer. Wherein a fixed size of a compressed block may be understood as a fixed size of a compressed block. For the sake of distinction, the size of the compressed block is referred to as a first size, which is the number of bytes of compressed data that the compressed block can contain.

Step 503, storing the n compressed blocks in m disk blocks in the storage medium.

the storage medium comprises m disk blocks, m is a positive integer, and the size of each disk block is the same, namely the capacity of each disk block is the same. For the sake of distinction, the capacity of a disk block is referred to as a second capacity, which is the number of bytes of data stored by the disk block. Hereinafter, for convenience of description, the embodiments of the present application are further described by taking the storage medium as a magnetic disk as an example, but the storage medium in the embodiments of the present application is not limited to a magnetic disk, and may be any other storage medium that can support a writing and reading function, such as a flash memory.

In the embodiment of the present application, there are two cases in the storage form of the first compressed block with the same capacity in the disk:

in case the first and second capacities are p times the first capacity, p complete compressed blocks can be stored in one complete disk block. Wherein p is a positive integer and p is less than or equal to n.

For example, as shown in fig. 6A, assuming that n is 10, m is 5, and p is 2, then compressed blocks 1 to 10 may be stored in disk blocks a to E, and two complete compressed blocks may be stored in one complete disk block. In this case, if the content of the compressed block 3 is desired to be read, the entire content of the compressed block 3 can be read from the disk block B, so the compressed block 3 can also be successfully decompressed. In addition, the content of the complete compressed block 4 can be read from the disk block B, so that the compressed block 4 can be successfully decompressed, and the subsequently decompressed data can be effectively utilized.

Case two, where the first capacity is q times the second capacity, one complete compressed block may be stored out of q complete disk blocks. Wherein q is a positive integer and q is less than or equal to m.

for example, as shown in fig. 6B, assuming that n is 2, m is 10, and q is 5, the compressed block 1 may be stored in the disk blocks a to E, and the compressed block 2 may be stored in the disk blocks F to J. In this case, the content of the compressed block 1 may be read from the disk block a to the disk block E, or the content of the compressed block 2 may be read from the disk block F to the disk block J. And because the read compressed blocks are all complete compressed blocks, the read compressed blocks can be successfully decompressed, and the subsequently decompressed data can be utilized.

Illustratively, there is a form of storage that both case one and case two can apply: referring to fig. 6C, assuming that n is 5, m is 5, and p is 1 (or q is 1), the compressed blocks 1 to 5 may be stored in the disk blocks a to E, respectively. In this case, since the disk blocks correspond to the compressed blocks one by one, when it is necessary to read which compressed block, the disk block corresponding to the compressed block can be directly read, redundant data is not read from the disk block, and the read data can be successfully decompressed.

By adopting the above manner provided by the application, the original data is compressed by limiting the size of each output compressed block, so that the first capacity of each output compressed block is the same, when the compressed blocks with the same first capacity are stored in the disk, the storage form in the disk is that at least one complete compressed block is stored in one complete disk block, or one complete compressed block is stored in at least one complete disk block.

in addition, in the embodiment of the application, by controlling the size of the output compression block to be fixed, compared with the existing mode of adopting fixed input, the purpose of lower compression ratio is achieved more easily. For example, assuming that the output compressed block is 4KB, the original data input to the compression module for compression processing has at least 4KB, i.e. the worst case is to fill the compressed block with 4KB of original data. However, in the conventional method, if the amount of the input original data is small, regular bytes contained in the input original data are likely to be small, so that the original data is difficult to be compressed, for example, a situation may occur in which 3KB original data is input and a 4KB compressed block is output after compression processing, resulting in a large compression rate. However, if the amount of the input original data is large, although the lower compression block can be achieved, the output compression block is relatively large, and the phenomenon of random read amplification easily occurs in the subsequent process. Through the comparison, the mode provided by the application not only can achieve a lower compression ratio, but also can avoid the random read amplification phenomenon.

in addition, in order to avoid the problem of large memory occupation amount in the decompression process, the first capacity of the compression block can be made to be smaller as much as possible in the embodiment of the application, so that when the compression block is read from the disk block and decompressed, the data of the first half of the compression block is online even if the data of the second half of the compression block needs to be decompressed, but the first capacity of the compression block is not large, so that the data of the first half of the compression block stored in the memory is smaller, and the memory occupation amount can be reduced as much as possible.

In addition, in order to save the storage space, if there are at least two compressed blocks in the output n compressed blocks, which have the same compressed data, because the contents of the at least two compressed blocks are the same, when the at least two compressed blocks are stored in the disk, the storage locations of the at least two compressed blocks are the same, that is, the at least two compressed blocks share the same storage location, that is, the compressed blocks are deduplicated.

next, a process of inputting raw data to a compression module to perform compression processing and outputting n compressed blocks will be described.

In one implementation, the original data may be sequentially input to the compression module in units of bytes to be compressed, and the process flow shown in fig. 7 may be repeatedly executed during the compression process until all bytes included in the original data are input to the compression module:

Step 701, determining that the number of bytes of the data subjected to the compression processing reaches a first capacity.

Step 702, determining whether the number of bytes s of the original data input to the compression module for compression is greater than a first capacity.

If yes, go to step 703; if the determination result is negative, step 704 is executed.

the above determination performed in step 702 is mainly used to determine whether there is compression benefit when the current original data is input to the compression module for compression, where when the number of bytes of the current original data input to the compression module for compression is greater than the number of bytes of the current data after compression, it may be determined that there is compression benefit, and when the number of bytes of the current original data input to the compression module for compression is less than or equal to the number of bytes of the current data after compression, it may be determined that there is no compression benefit. In case of compression gain, step 703 may be performed, whereas step 704 is performed.

Step 703: and the data after the current compression processing is contained in one compression block and output.

And under the condition of compression benefit, the data contained in the output compression block is the data subjected to the compression processing at this time.

Step 704: and continuously inputting the original data with the number of t bytes into the compression module, and containing the original data with the number of s bytes and the original data with the number of t bytes into one compression block and outputting the compression block. Where s, t is a positive integer, and t is the difference between the first capacity and s.

Under the condition of no compression benefit, the data contained in the output compression block is the original data input to the compression module at this time, and the byte number of the original data input to the compression module at this time is (s + t).

When outputting a compressed block, the byte number of the compressed data and the byte number of the original data input to the compression module for compression can be counted again. For example, one counter may be configured to count the number of bytes of compressed data, another counter may be configured to count the number of bytes of original data input to the compression module, and the two counters are cleared when a compressed block is output. Through the design, the number of bytes of the compressed data and the number of bytes of the input original data in each compression processing process can be counted conveniently. Of course, in practical application, statistics may be performed in other manners, which is not limited in the present application.

The above-described process flow is exemplified by a simple example. Assume that the first size of the compressed block is 4KB, i.e. the maximum number of bytes that can be contained in the compressed block is 4 KB. Then, when the byte number of the data after the current compression process reaches 4KB, if the byte number of the original data input to the compression module for the compression process is 5KB, it can be said that the compression rate of the current compression process is 80%, that is, the current compression process has compression benefit, and in this case, the data of 4KB after the compression process can be included in one compression block and output. If the byte number of the original data input to the compression module for compression processing is 3KB this time, it can be stated that the compression rate of the compression processing this time is 133%, that is, there is no compression benefit in the compression processing this time, in this case, the original data of 3KB input to the compression module this time can be placed in the compression block, and the original data of 1KB is continuously input in the compression module, the original data of 4KB is not subjected to compression processing, and the compression module is directly placed in the compression block, and then the compression block is output, and the compression block includes the original data of 4 KB.

In specific implementation, after the original data is sequentially input to the compression module in units of bytes for compression, in addition to the processing flow shown in fig. 7, there may be other special cases:

Case 1, when the remaining data that is not input to the compression module for compression processing in the original data is less than or equal to the first capacity, the processing mode of the last compression block that is output: if the byte number of the compressed data of the residual data is less than or equal to the byte number of the residual data, the compressed data of the residual data is contained in the last compressed block and output; and if the byte number of the compressed data of the residual data is larger than that of the residual data, the residual data is contained in the last compressed block and output.

For example, assume that the first capacity is 4KB and the remaining data of the original data that is not input to the compression module for compression processing is 3 KB. If the data of the remaining 3KB of original data after compression processing is 5KB, the last compressed block should contain the remaining 3KB of original data. If the remaining 3KB of original data is 3.5KB after compression, the last compressed block may also contain the remaining 3KB of original data. If the remaining 3KB of original data is 2KB, the last compressed block may contain 2KB of compressed data.

And 2, when the byte number of the original data input to the compression module for compression processing is equal to the preset value, if the byte number of the data subjected to compression processing does not reach the first capacity, the data subjected to compression processing is contained in one compression block and output.

In the above case 2, an upper limit is set for the number of bytes of the original data input to the compression module for compression each time, where the upper limit may be a preset value, and the preset value is used to represent the maximum value of the number of bytes of the original data input to the compression module for compression. When the number of bytes of the input original data is equal to the preset value, the original data can not be continuously input, but the data after the compression processing is contained in one compression block and output. For example, referring to fig. 8, the original data may be logically divided into several data blocks, and the number of bytes included in the divided data blocks in the embodiment of the present application may be equal to the first capacity, that is, the data blocks and the compressed blocks are equal to a large size. The first half of the original data of the data blocks s +1 to k are sequentially input to the compression module for compression, and the compressed data is contained in the compressed block c +2, wherein if the byte number of the partial data of the input data blocks s +1 to k has already reached the preset value, but the compressed data filled in the compressed block c +2 has not yet reached the first capacity, in this case, the compressed block c +2 may be directly output, and the remaining capacity (shown as a shaded portion in fig. 8) of the compressed block c +2 is not filled with other data.

And 3, when the compressed block output in the current compression processing process reaches the first capacity, if the byte number of the original data input into the last data block corresponding to the output compressed block is smaller than the set threshold, the original data input into the compression module at this time can be compressed again, wherein the original data except the original data input into the last data block are still filled in the current compressed block and output after the compression processing, and the original data input into the last data block are filled in the next compressed block and output after the compression processing.

For example, referring to fig. 9A, the original data may be logically divided into several data blocks, wherein the compressed data of the original data may be filled in the compressed block c +2 from the second half of the data block s +1 (as indicated by the dashed line 1 in fig. 9A), and when the compressed data filled in the compressed block c +2 has reached the first capacity, the input original data may have been corresponded to the first half of the data block s +3 (as indicated by the dashed line 2 in fig. 9A). The data block s +3 is the last data block corresponding to the currently output compression block c +2, and the original data (shown by a shaded portion in fig. 9A) input to the compression module in the data block s +3 for compression processing is smaller than a set threshold. In this case, the data after the compression processing that has been filled in the compression block c +2 may be discarded, and the original data may be compressed again from the latter half of the data block s + 1. Further, as shown in fig. 9B, the original data corresponding to the second half of the data block s +1 to the data block s +2 may be compressed and then filled in the compressed block c +2, and at this time, the compressed block c +2 still has a residual capacity (as shown by a shaded portion in fig. 9B), the residual capacity is temporarily not used, a next compressed block is directly newly generated, and the original data from the data block s +3 is compressed and then filled in the compressed block c + 3. Of course, in actual application, when the situation shown in the above case 3 occurs, the method is not limited to the method of performing the compression processing again on the original data, and other methods may be adopted, which is not limited in the present application.

By the design described in case 3, the size of the differential packet in the file system can be reduced as much as possible, and when some changes occur to the original data, the influence on the storage form of the compressed block stored in the current disk can be reduced as much as possible.

in the embodiment of the present application, after n compressed blocks are stored in a disk, in order to facilitate searching and accessing the compressed block where original data is located, an index may be created for a compressed file system, where the index may be created by using a data block as an object, or the index may be created by using a compressed block as an object. The created index may be embedded in the metadata area of the compressed file system or may be embedded in the data area of the compressed file system. The manner in which the index is created is described in detail below.

the first method is as follows: creating an index with data blocks as objects

Since the original data may be logically divided into a plurality of data blocks, the original data may be divided into i data blocks in the embodiment of the present application. Wherein each data block contains the same number of bytes as the first capacity. For example, if the first size is 4KB, the original data of 64KB can be divided into 16 data blocks of 4KB, each of which contains 4KB bytes.

For a jth data block in the i data blocks, a first index may be established for the jth data block, and a corresponding relationship between the established first index and the jth data block is recorded, where the first index is used to identify a storage location of data included in the jth data block in the disk. Wherein i is a positive integer, and j is any positive integer less than or equal to i.

As can be seen from the above-described compression processing flow shown in fig. 7, the data to be filled in the compressed block includes only two cases, one is compressed data filled when there is a compression gain, and the other is original data filled when there is no compression gain. In view of the above, the corresponding situation of the compressed block and the data block does not occur similarly to the situation shown in fig. 10. In fig. 10, the data block s includes data decompressed from the compression block c to the compression block c +2, the compression block c +1 corresponds to data in a shaded portion in the data block s, and it is obvious that the data in the shaded portion is smaller than the data filled in the compression block c +1, that is, the data actually filled in the compression block c +1 is neither the compressed data having compression benefit nor the original data, so this situation does not occur in the embodiment of the present application. Therefore, the jth data block in the embodiment of the present application contains decompressed data in at most two compressed blocks.

in the first example of the present application, when the jth data block includes data decompressed in the next compressed block, the content included in the first index corresponding to the jth data block is:

A first identification bit or a second identification bit; a block number of the next compressed block; an intra block offset of the jth data block.

the first identification bit is used for identifying that the data in the next compressed block is original data, the second identification bit is used for identifying that the data in the next compressed block is data subjected to compression processing, and the intra-block offset of the jth data block is the position of the header of the data subjected to decompression processing in the jth data block. Regarding the block number of the compressed block, the compressed block stored in the disk may be numbered according to the actual division of the physical storage space of the disk to mark the storage location of the compressed block in the disk.

refer to fig. 11 for a schematic diagram of the relationship between data blocks and compressed blocks. For data blocks such as data block s, data block s +1, data block s +2, data block s +3, and data block s +6, from the viewpoint of data flow, these data blocks each contain the header of decompressed data of a newly generated compressed block. In the example shown in fig. 11, the head of the decompressed data of the newly generated compressed block is connected to the tail of the decompressed data of the current compressed block. In practical applications, the header of the decompressed data of a newly generated compressed block may not be connected to the tail of the decompressed data of the current compressed block, which is not limited in this application.

for example, the start position of the data block s corresponds to the start position of the compressed block c, and the compressed block c can be regarded as the next compressed block newly generated with respect to the compressed block c-1, and the data block s includes the header of the data decompressed by the compressed block c. For another example, the start position of the data block s +1 corresponds to the start position of the compressed block c +1, and the compressed block c +1 can be regarded as a new next compressed block with respect to the compressed block c, and the data block s +1 includes the header of the data decompressed by the compressed block c + 1. For another example, the position shown by the dotted line in the data block s +3 corresponds to the start position of the compressed block c +3, so that the compressed block c +3 can be regarded as the next compressed block generated newly with respect to the compressed block c +2, and the data block s +3 includes the header of the data decompressed by the compressed block c + 3.

Such a block may be referred to as a first block, which may be defined as a block that contains a header of data that is decompressed from a next compressed block. For the first block, the first index given in the first example above may be employed. Further, the first block can be further divided into a non-compressed mode and a compressed mode. The contents of the first index will be described in detail below with reference to these two modes.

In case 1, under the condition of no compression benefit, the data in the next compression block contained in the data block is the original data, that is; the data block is the first block and is in a non-compressed mode.

For example, as shown in fig. 11, a data block s +2, and the like, data obtained by compressing such data blocks has no compression benefit, so that data in a compressed block c included in the data block s is original data, and data in a compressed block c +2 included in the data block s +2 is original data. Therefore, the data block s and the data block s +2 can be regarded as the first block, and the first block is the non-compressed mode.

Illustratively, the first index corresponding to the data block s is shown in table 1:

TABLE 1

wherein, the intra-block offset of the data block s is zero because the start position of the data block s corresponds to the header of the data after the decompression processing of the compression block c. The first flag may identify that the data in the compressed block c is original data, that is, the data block s is identified as a first block and the first block is an uncompressed mode.

Illustratively, the first index corresponding to the data block s +2 is shown in table 2:

TABLE 2

where the position shown by the dotted line in the data block s +2 corresponds to the header of the decompressed data of the compression block c +2, the intra-block offset of the data block s +2 is not zero. After the original data after the position indicated by the dotted line in the data block s +2 is input to the compression module, the output is still the original data, and the output original data is filled in the compression block c + 2. The first flag may identify that the data in the compressed block c +2 is original data, that is, the data block s +2 is identified as a first block and the first block is an uncompressed mode.

In case 2, when there is a compression benefit, the data in the next compression block included in the data block is the data after compression processing, that is: the data block is the first block and the first block is the compressed mode.

For example, as shown in fig. 11, the data block s +1, the data block s +3, and the like, the data of such data block after the compression processing has a compression benefit, so the data in the compression block c +1 included in the data block s +1 is the data after the compression processing, and the data in the compression block c +3 included in the data block s +3 is also the data after the compression processing. In view of the above features, the data blocks s +1 and s +3 can be regarded as the first block, and the first block is the compressed mode.

Illustratively, the first index corresponding to the data block s +1 is shown in table 3:

TABLE 3

Wherein, since the start position of the data block s +1 corresponds to the start position of the decompressed data of the compression block c +1, the intra-block offset of the data block s +1 is zero. The second flag may identify that the data in the compressed block c +1 is the data after compression processing, that is, the data block s +1 is the first block and the first block is the compression mode.

Illustratively, the first index corresponding to the data block s +3 is shown in table 4:

TABLE 4

The position indicated by the dotted line in the data block s +3 corresponds to the header of the data after decompression processing of the compression block c +3, so that the intra-block offset of the data block s +3 is not zero. After the original data after the position indicated by the dotted line in the data block s +3 is input to the compression module, the data after the compression processing is output from the compression module, and the data after the compression processing is filled in the compression block c + 3. The second flag may identify that the data in the compressed block c +3 is the data after compression processing, that is, the data block s +3 is the first block and the first block is the compression mode.

In a second example of the present application, when a jth data block only includes data decompressed in a current compressed block, a content included in a first index corresponding to the jth data block is:

A third identification bit; a block distance with respect to a first data block located before a jth data block; a block distance with respect to a second data block located after the jth data block; an intra block offset of the first data block.

And the third identification bit is used for identifying the data in the current compression block as the data obtained after compression processing. The first data block includes data decompressed in the current compressed block, or the first data block includes data decompressed in the current compressed block and data decompressed in the previous compressed block. The intra-block offset of the first data block is the position of the header of the decompressed data of the current compression block in the first data block. The second data block includes the data decompressed in the current compression block and the data decompressed in the next compression block. In fact, the first data block and the second data block may also be understood as the first blocks described in the above first example.

Continuing with FIG. 11, a schematic diagram of the relationship between data blocks and compressed blocks is shown. For data blocks such as data block s +4 and data block s +5, from the viewpoint of data stream, such data blocks only contain decompressed data in the current compressed block, and do not contain a header of decompressed data in the next newly generated compressed block. For example, the data block s +4 and the data block s +5 both contain the decompressed data of the current compressed block c +3, and do not contain the header of the decompressed data of the next new compressed block c + 4.

Such a block may be referred to as a non-leading block, which may be defined as a block that contains the decompressed data of the current compressed block but does not contain the header of the decompressed data of the next compressed block. In addition, actually, the non-first block includes partial data of the current compressed block after decompression, the first block located before the non-first block also includes data of the compressed block after decompression, and if there is an adjacent non-first block located after the non-first block, the adjacent non-first block also includes data of the compressed block after decompression, and the first block located after the non-first block also includes data of the compressed block after decompression, so the non-first block is only a compressed mode. For the non-first block, the first index given in the second example above may be used, and the content of the first index will be described in detail below.

illustratively, the first index corresponding to the data block s +4 is shown in table 5:

TABLE 5

intra-block offset of first data block	third identification position
		Block distance relative to first data block	Block distance relative to second data block

Wherein the intra-block offset of the first data block is a position indicated by a dotted line in the data block s +3 shown in fig. 11, and the position indicated by a dotted line in the data block s +3 corresponds to the header of the decompressed data of the compression block c + 2. The block distance with respect to the first data block, i.e. the block distance with respect to data block s + 3. The block distance with respect to the second data block, i.e. the block distance with respect to data block s + 6. The data block s +3 includes, in addition to the data of the compressed block c +2 after the decompression processing, the data of the compressed block c +3 after the decompression processing, so that the data block s +3 is the first block before the data block s +4, and the data block s +6 includes, in addition to the data of the compressed block c +3 after the decompression processing, the data of the compressed block c +4 after the decompression processing, so that the data block s +6 is the first block after the data block s + 4. In one approach, the block distance to the first block may be expressed in units of data blocks, in which case the block distance to data block s +3 equals 1 data block and the block distance to data block s +6 equals 2 data blocks. Alternatively, the block distance may also be expressed in bytes, for example, if one data block is 4KB, the block distance relative to data block s +3 is equal to 4KB, the block distance relative to data block s +6 is equal to 8KB, etc. The third flag may identify that the data in the compressed block c +3 is the data after compression processing, that is, the data block s +3 is identified as a non-first block, and the non-first block is a compression mode. Of course, the above-mentioned manner of representing the block distance is merely an exemplary illustration, and in practical applications, the block distance may also be represented in other manners, for example, on the basis of the first manner, the block distance with respect to the first data block is reduced by one, and the block distance with respect to the second data block is increased by one.

The first index corresponding to the data block s +5 may also refer to the first index shown in table 5, which is not described here.

optionally, in a third example of the present application, on the basis of the first index shown in the second example of the present application, the first index corresponding to the jth data block may further include the block number and the fourth identification bit of the current compressed block. Here, the fourth identification bit is used to identify that the data in the current compressed block is the data after compression processing, that is, to identify that the jth data block is the compressed mode, and whether the jth data block is the first block or not, and needs to be determined according to the block distance relative to the first data block. For example, when the block distance with respect to the first data block is zero, it may be determined that the jth data block is the first block, and when the block distance with respect to the first data block is not zero, it may be determined that the jth data block is the non-first block.

Illustratively, continuing to enumerate the first index corresponding to the data block s +4, in the third example, the first index corresponding to the data block s +4 is shown in table 6:

TABLE 6

In this example, the block distance relative to the first data block may be split into two part representations, one part representing the high x bits and the other part representing the low y bits. For example, when the block distance with respect to the first data block is composed of 8 bits, then the upper x bits may represent the upper 4 bits and the lower y bits may represent the lower 4 bits. Of course, the above examples are merely illustrative, and the specific implementation is not limited to the above examples. Under the condition of ensuring that the sum of the bit positions occupied by the block distance relative to the first data block, the intra-block offset of the first data block and the block distance relative to the second data block is not changed, the values of x and y can be configured according to actual requirements. In addition, the block number of the compressed block c +3 is added in the first index, so that the position of the corresponding compressed block can be directly determined according to the content of the first index corresponding to the data block s +4, and the position of the corresponding compressed block is determined without searching the first index corresponding to the data block s +3, so that the searching is simpler and more convenient and the searching efficiency is higher. Further, for case 2 in the first example, and the second example, the first index may also be represented in the manner provided in the third example.

It should be noted that the block distance from the second data block referred to in the second example and the third example is mainly used to determine which data block the tail end of the original data of the compressed block after decompression processing corresponds to. In practical applications, the block distance from the second data block may be an optional content, and the first index may include the block distance from the second data block or may not include the block distance from the second data block.

After the first indexes shown in the first to third examples are established for the data blocks, if the original data in the specified byte range needs to be read in the compressed file system subsequently, the data block corresponding to the specified byte range may be determined first, and then the first index corresponding to the data block is searched, and the position of the compressed block that needs to be read is found by using the first index.

For example, continuing with reference to fig. 11, assume that the original data size is 128KB, divided into 32 data blocks of 4KB, the first data block corresponding to the original data of 1KB to 4KB, and so on, data block s corresponding to the original data of 17KB to 20KB, data block s +1 corresponding to the original data of 21KB to 24KB, data block s +2 corresponding to the original data of 25KB to 28KB, data block s +3 corresponding to the original data of 29KB to 32KB, data block s +4 corresponding to the original data of 33KB to 36KB, and so on. Several search scenarios are listed below for specific description.

In scenario one, if the original data of 17KB to 20KB needs to be searched, the first index corresponding to the data block s can be found (for example, as shown in table 1 above). By the block number of the compressed block c contained in the first index, it can be determined that the data block s corresponds to the data in the compressed block c, and the storage location of the compressed block c in the disk can be determined, further, the intra-block offset of the data block s is zero, which indicates that all data in the compressed block c corresponds to the data block s from the header of the compressed block c, and since the data in the compressed block c is the original data and the intra-block offset of the data block s is 0, the data in the compressed block c can be directly read, that is, the original data of 17KB to 20KB can be obtained.

In scenario two, if the original data of the 29KB to the 32KB needs to be searched, the first index corresponding to the data block s +3 can be found (for example, as shown in table 4 above). By the block number of the compressed block c +3 contained in the first index, it can be determined that the data block s +3 corresponds to the data in the compressed block c +3, and the storage location of the compressed block c +3 in the disk, and further, according to the intra-block offset of the data block s +3, it can be determined that the part after the position of the dotted line in the data block s +3 corresponds to the data in the compressed block c +3, and the part before the position of the dotted line in the data block s +3 corresponds to the data in the previous compressed block c + 2.

In this case, a first index corresponding to a first head block located before the data block s +3 needs to be searched, and the data block s +2 located before the data block s +3 in this scenario is the head block, so that the first index corresponding to the data block s +2 can be directly searched, the block number of the compression block c +2 and the intra-block offset of the data block s +2 are recorded in the first index corresponding to the data block s +2, the storage location of the compression block c +2 can be determined according to the block number of the compression block c +2, and the intra-block offset of the data block s +2 can determine that a part before the position of the dotted line in the data block s +3 corresponds to data in the compression block c + 2. Further, the compressed block c +2 is read, because the data in the compressed block c +2 is the original data and the intra-block offset of the data block s +2 is not zero, the data in the compressed block c +2 can be read first, and then the portion of the data block s +3 before the position of the dotted line corresponds to the data in the compressed block c +2 by copying. It should be understood that the copy process can be regarded as a special decompression process, and the original data copied from the compressed block can also be regarded as decompressed data. Reading the data in the compressed block c +3, decompressing the data in the compressed block c +3 because the data in the compressed block c +3 is the data after the compression processing, and obtaining the data in the compressed block c +3 corresponding to the part after the position of the dotted line in the data block s +3 from the data after the decompression processing of the compressed block c +3, wherein the obtained data is the original data of the 29KB to the 32 KB.

Of course, in practical application, if the data block s +2 located before the data block s +3 is not the non-first block, the first block located before the data block s +2 may be determined according to the block distance, relative to the first data block, included in the first index corresponding to the data block s + 2. After finding the first block located before the data block s +2, the data in the compressed block may be further read and decompressed with reference to the above procedure, and the required original data may be obtained from the decompressed data in combination with the intra-block offset of the first data block recorded in the first index corresponding to the data block s +2 and the intra-block offset relative to the first data block.

In a third scenario, if the original data of 33KB to 36KB needs to be searched, the first index corresponding to the data block s +4 can be found.

A. when the first index corresponding to the data block s +4 is shown in the above table 5, the first data block may be found to be the data block s +3 by the block distance relative to the first data block, the first index corresponding to the data block s +3 may be further found (for example, as shown in the above table 4), and by the block number of the compressed block c +3, the data in the compressed block c +3 corresponding to the data block s +4 may be determined, and the storage location of the compressed block c +3 in the disk may be determined. The compressed block c +3 is decompressed, and according to the intra-block offset of the data block s +3, it is assumed that the intra-block offset of the data block s +3 in this scenario is 1KB, so that it can be determined that the original data of the first 3KB in the decompressed original data of the compressed block c +3 is decompressed data included after the position shown by the dotted line in the data block s +3, and the original data of the 4KB from the 4KB to the 7KB in the decompressed original data of the compressed block c +3 is decompressed data included in the data block s +4, so that the original data to be searched can be obtained. Here, it should be noted that, the decompression processing is performed on the compressed block, all data in the compressed block may be decompressed, or whether the required original data is found during the decompression process may be determined, and the decompression process is terminated after the required original data is found.

B. When the first index corresponding to the data block s +4 is shown in the above table 6, since the first index includes the block number of the compressed block c +3, the storage location of the compressed block c +3 can be obtained without searching the first index corresponding to the data block s +3, and then the original data of 33KB to 36KB can be obtained after the compressed block c +3 is decompressed by referring to the method shown in the above a.

Scene four: if the original data of the 30KB to 31KB needs to be searched, the first index corresponding to the data block s +3 can be found (for example, as shown in table 4 above). By the block number of the compressed block c +3 contained in the first index, it can be determined that the data block s +3 corresponds to the data in the compressed block c +3, and the storage location of the data block s +3 in the disk can be determined. Reading and decompressing the data in the compressed block c +3, and according to the intra-block offset of the data block s +3, assuming that the intra-block offset of the data block s +3 in this scenario is 1KB, that is, the data contained between the start position of the data block s +3 and the position shown by the dotted line is 1KB, it can be determined that the first 2KB of original data in the original data decompressed by the compressed block c +3 is the original data in the 30KB to 31 KB.

scene five: if the original data of the 29KB needs to be searched, first, the first index (for example, as shown in table 4) corresponding to the data block s +3 is found, and the example given in the scenario four is continued to be used, since the intra-block offset of the data block s +3 takes 1KB, it is determined that the decompressed data of the compressed block c +3 does not include the original data of the 29KB, and therefore, the first index corresponding to the first head block located before the data block s +3 needs to be further found, and the first index corresponding to the first head block in the scenario is the first index corresponding to the data block s +2 (for example, as shown in table 2). According to the block number of the compressed block c +2 contained in the first index corresponding to s +2, it can be determined that the data block s +2 corresponds to the data in the compressed block c +2, and the storage location of the data block s +2 in the disk can be determined. Reading data in the compressed block c +2, where the data in the compressed block c +2 is original data, and according to the intra-block offset of the data block s +2, assuming that the intra-block offset of the data block s +2 in this scenario is 1KB, that is, data included between the start position of the data block s +2 and the position shown by the dotted line is 1KB, it may be determined that the original data of the first 3KB in the read data of the compressed block c +2 is data included in the data block s +2, and the original data of the 4KB in the read data of the compressed block c +2 is 29KB in the data block s + 3.

of course, in practical application, it may also be supported to search the original data of the 29KB to the 31KB, in this case, the manners shown in the scene four and the scene five may be combined, and no description is given in this embodiment.

the second method comprises the following steps: creating an index with compressed blocks as objects

And establishing a second index for the kth compression block in the n compression blocks, and recording the corresponding relation between the established second index and the kth compression block, wherein the second index is used for identifying the byte range of the original data which is input to the compression module for compression processing and corresponds to the kth compression block. k is a positive integer less than or equal to n.

Refer to fig. 12 for an exemplary illustration of a second index corresponding to a compressed block. The byte range of the original data input to the compression module for compression processing corresponding to each compression block can be represented by an offset in the file. For example, assuming that the first capacities of the compressed blocks 1 to n are all 4KB, if the in-file offset 0 corresponding to the compressed block 1 indicates a byte range of 0 to 8KB, the original data of 0 to 8KB may be represented and filled in the compressed block 1 after being compressed, if the in-file offset 1 corresponding to the compressed block 2 indicates a byte range of 8 to 20KB, the original data of 8 to 20KB may be represented and filled in the compressed block 1 after being compressed, and if the in-file offset n corresponding to the compressed block n indicates a byte range of 128 to 136KB, the original data of 128 to 136KB may be represented and filled in the compressed block n after being compressed.

In the index establishing manner shown in the second manner, a binary search method may be adopted when searching for the compressed block corresponding to the original data. In view of the searching manner, when the second manner is applied to the scene with fewer compressed blocks, the searching efficiency is higher, and because the compressed blocks are fewer, the number of the established indexes is fewer, and the storage space can also be saved.

The above is a specific description of the data compression method provided in the embodiments of the present application. Since the embodiment of the present application requires the generation of a fixed-size compression block, the design of the compression algorithm is adjusted accordingly. Next, a brief description will be given of a design method of a compression algorithm in the embodiment of the present application, taking the LZ4 algorithm as an example.

Before describing the design of the compression algorithm, the compression format of LZ4 is first explained for ease of understanding.

in the LZ4 compression algorithm, if there is identical data between the currently input original data and the previously input original data, the currently input original data may be matched with the previously input identical original data, and further, the identical original data does not need to be represented repeatedly, so as to implement compression. Specifically, LZ4 is a compression format in bytes, and a compressed block may be composed of several compressed sequences (sequences), which are hereinafter referred to as compressed sequences. Each compressed sequence may record a literal quantity (literal) of a certain byte length and a sliding window match (match), the format of which is shown, for example, in fig. 13. One compressed sequence includes a token (token), a literal (literal), and an offset (offset). Optionally, word-size length (linear small-integer code) and sliding window matching length (ic _ match) may be further included.

The literal represents the portion of the original data directly stored in the compressed sequence, i.e., the original data that cannot be compressed. match represents the portion of the compressed sequence that can be matched. offset represents an offset amount between the currently input original data and the same original data that has been input before, the offset is represented by fixed 2 bytes, if the offset is 0, it represents that the match byte length is zero, that is, there is no sliding window matching, and the maximum value of the offset (MAXDISTANCE) is 65535.

token is represented by 1 byte, i.e. 8 bits (bit), where the upper 4 bits can be used to identify the byte length of the primitive (token _ primitive) and the lower 4 bits can be used to identify the byte length of the match (token _ match).

the minimum value of token _ primitive is 0, and the maximum value is 15, so the maximum length of the token _ primitive can be 15 bytes. If the byte length of actual literal is greater than or equal to 15, there is an lsic _ literal in the compressed sequence that identifies the remaining byte length of literal other than 15 bytes.

Since the minimum value of token _ match is 0 and the maximum value is 15, token _ match can indicate that the byte length of a match is 15 bytes at most, and if the byte length of the actual match is greater than or equal to 15, there is an lsic _ match in the compressed sequence for identifying the remaining byte length of the match except for 15 bytes. Note that, since the minimum byte length (minmeach) of the match is 4 bytes when the offset is not zero, the byte length of the match is 4 bytes when token _ match is 0, and the byte length of the match is 19 bytes when token _ match is 15.

LZ4 has some special rules for the last two special compression sequences of the output: for example, the last 5 bytes of the last compressed sequence are literal and there is no offset field, i.e. there is no match, and the details of this are not considered in this embodiment, so detailed descriptions are not provided.

In combination with the above concepts, a part of design adjustment of the compression algorithm is briefly described, and the compression algorithm in the embodiment of the present application is based on a dynamic programming principle, and the specific implementation manner includes two types:

(1) dynamic programming transfer equation for transfer in sequence unit

A schematic diagram of an input original sequence (the original sequence can be understood as the original data corresponding to the compressed sequence) without considering the case of outputting a special compressed sequence is shown in fig. 14, in which case the naive cost [ i ] cost function is:

Wherein i is the ending position of the original sequence input this time, j is the starting position of the original sequence input this time, that is, the starting position of the part serving as the literal in the original sequence input this time, k is the starting position of the part serving as the match in the original sequence input this time, and the relation satisfied among i, j, and k is as follows: j is not less than k and not more than i, and j is less than i. Wherein, the original sequence input this time is the original data composed of the j to i-1 th bytes.

the cost [ i ] cost function represents the minimum byte length of the output compressed sequence when the original data are sequentially input until the last original sequence of the input is finished at i (or until the input original data is a suffix ending with i), wherein the suffix ending with i can be understood as the original data consisting of the 0 th to the (i-1) th bytes.

the cost [ j ] cost function represents the minimum byte length of the compressed sequence output when the original data are sequentially input until the last original sequence of the input is ended at j (or until the suffix of the input original data ending with j), wherein the suffix ending with j can be understood as inputting the original data consisting of 0 th to j-1 th bytes.

k-j is the byte length of the part of the original sequence input this time as the literal.

The lsic function represents the byte length of lsic _ primitive or lsic _ match, and specifically includes:

When x is equal to i, lsic [ i ] represents the byte length of lsic _ primitive when no part exists as match in the original sequence input this time;

when x is k-j, lsic [ k-j ] represents the byte length of lsic _ primitive when part as match exists in the original sequence input this time;

when x is i-k-minminach, lsic [ i-k-minminach ] indicates the byte length of lsic _ match when there is a part as match in the original sequence of this input, minminach is 4.

Adding constraint conditions and further simplifying the following steps:

where ml is the byte length of the valid match, which cannot be longer than the length of the original sequence of this input, i.e. ml ≦ i-j, and the changematch is the maximum value of the suffix length obtained by consecutive repetition of the prefix of the suffix ending with i (which can be found by the modified knuth-morris-pratt (KMP) algorithm mismatch function), for example: if the original sequence is ABABAB, the changematch is 4, if the original sequence is abccabbc, the changematch is 9, if the original sequence is abcdabbc, the changematch is 3, if the original sequence is babbcdabc, the changematch is also 3.

m is the starting position of the template string (template), which may also be obtained by a modified KMP algorithm mismatch function, the position of the template in the input original sequence is shown in fig. 15, the template string is located before the part of the match in the input original sequence this time, and the template string may include part of or all of the original data in the part of the original data as the iterator in the input original sequence this time, or may include part of or all of the original data in the original data as the iterator in the input original sequence before.

k-m represents the value of offset, which has a maximum value of 65535 MAXDISTANCE since it is a maximum of two bytes.

(2) Dynamic programming transfer equation for transfer at the end of the hierarchy

the original sequence of inputs can also be shifted by a variant ending with little, for example as shown in fig. 16, in which case the cost [ i ] cost function is:

wherein i is the ending position of the part which is the literal in the original sequence input this time.

ml represents the byte length of a valid match, unlike (1), where ml is a loop argument and ranges from MINMATCH to Longestmatch. The definitions of MINMATCH and changematch can be referred to the explanation in (1).

the lsicdelta _lit is the length of lsic _ lite increased by the sum of the lengths of the literals of the cost [ i-1] (value is 0-1, for example, when the lengths of the literals corresponding to the cost [ i-1] are 14, 15+254, 15+255+254 … …, the lsicdelta _lit is 1, otherwise, 0).

the cost [ i ] cost function represents the minimum byte length of the output compressed sequence when the original data is sequentially input until the input original sequence takes the end position of the part of the l-leral as i, the end position of the input original sequence takes the part of the l-leral as i can be understood as inputting the original data consisting of the 0 th byte to the i-1 th byte, wherein the original data of the i-1 th byte is not necessarily taken as the part of the l-leral, but is firstly taken as the l-leral processing when the compression processing is carried out from the original data after the i-1 th byte, but the byte length of the actual l-leral can be zero.

The cost [ i-1] cost function represents the minimum byte length of the output compressed sequence when the original data are sequentially input until the input original sequence takes i-1 as the ending position of the litera part, the ending position of the input original sequence takes i-1 as the litera part can be understood as inputting original data consisting of 0 th byte to i-2 th byte, wherein the original data of the i-2 th byte is not necessarily taken as the litera part, but is firstly taken as the litera when the compression processing is carried out from the original data after the i-2 th byte, but the byte length of the actual litera can be zero.

the cost [ i-ml ] cost function represents the minimum byte length of the output compressed sequence when the original data are sequentially input until the input original sequence takes i-ml as the ending position of the litera part, and the ending position of the input original sequence takes i-ml as the litera part can be understood as inputting the original data consisting of 0 th to i-ml-1 th bytes, wherein the original data of the i-ml-1 th byte is not necessarily taken as the litera part, but is firstly taken as the litera when the compression processing is carried out from the original data after the i-ml-1 th byte, but the byte length of the actual litera can be zero.

The definition of the lsic function can be referred to the explanation of (1), when x is ml-MINMACH, lsic [ ml-MINMACH ] represents the byte length of the lsic _ match when part of the match exists in the original sequence of the input.

Hereinafter, a data compression apparatus according to an embodiment of the present application will be described with reference to the accompanying drawings based on the same technical concept.

The present application provides a data compression apparatus, which may be any device capable of executing the method according to the above method embodiments of the present application, for example, an image generation server, a personal computer, or a mobile terminal. The data compression apparatus includes functions or modules or means (means) for performing the methods according to the above-described method embodiments of the present application, and optionally, the data compression apparatus may also include functions or modules or means (means) for performing the reading and decompression processes of the compressed blocks. The functions or modules or units or means (means) may be implemented by software, or by hardware executing corresponding software.

Fig. 17 shows a first schematic structural diagram of a data compression apparatus according to an embodiment of the present application. The data compression apparatus 1700 includes a transceiver module 1701, a processing module 1702, and a compression module 1703. The transceiver module 1701 may be configured to obtain raw data to be compressed. The processing module 1702 may be configured to input the original data into the compression module 1703, perform compression processing on the original data, and then sequentially output n compressed blocks including the compressed data, where a first capacity of each output compressed block is the same, and the first capacity represents a number of bytes of the compressed data that the compressed block can include. The processing module 1702 may be further configured to store the n compressed blocks in a storage medium, where the storage medium includes m disk blocks, and a second capacity of each disk block is the same, and the second capacity represents a number of bytes of data stored in the disk block.

Specifically, the storage form of the compressed blocks with the same first capacity in the storage medium can be designed as follows:

in one implementation, if there are at least two compressed blocks in the n compressed blocks, where the compressed data of the at least two compressed blocks are the same, when the processing module 1702 stores the at least two compressed blocks in the storage medium, the at least two compressed blocks may be further subjected to deduplication processing, so that the storage locations of the at least two compressed blocks in the storage medium are the same.

The processing module 1702 may also be used to create an index. In one implementation, the processing module 1702 may be configured to divide the original data into i data blocks, where each data block includes data with a same byte number as the first size, and a jth data block of the i data blocks includes decompressed data in at most two compressed blocks. Then, the processing module 1702 may be further configured to establish a first index for the jth data block, and record a corresponding relationship between the established first index and the jth data block, where the first index is used to identify a storage location of data included in the jth data block in the storage medium. i is a positive integer, and j is any positive integer less than or equal to i.

In a first possible design, when the jth data block includes data decompressed in the next compressed block, the content included in the first index corresponding to the jth data block is: a first identification bit or a second identification bit, a block number of the next compressed block, and an intra-block offset of a jth data block. The first identification bit is used for identifying that the data in the next compressed block is original data, the second identification bit is used for identifying that the data in the next compressed block is data subjected to compression processing, and the intra-block offset of the jth data block is the position of the header of the data subjected to decompression processing in the jth data block.

In a second possible design, when the jth data block only contains decompressed data in the current compressed block, the content contained in the first index corresponding to the jth data block is: the third identification bit is used for identifying the data in the current compression block as the data obtained after compression processing; the block distance of a first data block which is relatively positioned before the jth data block, wherein the first data block comprises data which is decompressed in the current compression block, or the first data block comprises data which is decompressed in the current compression block and data which is decompressed in the previous compression block; the block distance of a second data block which is positioned behind the jth data block is opposite, and the second data block comprises data which is decompressed in the current compression block and data which is decompressed in the next compression block; and the intra-block offset of the first data block is the position of the header of the data subjected to decompression processing by the current compression block in the first data block.

In a third possible design, when the jth data block includes the block number of the current compressed block, the first index corresponding to the jth data block may further include the block number of the current compressed block.

In a possible implementation manner, the compression processing procedure executed by the processing module 1702 may specifically be: the original data is sequentially input to the compression module 1703 in bytes for compression, and the following processing is repeatedly performed until all bytes included in the original data are input to the compression module:

after determining that the number of bytes of the data subjected to the compression processing reaches the first capacity each time, determining whether the number of bytes s of the original data input to the compression module 1703 for compression processing is greater than the first capacity; if the judgment result is yes, the data after the compression processing is contained in one compression block and output; if the determination result is negative, the original data of t bytes is continuously input to the compression module 1703, and the original data of s bytes and the original data of t bytes are included in one compression block and output. Wherein s, t is a positive integer, and t is a difference between the first capacity and s.

in one possible implementation, the processing module 1702 may further be configured to: when the number of bytes of the original data input to the compression module 1703 for compression at this time is equal to a preset value, and the number of bytes of the data after compression at this time still does not reach the first capacity, the data after compression at this time is included in one compression block and output, where the preset value is the maximum value of the number of bytes of the original data input to the compression module 1703 for compression at this time.

Fig. 18 shows a schematic structural diagram of a data compression apparatus according to an embodiment of the present application. The data compression apparatus 1800 may include a processor 1801 and a communication interface 1802. The processor 1801 may be a Central Processing Unit (CPU) or a Network Processor (NP). The processor 1801 may also be other types of chips such as a baseband circuit, a radio frequency circuit, an Application Specific Integrated Circuit (ASIC), a Programmable Logic Device (PLD), or any combination thereof. The communication device 1800 may also include a memory 1803 for storing programs executed by the processor 1801 and data for desired processing. The memory 1803 may be integrated with the processor 1801, or may be disposed separately from the processor 1801.

the processor 1801 may be configured to obtain raw data to be compressed using the communication interface 1802. The processor 1801 may be configured to input the original data into a compression module, perform compression processing on the original data, and then sequentially output n compressed blocks including the compressed data, where a first capacity of each output compressed block is the same, and the first capacity represents a number of bytes of the compressed data that the compressed block can include. The processor 1801 may further be configured to store the n compressed blocks in a storage medium, where the storage medium includes m disk blocks, and a second capacity of each disk block is the same, where the second capacity represents a number of bytes of data stored by the disk block.

for specific functions of the processor 1801, the communication interface 1802, and the memory 1803, reference may be made to corresponding descriptions in the foregoing method embodiments of the present application, and details are not described herein again.

Based on the same technical concept, the present application further provides a chip, where the chip may communicate with a memory, or the chip includes a memory, and the chip executes program instructions stored in the memory to implement the functions corresponding to the methods involved in the above method embodiments.

Based on the same technical concept, the present application also provides a computer storage medium, which stores computer readable instructions, and when the computer readable instructions are executed, the computer storage medium implements the functions corresponding to the methods involved in the above method embodiments.

Based on the same technical concept, the present application further provides a computer program product including a software program, which when running on a computer, enables the functions corresponding to the methods in the above method embodiments to be implemented.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

these computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

it will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. a method of data compression, comprising:

acquiring original data to be compressed;

Inputting the original data into a compression module for compression processing, and then sequentially outputting n compression blocks containing the compressed data, wherein the first capacity of each compression block is the same, and the first capacity represents the number of bytes of the compressed data which can be contained in the compression block;

storing n of the compressed blocks in a storage medium, wherein the storage medium comprises m disk blocks, and a second capacity of each disk block is the same and represents the number of bytes of data stored in the disk block;

Wherein the second capacity is p times the first capacity, and the storage form of the n compressed blocks in the storage medium is: storing p complete compressed blocks in one complete disk block; or, the first capacity is q times of the second capacity, and the storage form of the n compressed blocks in the storage medium is: storing one complete compressed block of the q complete disk blocks; n, m, p and q are positive integers, p is less than or equal to n, and q is less than or equal to m.

2. The method of claim 1, wherein the method further comprises:

if at least two compressed blocks which contain the same compressed data exist in the n compressed blocks, when the at least two compressed blocks are stored in the storage medium, the storage positions of the at least two compressed blocks in the storage medium are the same.

3. The method of claim 1, further comprising:

dividing the original data into i data blocks, wherein the number of bytes of data contained in each data block is the same as the first capacity, and the jth data block in the i data blocks at most contains data subjected to decompression processing in two compressed blocks;

Establishing a first index for the jth data block, and recording a corresponding relationship between the established first index and the jth data block, where the first index is used to identify a storage location of data contained in the jth data block in the storage medium;

i is a positive integer, and j is any positive integer less than or equal to i.

4. The method according to claim 3, wherein when the jth data block contains data decompressed in a next compressed block, the content contained in the first index corresponding to the jth data block is:

A first identification bit or a second identification bit, a block number of the next compressed block, and an intra-block offset of the jth data block;

The first identification bit is used for identifying that the data in the next compressed block is original data; the second identification bit is used for identifying the data in the next compressed block as the data after compression processing; the intra-block offset of the jth data block is the position of the header of the decompressed data of the next compressed block in the jth data block.

5. The method according to claim 3, wherein when the jth data block only contains decompressed data in the current compressed block, the content contained in the first index corresponding to the jth data block is:

A third identification bit, configured to identify that the data in the current compressed block is data obtained after compression processing;

The block distance of a first data block located before the jth data block is relatively determined, where the first data block includes data decompressed in a current compressed block, or the first data block includes data decompressed in the current compressed block and data decompressed in a previous compressed block;

The block distance of a second data block which is positioned after the jth data block is relative to the jth data block, wherein the second data block comprises data which is decompressed in the current compression block and data which is decompressed in the next compression block;

An intra-block offset of the first data block, wherein the intra-block offset of the first data block is a position of a header of the decompressed data of the current compression block in the first data block.

6. The method of claim 5, wherein the first index corresponding to the jth data block further comprises a block number of the current compressed block.

7. The method as claimed in any one of claims 1 to 6, wherein inputting the original data into a compression module for compression processing and then sequentially outputting n compression blocks containing the compressed data comprises:

the original data is sequentially input to the compression module by taking bytes as units for compression;

repeatedly executing the following processing until all the byte numbers contained in the original data are input to the compression module:

After determining that the byte number of the compressed data reaches the first capacity each time, judging whether the byte number s of the original data input to the compression module for compression is larger than the first capacity or not;

If the judgment result is yes, the data after the compression processing is contained in one compression block and output;

If the judgment result is negative, continuing to input the original data with t bytes to the compression module, containing the original data with s bytes and the original data with t bytes in one compression block and outputting the original data with s bytes and the original data with t bytes, wherein s and t are positive integers, and t is the difference value between the first capacity and s.

8. The method of claim 7, wherein after the raw data is sequentially input to the compression module in units of bytes for compression, the method further comprises:

When the number of bytes of the original data input to the compression module for compression processing at this time is equal to a preset value, if the number of bytes of the data subjected to compression processing at this time still does not reach the first capacity, the data subjected to compression processing at this time is contained in one compression block and output, and the preset value is the maximum value of the number of bytes of the original data input to the compression module for compression processing at this time.

9. a data compression apparatus, comprising:

The receiving and transmitting module is used for acquiring original data to be compressed;

The processing module is used for inputting the original data into the compression module for compression processing and then sequentially outputting n compression blocks containing the compressed data, wherein the first capacity of each compression block is the same, and the first capacity represents the number of bytes of the compressed data which can be contained in the compression block; storing n of the compressed blocks in a storage medium, wherein the storage medium comprises m disk blocks, and a second capacity of each disk block is the same and represents the number of bytes of data stored in the disk block;

10. the apparatus of claim 9, wherein the processing module is further configured to:

11. The apparatus of claim 9, wherein the processing module is further configured to:

i is a positive integer, and j is any positive integer less than or equal to i.

12. The apparatus according to claim 11, wherein when the jth data block contains data decompressed in a next compressed block, the content contained in the first index corresponding to the jth data block is:

13. the apparatus according to claim 11, wherein when the jth data block only contains decompressed data in the current compressed block, the content contained in the first index corresponding to the jth data block is:

14. the apparatus of claim 13, wherein the first index corresponding to the jth data block further comprises a block number of the current compressed block.

15. The apparatus according to any one of claims 9 to 14, wherein the processing module is specifically configured to:

16. the apparatus of claim 15, wherein the processing module is further configured to:

17. a computer storage medium comprising computer readable instructions which, when executed, implement the method of any of claims 1 to 8.