WO2022262381A1

WO2022262381A1 - Data compression method and apparatus

Info

Publication number: WO2022262381A1
Application number: PCT/CN2022/085621
Authority: WO
Inventors: 俞超; 陈宜; 李桂付; 邱歌; 李志鹏; 张代曰; 钱璟
Original assignee: 华为技术有限公司
Priority date: 2021-06-16
Filing date: 2022-04-07
Publication date: 2022-12-22
Also published as: JP2024525170A; US20240283463A1; EP4336336A1; CN115480692A

Abstract

Disclosed in the present application are a data compression method and apparatus, the method comprising: acquiring m data blocks of a data area in a readable and writable file system, using a preset compression algorithm to compress the m data blocks, and obtaining in sequence n compressed data blocks, wherein a first capacity of each compressed data block is the same, the first capacity representing the number of bytes of compressed data that the compressed data block can contain; establishing a first index of each data block amongst j data blocks corresponding to an i-th compressed data block in the n items of compressed data, and recording a mapping relationship between the first index and the j data blocks. The first index is used for identifying the storage location of each data block amongst the j data blocks in a storage medium, and attribute information included in each data block amongst the j data blocks. When the data blocks are read, the reading efficiency can be effectively increased, ensuring that data reading is implemented with a small read amplification factor in a random read scenario.

Description

A data compression method and device

This application claims the priority of the Chinese patent application with the application number 202110667882.7 and the application name "A Data Compression Method and Device" submitted to the China Patent Office on June 16, 2021, the entire contents of which are incorporated by reference in this application .

technical field

The present application relates to the technical field of data compression, in particular to a data compression method and device.

Background technique

In order to improve the overall input and output (IO) read and write performance of the storage system, it is necessary to compress the files in the memory. At present, the read-write file system of Linux, such as: F2FS, the second generation flash file system (journalling Flash file system version2, JFFS2), B-tree file system (B-tree file system, BTRFS), etc., the read-write file system of Windows , for example: NTFS, etc. Since the metadata area in the file system accounts for a small proportion of the entire file system, the data area often occupies a relatively high device storage capacity. Therefore, compressing the data in the data area can reduce the size of the input and output IO and improve the overall read and write performance of the IO.

Existing data compression methods usually compress the original file data (or source data) that needs to be compressed according to the smallest compressible unit of fixed size, and the compressed file data (or compressed data) can include header data and compressed data. data. Among them, the header data is used to represent the attribute information of the file data; the compressed data is used to represent the content of the file data. Then save the compressed file data to the storage medium. However, the existing compression schemes for read-write file systems have the problem of random read amplification, and the read efficiency is low.

Contents of the invention

Embodiments of the present application provide a data compression method and device, which can solve the problem of random read amplification of a read-write file system and improve read efficiency.

In the first aspect, the embodiment of the present application provides a data compression method. The execution subject of the method may be an electronic device, or a component (for example, a chip, a chip system or a processor, etc.) located in the electronic device. The subject is described as an example of an electronic device. The method includes: the electronic device acquires m data blocks in the data area of the readable and writable file system, where m is a positive integer greater than or equal to 1. The electronic device compresses m data blocks using a preset compression algorithm, and obtains n compressed data blocks in turn, wherein the first capacity of each compressed data block is the same, and the first capacity represents the compressed data that can be contained in the compressed data block The number of bytes, n is a positive integer greater than or equal to 1. The electronic device establishes a first index for each of the j data blocks corresponding to the i-th compressed data block among the n compressed data, and records a mapping relationship between the first index and the j data blocks. Wherein, i is a positive integer greater than or equal to 1 and less than or equal to n; j is a positive integer greater than or equal to 1 and less than or equal to m. Wherein, the first index is used to identify the storage location of each data block contained in the j data blocks in the storage medium, and the attribute information contained in each data block contained in the j data blocks.

Therefore, the data compression method provided by the embodiment of the present application can effectively improve the reading efficiency when reading a data block, and can ensure that the random reading scenario completes data reading with a small read amplification factor. In addition, the attributes contained in the index of the data block can be modified, so that the compressed file on the storage device can be modified. It can be seen that the embodiment of the present application solves the problem of random read amplification in existing read-write file system compression schemes, and at the same time solves the problem that existing file systems with fixed output compression methods cannot support data and metadata updates.

In a specific achievable manner, m data blocks are compressed using a preset compression algorithm, and n compressed data blocks are sequentially obtained, specifically: each data block in the m data blocks is assigned to the first gather. When the data capacity of the j data blocks in the first set is equal to the rated capacity of the first set, perform a compression operation on the j data blocks according to the set compression threshold, and obtain the i-th compressed data block.

In a specific implementable manner, the first index of each data block in the j data blocks corresponding to the i-th compressed data block in the n compressed data is established, specifically: when the header of the i-th compressed data block When the sum of the total data length of the partial data and compressed data and the set compression threshold is less than or equal to the total data length of j data blocks, establish the first index of each data block in the j data blocks.

In a specific implementable manner, the attribute information includes at least one of the following: a first attribute, which is used to indicate whether the storage location of the compressed data block where the data block is compressed is pre-allocated. The second attribute is used to indicate whether the data page of the data block is valid. The third attribute is used to represent whether the data page of the data block is the first compressed page of the compressed data block of the data block. The fourth attribute is used to represent whether the data page of the data block is included in the compressed data pages of the two compressed blocks. The fifth attribute is used to indicate whether the data page of the data block is a compressed page of the compressed data block after the data block is compressed. The sixth attribute is used to represent the index address of the compressed data block where the data page of the data block is located. The seventh attribute is used to represent that when the data page of the data block belongs to the first compressed page of the compressed data block of the data block, the attribute value of the seventh attribute is the offset of the data block in the set corresponding to the compressed data block; when When the data page of the data block does not belong to the first compressed page of the compressed data block of the data block, the attribute value of the seventh attribute is the distance between the data page of the data block and the first compressed page of the compressed data block.

In a specific implementable manner, the attribute information includes a third attribute, and the first index of each data block in the j data blocks corresponding to the i-th compressed data block in the n compressed data is established, specifically: when j When the data page of each data block in a data block is the first compressed page of the ith compressed data block, the attribute value of the third attribute is assigned a value of 1. When the data page of each data block in the j data blocks is not the first compressed page of the ith compressed data block, the attribute value of the third attribute is assigned a value of 0.

In a specific implementable manner, the attribute information includes the seventh attribute, and further includes: when the attribute value of the third attribute is 1, updating the offset of the data block in the set corresponding to the compressed data block when the attribute value of the seventh attribute is 1. shift. When the attribute value of the third attribute is 0, the distance between the data page of the data block and the first compressed page of the compressed data block is updated when the attribute value of the seventh attribute is 0.

In a specific implementable manner, the attribute information includes a fourth attribute, and the first index of each data block in the j data blocks corresponding to the i-th compressed data block in the n compressed data is established, specifically: when j When the data pages of each data block in the data blocks are contained in the compressed data pages of two compressed blocks, the attribute value of the fourth attribute is assigned a value of 1. When the data page of each data block in the j data blocks is not included in the compressed data pages of the two compressed blocks, the attribute value of the fourth attribute is assigned a value of 0.

In a specific implementable manner, the attribute information includes the second attribute, and the first index of each data block in the j data blocks corresponding to the i-th compressed data block in the n compressed data is established, specifically: when j When the data page of each data block in the data blocks is valid, the attribute value of the second attribute is assigned a value of 1. When the data page of each data block in the j data blocks is invalid, the attribute value of the second attribute is assigned a value of 0.

In some practicable manners, before compressing m data blocks using a preset compression algorithm to obtain n compressed data blocks sequentially, it also includes: obtaining a second set of data to be overwritten and written, the second set including p compressed data block, p is a positive integer greater than or equal to 1. Obtain the compressed page of the first target compressed data in the p compressed data blocks, and q data blocks corresponding to the compressed page of the first target compressed data block, where q is a positive integer greater than or equal to 1. Determine the position offset of the first target data block among the q data blocks among the q data blocks. The data page of the first target data block is determined as the data page to be overwritten with data.

In a specific implementable manner, the first index is used to identify the storage location of the i-th compressed data block in the storage medium, and the attribute information contained in each of the j data blocks.

In some practicable manners, it also includes: reading the first index of the first data block to obtain the index address of the first compressed data block corresponding to the first data block, wherein the first index includes attribute information of the first data block . An index of a first compressed data block corresponding to the first data block is read. According to the index of the first compressed data block, the first compressed data block is decompressed to obtain multiple data blocks corresponding to the first compressed data block, and the multiple data blocks include the first data block. An offset of the first data block within the plurality of decompressed data blocks is determined. According to the offset of the first data block in the multiple decompressed data blocks, the data of the first data block can be obtained.

In a second aspect, the embodiment of the present application provides a data compression device, which includes: a first acquisition unit configured to acquire m data blocks in a data area in a writable and writable file system, where m is a positive integer greater than or equal to 1. The compression unit is used to compress m data blocks by using a preset compression algorithm to obtain n compressed data blocks in turn, wherein the first capacity of each compressed data block is the same, and the first capacity represents the post-compression processing that the compressed data block can contain The number of bytes of data, n is a positive integer greater than or equal to 1. The update unit is configured to establish the first index of each data block in the j data blocks corresponding to the i-th compressed data block in the n compressed data, and record the mapping relationship between the first index and the j data blocks. Wherein, i is a positive integer greater than or equal to 1 and less than or equal to n; j is a positive integer greater than or equal to 1 and less than or equal to m. Wherein, the first index is used to identify the storage location of each data block contained in the j data blocks in the storage medium, and the attribute information contained in each data block contained in the j data blocks.

Therefore, the data compression method provided by the embodiment of the present application can effectively improve the reading efficiency when reading a data block, and can ensure that the random reading scenario completes data reading with a small read amplification factor. In addition, the attributes contained in the index of the data block can be modified, so that the compressed file on the storage device can be modified. It can be seen that the embodiment of the present application solves the problem of random read amplification in the existing read-write file system compression scheme, and at the same time solves the problem that the existing file system with fixed output compression mode cannot support data and metadata update.

In a specific implementable manner, the compression unit is configured to: sequentially allocate each data block in the m data blocks to the first set in a preset order. When the data capacity of the j data blocks in the first set is equal to the rated capacity of the first set, perform a compression operation on the j data blocks according to the set compression threshold, and obtain the i-th compressed data block.

In a specific implementable manner, the updating unit is used for: when the sum of the header data of the i-th compressed data block and the total data length of the compressed data and the set compression threshold is less than or equal to the total data length of the j data blocks When , establish the first index of each data block in the j data blocks.

In a specific implementable manner, the attribute information includes at least one of the following: a first attribute, which is used to indicate whether the storage location of the compressed data block where the data block is compressed is pre-allocated. The second attribute is used to indicate whether the data page of the data block is valid. The third attribute is used to represent whether the data page of the data block is the first compressed page of the compressed data block of the data block. The fourth attribute is used to represent whether the data page of the data block is included in the compressed data pages of the two compressed blocks. The fifth attribute is used to indicate whether the data page of the data block is a compressed page of the compressed data block after the data block is compressed. The sixth attribute is used to represent the index address of the compressed data block where the data page of the data block is located. The seventh attribute is used to indicate that when the data page of the data block belongs to the first compressed page of the compressed data block of the data block, the attribute value of the seventh attribute is the offset of the data block in the set corresponding to the compressed data block. When the data page of the data block does not belong to the first compressed page of the compressed data block of the data block, the attribute value of the seventh attribute is the distance between the data page of the data block and the first compressed page of the compressed data block.

In a specific implementable manner, the attribute information includes a third attribute, and the update unit is further configured to: when the data page of each data block in the j data blocks is the first compressed page of the i-th compressed data block, The attribute value of the third attribute is assigned a value of 1. When the data page of each data block in the j data blocks is not the first compressed page of the ith compressed data block, the attribute value of the third attribute is assigned a value of 0.

In a specific implementable manner, the attribute information includes a seventh attribute, and the updating unit is further configured to: when the attribute value of the third attribute is 1, update the attribute value of the seventh attribute to the set corresponding to the data block in the compressed data block offset within. When the attribute value of the third attribute is 0, the distance between the data page of the data block and the first compressed page of the compressed data block is updated when the attribute value of the seventh attribute is 0.

In a specific implementable manner, the attribute information includes a fourth attribute, and the update unit is further configured to: when the data pages of each data block in the j data blocks are included in the compressed data pages of two compressed blocks, the first The attribute values of the four attributes are assigned a value of 1. When the data page of each data block in the j data blocks is not included in the compressed data pages of the two compressed blocks, the attribute value of the fourth attribute is assigned a value of 0.

In a specific implementable manner, the attribute information includes a second attribute, and the updating unit is further configured to: assign a value of 1 to the attribute value of the second attribute when the data page of each data block in the j data blocks is valid. When the data page of each data block in the j data blocks is invalid, the attribute value of the second attribute is assigned a value of 0.

In some implementable manners, the device further includes: a second acquiring unit, configured to acquire a second set of data to be overwritten and written, the second set includes p compressed data blocks, and p is a positive integer greater than or equal to 1. The third obtaining unit is used to obtain the compressed page of the first target compressed data in the p compressed data blocks, and the q data blocks corresponding to the compressed page of the first target compressed data block, where q is a positive integer greater than or equal to 1. The first determining unit is configured to determine the position offset of the first target data block among the q data blocks among the q data blocks. The second determining unit is configured to determine that the data page of the first target data block is the data page to be overwritten with data.

In some practicable manners, it further includes: a first reading unit, configured to read the first index of the first data block, and obtain the index address of the first compressed data block corresponding to the first data block, wherein the first index It includes attribute information of the first data block. The second reading unit is configured to read the index of the first compressed data block corresponding to the first data block. The decompression unit is configured to decompress the first compressed data block according to the index of the first compressed data block to obtain multiple data blocks corresponding to the first compressed data block, the multiple data blocks including the first data block. The third determining unit is configured to determine the offset of the first data block in the decompressed multiple data blocks. The third obtaining unit is configured to obtain the data of the first data block according to the offset of the first data block in the decompressed multiple data blocks.

In a third aspect, an embodiment of the present application provides a device, which includes: configured to execute the data compression method in the first aspect.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, the computer-readable storage medium includes computer instructions, and when the computer instructions are run on an electronic device, the electronic device executes the method described in the first aspect. Data compression method.

In a fifth aspect, an embodiment of the present application provides a computer program, and when the program is called by a processor, the data compression method in the first aspect is executed.

In a sixth aspect, an embodiment of the present application provides a chip system, which includes one or more processors, and when the one or more processors execute instructions, the one or more processors execute the data of the first aspect compression method.

It should be understood that descriptions of technical features, technical solutions, beneficial effects or similar language in this application do not imply that all features and advantages can be realized in any single embodiment. On the contrary, it can be understood that the description of features or beneficial effects means that specific technical features, technical solutions or beneficial effects are included in at least one embodiment. Therefore, descriptions of technical features, technical solutions or beneficial effects in this specification do not necessarily refer to the same embodiment. Furthermore, the technical features, technical solutions and beneficial effects described in this embodiment may also be combined in any appropriate manner. Those skilled in the art will understand that the embodiments can be implemented without one or more specific technical features, technical solutions or advantageous effects of the specific embodiments. In other embodiments, additional technical features and beneficial effects may also be identified in certain embodiments that do not embody all embodiments.

Description of drawings

FIG. 1a is a block diagram of an operating system provided by an embodiment of the present application;

FIG. 1b is a schematic structural diagram of a storage system provided by an embodiment of the present application;

Fig. 2 is a schematic structural diagram of a solid-state hard disk of the storage system in Fig. 1b;

Fig. 3 is the structural representation of the flash memory chip of solid-state hard disk in Fig. 2;

FIG. 4 is a schematic diagram of a flash memory translation layer corresponding to the flash memory chip in FIG. 3;

FIG. 5 is a schematic diagram of a fixed input compression mode;

FIG. 6 is a schematic diagram of a fixed output compression mode provided by an embodiment of the present application;

FIG. 7 is a schematic diagram of a data block index provided by an embodiment of the present application;

Fig. 8 is a schematic diagram of a data block index in an existing scalable read-only file system;

FIG. 9 is a schematic flow chart of a data compression method provided by an embodiment of the present application;

FIG. 10 is a schematic flow chart of updating a data block index provided by an embodiment of the present application;

FIG. 11 is a schematic diagram of a data block index relationship during data compression provided by an embodiment of the present application;

FIG. 12 is a schematic flowchart of another data compression method provided by the embodiment of the present application;

Fig. 13 is a schematic diagram of data block index relationship during an overwrite writing or reading process provided by an embodiment of the present application;

Fig. 14 is a schematic flow chart of data reading provided by the embodiment of the present application;

FIG. 15 is a schematic structural diagram of a data compression device provided by an embodiment of the present application.

detailed description

The terms "including" and "having" and any variations thereof mentioned in the description of the present application are intended to cover non-exclusive inclusion. For example, a process, method, system, product or device comprising a series of steps or units is not limited to the listed steps or units, but optionally also includes other unlisted steps or units, or optionally also includes Other steps or elements inherent to the process, method, product or apparatus are included.

It should be noted that, in the embodiments of the present application, words such as "exemplary" or "for example" are used as examples, illustrations or descriptions. Any embodiment or design scheme described as "exemplary" or "for example" in the embodiments of the present application shall not be interpreted as being more preferred or more advantageous than other embodiments or design schemes. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete manner.

In the description of the present application, unless otherwise specified, the meaning of "plurality" refers to two or more. "And/or" in this article is just a relationship describing the relationship between related objects, which means that there can be three relationships, for example, A and/or B, which can mean: A exists alone, A and B exist at the same time, and B exists alone These three situations.

For ease of understanding, the following first introduces related terms and concepts that may be involved in the embodiments of the present application.

A block diagram of an operating system as shown in Figure 1a.

An operating system (OS for short) is a computer program that manages computer hardware and software resources. For example, unix operating system, windows operating system, linux operating system, etc. The operating system needs to handle basic tasks such as managing and configuring memory, prioritizing the supply and demand of system resources, controlling input and output devices, operating the network, and managing the file system. The operating system also provides an interface for the user to interact with the system.

The operating system kernel refers to the core part of most operating systems. It consists of those parts of the operating system used to manage memory, files, peripherals, and system resources. It is responsible for managing the system's processes, memory, device drivers, files, and network systems, and determines the performance and stability of the operating system. The operating system kernel is a system software that provides functions such as hardware abstraction layer, disk and file system control, and multitasking. It provides secure access to computer hardware for many applications, and it can determine when and how long an application will operate on a certain part of the computer's hardware. Since it is very complicated to directly operate on computer hardware, the operating system kernel can provide a set of hardware abstraction methods to complete these operations.

The file system is the core module of the operating system kernel, that is, the main component. The file system is a method of organizing files on the storage device, responsible for managing and storing file information, mainly for users to create files, store, read, modify, Dump files, control file access, revoke files when they are no longer used by users, etc.

The file system provides an abstract representation of files in the kernel, completes the mapping of files to physical storage devices (such as disks, hard disks, etc.), and maps the physical addresses of files on storage devices into user-visible path names and file names to facilitate file data. Fast reading, modification and persistence of data.

File systems include read-write file systems and read-only file systems. A read-write file system is a file system that can write files to storage devices and read files from storage devices, such as: file allocation table (file allocation table, FAT), high performance file system (high performance file system, HPFS) , new technology file system (NTFS), fourth extended file system (fourth extended file system, EXT4), flash friendly file system (flash friendly file system, F2FS), etc. A read-only file system is a file system that can only read files from a storage device, but cannot write files to a storage device, such as an extendable read-only file system (EROFS).

In order to make the present application more clear, an application scenario of the present application is described first.

A schematic structural diagram of a storage system shown in FIG. 1b.

In the application scenario shown in Figure 1b, users access data through applications. The computers running these applications are called "application servers". The application server 100 may be a physical machine or a virtual machine. Physical application servers include, but are not limited to, desktops, servers, laptops, and mobile devices. The application server accesses the storage system through the optical fiber switch 110 to access data. However, the switch 110 is only an optional device, and the application server 100 can also directly communicate with the storage system 120 through the network. Alternatively, the optical fiber switch 110 can also be replaced with an Ethernet switch, an InfiniBand switch, a RoCE (RDMA over Converged Ethernet) switch, or the like.

The storage system 120 shown in FIG. 1b is a centralized storage system. The so-called centralized storage system refers to a central node composed of one or more master devices, where data is stored centrally, and all data processing services of the entire system are centrally deployed on this central node. In other words, in the centralized storage system, the terminal or client is only responsible for the input and output of data, while the storage and control processing of data is completely handed over to the central node. The characteristic of the centralized storage system is that there is a unified entrance, and all data from external devices must pass through this entrance, and this entrance is the engine 121 of the centralized storage system. The engine 121 is the most core component in the centralized storage system, where many advanced functions of the storage system are implemented.

As shown in FIG. 1 b , there are one or more controllers in the engine 121 . FIG. 1 b takes the engine including two controllers as an example for illustration. There is a mirror channel between controller 0 and controller 1, so when controller 0 writes a piece of data into its memory 124, it can send a copy of the data to controller 1 through the mirror channel, and controller 1 Store the copy in its own local memory 124 . Therefore, controller 0 and controller 1 are mutual backups. When controller 0 fails, controller 1 can take over the business of controller 0. When controller 1 fails, controller 0 can take over the business of controller 1. business, so as to avoid the unavailability of the entire storage system 120 caused by hardware failure. When four controllers are deployed in the engine 121, there is a mirroring channel between any two controllers, so any two controllers are mutual backups.

The engine 121 also includes a front-end interface 125 and a back-end interface 126 , wherein the front-end interface 125 is used to communicate with the application server 100 to provide storage services for the application server 100 . The back-end interface 126 is used to communicate with the hard disk 134 to expand the capacity of the storage system. Through the back-end interface 126, the engine 121 can be connected with more hard disks 134, thereby forming a very large storage resource pool.

According to the type of communication protocol between the engine 121 and the disk enclosure 130, the disk enclosure 130 may be a SAS disk enclosure, or an NVMe disk enclosure, an IP disk enclosure, or other types of disk enclosures. The SAS hard disk enclosure adopts the SAS3.0 protocol, and each enclosure supports 25 SAS hard disks. The engine 121 is connected to the hard disk enclosure 130 through an onboard SAS interface or a SAS interface module. The NVMe disk enclosure is more like a complete computer system, and the NVMe disk is inserted into the NVMe disk enclosure. The NVMe disk enclosure is then connected to the engine 121 through the RDMA port.

In terms of hardware, as shown in FIG. 1 b , the controller 0 includes at least a processor 123 and a memory 124 . Processor 123 is a central processing unit (central processing unit, CPU), used for processing data access requests from outside the storage system (server or other storage systems), and also used for processing requests generated inside the storage system. Exemplarily, when the processor 123 receives the write data request sent by the application server 100 through the front-end port 125 , it will temporarily save the data in the write data request in the memory 124 . When the total amount of data in the memory 124 reaches a certain threshold, the processor 123 sends the data stored in the memory 124 to the hard disk 134 for persistent storage through the back-end port.

The memory 124 refers to an internal memory directly exchanging data with the processor. It can read and write data at any time, and the speed is very fast. It is used as a temporary data storage for an operating system or other running programs. The memory includes at least two kinds of memory, for example, the memory can be random access memory. For example, the random access memory is dynamic random access memory (Dynamic Random Access Memory, DRAM), or storage class memory (Storage Class Memory, SCM). DRAM is a semiconductor memory, which, like most Random Access Memory (RAM), is a volatile memory device. SCM is a composite storage technology that combines the characteristics of traditional storage devices and memory. Storage-class memory can provide faster read and write speeds than hard disks, but the access speed is slower than DRAM, and the cost is also cheaper than DRAM. . However, the DRAM and the SCM are only exemplary illustrations in this embodiment, and the memory may also include other random access memories, such as Static Random Access Memory (Static Random Access Memory, SRAM) and the like. In addition, the memory 124 can also be a dual-in-line memory module or a dual-line memory module (Dual In-line Memory Module, DIMM for short), that is, a module composed of dynamic random access memory (DRAM), or a solid-state hard drive. (Solid State Disk, SSD). In practical applications, multiple memories 124 and different types of memories 124 may be configured in the controller 0 . This embodiment does not limit the quantity and type of the memory 113 . In addition, the memory 124 can be configured to have a power saving function. The power saving function means that the data stored in the internal memory 124 will not be lost when the system is powered off and then powered on again. Memory with a power saving function is called non-volatile memory.

Exemplarily, both the memory 124 and the hard disk 134 may be a solid-state disk (English: Solid-state drive or Solid-state disk, SSD for short), which is a storage device mainly using flash memory (NAND Flash) as a permanent memory. As shown in FIG. 2, the SSD 200 includes a NAND flash memory and a main controller (referred to as main control) 201. NAND flash memory includes multiple flash memory chips 205 for storing data. The main control 201 is the brain center of the SSD, responsible for some complex tasks, such as managing data storage, maintaining SSD performance and service life, and so on. The main control 201 is an embedded microchip, which includes a processor 202, and its function is like a command center, sending out all operation requests of the SSD. For example, the processor 202 can perform functions such as reading/writing data, garbage collection, and wear leveling through firmware in the buffer.

The SSD master 201 also includes a host interface 204 and several channel controllers. Wherein, the host interface 204 is used for communicating with the host. The host here can refer to any device such as a server, a personal computer, or an array controller. Through several channel controllers, the main control 201 can operate multiple flash memory chips 205 in parallel, thereby increasing the underlying bandwidth. For example, assuming that there are 8 channels between the main control 201 and the FLASH particles, then the main control 201 reads and writes data to 8 flash memory chips 205 in parallel through these 8 channels.

As shown in Figure 3, a die is a package of one or more flash memory chips. A die can contain multiple panels. Multi-Plane NAND is a design that can effectively improve performance. As shown in Figure 3, a die is divided into two Planes, and the block numbers in the two Planes are single and double crossover. Therefore, during operation, a single and double crossover operation can be performed to improve performance. A panel contains multiple blocks (block). And a block consists of several pages (page). Taking a flash memory chip with a capacity of 16GB as an example, every 4314*8=34512 cells logically form a page, and each page can store 4KB of content and 218B of ECC verification data, and a page is also the smallest unit of IO operations. Every 128 pages form a block, and each 2048 blocks form a panel. A whole flash memory chip is composed of two panels. One panel stores blocks with odd numbers, and the other stores blocks with even numbers. Two planes can be parallelized. operate. This is just an example, and the size of the page, the capacity of the block, and the capacity of the flash memory chip may have different specifications, which are not limited in this embodiment.

The host writes data into the block, and when a block is full, the SSD master 201 will select the next block to continue writing. A page is the smallest unit of data writing. In other words, the master control 201 writes data into the block at the page granularity. Block is the smallest unit of data erasure. When the master control erases data, it can only erase the entire block at a time.

The host accesses the SSD through the logical block address (Logical Block Address, LBA), each LBA represents a sector (take 512B as an example), and inside the SSD, the host accesses the SSD in units of pages (take 4KB as an example). Therefore, every time the application server writes a piece of data, the SSD master will find a Page to write the data into. The address of the page is called the Physical Block Address (PBA). SSD internally records a mapping from LBA to PBA. With such a mapping, the next time the host needs to read the data of a certain LBA, the SSD will know where to read the data from the flash memory chip. FIG. 4 is a schematic diagram of a flash translation layer (Flash Translation Layer, FTL), and the FTL is located in the firmware of the processor 202. As shown in Figure 4, every time the host writes a new data, a new mapping relationship will be generated, and this mapping relationship will be added (first write) or changed (overwritten) to the FTL. When reading a certain data, the SSD first searches for the PBA corresponding to the LBA of the data in the FTL, and then reads the corresponding data according to the PBA.

The flash memory chip cannot support overwriting, which means that when the host modifies the data on a certain LBA, it cannot be directly changed on the PBA corresponding to this LBA, but must be written to a new PBA, and a mapping is added in the FTL . For example, there is a mapping relationship between LBA D and PBA D in FTL. When the host sends an IO request to modify the data of LBA D, the SSD looks for a new location (PBA E) to write the data, and writes the data in the FTL Add the mapping relationship between LBA D and PBA E. Just cause the data on PBA D to become invalid data. Invalid data (also called garbage data) refers to data that is not pointed to by any mapping relationship. For this part of data, users will not access these FLASH spaces, because they are replaced by new mapping relationships. As the application server continues to write, the FLASH storage space gradually decreases until it is exhausted. If these junk data are not cleared in time, the host cannot write. There is a garbage collection mechanism inside the SSD. Its basic principle is to move the valid data in several blocks to a new block, and then erase these blocks to generate new usable blocks. .

In addition, the memory 124 also stores a software program, and the processor 123 runs the software program in the memory 124 to manage the hard disk. For example, hard disks are abstracted into storage resource pools, and then divided into LUNs for use by servers. The LUN here is actually the hard disk seen on the server. Of course, some centralized storage systems are also file servers themselves, which can provide shared file services for servers.

Data stored in memory 124 may be represented by a file system. The file system is a structured form of data file storage and organization. We know that all the data in the computer are 0 and 1, and a series of 01 combinations stored on the hardware media are completely indistinguishable and manageable for us. Therefore, we use the concept of "file" to organize these data, and the data used for the same purpose can be composed of different types of files according to the structure required by different applications. Usually different suffixes are used to refer to different types, and then we give each file a name that is easy to understand and remember. And when there are many files, we group these files according to a certain division method, and each group of files is placed in the same directory (or folder). In addition, there may be a subdirectory (subdirectory or subfolder) under the directory except files, and all files and directories form a tree structure. This tree structure has a dedicated name: File System (File System). There are many types of file systems, the common ones are FAT/FAT32/NTFS of Windows, EXT2/EXT3/EXT4/XFS/BtrFS of Linux, etc. In order to facilitate the search, start from the root node and go down to the file itself, and use special characters for the names of these directories, subdirectories, and files (such as "\" for Windows/DOS, "/" for Unix-like systems) ) together, such a string of characters is called a file path, such as "/etc/systemd/system.conf" in Linux or "C:\Windows\System32\taskmgr.exe" in Windows. A path is a unique identifier for accessing a specific file. For example, D:\data\file.exe under Windows is the path of a file, which represents the file.exe file under the data directory under the D partition.

The file system is built on the block device. The file system not only records the file path, but also records which blocks form a file, and which blocks record directory/subdirectory information. Different file systems have different organizational structures. In order to facilitate management, a block device such as a hard disk can usually be divided into multiple logical block devices, that is, a hard disk partition (Partition). Conversely, the capacity and performance of a single medium are limited, and multiple physical block devices can be combined into a logical block device through certain technical means, such as various levels of RAID, JBOD, etc. File systems can also be built on top of these logical block devices. In any case, the application server application does not need to care about the specific location of the underlying block device where the file to be accessed is located. It only needs to send the file name/ID of the file to the file system, and the file system will query the file according to the file name/ID. Just the path.

Common file access protocols are NFS, CIFS, or SMB, which are not limited in this embodiment.

The file system in this application is a read-write file system. A read-write file system is a file system that can write files to storage devices and read files from storage devices, such as FAT, HPFS, NTFS, EXT4, F2FS, etc.

A file system generally includes a metadata area and a data area, and the metadata area includes a super block and an index node (inode) area. The super block of the metadata area can include the control information of the file system, data structure, etc., and the inode area of the metadata area can include the description information of the file, such as file length, file type, etc., and the file type is, for example, a regular inode ), directory file (directory inode), soft link (symbol link inode), special file (special inode), etc. The data stored in the data area may be data obtained after file-level compression processing based on a lossless compression technology. The data in the data area is stored in the physical storage space of the storage medium (for example, disk, flash memory, etc.) according to a set of disk blocks. Wherein, the data of the same file may be stored in continuous disk blocks, or may also be interleavedly stored in discontinuous disk blocks.

It should be understood that the introduction of the concept of a disk block in the present application does not mean that the storage medium is only limited to a disk, and the disk block can be used to represent a small physical storage space obtained by dividing the physical storage space of the storage medium.

Of course, the storage system in this application may also include a distributed storage system. The so-called distributed storage system refers to a system that stores data dispersedly on multiple independent storage nodes. Traditional network storage systems use centralized storage arrays to store all data. The performance of storage arrays is not only the bottleneck of system performance, but also the focus of reliability and security, which cannot meet the needs of large-scale storage applications.

The above is a brief introduction of an application scenario of the present application.

In the above storage system, the speed of the device is sorted according to the data read and write capabilities, and the order from strong to weak is as follows: central processing unit (central processing unit, CPU) >> double data rate synchronous DRAM (double data rate synchronous dynamic random access memory, DDR SDRAM) > flash memory chip falsh. It can be seen that the bottleneck of data access in the storage system is the IO (input output) time overhead of data between memory and flash.

In order to improve the overall IO read/write performance of the storage system, the files in the memory need to be compressed. Since the metadata area in the file system accounts for a small proportion of the entire file system, the data area often occupies a relatively high device storage capacity. Therefore, when writing data to the falsh, compressing the data and writing the compressed data into the falsh can reduce the storage capacity of the falsh and prolong the service life of the falsh.

At present, the read-write file system of Linux, such as: F2FS, the second generation flash file system (journalling Flash file system version2, JFFS2), B-tree file system (B-tree file system, BTRFS), etc., the read-write file system of Windows , for example: NTFS, etc., can use the following data compression methods:

The original file data (or source data) that needs to be compressed is divided into the smallest compressible unit of fixed size

(cluster) for compression, and the compressed file data (or compressed data) includes header data and compressed data. Among them, the header data is used to represent the attribute information of the file data; the compressed data is used to represent the content of the file data. Then save the compressed file data to flash and align it with a size of 4kb.

Exemplarily, in the schematic diagram of the fixed input compression mode shown in Figure 5, 4 data blocks (blocks) with continuous addresses are compressed as a cluster0 to obtain a cluster composed of header data (header)+compressed data (compressed data). Compress file data. If the compressed file data is less than 4kb, the compressed file data will be stored on the flash in 4kb size.

Assume that the size of the original file data (or source data) shown in Figure 5 is 4 blocks, each block is 4kb in size, one block is a logical page, and the logical pages of the original file data are numbered: 0, 1, 2, 3. The original file data is compressed into compressed file data according to a compression ratio of 75%, and the size of the compressed file data is 12kb. Therefore, the data size of the compressed file is 3 blocks, so the actual page of the compressed file data is 3 pages, that is, the size of the actual falsh page that needs to be read to read a single logical page is shown in Table 1.

Table 1

After saving the compressed file data to the flash. If you need to read the target logical page of the original file data on the flash, you need to read 3 pages of the compressed file data, and decompress the compressed file data before reading the target logical page. For example, in the scenario of random reading, if you need to read the original file data of logical page 0 on the flash, you need to read all the data of the three pages of the compressed file data and decompress the compressed file data. The original file data of logical page 0 was successfully read. Therefore, the data reading efficiency is:

It can be seen that the compressed file data obtained through the data compression method shown in FIG. 5 has low reading efficiency in a random reading scenario.

In order to solve the above technical problem, the embodiment of the present application provides a data compression method, the method obtains m data blocks in the data area of the readable and writable file system. Compress m data blocks using a preset compression algorithm to obtain n compressed data blocks in sequence, wherein the first capacity of each compressed data block is the same, and the first capacity represents the bytes of compressed data that the compressed data block can contain number. Wherein, m and n are both positive integers greater than or equal to 1.

Wherein, the preset compression algorithm may be a compression algorithm corresponding to a fixed output compression mode, such as (lempel-ziv 4, LZ4) compression algorithm. Certainly, the preset compression algorithm may also be other compression algorithms, which are not specifically limited in this embodiment of the present application.

Exemplarily, in an application scenario, as shown in Figure 6, assuming that the size of the source data is 16kb, taking the data of 4kb as a data block and also a logical page as an example, the logical pages of the source data are numbered: 0, 1, 2, 3, as shown in the first row of Table 2.

Assume that the continuous 16kb source data of the logical page is divided into three parts: 6kb, 7kb, and 5kb. The three pieces of data are compressed using a preset compression algorithm (for example, LZ4) until the size of each piece of compressed data in the compressed data block is 4kb.

The data pages of the compressed data block are three pages, which are numbered respectively: compressed page 4, compressed page 5 and compressed page 6 as shown in FIG. 6 .

Table 2

It can be seen that the source data of logical page 0 is compressed into compressed page 4, so it is compressed into 1 page. A part of source data of logical page 1 is compressed in compressed page 4, and another part of source data of logical page 1 is compressed in compressed page 5, so it is compressed into 2 pages. A part of source data of logical page 2 is compressed in compressed page 5, and another part of source data of logical page 2 is compressed in compressed page 6, so it is compressed into 2 pages. The source data of logical page 3 is all compressed in compressed page 6, so it is compressed into 1 page.

Therefore, in a random read scenario, any one or more of the above logical pages may be read. Exemplarily, when reading logical page 0, only one compressed page is required to be read, as shown in the second row and second column of Table 2. After decompression, all data of logical page 0 can be obtained.

At this time, the read efficiency can be calculated according to the following formula 2:

The read efficiency of reading logical page 3 is the same as that of reading logical page 0.

When logical page 1 is read, two compressed pages need to be read, as shown in the second row and third column of Table 2. All the data of logical page 1 can be obtained after the data of compressed page 4 and compressed page 5 are decompressed.

At this point, the read efficiency can be calculated according to the following equation 3:

The read efficiency of reading logical page 2 is the same as that of reading logical page 1.

In addition, the average reading efficiency of 4 logical pages can be calculated according to the following formula 4:

From the reading efficiency calculated by Equation 2, Equation 3, and Equation 4, it can be seen that in the random reading scenario, the reading efficiency of the data compression method shown in Figure 6 is much greater than that of the data compression method shown in Figure 5 The read efficiency of the method.

It can be seen that in the embodiment of the present application, m data blocks in the data area of the readable and writable file system are compressed using a compression algorithm corresponding to a fixed output compression mode, and n compressed data blocks of data with the same number of bytes are sequentially obtained, so that the output Each compressed data block of is a fixed size. When reading a data block, it can effectively improve the reading efficiency, and can ensure that the random read scenario completes the data reading with a small read amplification factor.

In addition, the data block indexing method in the existing extensible read-only file system (extendable read-only file system, erofs) shown in FIG. 8 . In the data block address array data_addr, all accesses are block addresses, which point to the address of the actual data block. When erofs is making a mirror image, when adopting the method shown in Figure 5 to compress data, because the structure of storage device (for example: disk) and file content are fixed, can't support the modification to file. However, in the actual operation scenario of the user, many compressed files on the storage device may need to be modified frequently, and erofs cannot support this appeal.

The above data block index can find the corresponding data block. It can be understood that the data block index is also an inode, that is, metadata. Among them, the inode is the area used to store metadata, that is, the area used to store file-related attribute information, such as: the creator of the file, the date of creation, the size, the location of the data block, and so on. Among them, each inode has a number, and the operating system uses different inode numbers to identify different files. Exemplarily, on the surface, the user opens the file through the file name. In fact, the system first finds the corresponding inode number according to the file name, then obtains the inode information through the inode number, and then finds the address of the data block according to the inode information, and reads the data.

That is to say, the inode records the attributes of the file and the actual storage location of the file, that is, the data block number (block number), each block (common size is 4KB), and the file can be searched and located through the inode. Inode is in Linux and vnode in Unix. Basically, the information contained in the inode is at least the following: (1) the type of the file; (2) the file access permission; (3) the owner and group of the file; (4) the size of the file; (5) the number of links, which point to The total number of file names of the inode; (6) the state change time (ctime), the latest access time (atime) and the latest modification time (mtime) of the file; (7) the special attributes of the file, SUID, SGID and SBIT; (8) the file The actual pointer of the content.

As shown in FIG. 8 , it is an existing data block index format, which does not support scalability, for example, does not support additional writing, block reservation, truncate, and the like. Among them, appending refers to adding new content on the basis of the original file without deleting the content in the original file. Block reservation refers to the fact that the file system considers in advance where the disk blocks can be allocated if the file grows, and reserves these disk blocks. Truncate refers to modifying files, such as deleting, adding, and so on.

Exemplarily, as shown in FIG. 8, the data block index is referred to by blk entry, which is abbreviated as blk for ease of description. Where blk 1 is the index of the compressed data block 1, and the address of the compressed data block 1 on the storage device is stored in blk1. blk2 is an index of the compressed data block 2, and the address of the compressed data block 2 on the storage device is stored in blk2. blk3 is an index of the compressed data block 3, and the address of the compressed data block 3 on the storage device is stored in blk3. blk4 is an index of the compressed data block 4, and the address of the compressed data block 4 on the storage device is stored in blk4. Therefore, the location of the compressed data block on the storage device can be determined according to the address stored in the blk.

In order to enable the read-write file system to support writing, overwriting, pre-allocation, truncate, etc., the data compression method provided by the embodiment of the present application further includes: establishing j data blocks corresponding to the i-th compressed data block in the n compressed data The first index of each data block in , and record the mapping relationship between the first index and j data blocks. Wherein, i is a positive integer greater than or equal to 1 and less than or equal to n. j is a positive integer greater than or equal to 1 and less than or equal to m. Wherein, the first index is used to identify the storage location of each data block contained in the j data blocks in the storage medium, and the attribute information contained in each data block contained in the j data blocks. Of course, the first index is also used to identify the storage location of the i-th compressed data block in the storage medium, and the attribute information contained in each of the j data blocks.

Among them, at least one of the following attribute information:

The first attribute is used to indicate whether the storage location of the compressed data block where the data block is compressed is pre-allocated.

The second attribute is used to represent whether the data page of the data block is valid; that is, whether it is a normal data page or a hole data page, where the hole data page can be understood as a blank data page.

The third attribute is used to represent whether the data page of the data block is the first compressed page of the compressed data block of the data block.

The fourth attribute is used to represent whether the data page of the data block is included in the compressed data pages of the two compressed blocks.

The fifth attribute is used to indicate whether the data page of the data block is a compressed page of the compressed data block after the data block is compressed.

The sixth attribute is used to represent the index address of the compressed data block where the data page of the data block is located.

The seventh attribute is used to represent that when the data page of the data block belongs to the first compressed page of the compressed data block of the data block, the attribute value of the seventh attribute is the offset of the data block in the set corresponding to the compressed data block; when When the data page of the data block does not belong to the first compressed page of the compressed data block of the data block, the attribute value of the seventh attribute is the distance between the data page of the data block and the first compressed page of the compressed data block.

Exemplarily, as shown in FIG. 7 , the first index of the data block includes: blk entry, which corresponds to storing the address of the data block or compressed data block; and extent entry, which corresponds to storing extended attribute information. Among them, each extent entry corresponds to a blk entry one by one, and each data page has a corresponding extent entry and blk entry.

Among them, the collection representation of Extent entry members is as follows:

Exemplarily, as shown in set A, the members included in the data block index may be shown in set A, and it should be noted that each data page has a corresponding set A.

The meaning of set members is explained as follows:

is_reserved is the first attribute mentioned above.

is_valid is the above-mentioned second attribute.

first_page is the above-mentioned third attribute.

cross_block is the fourth attribute mentioned above.

is_compress is the fifth attribute mentioned above.

blkidx is the sixth attribute mentioned above.

ofs is the seventh attribute mentioned above.

It can be seen that when applying the method shown in FIG. 6 to compress data in the read-write file system, the attributes contained in the index of the data block shown in FIG. 7 can be modified, so that the compressed file on the storage device can be modified.

The data compression method provided by the embodiment of the present application will be described below with reference to specific examples.

FIG. 9 is a schematic flowchart of a data compression method provided by an embodiment of the present application. As shown in Figure 9, the method includes:

S901. Obtain m data blocks in the data area of the readable and writable file system, where m is a positive integer greater than or equal to 1.

The m data blocks can be understood as data blocks that need to be written back. Wherein, write-back may refer to writing data into memory first for caching during a write operation, but not immediately writing data into a storage device (for example: disk). The data cached in the memory will be written to the storage device only under some specific conditions or operations (for example: a refresh mechanism, a synchronization (sync) operation, etc.).

S902. Compress m data blocks using a preset compression algorithm to sequentially obtain n compressed data blocks, wherein the first capacity of each compressed data block is the same, and the first capacity represents the compressed data that can be contained in the compressed data block The number of bytes, n is a positive integer greater than or equal to 1.

Wherein, the preset compression algorithm may be the LZ4 compression algorithm, and of course, other fixed-output compression algorithms may also be used, which are not specifically limited in this embodiment of the present application.

Wherein, m can be any positive integer. For example, m is 4, m is 10, or m is 20.

S902 can be specifically implemented as:

S9021. Allocate each data block in the m data blocks to the first set sequentially in a preset order.

The preset sequence can store addresses consecutively. That is, a contiguous sequence of m data blocks.

This first set may be referred to as the smallest compressible unit (cluster). In other words, the first set is the smallest compressible set of data blocks. For example, a 6kb data block set, a 7kb data block set, and a 5kb data block set are shown in FIG. 6 .

Exemplarily, m data blocks are mapped to a continuous address in the storage medium. Assuming a data block as the starting point, according to the order of the address of the data block mapped in the storage medium, divide the fixed-size data set in sequence, as shown in Figure 6, the data block 0 and 1/2 of the data block 1 form 6kb 1/2 data of data block 1, 3/4 data of data block 2, and blank data pages form a 7kb data set; 1/4 data of data block 2 and data block 3 form a 5kb data set collection of data.

S9022. Determine whether the data capacity of the j data blocks in the first set is equal to the rated capacity of the first set, where j is a positive integer greater than or equal to 1 and less than or equal to m. If the data capacity of the j data blocks is not equal to the rated capacity of the first set, execute S9021; if the data capacity of the j data blocks is equal to the rated capacity of the first set, execute S9023.

S9023. Perform a fixed compression operation on the j data blocks in the first set according to the set compression threshold, and obtain the i-th compressed data block.

Wherein, a compression threshold is set to characterize the compression rate. Exemplarily, the expression formula for setting the compression threshold may be:

Set the compression threshold = total data length - total data length * compression rate

S9024. Determine whether the total data length of the j data blocks is greater than the sum of the header data of the ith compressed data block, the total data length of the compressed data, and the set compression threshold. If yes, execute S903. Otherwise, commit the source data page to flash.

S903. Establish the first index of each data block in the j data blocks corresponding to the i-th compressed data block in the n compressed data, and record the mapping relationship between the first index and the j data blocks; wherein, i is a positive integer greater than or equal to 1 and less than or equal to n; j is a positive integer greater than or equal to 1 and less than or equal to m.

It should be understood that, when a compressed data block is compressed, an index of each data block in the data blocks corresponding to the compressed data block is established.

Wherein, the first index is used to identify the storage location of each data block contained in the j data blocks in the storage medium, and the attribute information contained in each data block contained in the j data blocks.

Exemplarily, taking the f2fs read-write file system of linux as an example, the first index format of the data block on F2fs can be:

The data structure of the first index containing attribute information may be:

For example, Entry data structure:

Among them, the attribute information may include at least one of the following:

The first attribute (is_reserved) is used to represent whether the storage location of the compressed data block where the data block is compressed is pre-allocated;

The second attribute (is_valid) is used to represent whether the data page of the data block is valid;

The third attribute (first_page) is used to represent whether the data page of the data block is the first compressed page of the compressed data block of the data block;

The fourth attribute (cross_block) is used to represent whether the data page of the data block is included in the compressed data pages of the two compressed blocks;

The fifth attribute (is_compress) is used to represent whether the data page of the data block is a compressed page of the compressed data block after the data block is compressed;

The sixth attribute (blkidx) is used to represent the index address of the compressed data block where the data page of the data block is located.

The seventh attribute (ofs) is used to represent when the data page of the data block belongs to the first compressed page of the compressed data block of the data block, the attribute value of the seventh attribute is the offset of the data block in the set corresponding to the compressed data block When the data page of the data block does not belong to the first compressed page of the compressed data block of the data block, the attribute value of the seventh attribute is the distance between the data page of the data block and the first compressed page of the compressed data block.

In a specific implementable manner, as shown in FIG. 10 , it is a schematic flowchart of updating a data block index provided by the embodiment of the present application. As shown in Figure 10, the attribute information may include a third attribute (first_page) and a seventh attribute (ofs), and S903 may specifically be implemented as:

S1031. Determine whether the data page of each data block in the j data blocks is the first compressed page of the i-th compressed data block. If yes, the attribute value of the third attribute is assigned a value of 1; if not, the attribute value of the third attribute is assigned a value of 0.

S1032. When the attribute value of the third attribute is 1, update the attribute value of the seventh attribute to an offset in the corresponding first set in the data block.

S1033. When the attribute value of the third attribute is 0, update the distance between the data page of the data block and the first compressed page of the ith compressed data block with the attribute value of the seventh attribute.

Of course, the attribute information may also include a fourth attribute (cross_block), and S103 may be specifically implemented as:

S1034. Determine whether the data page of each data block in the j data blocks is included in the compressed data pages of the two compressed blocks. If yes, the attribute value of the fourth attribute is assigned a value of 1; if not, the attribute value of the fourth attribute is assigned a value of 0.

Of course, the attribute information can also include the second attribute (is_valid), and S103 can be specifically implemented as:

S1035. Determine whether the data page of each data block in the j data blocks is valid. If yes, the attribute value of the second attribute is assigned a value of 1; if not, the attribute value of the second attribute is assigned a value of 0.

Of course, the attribute information may also include a sixth attribute (blkidx), and S103 may be specifically implemented as:

S1036. Determine the index address of the compressed data block where the data page of each data block in the m data blocks is located.

Wherein, according to the order of the storage locations of the m data blocks in the memory, the compression is performed with the size of the smallest fixed compression unit (such as the first set). When the first compression is completed (that is, the first compressed data block is compressed), the data pages of the complete data block corresponding to the first compressed data block are all at the index positions of the first compressed data block. For example, the data blocks corresponding to the first compressed data block include partial data of data block 0, data block 1, data block 2, and data block 3. The complete data blocks corresponding to the first compressed data block are data block 0, data block 1 and data block 2. Therefore, the data pages of data block 0, data block 1, and data block 2 are at the index position of the first compressed data block.

It should be noted here that, except that the seventh attribute needs to be attached to the third attribute, the data block index update processes corresponding to other attributes are independent of each other. In this embodiment of the present application, there is no specific limitation on the order of the data block index update process corresponding to each of the first attribute, the second attribute, the fourth attribute, the fifth attribute and the sixth attribute.

Exemplarily, it is assumed that, as shown in FIG. 11 , m data blocks include data block 0 (ie, block0), data block 1 (ie, block1), data block 2 (ie, block2), and data block 3 (ie, block3). Among them, block0, block1, block2, and block3 map a continuous address in memory. When mapping a segment of consecutive addresses in the memory according to the order of block0, block1, block2, and block3 (compression direction from left to right in Figure 11), the compression is performed with the size of the smallest fixed compression unit (such as the first set). During compression:

When block0 and a part of block1 reach the minimum fixed compression unit (such as 4kb), the first compression is performed to obtain the first compressed data block (compress blk0). At this point, the data block index of block0 is established, as shown in Table 3:

table 3

Combining with FIG. 10 , it can be obtained that the data page of block0 falls on the first compressed page of the first compressed data block, so first_page is assigned a value of 1. The data page of block0 only falls on the first compressed page of the first compressed data block, so the assignment value of cross_block is 0. The index address of the data page of block0 in the first compressed data block is the serial number of the first compressed data block (compress blk0), so the assignment value of blkidx is 0. The data page of block0 falls on the first compressed page of the first compressed data block, and the offset of block0 in its corresponding first set is 0, so ofs is assigned a value of 0. The data page of block0 is a valid data page, so the assignment of is_valid is 1.

When the remaining part of block1, block2, and a part of block3 reach the minimum fixed compression unit (such as 4kb), the second compression is performed to obtain the second compressed data block (compress blk1). At this point, the data block indexes of block1 and block2 are established, as shown in Table 4:

Table 4

Combining with FIG. 10 , it can be obtained that the data page of block1 falls on the first compressed page of the second compressed data block, so first_page is assigned a value of 1. The data page of Block1 falls on the compressed pages of the first compressed data block and the second compressed data block, so cross_block is assigned a value of 1. The index address of the data page of Block1 in the second compressed data block is the serial number of the second compressed data block (compress blk1), so the assignment value of blkidx is 1. The data page of Block1 falls on the first compressed page of the second compressed data block, and the offset of block1 in the set of data blocks is Ofs1, so ofs is assigned the value Ofs1. The data page of Block1 is a valid data page, so the value of is_valid is 1.

Similarly, the data page of block2 does not fall on the first compressed page of the second compressed data block, so first_page is assigned a value of 0. The data page of Block2 only falls on the compressed page of the second compressed data block, so the assignment value of cross_block is 0. The index address of the data page of Block2 in the second compressed data block is the serial number of the second compressed data block (compress blk1), so the assignment value of blkidx is 1. The data page of Block2 does not fall on the first compressed page of the second compressed data block, and the distance between the data page of block2 and the first compressed page of the first compressed data block is 1, so ofs is assigned a value of 1. The data page of Block2 is a valid data page, so the value of is_valid is 1.

When the remaining part of block3 reaches the minimum fixed compression unit (such as 4kb), the third compression is performed to obtain the third compressed data block (compress blk2). At this point, the data block index of block3 is established, as shown in Table 5:

table 5

Combining with FIG. 10 , it can be obtained that the data page of block3 falls on the first compressed page of the third compressed data block, so first_page is assigned a value of 1. The data page of Block1 falls on the compressed pages of the second compressed data block and the third compressed data block, so cross_block is assigned a value of 1. The index address of the data page of Block1 in the third compressed data block is the serial number of the second compressed data block (compress blk2), so the assignment value of blkidx is 2. The data page of Block3 falls on the first compressed page of the third compressed data block, and the offset of block3 in the set of data blocks is Ofs2, so ofs is assigned a value of Ofs2. The data page of Block3 is a valid data page, so the value of is_valid is 1.

S904. Determine whether m pieces of data have been compressed. If the compression is complete, the compressed page of the compressed data block is submitted to the device. If not, execute S902.

In some embodiments, as shown in FIG. 12 , it is a schematic flowchart of a data compression method provided in the embodiment of the present application. As shown in FIG. 12, before performing S902, the data compression method provided in the embodiment of the present application further includes:

S905. Obtain a second set of data to be overwritten and written.

S906. Determine whether the compressed data block is included in the second set of data to be overwritten. If yes, execute S907; if not, execute the existing data overwriting process.

Wherein, the second set may include p compressed data blocks, and p is a positive integer greater than or equal to 1.

S907. Obtain the compressed page of the first target compressed data among the p compressed data blocks in the second set, and q data blocks corresponding to the compressed page of the first target compressed data block, where q is a positive integer greater than or equal to 1.

S908. Determine the position offset of the first target data block among the q data blocks among the q data blocks.

Specifically, the index address of each compressed data block in the second set may be read, and the compressed data block is decompressed to obtain each data block corresponding to the compressed data block. Then determine the position offset of each data block among the q data blocks.

S909. According to the first compressed page and the first data block in the compressed page

to get the data page of the first data block.

S910. Determine that the data page of the first target data block is the data page to be overwritten with data.

S911. Overwrite the second data block into the data page of the first data block.

S912. Allocate the second data block to the first set.

To sum up, in the read-write file system f2fs, using the fixed output compression mode provided by this application to compress the designated so files, vdex files, and odex files, etc., can achieve the following benefits: For example, installing 40 applications on electronic devices In the process, an average of 12% of the time benefit can be obtained for each application. During the application installation process, there is append writing to the so file, and there is an overwriting process to the vdex file and odex file. The average start-up benefit of 40 applications, compared with the fixed input compression mode, can obtain an additional 8% start-up benefit.

The above data compression method provided by the embodiment of the present application needs to read the data after the data is compressed using the above data compression method. As shown in FIG. 14 , it is a schematic flow chart of data reading provided by the embodiment of the present application. As shown in Figure 14, the data reading process is as follows:

S141. Read the first index of the first data block to obtain the index address of the first compressed data block corresponding to the first data block, where the first index includes attribute information of the first data block.

Wherein, the attribute information of the first data block may include at least one of the first attribute to the seventh attribute in the foregoing embodiment. Exemplarily, taking the attribute information of the first data block including the third attribute (first_page), the fourth attribute (cross_block), the sixth attribute (blkidx) and the seventh attribute (ofs) as an example:

In overwrite and read-only scenarios, assume that the first data block is data block 2 (block2) shown in FIG. 13 . Read attribute information such as ofs, cross_block, blkidx of block2, and obtain the index address of the first compressed data block corresponding to block2. Specifically, following Table 4, the value of ofs read in block2 is 1, it can be determined that the data page of Block2 does not fall on the first compressed page of the second compressed data block, and the distance between the data page of block2 and the first compressed page can also be obtained. The distance of the first compressed page of the compressed data block is 1. Then read the assignment of cross_block of block2 to 0, it can be determined that the data page of Block2 only falls on the compressed page of the first compressed data block, and does not fall on the compressed pages of other compressed data blocks. After reading the assignment of blkidx of block2 to 1, it can be determined that the index address of the data page of Block2 in the first compressed data block can be the serial number of the first compressed data block, and the index address of the first compressed data block can be obtained as 1 .

S142. Read the index of the first compressed data block corresponding to the first data block.

S143. Decompress the first compressed data block according to the index of the first compressed data block to obtain multiple data blocks corresponding to the first compressed data block, where the multiple data blocks include the first data block.

Specifically, the first compressed data block is found on the device according to the index of the first compressed data block. After the first compressed data block is found, the first compressed data block is parsed, and multiple parsed data blocks are obtained. For example, as shown in Figure 13, the first compressed data block is copress blk1, and after copress blk1 is parsed, it is obtained: part of the data of data block 1 (block1), part of data of data block 2 (block2), and part of data of data block 3 (block3) .

S144. Determine the offset of the first data block in the multiple decompressed data blocks.

Specifically, according to the attribute information of the first data block, it can be concluded that the first data block is data block 2 (block2) shown in FIG. 13 . As shown in Figure 13, the expression of the offset (dstofs) of block2 in multiple data blocks parsed by copress blk1 is:

dstofs=block_size-ofs1% block_size

Among them, dstofs represents the offset of block2 in the multiple data blocks parsed by copress blk1; block_size represents the length of the data block; ofs1 represents the attribute value of the seventh attribute; ofs1% block_size represents the remainder.

S145. Obtain the data of the first data block according to the offset of the first data block in the multiple decompressed data blocks.

Following the above example, the first data block is block2, as shown in FIG. 13 and Table 4, and the data of block2 can be obtained.

Specifically, the communication system described in this possible design is used to perform the functions of each device in the data compression method shown in FIG. 9 , so the same effect as the above data compression method can be achieved.

Fig. 15 is a data compression device provided by the embodiment of the present application. The data compression device 1500 may include: a first acquisition unit 1501, configured to acquire m data blocks in the data area of the writable and writable file system, where m is greater than or equal to A positive integer of 1. The compression unit 1502 is configured to compress m data blocks using a preset compression algorithm to obtain n compressed data blocks in turn, wherein the first capacity of each compressed data block is the same, and the first capacity represents the compression process that the compressed data block can contain The number of bytes of the subsequent data, n is a positive integer greater than or equal to 1. The updating unit 1503 is configured to establish the first index of each data block in the j data blocks corresponding to the i-th compressed data block in the n compressed data, and record the mapping relationship between the first index and the j data blocks. Wherein, i is a positive integer greater than or equal to 1 and less than or equal to n; j is a positive integer greater than or equal to 1 and less than or equal to m. Wherein, the first index is used to identify the storage location of each data block contained in the j data blocks in the storage medium, and the attribute information contained in each data block contained in the j data blocks.

In a specific implementable manner, the compression unit 1502 is configured to: sequentially allocate each data block in the m data blocks to the first set in a preset order. When the data capacity of the j data blocks in the first set is equal to the rated capacity of the first set, perform the compression operation on the j data blocks according to the set compression threshold, and obtain the ith compressed data block.

In a specific implementable manner, the updating unit 1503 is configured to: when the sum of the header data and the total data length of the compressed data of the ith compressed data block and the set compression threshold is less than or equal to the total data length of the j data blocks When the length is long, the first index of each data block in the j data blocks is established.

In a specific implementable manner, the attribute information includes at least one of the following: the first attribute is used to indicate whether the storage location of the compressed data block where the data block is compressed is pre-allocated; the second attribute is used to indicate Whether the data page of the data block is valid; the third attribute is used to indicate whether the data page of the data block is the first compressed page of the compressed data block of the data block; the fourth attribute is used to indicate whether the data page of the data block is included in Among the compressed data pages of the two compressed blocks; the fifth attribute is used to represent whether the data page of the data block is the compressed page of the compressed data block after the data block is compressed; the sixth attribute is used to represent the location of the data page of the data block The index address of the compressed data block; the seventh attribute is used to represent that when the data page of the data block belongs to the first compressed page of the compressed data block of the data block, the attribute value of the seventh attribute is corresponding to the compressed data block of the data block The offset within the collection; when the data page of the data block does not belong to the first compressed page of the compressed data block of the data block, the attribute value of the seventh attribute is the distance from the data page of the data block to the first compressed page of the compressed data block page distance.

In a specific implementable manner, the attribute information includes a third attribute, and the updating unit 1503 is further configured to: when the data page of each data block in the j data blocks is the first compressed page of the i-th compressed data block , the attribute value of the third attribute is assigned a value of 1. When the data page of each data block in the j data blocks is not the first compressed page of the ith compressed data block, the attribute value of the third attribute is assigned a value of 0.

In a specific implementable manner, the attribute information includes a seventh attribute, and the updating unit 1503 is further configured to: when the attribute value of the third attribute is 1, update the attribute value of the seventh attribute in the data block corresponding to the compressed data block The offset within the set. When the attribute value of the third attribute is 0, the distance between the data page of the data block and the first compressed page of the compressed data block is updated when the attribute value of the seventh attribute is 0.

In a specific implementable manner, the attribute information includes a fourth attribute, and the updating unit 1503 is further configured to: when the data page of each data block in the j data blocks is included in the compressed data pages of two compressed blocks, The attribute value of the fourth attribute is assigned a value of 1. When the data page of each data block in the j data blocks is not included in the compressed data pages of the two compressed blocks, the attribute value of the fourth attribute is assigned a value of 0.

In some implementation manners, it further includes: a second obtaining unit 1504, configured to obtain a second set of data to be overwritten and written, the second set includes p compressed data blocks, and p is a positive integer greater than or equal to 1. The third obtaining unit 1505 is configured to obtain the compressed page of the first target compressed data in the p compressed data blocks, and q data blocks corresponding to the compressed page of the first target compressed data block, where q is a positive integer greater than or equal to 1. The first determining unit 1506 is configured to determine the position offset of the first target data block among the q data blocks among the q data blocks. The second determining unit 1507 is configured to determine that the data page of the first target data block is the data page to be overwritten with data.

In some practicable manners, it further includes: a first reading unit, configured to read the first index of the first data block, and obtain the index address of the first compressed data block corresponding to the first data block, wherein the first index It includes attribute information of the first data block. The second reading unit is configured to read the index of the first compressed data block corresponding to the first data block. The decompression unit is configured to decompress the first compressed data block according to the index of the first compressed data block to obtain a plurality of data blocks corresponding to the first compressed data block, the plurality of data blocks including the first data block. The third determining unit is configured to determine the offset of the first data block in the decompressed multiple data blocks. The third obtaining unit is configured to obtain the data of the first data block according to the offset of the first data block in the decompressed multiple data blocks.

An embodiment of the present application also provides a device, which includes: a unit for performing the steps described in any one of the foregoing, or a unit for performing the steps described in any one of the foregoing.

An embodiment of the present application also provides a computer-readable storage medium, including instructions, which, when run on a computer, cause the computer to execute any one of the above methods.

The embodiment of the present application also provides a computer program product containing instructions, which, when run on a computer, causes the computer to execute any one of the above methods.

The embodiment of the present application also provides a chip, the chip includes a processor and an interface circuit, the interface circuit is coupled to the processor, the processor is used to run computer programs or instructions to implement the above method, and the interface circuit is used to communicate with other modules outside the chip to communicate.

In the description of the present application, unless otherwise specified, "/" means "or", for example, A/B may mean A or B. The "and/or" in this article is just an association relationship describing associated objects, which means that there can be three relationships, for example, A and/or B, which can mean: A exists alone, A and B exist at the same time, and B exists alone These three situations. In addition, "at least one" means one or more, and "plurality" means two or more. Words such as "first" and "second" do not limit the number and order of execution, and words such as "first" and "second" do not necessarily limit the difference.

In the description of this application, words such as "exemplary" or "for example" are used to mean an example, illustration or illustration. Any embodiment or design scheme described as "exemplary" or "for example" in the embodiments of the present application shall not be interpreted as being more preferred or more advantageous than other embodiments or design schemes. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete manner.

Through the description of the above embodiments, those skilled in the art can clearly understand that for the convenience and brevity of the description, only the division of the above-mentioned functional modules is used as an example for illustration. In practical applications, the above-mentioned functions can be allocated according to needs It is completed by different functional modules, that is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above.

In the several embodiments provided in this application, it should be understood that the disclosed devices and methods may be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the modules or units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components can be Incorporation or may be integrated into another device, or some features may be omitted, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.

The unit described as a separate component may or may not be physically separated, and the component displayed as a unit may be one physical unit or multiple physical units, that is, it may be located in one place, or may be distributed to multiple different places . Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.

Although the application has been described in conjunction with specific features and embodiments thereof, it will be apparent that various modifications and combinations can be made thereto without departing from the spirit and scope of the application. Accordingly, the specification and drawings are merely illustrative of the application as defined by the appended claims and are deemed to cover any and all modifications, variations, combinations or equivalents within the scope of this application. Obviously, those skilled in the art can make various changes and modifications to the application without departing from the spirit and scope of the application. In this way, if these modifications and variations of the present application fall within the scope of the claims of the present application and their equivalent technologies, the present application is also intended to include these modifications and variations.

It should be noted that: the above is only the specific implementation of the application, but the protection scope of the application is not limited thereto, and any changes or replacements within the technical scope disclosed in the application shall be covered by the application. within the scope of protection. Therefore, the protection scope of the present application should be determined by the protection scope of the claims.

Claims

A data compression method, characterized in that the method comprises:

Obtain m data blocks in the data area of the readable and writable file system, where m is a positive integer greater than or equal to 1;

The m data blocks are compressed using a preset compression algorithm, and n compressed data blocks are sequentially obtained, wherein the first capacity of each compressed data block is the same, and the first capacity represents the compressed data that the compressed data block can contain The number of bytes of data, n is a positive integer greater than or equal to 1;

Establishing the first index of each data block in the j data blocks corresponding to the i-th compressed data block in the n compressed data, and recording the mapping relationship between the first index and j data blocks; wherein, i is a positive integer greater than or equal to 1 and less than or equal to n; j is a positive integer greater than or equal to 1 and less than or equal to m;

Wherein, the first index is used to identify the storage location of each data block contained in the j data blocks in the storage medium, and the attribute information contained in each data block contained in the j data blocks.
The method according to claim 1, wherein the m data blocks are compressed using a preset compression algorithm, and n compressed data blocks are sequentially obtained, comprising:

sequentially assigning each data block in the m data blocks to the first set in a preset order;

When the data capacity of the j data blocks in the first set is equal to the rated capacity of the first set, perform a compression operation on the j data blocks according to a set compression threshold, and obtain the The i-th compressed data block.
The method according to claim 2, wherein said establishing the first index of each data block in the j data blocks corresponding to the i-th compressed data block in the n compressed data comprises:

When the sum of the header data of the i-th compressed data block and the total data length of the compressed data and the set compression threshold is less than or equal to the total data length of the j data blocks, the j data blocks are established The first index of each data block in the block.
The method according to any one of claims 1-3, wherein the attribute information includes at least one of the following:

The first attribute is used to represent whether the storage location of the compressed data block where the data block is compressed is pre-allocated;

The second attribute is used to represent whether the data page of the data block is valid;

The third attribute is used to represent whether the data page of the data block is the first compressed page of the compressed data block of the data block;

The fourth attribute is used to represent whether the data page of the data block is included in the compressed data pages of the two compressed blocks;

The fifth attribute is used to represent whether the data page of the data block is a compressed page of the compressed data block after the data block is compressed;

The sixth attribute is used to represent the index address of the compressed data block where the data page of the data block is located;

The seventh attribute is used to indicate that when the data page of the data block belongs to the first compressed page of the compressed data block of the data block, the attribute value of the seventh attribute is that the data block corresponds to the compressed data block The offset within the set; when the data page of the data block does not belong to the first compressed page of the compressed data block of the data block, the attribute value of the seventh attribute is the distance between the data pages of the data block The distance to the first compressed page of a compressed data block.
The method according to claim 4, wherein the attribute information includes a third attribute, and the establishment of each data block in the j data blocks corresponding to the i-th compressed data block in the n compressed data The first index of , including:

When the data page of each data block in the j data blocks is the first compressed page of the ith compressed data block, the attribute value of the third attribute is assigned a value of 1;

When the data page of each data block in the j data blocks is not the first compressed page of the ith compressed data block, the attribute value of the third attribute is assigned a value of 0.
The method according to claim 4 or 5, wherein the attribute information includes a seventh attribute, and further includes:

When the attribute value of the third attribute is 1, update the attribute value of the seventh attribute to the offset of the data block in the set corresponding to the compressed data block;

When the attribute value of the third attribute is 0, update the distance between the data page of the data block and the first compressed page of the compressed data block with the attribute value of the seventh attribute.
The method according to any one of claims 4-6, wherein the attribute information includes a fourth attribute, and the establishment of j data corresponding to the i-th compressed data block in the n compressed data The first index of each data block in the block, including:

When the data page of each data block in the j data blocks is included in the compressed data pages of two compressed blocks, the attribute value of the fourth attribute is assigned a value of 1;

When the data page of each data block in the j data blocks is not included in the compressed data pages of the two compressed blocks, the attribute value of the fourth attribute is assigned a value of 0.
The method according to any one of claims 4-7, wherein the attribute information includes a second attribute, and the establishment of j data corresponding to the i-th compressed data block in the n compressed data The first index of each data block in the block, including:

When the data page of each data block in the j data blocks is valid, the attribute value of the second attribute is assigned a value of 1;

When the data page of each data block in the j data blocks is invalid, the attribute value of the second attribute is assigned a value of 0.
The method according to any one of claims 1-8, wherein, before compressing the m data blocks by using a preset compression algorithm to sequentially obtain n compressed data blocks, further comprising:

Obtain a second set of data to be overwritten and written, the second set includes p compressed data blocks, and p is a positive integer greater than or equal to 1;

Obtain the compressed page of the first target compressed data in the p compressed data blocks, and the q data blocks corresponding to the compressed page of the first target compressed data block, where q is a positive integer greater than or equal to 1;

determining the position offset of the first target data block among the q data blocks in the q data blocks;

Determining that the data page of the first target data block is the data page to be overwritten with data.
The method according to any one of claims 1-9, wherein the first index is used to identify the storage location of the ith compressed data block in the storage medium, and the j data blocks The attribute information contained in each data block in .
A data compression device, characterized in that the device comprises:

The first acquisition unit is used to acquire m data blocks in the data area of the readable and writable file system, where m is a positive integer greater than or equal to 1;

A compression unit, configured to compress the m data blocks using a preset compression algorithm to sequentially obtain n compressed data blocks, wherein the first capacity of each compressed data block is the same, and the first capacity indicates that the compressed data block can contain The number of bytes of compressed data, n is a positive integer greater than or equal to 1;

An update unit, configured to establish the first index of each data block in the j data blocks corresponding to the i-th compressed data block in the n compressed data, and record the mapping between the first index and the j data blocks Relationship; wherein, i is a positive integer greater than or equal to 1 and less than or equal to n; j is a positive integer greater than or equal to 1 and less than or equal to m;

Wherein, the first index is used to identify the storage location of each data block contained in the j data blocks in the storage medium, and the attribute information contained in each data block contained in the j data blocks.
The device according to claim 11, wherein the compression unit is used for:

sequentially assigning each data block in the m data blocks to the first set in a preset order;

When the data capacity of the j data blocks in the first set is equal to the rated capacity of the first set, perform a compression operation on the j data blocks according to a set compression threshold, and obtain the The i-th compressed data block.
The device according to claim 12, wherein the updating unit is used for:

When the sum of the header data of the i-th compressed data block and the total data length of the compressed data and the set compression threshold is less than or equal to the total data length of the j data blocks, the j data blocks are established The first index of each data block in the block.
The device according to any one of claims 11-13, wherein the attribute information includes at least one of the following:

The first attribute is used to represent whether the storage location of the compressed data block where the data block is compressed is pre-allocated;

The second attribute is used to represent whether the data page of the data block is valid;

The third attribute is used to represent whether the data page of the data block is the first compressed page of the compressed data block of the data block;

The fourth attribute is used to represent whether the data page of the data block is included in the compressed data pages of the two compressed blocks;

The fifth attribute is used to represent whether the data page of the data block is a compressed page of the compressed data block after the data block is compressed;

The sixth attribute is used to represent the index address of the compressed data block where the data page of the data block is located;

The seventh attribute is used to indicate that when the data page of the data block belongs to the first compressed page of the compressed data block of the data block, the attribute value of the seventh attribute is that the data block corresponds to the compressed data block The offset within the set; when the data page of the data block does not belong to the first compressed page of the compressed data block of the data block, the attribute value of the seventh attribute is the distance between the data pages of the data block The distance to the first compressed page of a compressed data block.
The device according to claim 14, wherein the attribute information includes a third attribute, and the updating unit is further configured to:

When the data page of each data block in the j data blocks is the first compressed page of the ith compressed data block, the attribute value of the third attribute is assigned a value of 1;

When the data page of each data block in the j data blocks is not the first compressed page of the ith compressed data block, the attribute value of the third attribute is assigned a value of 0.
The device according to claim 14 or 15, wherein the attribute information includes a seventh attribute, and the updating unit is further configured to:

When the attribute value of the third attribute is 1, update the attribute value of the seventh attribute to the offset of the data block in the set corresponding to the compressed data block;

When the attribute value of the third attribute is 0, update the distance between the data page of the data block and the first compressed page of the compressed data block with the attribute value of the seventh attribute.
The device according to any one of claims 14-16, wherein the attribute information includes a fourth attribute, and the updating unit is further configured to:

When the data page of each data block in the j data blocks is included in the compressed data pages of two compressed blocks, the attribute value of the fourth attribute is assigned a value of 1;

When the data page of each data block in the j data blocks is not included in the compressed data pages of the two compressed blocks, the attribute value of the fourth attribute is assigned a value of 0.
The device according to any one of claims 14-17, wherein the attribute information includes a second attribute, and the updating unit is further configured to:

When the data page of each data block in the j data blocks is valid, the attribute value of the second attribute is assigned a value of 1;

When the data page of each data block in the j data blocks is invalid, the attribute value of the second attribute is assigned a value of 0.
The device according to any one of claims 11-18, further comprising:

The second acquisition unit is configured to acquire a second set of data to be overwritten and written, the second set includes p compressed data blocks, and p is a positive integer greater than or equal to 1;

The third acquisition unit is configured to acquire the compressed page of the first target compressed data in the p compressed data blocks, and the q data blocks corresponding to the compressed page of the first target compressed data block, where q is greater than or equal to 1 positive integer;

A first determining unit, configured to determine a position offset of the first target data block among the q data blocks in the q data blocks;

The second determining unit is configured to determine that the data page of the first target data block is the data page to be overwritten with data.
The device according to any one of claims 11-19, wherein the first index is used to identify the storage location of the ith compressed data block in the storage medium, and the j data blocks The attribute information contained in each data block in .
A device, characterized by comprising: a device for executing the data compression method according to any one of claims 1 to 10.
A computer-readable storage medium, characterized in that, the computer-readable storage medium includes computer instructions, and when the computer instructions are run on an electronic device, the electronic device executes any one of claims 1 to 10. The data compression method described in the item.
A computer program, characterized in that, when the program is invoked by a processor, the data compression method according to any one of claims 1 to 10 is executed.
A system on a chip, characterized in that it includes one or more processors, and when the one or more processors execute instructions, the one or more processors perform the process described in any one of claims 1 to 10 The data compression method described above.