WO2022262381A1 - Data compression method and apparatus - Google Patents
Data compression method and apparatus Download PDFInfo
- Publication number
- WO2022262381A1 WO2022262381A1 PCT/CN2022/085621 CN2022085621W WO2022262381A1 WO 2022262381 A1 WO2022262381 A1 WO 2022262381A1 CN 2022085621 W CN2022085621 W CN 2022085621W WO 2022262381 A1 WO2022262381 A1 WO 2022262381A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- compressed
- data block
- attribute
- page
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 78
- 238000013144 data compression Methods 0.000 title claims abstract description 41
- 238000007906 compression Methods 0.000 claims abstract description 86
- 230000006835 compression Effects 0.000 claims abstract description 84
- 238000013507 mapping Methods 0.000 claims abstract description 19
- 230000008569 process Effects 0.000 claims description 11
- 238000004590 computer program Methods 0.000 claims description 5
- 230000003321 amplification Effects 0.000 abstract description 10
- 238000003199 nucleic acid amplification method Methods 0.000 abstract description 10
- 230000015654 memory Effects 0.000 description 53
- 238000010586 diagram Methods 0.000 description 15
- 230000006870 function Effects 0.000 description 11
- 238000012545 processing Methods 0.000 description 11
- 238000012986 modification Methods 0.000 description 8
- 230000004048 modification Effects 0.000 description 8
- 230000009286 beneficial effect Effects 0.000 description 6
- 238000013461 design Methods 0.000 description 6
- 230000008901 benefit Effects 0.000 description 5
- 239000000306 component Substances 0.000 description 5
- 238000004891 communication Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 101100226364 Arabidopsis thaliana EXT1 gene Proteins 0.000 description 3
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 230000006837 decompression Effects 0.000 description 3
- 238000005192 partition Methods 0.000 description 3
- 238000013519 translation Methods 0.000 description 3
- 102100037460 E3 ubiquitin-protein ligase Topors Human genes 0.000 description 2
- 238000003491 array Methods 0.000 description 2
- 238000013500 data storage Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 239000013307 optical fiber Substances 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 101100226366 Arabidopsis thaliana EXT3 gene Proteins 0.000 description 1
- 102100029074 Exostosin-2 Human genes 0.000 description 1
- 101000918275 Homo sapiens Exostosin-2 Proteins 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 239000002131 composite material Substances 0.000 description 1
- 239000008358 core component Substances 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 238000010348 incorporation Methods 0.000 description 1
- 238000011900 installation process Methods 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 208000034420 multiple type III exostoses Diseases 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 239000002245 particle Substances 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000002688 persistence Effects 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M7/00—Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
- H03M7/30—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
- H03M7/60—General implementation details not specific to a particular type of compression
- H03M7/6047—Power optimization with respect to the encoder, decoder, storage or transmission
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0602—Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
- G06F3/0608—Saving storage space on storage systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0602—Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
- G06F3/061—Improving I/O performance
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0638—Organizing or formatting or addressing of data
- G06F3/064—Management of blocks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0655—Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
- G06F3/0661—Format or protocol conversion arrangements
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0668—Interfaces specially adapted for storage systems adopting a particular infrastructure
- G06F3/0671—In-line storage system
- G06F3/0673—Single storage device
- G06F3/0679—Non-volatile semiconductor memory device, e.g. flash memory, one time programmable memory [OTP]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0668—Interfaces specially adapted for storage systems adopting a particular infrastructure
- G06F3/0671—In-line storage system
- G06F3/0683—Plurality of storage devices
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0668—Interfaces specially adapted for storage systems adopting a particular infrastructure
- G06F3/0671—In-line storage system
- G06F3/0683—Plurality of storage devices
- G06F3/0688—Non-volatile semiconductor memory arrays
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M7/00—Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
- H03M7/30—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
Definitions
- the present application relates to the technical field of data compression, in particular to a data compression method and device.
- the read-write file system of Linux such as: F2FS, the second generation flash file system (journalling Flash file system version2, JFFS2), B-tree file system (B-tree file system, BTRFS), etc.
- the read-write file system of Windows for example: NTFS, etc. Since the metadata area in the file system accounts for a small proportion of the entire file system, the data area often occupies a relatively high device storage capacity. Therefore, compressing the data in the data area can reduce the size of the input and output IO and improve the overall read and write performance of the IO.
- Existing data compression methods usually compress the original file data (or source data) that needs to be compressed according to the smallest compressible unit of fixed size, and the compressed file data (or compressed data) can include header data and compressed data. data.
- the header data is used to represent the attribute information of the file data; the compressed data is used to represent the content of the file data. Then save the compressed file data to the storage medium.
- the existing compression schemes for read-write file systems have the problem of random read amplification, and the read efficiency is low.
- Embodiments of the present application provide a data compression method and device, which can solve the problem of random read amplification of a read-write file system and improve read efficiency.
- the embodiment of the present application provides a data compression method.
- the execution subject of the method may be an electronic device, or a component (for example, a chip, a chip system or a processor, etc.) located in the electronic device.
- the subject is described as an example of an electronic device.
- the method includes: the electronic device acquires m data blocks in the data area of the readable and writable file system, where m is a positive integer greater than or equal to 1.
- the electronic device compresses m data blocks using a preset compression algorithm, and obtains n compressed data blocks in turn, wherein the first capacity of each compressed data block is the same, and the first capacity represents the compressed data that can be contained in the compressed data block
- the number of bytes, n is a positive integer greater than or equal to 1.
- the electronic device establishes a first index for each of the j data blocks corresponding to the i-th compressed data block among the n compressed data, and records a mapping relationship between the first index and the j data blocks.
- i is a positive integer greater than or equal to 1 and less than or equal to n
- j is a positive integer greater than or equal to 1 and less than or equal to m.
- the first index is used to identify the storage location of each data block contained in the j data blocks in the storage medium, and the attribute information contained in each data block contained in the j data blocks.
- the data compression method provided by the embodiment of the present application can effectively improve the reading efficiency when reading a data block, and can ensure that the random reading scenario completes data reading with a small read amplification factor.
- the attributes contained in the index of the data block can be modified, so that the compressed file on the storage device can be modified. It can be seen that the embodiment of the present application solves the problem of random read amplification in existing read-write file system compression schemes, and at the same time solves the problem that existing file systems with fixed output compression methods cannot support data and metadata updates.
- m data blocks are compressed using a preset compression algorithm, and n compressed data blocks are sequentially obtained, specifically: each data block in the m data blocks is assigned to the first gather.
- each data block in the m data blocks is assigned to the first gather.
- the first index of each data block in the j data blocks corresponding to the i-th compressed data block in the n compressed data is established, specifically: when the header of the i-th compressed data block When the sum of the total data length of the partial data and compressed data and the set compression threshold is less than or equal to the total data length of j data blocks, establish the first index of each data block in the j data blocks.
- the attribute information includes at least one of the following: a first attribute, which is used to indicate whether the storage location of the compressed data block where the data block is compressed is pre-allocated.
- the second attribute is used to indicate whether the data page of the data block is valid.
- the third attribute is used to represent whether the data page of the data block is the first compressed page of the compressed data block of the data block.
- the fourth attribute is used to represent whether the data page of the data block is included in the compressed data pages of the two compressed blocks.
- the fifth attribute is used to indicate whether the data page of the data block is a compressed page of the compressed data block after the data block is compressed.
- the sixth attribute is used to represent the index address of the compressed data block where the data page of the data block is located.
- the seventh attribute is used to represent that when the data page of the data block belongs to the first compressed page of the compressed data block of the data block, the attribute value of the seventh attribute is the offset of the data block in the set corresponding to the compressed data block; when When the data page of the data block does not belong to the first compressed page of the compressed data block of the data block, the attribute value of the seventh attribute is the distance between the data page of the data block and the first compressed page of the compressed data block.
- the attribute information includes a third attribute
- the first index of each data block in the j data blocks corresponding to the i-th compressed data block in the n compressed data is established, specifically: when j When the data page of each data block in a data block is the first compressed page of the ith compressed data block, the attribute value of the third attribute is assigned a value of 1. When the data page of each data block in the j data blocks is not the first compressed page of the ith compressed data block, the attribute value of the third attribute is assigned a value of 0.
- the attribute information includes the seventh attribute, and further includes: when the attribute value of the third attribute is 1, updating the offset of the data block in the set corresponding to the compressed data block when the attribute value of the seventh attribute is 1. shift.
- the attribute value of the third attribute is 0, the distance between the data page of the data block and the first compressed page of the compressed data block is updated when the attribute value of the seventh attribute is 0.
- the attribute information includes a fourth attribute
- the first index of each data block in the j data blocks corresponding to the i-th compressed data block in the n compressed data is established, specifically: when j
- the attribute value of the fourth attribute is assigned a value of 1.
- the attribute value of the fourth attribute is assigned a value of 0.
- the attribute information includes the second attribute, and the first index of each data block in the j data blocks corresponding to the i-th compressed data block in the n compressed data is established, specifically: when j When the data page of each data block in the data blocks is valid, the attribute value of the second attribute is assigned a value of 1. When the data page of each data block in the j data blocks is invalid, the attribute value of the second attribute is assigned a value of 0.
- n compressed data blocks before compressing m data blocks using a preset compression algorithm to obtain n compressed data blocks sequentially, it also includes: obtaining a second set of data to be overwritten and written, the second set including p compressed data block, p is a positive integer greater than or equal to 1.
- the data page of the first target data block is determined as the data page to be overwritten with data.
- the first index is used to identify the storage location of the i-th compressed data block in the storage medium, and the attribute information contained in each of the j data blocks.
- it also includes: reading the first index of the first data block to obtain the index address of the first compressed data block corresponding to the first data block, wherein the first index includes attribute information of the first data block .
- An index of a first compressed data block corresponding to the first data block is read.
- the first compressed data block is decompressed to obtain multiple data blocks corresponding to the first compressed data block, and the multiple data blocks include the first data block.
- An offset of the first data block within the plurality of decompressed data blocks is determined. According to the offset of the first data block in the multiple decompressed data blocks, the data of the first data block can be obtained.
- the embodiment of the present application provides a data compression device, which includes: a first acquisition unit configured to acquire m data blocks in a data area in a writable and writable file system, where m is a positive integer greater than or equal to 1.
- the compression unit is used to compress m data blocks by using a preset compression algorithm to obtain n compressed data blocks in turn, wherein the first capacity of each compressed data block is the same, and the first capacity represents the post-compression processing that the compressed data block can contain The number of bytes of data, n is a positive integer greater than or equal to 1.
- the update unit is configured to establish the first index of each data block in the j data blocks corresponding to the i-th compressed data block in the n compressed data, and record the mapping relationship between the first index and the j data blocks.
- i is a positive integer greater than or equal to 1 and less than or equal to n
- j is a positive integer greater than or equal to 1 and less than or equal to m.
- the first index is used to identify the storage location of each data block contained in the j data blocks in the storage medium, and the attribute information contained in each data block contained in the j data blocks.
- the data compression method provided by the embodiment of the present application can effectively improve the reading efficiency when reading a data block, and can ensure that the random reading scenario completes data reading with a small read amplification factor.
- the attributes contained in the index of the data block can be modified, so that the compressed file on the storage device can be modified. It can be seen that the embodiment of the present application solves the problem of random read amplification in the existing read-write file system compression scheme, and at the same time solves the problem that the existing file system with fixed output compression mode cannot support data and metadata update.
- the compression unit is configured to: sequentially allocate each data block in the m data blocks to the first set in a preset order.
- the data capacity of the j data blocks in the first set is equal to the rated capacity of the first set, perform a compression operation on the j data blocks according to the set compression threshold, and obtain the i-th compressed data block.
- the updating unit is used for: when the sum of the header data of the i-th compressed data block and the total data length of the compressed data and the set compression threshold is less than or equal to the total data length of the j data blocks When , establish the first index of each data block in the j data blocks.
- the attribute information includes at least one of the following: a first attribute, which is used to indicate whether the storage location of the compressed data block where the data block is compressed is pre-allocated.
- the second attribute is used to indicate whether the data page of the data block is valid.
- the third attribute is used to represent whether the data page of the data block is the first compressed page of the compressed data block of the data block.
- the fourth attribute is used to represent whether the data page of the data block is included in the compressed data pages of the two compressed blocks.
- the fifth attribute is used to indicate whether the data page of the data block is a compressed page of the compressed data block after the data block is compressed.
- the sixth attribute is used to represent the index address of the compressed data block where the data page of the data block is located.
- the seventh attribute is used to indicate that when the data page of the data block belongs to the first compressed page of the compressed data block of the data block, the attribute value of the seventh attribute is the offset of the data block in the set corresponding to the compressed data block.
- the attribute value of the seventh attribute is the distance between the data page of the data block and the first compressed page of the compressed data block.
- the attribute information includes a third attribute
- the update unit is further configured to: when the data page of each data block in the j data blocks is the first compressed page of the i-th compressed data block, The attribute value of the third attribute is assigned a value of 1. When the data page of each data block in the j data blocks is not the first compressed page of the ith compressed data block, the attribute value of the third attribute is assigned a value of 0.
- the attribute information includes a seventh attribute
- the updating unit is further configured to: when the attribute value of the third attribute is 1, update the attribute value of the seventh attribute to the set corresponding to the data block in the compressed data block offset within.
- the attribute value of the third attribute is 0, the distance between the data page of the data block and the first compressed page of the compressed data block is updated when the attribute value of the seventh attribute is 0.
- the attribute information includes a fourth attribute
- the update unit is further configured to: when the data pages of each data block in the j data blocks are included in the compressed data pages of two compressed blocks, the first The attribute values of the four attributes are assigned a value of 1. When the data page of each data block in the j data blocks is not included in the compressed data pages of the two compressed blocks, the attribute value of the fourth attribute is assigned a value of 0.
- the attribute information includes a second attribute
- the updating unit is further configured to: assign a value of 1 to the attribute value of the second attribute when the data page of each data block in the j data blocks is valid.
- the attribute value of the second attribute is assigned a value of 0.
- the device further includes: a second acquiring unit, configured to acquire a second set of data to be overwritten and written, the second set includes p compressed data blocks, and p is a positive integer greater than or equal to 1.
- the third obtaining unit is used to obtain the compressed page of the first target compressed data in the p compressed data blocks, and the q data blocks corresponding to the compressed page of the first target compressed data block, where q is a positive integer greater than or equal to 1.
- the first determining unit is configured to determine the position offset of the first target data block among the q data blocks among the q data blocks.
- the second determining unit is configured to determine that the data page of the first target data block is the data page to be overwritten with data.
- the first index is used to identify the storage location of the i-th compressed data block in the storage medium, and the attribute information contained in each of the j data blocks.
- it further includes: a first reading unit, configured to read the first index of the first data block, and obtain the index address of the first compressed data block corresponding to the first data block, wherein the first index It includes attribute information of the first data block.
- the second reading unit is configured to read the index of the first compressed data block corresponding to the first data block.
- the decompression unit is configured to decompress the first compressed data block according to the index of the first compressed data block to obtain multiple data blocks corresponding to the first compressed data block, the multiple data blocks including the first data block.
- the third determining unit is configured to determine the offset of the first data block in the decompressed multiple data blocks.
- the third obtaining unit is configured to obtain the data of the first data block according to the offset of the first data block in the decompressed multiple data blocks.
- an embodiment of the present application provides a device, which includes: configured to execute the data compression method in the first aspect.
- an embodiment of the present application provides a computer-readable storage medium, the computer-readable storage medium includes computer instructions, and when the computer instructions are run on an electronic device, the electronic device executes the method described in the first aspect. Data compression method.
- an embodiment of the present application provides a computer program, and when the program is called by a processor, the data compression method in the first aspect is executed.
- an embodiment of the present application provides a chip system, which includes one or more processors, and when the one or more processors execute instructions, the one or more processors execute the data of the first aspect compression method.
- FIG. 1a is a block diagram of an operating system provided by an embodiment of the present application.
- FIG. 1b is a schematic structural diagram of a storage system provided by an embodiment of the present application.
- Fig. 2 is a schematic structural diagram of a solid-state hard disk of the storage system in Fig. 1b;
- Fig. 3 is the structural representation of the flash memory chip of solid-state hard disk in Fig. 2;
- FIG. 4 is a schematic diagram of a flash memory translation layer corresponding to the flash memory chip in FIG. 3;
- FIG. 5 is a schematic diagram of a fixed input compression mode
- FIG. 6 is a schematic diagram of a fixed output compression mode provided by an embodiment of the present application.
- FIG. 7 is a schematic diagram of a data block index provided by an embodiment of the present application.
- Fig. 8 is a schematic diagram of a data block index in an existing scalable read-only file system
- FIG. 9 is a schematic flow chart of a data compression method provided by an embodiment of the present application.
- FIG. 10 is a schematic flow chart of updating a data block index provided by an embodiment of the present application.
- FIG. 11 is a schematic diagram of a data block index relationship during data compression provided by an embodiment of the present application.
- FIG. 12 is a schematic flowchart of another data compression method provided by the embodiment of the present application.
- Fig. 13 is a schematic diagram of data block index relationship during an overwrite writing or reading process provided by an embodiment of the present application
- Fig. 14 is a schematic flow chart of data reading provided by the embodiment of the present application.
- FIG. 15 is a schematic structural diagram of a data compression device provided by an embodiment of the present application.
- FIG. 1a A block diagram of an operating system as shown in Figure 1a.
- An operating system is a computer program that manages computer hardware and software resources. For example, unix operating system, windows operating system, linux operating system, etc.
- the operating system needs to handle basic tasks such as managing and configuring memory, prioritizing the supply and demand of system resources, controlling input and output devices, operating the network, and managing the file system.
- the operating system also provides an interface for the user to interact with the system.
- the operating system kernel refers to the core part of most operating systems. It consists of those parts of the operating system used to manage memory, files, peripherals, and system resources. It is responsible for managing the system's processes, memory, device drivers, files, and network systems, and determines the performance and stability of the operating system.
- the operating system kernel is a system software that provides functions such as hardware abstraction layer, disk and file system control, and multitasking. It provides secure access to computer hardware for many applications, and it can determine when and how long an application will operate on a certain part of the computer's hardware. Since it is very complicated to directly operate on computer hardware, the operating system kernel can provide a set of hardware abstraction methods to complete these operations.
- the file system is the core module of the operating system kernel, that is, the main component.
- the file system is a method of organizing files on the storage device, responsible for managing and storing file information, mainly for users to create files, store, read, modify, Dump files, control file access, revoke files when they are no longer used by users, etc.
- the file system provides an abstract representation of files in the kernel, completes the mapping of files to physical storage devices (such as disks, hard disks, etc.), and maps the physical addresses of files on storage devices into user-visible path names and file names to facilitate file data. Fast reading, modification and persistence of data.
- File systems include read-write file systems and read-only file systems.
- a read-write file system is a file system that can write files to storage devices and read files from storage devices, such as: file allocation table (file allocation table, FAT), high performance file system (high performance file system, HPFS) , new technology file system (NTFS), fourth extended file system (fourth extended file system, EXT4), flash friendly file system (flash friendly file system, F2FS), etc.
- a read-only file system is a file system that can only read files from a storage device, but cannot write files to a storage device, such as an extendable read-only file system (EROFS).
- EROFS extendable read-only file system
- FIG. 1b A schematic structural diagram of a storage system shown in FIG. 1b.
- the application server 100 may be a physical machine or a virtual machine. Physical application servers include, but are not limited to, desktops, servers, laptops, and mobile devices.
- the application server accesses the storage system through the optical fiber switch 110 to access data.
- the switch 110 is only an optional device, and the application server 100 can also directly communicate with the storage system 120 through the network.
- the optical fiber switch 110 can also be replaced with an Ethernet switch, an InfiniBand switch, a RoCE (RDMA over Converged Ethernet) switch, or the like.
- the storage system 120 shown in FIG. 1b is a centralized storage system.
- the so-called centralized storage system refers to a central node composed of one or more master devices, where data is stored centrally, and all data processing services of the entire system are centrally deployed on this central node.
- the terminal or client is only responsible for the input and output of data, while the storage and control processing of data is completely handed over to the central node.
- the characteristic of the centralized storage system is that there is a unified entrance, and all data from external devices must pass through this entrance, and this entrance is the engine 121 of the centralized storage system.
- the engine 121 is the most core component in the centralized storage system, where many advanced functions of the storage system are implemented.
- FIG. 1 b there are one or more controllers in the engine 121 .
- FIG. 1 b takes the engine including two controllers as an example for illustration.
- controller 0 and controller 1 There is a mirror channel between controller 0 and controller 1, so when controller 0 writes a piece of data into its memory 124, it can send a copy of the data to controller 1 through the mirror channel, and controller 1 Store the copy in its own local memory 124 . Therefore, controller 0 and controller 1 are mutual backups.
- controller 0 fails, controller 1 can take over the business of controller 0.
- controller 1 fails, controller 0 can take over the business of controller 1. business, so as to avoid the unavailability of the entire storage system 120 caused by hardware failure.
- four controllers are deployed in the engine 121, there is a mirroring channel between any two controllers, so any two controllers are mutual backups.
- the engine 121 also includes a front-end interface 125 and a back-end interface 126 , wherein the front-end interface 125 is used to communicate with the application server 100 to provide storage services for the application server 100 .
- the back-end interface 126 is used to communicate with the hard disk 134 to expand the capacity of the storage system. Through the back-end interface 126, the engine 121 can be connected with more hard disks 134, thereby forming a very large storage resource pool.
- the disk enclosure 130 may be a SAS disk enclosure, or an NVMe disk enclosure, an IP disk enclosure, or other types of disk enclosures.
- the SAS hard disk enclosure adopts the SAS3.0 protocol, and each enclosure supports 25 SAS hard disks.
- the engine 121 is connected to the hard disk enclosure 130 through an onboard SAS interface or a SAS interface module.
- the NVMe disk enclosure is more like a complete computer system, and the NVMe disk is inserted into the NVMe disk enclosure. The NVMe disk enclosure is then connected to the engine 121 through the RDMA port.
- the controller 0 includes at least a processor 123 and a memory 124 .
- Processor 123 is a central processing unit (central processing unit, CPU), used for processing data access requests from outside the storage system (server or other storage systems), and also used for processing requests generated inside the storage system.
- CPU central processing unit
- the processor 123 receives the write data request sent by the application server 100 through the front-end port 125 , it will temporarily save the data in the write data request in the memory 124 .
- the processor 123 sends the data stored in the memory 124 to the hard disk 134 for persistent storage through the back-end port.
- the memory 124 refers to an internal memory directly exchanging data with the processor. It can read and write data at any time, and the speed is very fast. It is used as a temporary data storage for an operating system or other running programs.
- the memory includes at least two kinds of memory, for example, the memory can be random access memory.
- the random access memory is dynamic random access memory (Dynamic Random Access Memory, DRAM), or storage class memory (Storage Class Memory, SCM).
- DRAM Dynamic Random Access Memory
- SCM Storage Class Memory
- DRAM is a semiconductor memory, which, like most Random Access Memory (RAM), is a volatile memory device.
- SCM is a composite storage technology that combines the characteristics of traditional storage devices and memory.
- Storage-class memory can provide faster read and write speeds than hard disks, but the access speed is slower than DRAM, and the cost is also cheaper than DRAM.
- the DRAM and the SCM are only exemplary illustrations in this embodiment, and the memory may also include other random access memories, such as Static Random Access Memory (Static Random Access Memory, SRAM) and the like.
- the memory 124 can also be a dual-in-line memory module or a dual-line memory module (Dual In-line Memory Module, DIMM for short), that is, a module composed of dynamic random access memory (DRAM), or a solid-state hard drive. (Solid State Disk, SSD).
- DIMM Dual In-line Memory Module
- SSD Solid State Disk
- multiple memories 124 and different types of memories 124 may be configured in the controller 0 .
- the memory 124 can be configured to have a power saving function.
- the power saving function means that the data stored in the internal memory 124 will not be lost when the system is powered off and then powered on again.
- Memory with a power saving function is called non-volatile memory.
- both the memory 124 and the hard disk 134 may be a solid-state disk (English: Solid-state drive or Solid-state disk, SSD for short), which is a storage device mainly using flash memory (NAND Flash) as a permanent memory.
- the SSD 200 includes a NAND flash memory and a main controller (referred to as main control) 201.
- NAND flash memory includes multiple flash memory chips 205 for storing data.
- the main control 201 is the brain center of the SSD, responsible for some complex tasks, such as managing data storage, maintaining SSD performance and service life, and so on.
- the main control 201 is an embedded microchip, which includes a processor 202, and its function is like a command center, sending out all operation requests of the SSD.
- the processor 202 can perform functions such as reading/writing data, garbage collection, and wear leveling through firmware in the buffer.
- the SSD master 201 also includes a host interface 204 and several channel controllers. Wherein, the host interface 204 is used for communicating with the host.
- the host here can refer to any device such as a server, a personal computer, or an array controller.
- the main control 201 can operate multiple flash memory chips 205 in parallel, thereby increasing the underlying bandwidth. For example, assuming that there are 8 channels between the main control 201 and the FLASH particles, then the main control 201 reads and writes data to 8 flash memory chips 205 in parallel through these 8 channels.
- a die is a package of one or more flash memory chips.
- a die can contain multiple panels.
- Multi-Plane NAND is a design that can effectively improve performance.
- a die is divided into two Planes, and the block numbers in the two Planes are single and double crossover. Therefore, during operation, a single and double crossover operation can be performed to improve performance.
- a panel contains multiple blocks (block).
- a whole flash memory chip is composed of two panels. One panel stores blocks with odd numbers, and the other stores blocks with even numbers. Two planes can be parallelized. operate. This is just an example, and the size of the page, the capacity of the block, and the capacity of the flash memory chip may have different specifications, which are not limited in this embodiment.
- the host writes data into the block, and when a block is full, the SSD master 201 will select the next block to continue writing.
- a page is the smallest unit of data writing.
- the master control 201 writes data into the block at the page granularity.
- Block is the smallest unit of data erasure. When the master control erases data, it can only erase the entire block at a time.
- the host accesses the SSD through the logical block address (Logical Block Address, LBA), each LBA represents a sector (take 512B as an example), and inside the SSD, the host accesses the SSD in units of pages (take 4KB as an example). Therefore, every time the application server writes a piece of data, the SSD master will find a Page to write the data into. The address of the page is called the Physical Block Address (PBA). SSD internally records a mapping from LBA to PBA. With such a mapping, the next time the host needs to read the data of a certain LBA, the SSD will know where to read the data from the flash memory chip.
- LBA Logical Block Address
- FIG. 4 is a schematic diagram of a flash translation layer (Flash Translation Layer, FTL), and the FTL is located in the firmware of the processor 202.
- FTL Flash Translation Layer
- every time the host writes a new data a new mapping relationship will be generated, and this mapping relationship will be added (first write) or changed (overwritten) to the FTL.
- the SSD When reading a certain data, the SSD first searches for the PBA corresponding to the LBA of the data in the FTL, and then reads the corresponding data according to the PBA.
- the flash memory chip cannot support overwriting, which means that when the host modifies the data on a certain LBA, it cannot be directly changed on the PBA corresponding to this LBA, but must be written to a new PBA, and a mapping is added in the FTL .
- PBA E a new location
- Invalid data also called garbage data refers to data that is not pointed to by any mapping relationship.
- the memory 124 also stores a software program, and the processor 123 runs the software program in the memory 124 to manage the hard disk.
- hard disks are abstracted into storage resource pools, and then divided into LUNs for use by servers.
- the LUN here is actually the hard disk seen on the server.
- some centralized storage systems are also file servers themselves, which can provide shared file services for servers.
- Data stored in memory 124 may be represented by a file system.
- the file system is a structured form of data file storage and organization. We know that all the data in the computer are 0 and 1, and a series of 01 combinations stored on the hardware media are completely indistinguishable and manageable for us. Therefore, we use the concept of "file" to organize these data, and the data used for the same purpose can be composed of different types of files according to the structure required by different applications. Usually different suffixes are used to refer to different types, and then we give each file a name that is easy to understand and remember. And when there are many files, we group these files according to a certain division method, and each group of files is placed in the same directory (or folder).
- File System File System
- FAT/FAT32/NTFS of Windows EXT2/EXT3/EXT4/XFS/BtrFS of Linux, etc.
- a file path such as "/etc/systemd/system.conf” in Linux or "C: ⁇ Windows ⁇ System32 ⁇ taskmgr.exe” in Windows.
- a path is a unique identifier for accessing a specific file. For example, D: ⁇ data ⁇ file.exe under Windows is the path of a file, which represents the file.exe file under the data directory under the D partition.
- the file system is built on the block device.
- the file system not only records the file path, but also records which blocks form a file, and which blocks record directory/subdirectory information.
- Different file systems have different organizational structures.
- a block device such as a hard disk can usually be divided into multiple logical block devices, that is, a hard disk partition (Partition).
- Partition the capacity and performance of a single medium are limited, and multiple physical block devices can be combined into a logical block device through certain technical means, such as various levels of RAID, JBOD, etc.
- File systems can also be built on top of these logical block devices.
- the application server application does not need to care about the specific location of the underlying block device where the file to be accessed is located. It only needs to send the file name/ID of the file to the file system, and the file system will query the file according to the file name/ID. Just the path.
- Common file access protocols are NFS, CIFS, or SMB, which are not limited in this embodiment.
- the file system in this application is a read-write file system.
- a read-write file system is a file system that can write files to storage devices and read files from storage devices, such as FAT, HPFS, NTFS, EXT4, F2FS, etc.
- a file system generally includes a metadata area and a data area, and the metadata area includes a super block and an index node (inode) area.
- the super block of the metadata area can include the control information of the file system, data structure, etc.
- the inode area of the metadata area can include the description information of the file, such as file length, file type, etc., and the file type is, for example, a regular inode ), directory file (directory inode), soft link (symbol link inode), special file (special inode), etc.
- the data stored in the data area may be data obtained after file-level compression processing based on a lossless compression technology.
- the data in the data area is stored in the physical storage space of the storage medium (for example, disk, flash memory, etc.) according to a set of disk blocks.
- the data of the same file may be stored in continuous disk blocks, or may also be interleavedly stored in discontinuous disk blocks.
- the introduction of the concept of a disk block in the present application does not mean that the storage medium is only limited to a disk, and the disk block can be used to represent a small physical storage space obtained by dividing the physical storage space of the storage medium.
- the storage system in this application may also include a distributed storage system.
- the so-called distributed storage system refers to a system that stores data dispersedly on multiple independent storage nodes.
- Traditional network storage systems use centralized storage arrays to store all data.
- the performance of storage arrays is not only the bottleneck of system performance, but also the focus of reliability and security, which cannot meet the needs of large-scale storage applications.
- the speed of the device is sorted according to the data read and write capabilities, and the order from strong to weak is as follows: central processing unit (central processing unit, CPU) >> double data rate synchronous DRAM (double data rate synchronous dynamic random access memory, DDR SDRAM) > flash memory chip falsh. It can be seen that the bottleneck of data access in the storage system is the IO (input output) time overhead of data between memory and flash.
- CPU central processing unit
- IO input output
- the files in the memory need to be compressed. Since the metadata area in the file system accounts for a small proportion of the entire file system, the data area often occupies a relatively high device storage capacity. Therefore, when writing data to the falsh, compressing the data and writing the compressed data into the falsh can reduce the storage capacity of the falsh and prolong the service life of the falsh.
- the read-write file system of Linux such as: F2FS, the second generation flash file system (journalling Flash file system version2, JFFS2), B-tree file system (B-tree file system, BTRFS), etc.
- the read-write file system of Windows for example: NTFS, etc.
- F2FS the second generation flash file system
- JFFS2 journalling Flash file system version2, JFFS2
- B-tree file system B-tree file system
- NTFS NTFS
- the original file data (or source data) that needs to be compressed is divided into the smallest compressible unit of fixed size
- the compressed file data (or compressed data) includes header data and compressed data.
- the header data is used to represent the attribute information of the file data
- the compressed data is used to represent the content of the file data. Then save the compressed file data to flash and align it with a size of 4kb.
- 4 data blocks (blocks) with continuous addresses are compressed as a cluster0 to obtain a cluster composed of header data (header)+compressed data (compressed data). Compress file data. If the compressed file data is less than 4kb, the compressed file data will be stored on the flash in 4kb size.
- the size of the original file data (or source data) shown in Figure 5 is 4 blocks, each block is 4kb in size, one block is a logical page, and the logical pages of the original file data are numbered: 0, 1, 2, 3.
- the original file data is compressed into compressed file data according to a compression ratio of 75%, and the size of the compressed file data is 12kb. Therefore, the data size of the compressed file is 3 blocks, so the actual page of the compressed file data is 3 pages, that is, the size of the actual falsh page that needs to be read to read a single logical page is shown in Table 1.
- the compressed file data obtained through the data compression method shown in FIG. 5 has low reading efficiency in a random reading scenario.
- the embodiment of the present application provides a data compression method, the method obtains m data blocks in the data area of the readable and writable file system. Compress m data blocks using a preset compression algorithm to obtain n compressed data blocks in sequence, wherein the first capacity of each compressed data block is the same, and the first capacity represents the bytes of compressed data that the compressed data block can contain number.
- m and n are both positive integers greater than or equal to 1.
- the preset compression algorithm may be a compression algorithm corresponding to a fixed output compression mode, such as (lempel-ziv 4, LZ4) compression algorithm.
- the preset compression algorithm may also be other compression algorithms, which are not specifically limited in this embodiment of the present application.
- the size of the source data is 16kb, taking the data of 4kb as a data block and also a logical page as an example, the logical pages of the source data are numbered: 0, 1, 2, 3, as shown in the first row of Table 2.
- the continuous 16kb source data of the logical page is divided into three parts: 6kb, 7kb, and 5kb.
- the three pieces of data are compressed using a preset compression algorithm (for example, LZ4) until the size of each piece of compressed data in the compressed data block is 4kb.
- the data pages of the compressed data block are three pages, which are numbered respectively: compressed page 4, compressed page 5 and compressed page 6 as shown in FIG. 6 .
- the source data of logical page 0 is compressed into compressed page 4, so it is compressed into 1 page.
- a part of source data of logical page 1 is compressed in compressed page 4, and another part of source data of logical page 1 is compressed in compressed page 5, so it is compressed into 2 pages.
- a part of source data of logical page 2 is compressed in compressed page 5, and another part of source data of logical page 2 is compressed in compressed page 6, so it is compressed into 2 pages.
- the source data of logical page 3 is all compressed in compressed page 6, so it is compressed into 1 page.
- any one or more of the above logical pages may be read.
- the read efficiency can be calculated according to the following formula 2:
- the read efficiency of reading logical page 3 is the same as that of reading logical page 0.
- the read efficiency of reading logical page 2 is the same as that of reading logical page 1.
- m data blocks in the data area of the readable and writable file system are compressed using a compression algorithm corresponding to a fixed output compression mode, and n compressed data blocks of data with the same number of bytes are sequentially obtained, so that the output Each compressed data block of is a fixed size.
- the above data block index can find the corresponding data block.
- the data block index is also an inode, that is, metadata.
- the inode is the area used to store metadata, that is, the area used to store file-related attribute information, such as: the creator of the file, the date of creation, the size, the location of the data block, and so on.
- each inode has a number, and the operating system uses different inode numbers to identify different files.
- the user opens the file through the file name. In fact, the system first finds the corresponding inode number according to the file name, then obtains the inode information through the inode number, and then finds the address of the data block according to the inode information, and reads the data.
- the inode records the attributes of the file and the actual storage location of the file, that is, the data block number (block number), each block (common size is 4KB), and the file can be searched and located through the inode.
- Inode is in Linux and vnode in Unix.
- the information contained in the inode is at least the following: (1) the type of the file; (2) the file access permission; (3) the owner and group of the file; (4) the size of the file; (5) the number of links, which point to The total number of file names of the inode; (6) the state change time (ctime), the latest access time (atime) and the latest modification time (mtime) of the file; (7) the special attributes of the file, SUID, SGID and SBIT; (8) the file The actual pointer of the content.
- FIG. 8 it is an existing data block index format, which does not support scalability, for example, does not support additional writing, block reservation, truncate, and the like.
- appending refers to adding new content on the basis of the original file without deleting the content in the original file.
- Block reservation refers to the fact that the file system considers in advance where the disk blocks can be allocated if the file grows, and reserves these disk blocks.
- Truncate refers to modifying files, such as deleting, adding, and so on.
- the data block index is referred to by blk entry, which is abbreviated as blk for ease of description.
- blk 1 is the index of the compressed data block 1
- blk2 is an index of the compressed data block 2
- the address of the compressed data block 2 on the storage device is stored in blk2.
- blk3 is an index of the compressed data block 3
- the address of the compressed data block 3 on the storage device is stored in blk3.
- blk4 is an index of the compressed data block 4, and the address of the compressed data block 4 on the storage device is stored in blk4. Therefore, the location of the compressed data block on the storage device can be determined according to the address stored in the blk.
- the data compression method provided by the embodiment of the present application further includes: establishing j data blocks corresponding to the i-th compressed data block in the n compressed data
- the first index of each data block in and record the mapping relationship between the first index and j data blocks.
- i is a positive integer greater than or equal to 1 and less than or equal to n.
- j is a positive integer greater than or equal to 1 and less than or equal to m.
- the first index is used to identify the storage location of each data block contained in the j data blocks in the storage medium, and the attribute information contained in each data block contained in the j data blocks.
- the first index is also used to identify the storage location of the i-th compressed data block in the storage medium, and the attribute information contained in each of the j data blocks.
- At least one of the following attribute information is:
- the first attribute is used to indicate whether the storage location of the compressed data block where the data block is compressed is pre-allocated.
- the second attribute is used to represent whether the data page of the data block is valid; that is, whether it is a normal data page or a hole data page, where the hole data page can be understood as a blank data page.
- the third attribute is used to represent whether the data page of the data block is the first compressed page of the compressed data block of the data block.
- the fourth attribute is used to represent whether the data page of the data block is included in the compressed data pages of the two compressed blocks.
- the fifth attribute is used to indicate whether the data page of the data block is a compressed page of the compressed data block after the data block is compressed.
- the sixth attribute is used to represent the index address of the compressed data block where the data page of the data block is located.
- the seventh attribute is used to represent that when the data page of the data block belongs to the first compressed page of the compressed data block of the data block, the attribute value of the seventh attribute is the offset of the data block in the set corresponding to the compressed data block; when When the data page of the data block does not belong to the first compressed page of the compressed data block of the data block, the attribute value of the seventh attribute is the distance between the data page of the data block and the first compressed page of the compressed data block.
- the first index of the data block includes: blk entry, which corresponds to storing the address of the data block or compressed data block; and extent entry, which corresponds to storing extended attribute information.
- each extent entry corresponds to a blk entry one by one, and each data page has a corresponding extent entry and blk entry.
- Extent entry members are as follows:
- the members included in the data block index may be shown in set A, and it should be noted that each data page has a corresponding set A.
- first_page is the above-mentioned third attribute.
- cross_block is the fourth attribute mentioned above.
- is_compress is the fifth attribute mentioned above.
- blkidx is the sixth attribute mentioned above.
- FIG. 9 is a schematic flowchart of a data compression method provided by an embodiment of the present application. As shown in Figure 9, the method includes:
- the m data blocks can be understood as data blocks that need to be written back.
- write-back may refer to writing data into memory first for caching during a write operation, but not immediately writing data into a storage device (for example: disk).
- the data cached in the memory will be written to the storage device only under some specific conditions or operations (for example: a refresh mechanism, a synchronization (sync) operation, etc.).
- the preset compression algorithm may be the LZ4 compression algorithm, and of course, other fixed-output compression algorithms may also be used, which are not specifically limited in this embodiment of the present application.
- n can be any positive integer.
- m is 4, m is 10, or m is 20.
- S902 can be specifically implemented as:
- the preset sequence can store addresses consecutively. That is, a contiguous sequence of m data blocks.
- This first set may be referred to as the smallest compressible unit (cluster).
- the first set is the smallest compressible set of data blocks.
- a 6kb data block set, a 7kb data block set, and a 5kb data block set are shown in FIG. 6 .
- m data blocks are mapped to a continuous address in the storage medium. Assuming a data block as the starting point, according to the order of the address of the data block mapped in the storage medium, divide the fixed-size data set in sequence, as shown in Figure 6, the data block 0 and 1/2 of the data block 1 form 6kb 1/2 data of data block 1, 3/4 data of data block 2, and blank data pages form a 7kb data set; 1/4 data of data block 2 and data block 3 form a 5kb data set collection of data.
- S9022 Determine whether the data capacity of the j data blocks in the first set is equal to the rated capacity of the first set, where j is a positive integer greater than or equal to 1 and less than or equal to m. If the data capacity of the j data blocks is not equal to the rated capacity of the first set, execute S9021; if the data capacity of the j data blocks is equal to the rated capacity of the first set, execute S9023.
- a compression threshold is set to characterize the compression rate.
- the expression formula for setting the compression threshold may be:
- S9024 Determine whether the total data length of the j data blocks is greater than the sum of the header data of the ith compressed data block, the total data length of the compressed data, and the set compression threshold. If yes, execute S903. Otherwise, commit the source data page to flash.
- the first index is used to identify the storage location of each data block contained in the j data blocks in the storage medium, and the attribute information contained in each data block contained in the j data blocks.
- the first index format of the data block on F2fs can be:
- the data structure of the first index containing attribute information may be:
- Entry data structure For example, Entry data structure:
- the attribute information may include at least one of the following:
- the first attribute (is_reserved) is used to represent whether the storage location of the compressed data block where the data block is compressed is pre-allocated;
- the second attribute (is_valid) is used to represent whether the data page of the data block is valid
- the third attribute (first_page) is used to represent whether the data page of the data block is the first compressed page of the compressed data block of the data block;
- cross_block is used to represent whether the data page of the data block is included in the compressed data pages of the two compressed blocks
- the fifth attribute is used to represent whether the data page of the data block is a compressed page of the compressed data block after the data block is compressed;
- the sixth attribute (blkidx) is used to represent the index address of the compressed data block where the data page of the data block is located.
- the seventh attribute (ofs) is used to represent when the data page of the data block belongs to the first compressed page of the compressed data block of the data block, the attribute value of the seventh attribute is the offset of the data block in the set corresponding to the compressed data block When the data page of the data block does not belong to the first compressed page of the compressed data block of the data block, the attribute value of the seventh attribute is the distance between the data page of the data block and the first compressed page of the compressed data block.
- the attribute information may include a third attribute (first_page) and a seventh attribute (ofs), and S903 may specifically be implemented as:
- attribute information may also include a fourth attribute (cross_block), and S103 may be specifically implemented as:
- attribute information can also include the second attribute (is_valid), and S103 can be specifically implemented as:
- attribute information may also include a sixth attribute (blkidx), and S103 may be specifically implemented as:
- the compression is performed with the size of the smallest fixed compression unit (such as the first set).
- the data pages of the complete data block corresponding to the first compressed data block are all at the index positions of the first compressed data block.
- the data blocks corresponding to the first compressed data block include partial data of data block 0, data block 1, data block 2, and data block 3.
- the complete data blocks corresponding to the first compressed data block are data block 0, data block 1 and data block 2. Therefore, the data pages of data block 0, data block 1, and data block 2 are at the index position of the first compressed data block.
- the data block index update processes corresponding to other attributes are independent of each other.
- m data blocks include data block 0 (ie, block0), data block 1 (ie, block1), data block 2 (ie, block2), and data block 3 (ie, block3).
- block0, block1, block2, and block3 map a continuous address in memory.
- the compression is performed with the size of the smallest fixed compression unit (such as the first set).
- the data page of block0 falls on the first compressed page of the first compressed data block, so first_page is assigned a value of 1.
- the data page of block0 only falls on the first compressed page of the first compressed data block, so the assignment value of cross_block is 0.
- the index address of the data page of block0 in the first compressed data block is the serial number of the first compressed data block (compress blk0), so the assignment value of blkidx is 0.
- the data page of block0 falls on the first compressed page of the first compressed data block, and the offset of block0 in its corresponding first set is 0, so ofs is assigned a value of 0.
- the data page of block0 is a valid data page, so the assignment of is_valid is 1.
- the data page of block1 falls on the first compressed page of the second compressed data block, so first_page is assigned a value of 1.
- the data page of Block1 falls on the compressed pages of the first compressed data block and the second compressed data block, so cross_block is assigned a value of 1.
- the index address of the data page of Block1 in the second compressed data block is the serial number of the second compressed data block (compress blk1), so the assignment value of blkidx is 1.
- the data page of Block1 falls on the first compressed page of the second compressed data block, and the offset of block1 in the set of data blocks is Ofs1, so ofs is assigned the value Ofs1.
- the data page of Block1 is a valid data page, so the value of is_valid is 1.
- the data page of block2 does not fall on the first compressed page of the second compressed data block, so first_page is assigned a value of 0.
- the data page of Block2 only falls on the compressed page of the second compressed data block, so the assignment value of cross_block is 0.
- the index address of the data page of Block2 in the second compressed data block is the serial number of the second compressed data block (compress blk1), so the assignment value of blkidx is 1.
- the data page of Block2 does not fall on the first compressed page of the second compressed data block, and the distance between the data page of block2 and the first compressed page of the first compressed data block is 1, so ofs is assigned a value of 1.
- the data page of Block2 is a valid data page, so the value of is_valid is 1.
- the third compression is performed to obtain the third compressed data block (compress blk2).
- the data block index of block3 is established, as shown in Table 5:
- the data page of block3 falls on the first compressed page of the third compressed data block, so first_page is assigned a value of 1.
- the data page of Block1 falls on the compressed pages of the second compressed data block and the third compressed data block, so cross_block is assigned a value of 1.
- the index address of the data page of Block1 in the third compressed data block is the serial number of the second compressed data block (compress blk2), so the assignment value of blkidx is 2.
- the data page of Block3 falls on the first compressed page of the third compressed data block, and the offset of block3 in the set of data blocks is Ofs2, so ofs is assigned a value of Ofs2.
- the data page of Block3 is a valid data page, so the value of is_valid is 1.
- FIG. 12 it is a schematic flowchart of a data compression method provided in the embodiment of the present application. As shown in FIG. 12, before performing S902, the data compression method provided in the embodiment of the present application further includes:
- the second set may include p compressed data blocks, and p is a positive integer greater than or equal to 1.
- the index address of each compressed data block in the second set may be read, and the compressed data block is decompressed to obtain each data block corresponding to the compressed data block. Then determine the position offset of each data block among the q data blocks.
- FIG. 14 it is a schematic flow chart of data reading provided by the embodiment of the present application. As shown in Figure 14, the data reading process is as follows:
- the attribute information of the first data block may include at least one of the first attribute to the seventh attribute in the foregoing embodiment.
- the attribute information of the first data block including the third attribute (first_page), the fourth attribute (cross_block), the sixth attribute (blkidx) and the seventh attribute (ofs) as an example:
- the first data block is data block 2 (block2) shown in FIG. 13 .
- Read attribute information such as ofs, cross_block, blkidx of block2, and obtain the index address of the first compressed data block corresponding to block2.
- the value of ofs read in block2 is 1, it can be determined that the data page of Block2 does not fall on the first compressed page of the second compressed data block, and the distance between the data page of block2 and the first compressed page can also be obtained.
- the distance of the first compressed page of the compressed data block is 1.
- the first compressed data block is found on the device according to the index of the first compressed data block. After the first compressed data block is found, the first compressed data block is parsed, and multiple parsed data blocks are obtained. For example, as shown in Figure 13, the first compressed data block is copress blk1, and after copress blk1 is parsed, it is obtained: part of the data of data block 1 (block1), part of data of data block 2 (block2), and part of data of data block 3 (block3) .
- the first data block is data block 2 (block2) shown in FIG. 13 .
- the expression of the offset (dstofs) of block2 in multiple data blocks parsed by copress blk1 is:
- dstofs represents the offset of block2 in the multiple data blocks parsed by copress blk1; block_size represents the length of the data block; ofs1 represents the attribute value of the seventh attribute; ofs1% block_size represents the remainder.
- the first data block is block2, as shown in FIG. 13 and Table 4, and the data of block2 can be obtained.
- the communication system described in this possible design is used to perform the functions of each device in the data compression method shown in FIG. 9 , so the same effect as the above data compression method can be achieved.
- Fig. 15 is a data compression device provided by the embodiment of the present application.
- the data compression device 1500 may include: a first acquisition unit 1501, configured to acquire m data blocks in the data area of the writable and writable file system, where m is greater than or equal to A positive integer of 1.
- the compression unit 1502 is configured to compress m data blocks using a preset compression algorithm to obtain n compressed data blocks in turn, wherein the first capacity of each compressed data block is the same, and the first capacity represents the compression process that the compressed data block can contain The number of bytes of the subsequent data, n is a positive integer greater than or equal to 1.
- the updating unit 1503 is configured to establish the first index of each data block in the j data blocks corresponding to the i-th compressed data block in the n compressed data, and record the mapping relationship between the first index and the j data blocks.
- i is a positive integer greater than or equal to 1 and less than or equal to n
- j is a positive integer greater than or equal to 1 and less than or equal to m.
- the first index is used to identify the storage location of each data block contained in the j data blocks in the storage medium, and the attribute information contained in each data block contained in the j data blocks.
- the compression unit 1502 is configured to: sequentially allocate each data block in the m data blocks to the first set in a preset order. When the data capacity of the j data blocks in the first set is equal to the rated capacity of the first set, perform the compression operation on the j data blocks according to the set compression threshold, and obtain the ith compressed data block.
- the updating unit 1503 is configured to: when the sum of the header data and the total data length of the compressed data of the ith compressed data block and the set compression threshold is less than or equal to the total data length of the j data blocks When the length is long, the first index of each data block in the j data blocks is established.
- the attribute information includes at least one of the following: the first attribute is used to indicate whether the storage location of the compressed data block where the data block is compressed is pre-allocated; the second attribute is used to indicate Whether the data page of the data block is valid; the third attribute is used to indicate whether the data page of the data block is the first compressed page of the compressed data block of the data block; the fourth attribute is used to indicate whether the data page of the data block is included in Among the compressed data pages of the two compressed blocks; the fifth attribute is used to represent whether the data page of the data block is the compressed page of the compressed data block after the data block is compressed; the sixth attribute is used to represent the location of the data page of the data block The index address of the compressed data block; the seventh attribute is used to represent that when the data page of the data block belongs to the first compressed page of the compressed data block of the data block, the attribute value of the seventh attribute is corresponding to the compressed data block of the data block The offset within the collection; when the data page of the data block does not belong to the first
- the attribute information includes a third attribute
- the updating unit 1503 is further configured to: when the data page of each data block in the j data blocks is the first compressed page of the i-th compressed data block , the attribute value of the third attribute is assigned a value of 1. When the data page of each data block in the j data blocks is not the first compressed page of the ith compressed data block, the attribute value of the third attribute is assigned a value of 0.
- the attribute information includes a seventh attribute
- the updating unit 1503 is further configured to: when the attribute value of the third attribute is 1, update the attribute value of the seventh attribute in the data block corresponding to the compressed data block The offset within the set.
- the attribute value of the third attribute is 0, the distance between the data page of the data block and the first compressed page of the compressed data block is updated when the attribute value of the seventh attribute is 0.
- the attribute information includes a fourth attribute
- the updating unit 1503 is further configured to: when the data page of each data block in the j data blocks is included in the compressed data pages of two compressed blocks, The attribute value of the fourth attribute is assigned a value of 1. When the data page of each data block in the j data blocks is not included in the compressed data pages of the two compressed blocks, the attribute value of the fourth attribute is assigned a value of 0.
- the attribute information includes a second attribute
- the updating unit is further configured to: assign a value of 1 to the attribute value of the second attribute when the data page of each data block in the j data blocks is valid.
- the attribute value of the second attribute is assigned a value of 0.
- a second obtaining unit 1504 configured to obtain a second set of data to be overwritten and written, the second set includes p compressed data blocks, and p is a positive integer greater than or equal to 1.
- the third obtaining unit 1505 is configured to obtain the compressed page of the first target compressed data in the p compressed data blocks, and q data blocks corresponding to the compressed page of the first target compressed data block, where q is a positive integer greater than or equal to 1.
- the first determining unit 1506 is configured to determine the position offset of the first target data block among the q data blocks among the q data blocks.
- the second determining unit 1507 is configured to determine that the data page of the first target data block is the data page to be overwritten with data.
- it further includes: a first reading unit, configured to read the first index of the first data block, and obtain the index address of the first compressed data block corresponding to the first data block, wherein the first index It includes attribute information of the first data block.
- the second reading unit is configured to read the index of the first compressed data block corresponding to the first data block.
- the decompression unit is configured to decompress the first compressed data block according to the index of the first compressed data block to obtain a plurality of data blocks corresponding to the first compressed data block, the plurality of data blocks including the first data block.
- the third determining unit is configured to determine the offset of the first data block in the decompressed multiple data blocks.
- the third obtaining unit is configured to obtain the data of the first data block according to the offset of the first data block in the decompressed multiple data blocks.
- the first index is used to identify the storage location of the i-th compressed data block in the storage medium, and the attribute information contained in each of the j data blocks.
- the data compression method provided by the embodiment of the present application can effectively improve the reading efficiency when reading a data block, and can ensure that the random reading scenario completes data reading with a small read amplification factor.
- the attributes contained in the index of the data block can be modified, so that the compressed file on the storage device can be modified. It can be seen that the embodiment of the present application solves the problem of random read amplification in the existing read-write file system compression scheme, and at the same time solves the problem that the existing file system with fixed output compression mode cannot support data and metadata update.
- An embodiment of the present application also provides a device, which includes: a unit for performing the steps described in any one of the foregoing, or a unit for performing the steps described in any one of the foregoing.
- An embodiment of the present application also provides a computer-readable storage medium, including instructions, which, when run on a computer, cause the computer to execute any one of the above methods.
- the embodiment of the present application also provides a computer program product containing instructions, which, when run on a computer, causes the computer to execute any one of the above methods.
- the embodiment of the present application also provides a chip, the chip includes a processor and an interface circuit, the interface circuit is coupled to the processor, the processor is used to run computer programs or instructions to implement the above method, and the interface circuit is used to communicate with other modules outside the chip to communicate.
- words such as “exemplary” or “for example” are used to mean an example, illustration or illustration. Any embodiment or design scheme described as “exemplary” or “for example” in the embodiments of the present application shall not be interpreted as being more preferred or more advantageous than other embodiments or design schemes. Rather, the use of words such as “exemplary” or “such as” is intended to present related concepts in a concrete manner.
- the disclosed devices and methods may be implemented in other ways.
- the device embodiments described above are only illustrative.
- the division of the modules or units is only a logical function division. In actual implementation, there may be other division methods.
- multiple units or components can be Incorporation or may be integrated into another device, or some features may be omitted, or not implemented.
- the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
- the unit described as a separate component may or may not be physically separated, and the component displayed as a unit may be one physical unit or multiple physical units, that is, it may be located in one place, or may be distributed to multiple different places . Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
- each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
- the above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims (24)
- 一种数据压缩方法,其特征在于,所述方法包括:A data compression method, characterized in that the method comprises:获取可读写文件系统中数据区的m个数据块,m为大于等于1的正整数;Obtain m data blocks in the data area of the readable and writable file system, where m is a positive integer greater than or equal to 1;采用预设压缩算法压缩所述m个数据块,依次得到n个压缩数据块,其中,每个压缩数据块的第一容量相同,所述第一容量表征压缩数据块能够包含的压缩处理后的数据的字节数,n为大于等于1的正整数;The m data blocks are compressed using a preset compression algorithm, and n compressed data blocks are sequentially obtained, wherein the first capacity of each compressed data block is the same, and the first capacity represents the compressed data that the compressed data block can contain The number of bytes of data, n is a positive integer greater than or equal to 1;建立所述n个压缩数据中的第i个压缩数据块对应的j个数据块中每个数据块的第一索引,并记录所述第一索引与j个数据块的映射关系;其中,i为大于等于1,且小于等于n的正整数;j为大于等于1,且小于等于m的正整数;Establishing the first index of each data block in the j data blocks corresponding to the i-th compressed data block in the n compressed data, and recording the mapping relationship between the first index and j data blocks; wherein, i is a positive integer greater than or equal to 1 and less than or equal to n; j is a positive integer greater than or equal to 1 and less than or equal to m;其中,所述第一索引用于标识所述j个数据块包含中每个数据块在存储介质中的存储位置,及所述j个数据块中每个数据块包含的属性信息。Wherein, the first index is used to identify the storage location of each data block contained in the j data blocks in the storage medium, and the attribute information contained in each data block contained in the j data blocks.
- 根据权利要求1所述的方法,其特征在于,所述采用预设压缩算法压缩所述m个数据块,依次得到n个压缩数据块,包括:The method according to claim 1, wherein the m data blocks are compressed using a preset compression algorithm, and n compressed data blocks are sequentially obtained, comprising:将所述m个数据块中的各数据块以预设顺序依次分配至第一集合;sequentially assigning each data block in the m data blocks to the first set in a preset order;当所述第一集合中的所述j个数据块的数据容量等于所述第一集合的额定容量时,按照设定压缩的阈值,对所述j个数据块执行压缩操作,并得到所述第i个压缩数据块。When the data capacity of the j data blocks in the first set is equal to the rated capacity of the first set, perform a compression operation on the j data blocks according to a set compression threshold, and obtain the The i-th compressed data block.
- 根据权利要求2所述的方法,其特征在于,所述建立所述n个压缩数据中的第i个压缩数据块对应的j个数据块中每个数据块的第一索引,包括:The method according to claim 2, wherein said establishing the first index of each data block in the j data blocks corresponding to the i-th compressed data block in the n compressed data comprises:当所述第i个压缩数据块的头部数据和压缩数据的数据总长度与所述设定压缩的阈值之和小于等于所述j个数据块的数据总长度时,建立所述j个数据块中每个数据块的第一索引。When the sum of the header data of the i-th compressed data block and the total data length of the compressed data and the set compression threshold is less than or equal to the total data length of the j data blocks, the j data blocks are established The first index of each data block in the block.
- 根据权利要求1-3中任一项所述的方法,其特征在于,所述属性信息包括以下至少一项:The method according to any one of claims 1-3, wherein the attribute information includes at least one of the following:第一属性,用于表征数据块被压缩后所在的压缩数据块的存储位置是否是预先分配的;The first attribute is used to represent whether the storage location of the compressed data block where the data block is compressed is pre-allocated;第二属性,用于表征数据块的数据页是否有效;The second attribute is used to represent whether the data page of the data block is valid;第三属性,用于表征数据块的数据页是否是所述数据块的压缩数据块的第一个压缩页;The third attribute is used to represent whether the data page of the data block is the first compressed page of the compressed data block of the data block;第四属性,用于表征数据块的数据页是否包含在两个压缩块的压缩数据页中;The fourth attribute is used to represent whether the data page of the data block is included in the compressed data pages of the two compressed blocks;第五属性,用于表征数据块的数据页是否是所述数据块被压缩后的压缩数据块的压缩页;The fifth attribute is used to represent whether the data page of the data block is a compressed page of the compressed data block after the data block is compressed;第六属性,用于表征数据块的数据页所在的压缩数据块的索引地址;The sixth attribute is used to represent the index address of the compressed data block where the data page of the data block is located;第七属性,用于表征当数据块的数据页属于所述数据块的压缩数据块的第一个压缩页时,所述第七属性的属性值为所述数据块在所述压缩数据块对应的集合内的偏移;当所述数据块的数据页不属于所述数据块的压缩数据块的第一个压缩页时,所述第七属性的属性值为所述数据块的数据页距离压缩数据块的第一个压缩页的距离。The seventh attribute is used to indicate that when the data page of the data block belongs to the first compressed page of the compressed data block of the data block, the attribute value of the seventh attribute is that the data block corresponds to the compressed data block The offset within the set; when the data page of the data block does not belong to the first compressed page of the compressed data block of the data block, the attribute value of the seventh attribute is the distance between the data pages of the data block The distance to the first compressed page of a compressed data block.
- 根据权利要求4所述的方法,其特征在于,所述属性信息包括第三属性,所述建立所述n个压缩数据中的第i个压缩数据块对应的j个数据块中每个数据块的第一索引,包括:The method according to claim 4, wherein the attribute information includes a third attribute, and the establishment of each data block in the j data blocks corresponding to the i-th compressed data block in the n compressed data The first index of , including:当所述j个数据块中每个数据块的数据页是所述第i个压缩数据块的第一个压缩页时,所述第三属性的属性值赋值为1;When the data page of each data block in the j data blocks is the first compressed page of the ith compressed data block, the attribute value of the third attribute is assigned a value of 1;当所述j个数据块中每个数据块的数据页不是所述第i个压缩数据块的第一个压缩页时,所述第三属性的属性值赋值为0。When the data page of each data block in the j data blocks is not the first compressed page of the ith compressed data block, the attribute value of the third attribute is assigned a value of 0.
- 根据权利要求4或5所述的方法,其特征在于,所述属性信息包括第七属性,还包括:The method according to claim 4 or 5, wherein the attribute information includes a seventh attribute, and further includes:当所述第三属性的属性值为1时,更新所述第七属性的属性值为所述数据块在所述压缩数据块对应的集合内的偏移;When the attribute value of the third attribute is 1, update the attribute value of the seventh attribute to the offset of the data block in the set corresponding to the compressed data block;当所述第三属性的属性值为0时,更新所述第七属性的属性值为所述数据块的数据页距离压缩数据块的第一个压缩页的距离。When the attribute value of the third attribute is 0, update the distance between the data page of the data block and the first compressed page of the compressed data block with the attribute value of the seventh attribute.
- 根据权利要求4-6中任一项所述的方法,其特征在于,所述属性信息包括第四属性,所述建立所述n个压缩数据中的第i个压缩数据块对应的j个数据块中每个数据块的第一索引,包括:The method according to any one of claims 4-6, wherein the attribute information includes a fourth attribute, and the establishment of j data corresponding to the i-th compressed data block in the n compressed data The first index of each data block in the block, including:当所述j个数据块中的每个数据块的数据页包含在两个压缩块的压缩数据页中时,所述第四属性的属性值赋值为1;When the data page of each data block in the j data blocks is included in the compressed data pages of two compressed blocks, the attribute value of the fourth attribute is assigned a value of 1;当所述j个数据块中的每个数据块的数据页不包含在两个压缩块的压缩数据页中时,所述第四属性的属性值赋值为0。When the data page of each data block in the j data blocks is not included in the compressed data pages of the two compressed blocks, the attribute value of the fourth attribute is assigned a value of 0.
- 根据权利要求4-7中任一项所述的方法,其特征在于,所述属性信息包括第二属性,所述建立所述n个压缩数据中的第i个压缩数据块对应的j个数据块中每个数据块的第一索引,包括:The method according to any one of claims 4-7, wherein the attribute information includes a second attribute, and the establishment of j data corresponding to the i-th compressed data block in the n compressed data The first index of each data block in the block, including:当所述j个数据块中的每个数据块的数据页为有效时,所述第二属性的属性值赋值为1;When the data page of each data block in the j data blocks is valid, the attribute value of the second attribute is assigned a value of 1;当所述j个数据块中的每个数据块的数据页为无效时,所述第二属性的属性值赋值为0。When the data page of each data block in the j data blocks is invalid, the attribute value of the second attribute is assigned a value of 0.
- 根据权利要求1-8中任一项所述的方法,其特征在于,在所述采用预设压缩算法压缩所述m个数据块,依次得到n个压缩数据块之前,还包括:The method according to any one of claims 1-8, wherein, before compressing the m data blocks by using a preset compression algorithm to sequentially obtain n compressed data blocks, further comprising:获取待覆盖写入数据的第二集合,所述第二集合包括p个压缩数据块,p为大于等于1的正整数;Obtain a second set of data to be overwritten and written, the second set includes p compressed data blocks, and p is a positive integer greater than or equal to 1;获取所述p个压缩数据块中第一目标压缩数据的压缩页,及所述第一目标压缩数据块的压缩页对应的q个数据块,q为大于等于1的正整数;Obtain the compressed page of the first target compressed data in the p compressed data blocks, and the q data blocks corresponding to the compressed page of the first target compressed data block, where q is a positive integer greater than or equal to 1;确定所述q个数据块中的第一目标数据块在所述q个数据块中的位置偏移;determining the position offset of the first target data block among the q data blocks in the q data blocks;确定所述第一目标数据块的数据页为待覆盖写入数据的数据页。Determining that the data page of the first target data block is the data page to be overwritten with data.
- 根据权利要求1-9中任一项所述的方法,其特征在于,所述第一索引用于标识所述第i个压缩数据块在存储介质中的存储位置,及所述j个数据块中每个数据块包含的属性信息。The method according to any one of claims 1-9, wherein the first index is used to identify the storage location of the ith compressed data block in the storage medium, and the j data blocks The attribute information contained in each data block in .
- 一种数据压缩装置,其特征在于,所述装置包括:A data compression device, characterized in that the device comprises:第一获取单元,用于获取可读写文件系统中数据区的m个数据块,m为大于等于1的正整数;The first acquisition unit is used to acquire m data blocks in the data area of the readable and writable file system, where m is a positive integer greater than or equal to 1;压缩单元,用于采用预设压缩算法压缩所述m个数据块,依次得到n个压缩数据 块,其中,每个压缩数据块的第一容量相同,所述第一容量表征压缩数据块能够包含的压缩处理后的数据的字节数,n为大于等于1的正整数;A compression unit, configured to compress the m data blocks using a preset compression algorithm to sequentially obtain n compressed data blocks, wherein the first capacity of each compressed data block is the same, and the first capacity indicates that the compressed data block can contain The number of bytes of compressed data, n is a positive integer greater than or equal to 1;更新单元,用于建立所述n个压缩数据中的第i个压缩数据块对应的j个数据块中每个数据块的第一索引,并记录所述第一索引与j个数据块的映射关系;其中,i为大于等于1,且小于等于n的正整数;j为大于等于1,且小于等于m的正整数;An update unit, configured to establish the first index of each data block in the j data blocks corresponding to the i-th compressed data block in the n compressed data, and record the mapping between the first index and the j data blocks Relationship; wherein, i is a positive integer greater than or equal to 1 and less than or equal to n; j is a positive integer greater than or equal to 1 and less than or equal to m;其中,所述第一索引用于标识所述j个数据块包含中每个数据块在存储介质中的存储位置,及所述j个数据块中每个数据块包含的属性信息。Wherein, the first index is used to identify the storage location of each data block contained in the j data blocks in the storage medium, and the attribute information contained in each data block contained in the j data blocks.
- 根据权利要求11所述的装置,其特征在于,所述压缩单元用于:The device according to claim 11, wherein the compression unit is used for:将所述m个数据块中的各数据块以预设顺序依次分配至第一集合;sequentially assigning each data block in the m data blocks to the first set in a preset order;当所述第一集合中的所述j个数据块的数据容量等于所述第一集合的额定容量时,按照设定压缩的阈值,对所述j个数据块执行压缩操作,并得到所述第i个压缩数据块。When the data capacity of the j data blocks in the first set is equal to the rated capacity of the first set, perform a compression operation on the j data blocks according to a set compression threshold, and obtain the The i-th compressed data block.
- 根据权利要求12所述的装置,其特征在于,所述更新单元用于:The device according to claim 12, wherein the updating unit is used for:当所述第i个压缩数据块的头部数据和压缩数据的数据总长度与所述设定压缩的阈值之和小于等于所述j个数据块的数据总长度时,建立所述j个数据块中每个数据块的第一索引。When the sum of the header data of the i-th compressed data block and the total data length of the compressed data and the set compression threshold is less than or equal to the total data length of the j data blocks, the j data blocks are established The first index of each data block in the block.
- 根据权利要求11-13中任一项所述的装置,其特征在于,所述属性信息包括以下至少一项:The device according to any one of claims 11-13, wherein the attribute information includes at least one of the following:第一属性,用于表征数据块被压缩后所在的压缩数据块的存储位置是否是预先分配的;The first attribute is used to represent whether the storage location of the compressed data block where the data block is compressed is pre-allocated;第二属性,用于表征数据块的数据页是否有效;The second attribute is used to represent whether the data page of the data block is valid;第三属性,用于表征数据块的数据页是否是所述数据块的压缩数据块的第一个压缩页;The third attribute is used to represent whether the data page of the data block is the first compressed page of the compressed data block of the data block;第四属性,用于表征数据块的数据页是否包含在两个压缩块的压缩数据页中;The fourth attribute is used to represent whether the data page of the data block is included in the compressed data pages of the two compressed blocks;第五属性,用于表征数据块的数据页是否是所述数据块被压缩后的压缩数据块的压缩页;The fifth attribute is used to represent whether the data page of the data block is a compressed page of the compressed data block after the data block is compressed;第六属性,用于表征数据块的数据页所在的压缩数据块的索引地址;The sixth attribute is used to represent the index address of the compressed data block where the data page of the data block is located;第七属性,用于表征当数据块的数据页属于所述数据块的压缩数据块的第一个压缩页时,所述第七属性的属性值为所述数据块在所述压缩数据块对应的集合内的偏移;当所述数据块的数据页不属于所述数据块的压缩数据块的第一个压缩页时,所述第七属性的属性值为所述数据块的数据页距离压缩数据块的第一个压缩页的距离。The seventh attribute is used to indicate that when the data page of the data block belongs to the first compressed page of the compressed data block of the data block, the attribute value of the seventh attribute is that the data block corresponds to the compressed data block The offset within the set; when the data page of the data block does not belong to the first compressed page of the compressed data block of the data block, the attribute value of the seventh attribute is the distance between the data pages of the data block The distance to the first compressed page of a compressed data block.
- 根据权利要求14所述的装置,其特征在于,所述属性信息包括第三属性,所述更新单元还用于:The device according to claim 14, wherein the attribute information includes a third attribute, and the updating unit is further configured to:当所述j个数据块中每个数据块的数据页是所述第i个压缩数据块的第一个压缩页时,所述第三属性的属性值赋值为1;When the data page of each data block in the j data blocks is the first compressed page of the ith compressed data block, the attribute value of the third attribute is assigned a value of 1;当所述j个数据块中每个数据块的数据页不是所述第i个压缩数据块的第一个压缩页时,所述第三属性的属性值赋值为0。When the data page of each data block in the j data blocks is not the first compressed page of the ith compressed data block, the attribute value of the third attribute is assigned a value of 0.
- 根据权利要求14或15所述的装置,其特征在于,所述属性信息包括第七属性,所述更新单元还用于:The device according to claim 14 or 15, wherein the attribute information includes a seventh attribute, and the updating unit is further configured to:当所述第三属性的属性值为1时,更新所述第七属性的属性值为所述数据块在所 述压缩数据块对应的集合内的偏移;When the attribute value of the third attribute is 1, update the attribute value of the seventh attribute to the offset of the data block in the set corresponding to the compressed data block;当所述第三属性的属性值为0时,更新所述第七属性的属性值为所述数据块的数据页距离压缩数据块的第一个压缩页的距离。When the attribute value of the third attribute is 0, update the distance between the data page of the data block and the first compressed page of the compressed data block with the attribute value of the seventh attribute.
- 根据权利要求14-16中任一项所述的装置,其特征在于,所述属性信息包括第四属性,所述更新单元还用于:The device according to any one of claims 14-16, wherein the attribute information includes a fourth attribute, and the updating unit is further configured to:当所述j个数据块中的每个数据块的数据页包含在两个压缩块的压缩数据页中时,所述第四属性的属性值赋值为1;When the data page of each data block in the j data blocks is included in the compressed data pages of two compressed blocks, the attribute value of the fourth attribute is assigned a value of 1;当所述j个数据块中的每个数据块的数据页不包含在两个压缩块的压缩数据页中时,所述第四属性的属性值赋值为0。When the data page of each data block in the j data blocks is not included in the compressed data pages of the two compressed blocks, the attribute value of the fourth attribute is assigned a value of 0.
- 根据权利要求14-17中任一项所述的装置,其特征在于,所述属性信息包括第二属性,所述更新单元还用于:The device according to any one of claims 14-17, wherein the attribute information includes a second attribute, and the updating unit is further configured to:当所述j个数据块中的每个数据块的数据页为有效时,所述第二属性的属性值赋值为1;When the data page of each data block in the j data blocks is valid, the attribute value of the second attribute is assigned a value of 1;当所述j个数据块中的每个数据块的数据页为无效时,所述第二属性的属性值赋值为0。When the data page of each data block in the j data blocks is invalid, the attribute value of the second attribute is assigned a value of 0.
- 根据权利要求11-18中任一项所述的装置,其特征在于,还包括:The device according to any one of claims 11-18, further comprising:第二获取单元,用于获取待覆盖写入数据的第二集合,所述第二集合包括p个压缩数据块,p为大于等于1的正整数;The second acquisition unit is configured to acquire a second set of data to be overwritten and written, the second set includes p compressed data blocks, and p is a positive integer greater than or equal to 1;第三获取单元,用于获取所述p个压缩数据块中第一目标压缩数据的压缩页,及所述第一目标压缩数据块的压缩页对应的q个数据块,q为大于等于1的正整数;The third acquisition unit is configured to acquire the compressed page of the first target compressed data in the p compressed data blocks, and the q data blocks corresponding to the compressed page of the first target compressed data block, where q is greater than or equal to 1 positive integer;第一确定单元,用于确定所述q个数据块中的第一目标数据块在所述q个数据块中的位置偏移;A first determining unit, configured to determine a position offset of the first target data block among the q data blocks in the q data blocks;第二确定单元,用于确定所述第一目标数据块的数据页为待覆盖写入数据的数据页。The second determining unit is configured to determine that the data page of the first target data block is the data page to be overwritten with data.
- 根据权利要求11-19中任一项所述的装置,其特征在于,所述第一索引用于标识所述第i个压缩数据块在存储介质中的存储位置,及所述j个数据块中每个数据块包含的属性信息。The device according to any one of claims 11-19, wherein the first index is used to identify the storage location of the ith compressed data block in the storage medium, and the j data blocks The attribute information contained in each data block in .
- 一种设备,其特征在于,包括:用于执行权利要求1至10中任一项所述的数据压缩方法。A device, characterized by comprising: a device for executing the data compression method according to any one of claims 1 to 10.
- 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质包括计算机指令,当所述计算机指令在电子设备上运行时,使得所述电子设备执行如权利要求1至10中任一项所述的数据压缩方法。A computer-readable storage medium, characterized in that, the computer-readable storage medium includes computer instructions, and when the computer instructions are run on an electronic device, the electronic device executes any one of claims 1 to 10. The data compression method described in the item.
- 一种计算机程序,其特征在于,当所述程序被处理器调用时,权利要求1至10中任一项所述的数据压缩方法被执行。A computer program, characterized in that, when the program is invoked by a processor, the data compression method according to any one of claims 1 to 10 is executed.
- 一种芯片系统,其特征在于,包括一个或多个处理器,当所述一个或多个处理器执行指令时,所述一个或多个处理器执行如权利要求1至10中任一项所述的数据压缩方法。A system on a chip, characterized in that it includes one or more processors, and when the one or more processors execute instructions, the one or more processors perform the process described in any one of claims 1 to 10 The data compression method described above.
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US18/567,644 US20240283463A1 (en) | 2021-06-16 | 2022-04-07 | Data compression method and apparatus |
JP2023577669A JP2024525170A (en) | 2021-06-16 | 2022-04-07 | Data compression method and device |
EP22823871.3A EP4336336A1 (en) | 2021-06-16 | 2022-04-07 | Data compression method and apparatus |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110667882.7 | 2021-06-16 | ||
CN202110667882.7A CN115480692A (en) | 2021-06-16 | 2021-06-16 | Data compression method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022262381A1 true WO2022262381A1 (en) | 2022-12-22 |
Family
ID=84419764
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2022/085621 WO2022262381A1 (en) | 2021-06-16 | 2022-04-07 | Data compression method and apparatus |
Country Status (5)
Country | Link |
---|---|
US (1) | US20240283463A1 (en) |
EP (1) | EP4336336A1 (en) |
JP (1) | JP2024525170A (en) |
CN (1) | CN115480692A (en) |
WO (1) | WO2022262381A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117389484A (en) * | 2023-12-12 | 2024-01-12 | 深圳大普微电子股份有限公司 | Data storage processing method, device, equipment and storage medium |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117708070B (en) * | 2023-07-27 | 2024-08-02 | 荣耀终端有限公司 | File compression method and electronic equipment |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101727298A (en) * | 2009-11-04 | 2010-06-09 | 北京东方广视科技股份有限公司 | Method and device for realizing independent redundant arrays of inexpensive disks |
CN103020205A (en) * | 2012-12-05 | 2013-04-03 | 北京普泽天玑数据技术有限公司 | Compression and decompression method based on hardware accelerator card on distributive-type file system |
CN103516369A (en) * | 2013-06-20 | 2014-01-15 | 易乐天 | Method and system for self-adaptation data compression and decompression and storage device |
CN107947799A (en) * | 2017-11-28 | 2018-04-20 | 郑州云海信息技术有限公司 | A kind of data compression method and apparatus |
US20190339911A1 (en) * | 2018-05-04 | 2019-11-07 | EMC IP Holding Company LLC | Reporting of space savings due to compression in storage systems |
CN110557124A (en) * | 2018-05-30 | 2019-12-10 | 华为技术有限公司 | Data compression method and device |
-
2021
- 2021-06-16 CN CN202110667882.7A patent/CN115480692A/en active Pending
-
2022
- 2022-04-07 WO PCT/CN2022/085621 patent/WO2022262381A1/en active Application Filing
- 2022-04-07 JP JP2023577669A patent/JP2024525170A/en active Pending
- 2022-04-07 US US18/567,644 patent/US20240283463A1/en active Pending
- 2022-04-07 EP EP22823871.3A patent/EP4336336A1/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101727298A (en) * | 2009-11-04 | 2010-06-09 | 北京东方广视科技股份有限公司 | Method and device for realizing independent redundant arrays of inexpensive disks |
CN103020205A (en) * | 2012-12-05 | 2013-04-03 | 北京普泽天玑数据技术有限公司 | Compression and decompression method based on hardware accelerator card on distributive-type file system |
CN103516369A (en) * | 2013-06-20 | 2014-01-15 | 易乐天 | Method and system for self-adaptation data compression and decompression and storage device |
CN107947799A (en) * | 2017-11-28 | 2018-04-20 | 郑州云海信息技术有限公司 | A kind of data compression method and apparatus |
US20190339911A1 (en) * | 2018-05-04 | 2019-11-07 | EMC IP Holding Company LLC | Reporting of space savings due to compression in storage systems |
CN110557124A (en) * | 2018-05-30 | 2019-12-10 | 华为技术有限公司 | Data compression method and device |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117389484A (en) * | 2023-12-12 | 2024-01-12 | 深圳大普微电子股份有限公司 | Data storage processing method, device, equipment and storage medium |
CN117389484B (en) * | 2023-12-12 | 2024-04-26 | 深圳大普微电子股份有限公司 | Data storage processing method, device, equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
JP2024525170A (en) | 2024-07-10 |
US20240283463A1 (en) | 2024-08-22 |
EP4336336A1 (en) | 2024-03-13 |
CN115480692A (en) | 2022-12-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Kwon et al. | Strata: A cross media file system | |
JP6664218B2 (en) | Efficient data object storage and retrieval | |
US8285967B1 (en) | Method for on-demand block map generation for direct mapped LUN | |
US9342256B2 (en) | Epoch based storage management for a storage device | |
US8578128B1 (en) | Virtual block mapping for relocating compressed and/or encrypted file data block blocks | |
US11061770B1 (en) | Reconstruction of logical pages in a storage system | |
US9075754B1 (en) | Managing cache backup and restore | |
TW201935243A (en) | SSD, distributed data storage system and method for leveraging key-value storage | |
US11256678B2 (en) | Reconstruction of links between logical pages in a storage system | |
US9021222B1 (en) | Managing incremental cache backup and restore | |
WO2022262381A1 (en) | Data compression method and apparatus | |
WO2022095346A1 (en) | Blockchain data storage method, system, device, and readable storage medium | |
US11099940B1 (en) | Reconstruction of links to orphaned logical pages in a storage system | |
US11625169B2 (en) | Efficient token management in a storage system | |
CN114860163A (en) | Storage system, memory management method and management node | |
US11334523B2 (en) | Finding storage objects of a snapshot group pointing to a logical page in a logical address space of a storage system | |
CN115427941A (en) | Data management system and control method | |
US11269547B2 (en) | Reusing overwritten portion of write buffer of a storage system | |
WO2021208239A1 (en) | Low-latency file system address space management method and system, and medium | |
US11232043B2 (en) | Mapping virtual block addresses to portions of a logical address space that point to the virtual block addresses | |
US11210230B2 (en) | Cache retention for inline deduplication based on number of physical blocks with common fingerprints among multiple cache entries | |
CN115904255A (en) | Data request method, device, equipment and storage medium | |
US11868256B2 (en) | Techniques for metadata updating and retrieval | |
US11360691B2 (en) | Garbage collection in a storage system at sub-virtual block granularity level | |
Zhou et al. | A file system bypassing volatile main memory: Towards a single-level persistent store |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22823871 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 18567644 Country of ref document: US Ref document number: 2022823871 Country of ref document: EP |
|
ENP | Entry into the national phase |
Ref document number: 2022823871 Country of ref document: EP Effective date: 20231206 |
|
ENP | Entry into the national phase |
Ref document number: 2023577669 Country of ref document: JP Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |