WO2019119336A1 - Multi-thread compression and decompression methods in generic data gz format, and device - Google Patents

Multi-thread compression and decompression methods in generic data gz format, and device Download PDF

Info

Publication number
WO2019119336A1
WO2019119336A1 PCT/CN2017/117619 CN2017117619W WO2019119336A1 WO 2019119336 A1 WO2019119336 A1 WO 2019119336A1 CN 2017117619 W CN2017117619 W CN 2017117619W WO 2019119336 A1 WO2019119336 A1 WO 2019119336A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
gzdi
compression
thread
decompression
Prior art date
Application number
PCT/CN2017/117619
Other languages
French (fr)
Chinese (zh)
Inventor
朱泽轩
孙怡雯
Original Assignee
深圳大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳大学 filed Critical 深圳大学
Priority to PCT/CN2017/117619 priority Critical patent/WO2019119336A1/en
Publication of WO2019119336A1 publication Critical patent/WO2019119336A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction

Definitions

  • the invention belongs to the technical field of data processing, and in particular relates to a multi-thread compression and decompression method and device for a universal data gz format.
  • the gz compression format is mainly used.
  • the most widely used library is zlib single-threaded gz compression and multi-threaded gz compression with pigz (A parallel implementation of gzip).
  • the main disadvantages of the gz format compression software using the Zlib and Pigz methods are as follows:
  • the general gz format compression software often assumes that the input is a single character stream, that is, there is only one data source, and for multi-source data, parallel processing is not well performed.
  • the most common is multi-source data, such as Internet user information data collection, at the same time, there may be more than one user information needs to be compressed and saved to the same file.
  • the zlib library only implements the most basic single-threaded gz compression and decompression, while pigz is a parallel gz compressed version.
  • Pigz's multi-threaded compression software mainly implements block compression of single data. For decompression, it only provides a single-threaded solution, which makes the efficiency of decompression limited by the single-thread computing power of the CPU.
  • the decompression reading of massive data there is also a huge demand in the industrial application and academic field through the decompression method of parallel multi-threading, such as high-throughput DNA sequencing to generate hundreds of GB of FASTA files; but in fact, subsequent bioinformatics analysis In this case, only one thread can be used for decompression reading (usually one compute node of HPC will provide dozens of threads), which actually extends the analysis time.
  • the invention provides a multi-thread compression and decompression method and device for general data gz format, aiming at multi-thread compression of original data under the premise of separating read and write operations from decompression and compression calculation.
  • the compressed data is multi-threaded and decompressed.
  • the present invention provides a multi-thread compression and decompression method in a general data gz format, comprising: a compression step S1 and a decompression step S2, wherein the compression step S1 comprises:
  • Step S11 inputting original data, and performing block processing on the original data to obtain M pieces of data blocks;
  • each data block is represented as Di, i ⁇ [0, M-1];
  • Step S12 preset by the first thread pool threads N 1 of each compressed data block to the M parts of the compression process is reserved in the header portion of the predetermined space of the gz format to obtain data compression gzDi parts M And the size of the data gzDi size (gzDi);
  • Step S13 sequentially writing the M pieces of the compressed data gzDi into the disk, and sequentially writing the size (gzDi) of the corresponding M pieces of the data gzDi into the preset space to obtain compressed data;
  • the decompression step S2 includes:
  • Step S21 inputting the compressed data, reading the written list information of the size (gzDi), and segmenting the compressed data according to the list information of the size (gzDi) to obtain M pieces of data gzDi ;
  • Step S22 using the N 2 threads in the preset second thread pool to decompress M pieces of the data block gzDi, respectively, to obtain M pieces of decompressed original data Di;
  • Step S23 the original data Di is decompressed according to the list information of the size (gzDi) to obtain complete original data.
  • the present invention also provides a multi-thread compression and decompression device of the general data gz format, comprising: a compression module 1 and a decompression module 2, wherein the compression module 1 comprises:
  • the blocking module 11 is configured to input original data, and perform block processing on the original data to obtain M pieces of data blocks;
  • each data block is represented as Di, i ⁇ [0, M-1];
  • the compression module 12 is configured to respectively compress the M data blocks by using N 1 threads in the preset first thread pool, and reserve a preset space in the file header portion of the gz format during the compression process, and obtain M compression.
  • the writing module 13 is configured to sequentially write the M pieces of the compressed data gzDi into the disk, and sequentially write the size (gzDi) of the corresponding M pieces of the data gzDi into the preset space to obtain compression. data;
  • the decompression module 2 includes:
  • a segmentation module 21 configured to input the compressed data, read the written list information of the size (gzDi), and segment the compressed data according to the list information of the size (gzDi) to obtain M Data block gzDi;
  • the decompression module 22 is configured to decompress the M pieces of the data block gzDi by using N 2 threads in the preset second thread pool to obtain the M pieces of decompressed original data Di;
  • the serial module 23 is configured to obtain the complete original data by serially decompressing the original data Di according to the list information of the size (gzDi).
  • the compression step is that the input original data is first processed into blocks, and then utilized.
  • N1 threads respectively compress the data block, obtain M pieces of compressed data gzDi and corresponding size(gzDi), and finally write gzDi to the disk, wherein M parts size(gzDi) are written into the file header part of gz format; decompression The step of reading the size (gzDi) of the written list information, and segmenting the input compressed data according to the list information to obtain M pieces of data blocks; and decompressing the M pieces of the data block by using N2 threads respectively.
  • the present invention uses a parallel multi-thread gz compression method separated from the read-write IO operation to read
  • the write operation is separated from the decompression and compression calculations, and can be scheduled according to the actual situation, effectively avoiding the IO competition of the multi-thread compression write operation; in addition, using the multi-thread gz decompression method to make the soft
  • the program can use more CPUs for decompression calculations on the computer, resulting in larger data input, and a single program can achieve higher computational occupancy; at the same time, this size (gzDi) gz format is stored in the header portion of the file. It can be decompressed by the original single-thread decompression method of zlib or pigz, or bgz multi-thread decompression can be used to ensure compatibility, so that the method promotion cost is extremely low.
  • FIG. 1 is a schematic flowchart of a multi-thread compression and decompression method of a general data gz format according to an embodiment of the present invention
  • FIG. 2 is a schematic diagram of a process of Bgz multi-thread compression and decompression provided by an embodiment of the present invention
  • FIG. 3 is a schematic block diagram of a multi-thread compression and decompression device of a general data gz format according to an embodiment of the present invention.
  • the present invention provides a multi-thread compression and decompression method and device for a universal data gz format, and develops a multi-thread compression and decompression solution for multi-source data for the currently widely used gz compression format.
  • the compression calculation and the write operation are separated, which can effectively avoid the IO competition of multi-thread compression write operation;
  • the multi-thread gz decompression method provides the software to use the more CPU to decompress the calculation on the computer, thereby obtaining a larger Data input, a single program can obtain higher computing occupancy; at the same time, the present invention notices software compatibility, and the compressed data structure is specially designed to ensure that the existing zlib and pigz methods can be performed without modification. Single-threaded decompression.
  • the present invention provides a flexible big data multi-thread compression and reading solution, which is especially suitable for large amounts of large data in multi-data sources. Compression and reading on the performance computing platform, enabling big data software to more fully utilize the computing power of high-performance computers (HPC).
  • Big data storage solutions must be economical, so compressed storage is an inevitable choice. Compared with the direct storage of ordinary data, big data is compressed by using certain computing resources during storage. After the total amount of data characters is reduced, storage is performed, which can greatly reduce the occupation of IO resources and hard disk storage space. Such a solution can use all parts of the computer more comprehensively and in a more coordinated manner, avoiding the situation that direct storage only takes up IO and does not apply to the CPU, and fully utilizes the hardware performance of the HPC.
  • the method provided by the present invention is specifically described below.
  • the present invention is based on the general data gz format multi-thread compression and decompression algorithm bgz (block gzip) of the zlib open source library.
  • the bgz method utilizes zlib data structure and deflate method to realize and read and write.
  • the method includes: a compression step S1 and a decompression step S2, wherein the compression step S1 includes:
  • Step S11 inputting original data, and performing block processing on the original data to obtain M pieces of data blocks;
  • each data block is represented as Di, i ⁇ [0, M-1];
  • the original data provided by the embodiment of the present invention is not limited to a data source form, and may be from one data source or multiple source data; nor is it limited to the number of data copies, and may be one piece of data. It can also be multiple copies of data.
  • the original data is prepared, the original data to be compressed is loaded into the memory, and the block processing is performed to obtain M pieces of data. If there are multiple pieces of data, the large data is divided into blocks according to the data source classification.
  • the block size is adjustable and is set according to the machine memory configuration. The default is 10MB. For example, if there are 10 data, 9 1M size, and 1 10G size, then only 10G data can be divided into blocks, and another 9 data can be processed as 1 copy.
  • Step S12 preset by the first thread pool threads N 1 of each compressed data block to the M parts of the compression process is reserved in the header portion of the predetermined space of the gz format to obtain data compression gzDi parts M And the size of the data gzDi size (gzDi);
  • the number of threads N 1 is less than or equal to the maximum number of threads of the machine.
  • the above parallel compression process is: one thread processes a data block, and recycles the thread pool until all the data is compressed, and different data compression takes time, so the thread pool needs to be flexibly scheduled to ensure that all threads are in Calculate the status.
  • the N 1 threads in the preset first thread pool respectively correspond to compressing N pieces of N 1 data blocks in the data block, and the data blocks corresponding to any one of the N 1 threads are compressed. Thereafter, the remaining uncompressed data blocks are processed by the thread until the data blocks of the M shares are compressed.
  • the purpose of preserving the preset space in the file header portion of the gz format is to record the size (gzDi) of the plurality of compressed data as a fast address index in the subsequent decompression process, thereby implementing multi-thread decompression.
  • Step S13 sequentially writing the M pieces of the compressed data gzDi into the disk, and sequentially writing the size (gzDi) of the corresponding M pieces of the data gzDi into the preset space to obtain compressed data;
  • the compressed data gzDi is written to the hard disk and the corresponding size (gzDi) is recorded according to requirements. If it is a single hard disk system, single-threaded writing, if it is a multi-machine distributed system, that is, a multi-hard disk system, the number of writing threads is determined according to the actual number of hard disks.
  • the size (gzDi) of the record is necessary for multi-thread decompression.
  • the size (gzDi) is recorded in the data header of the gz compression format, and in fact, may also be recorded in the memory or index according to the requirements of the software system. List.
  • the decompression process can support single-thread decompression of zlib and pigz, or multi-thread decompression using size(gzDi).
  • the decompression step S2 includes:
  • Step S21 inputting the compressed data, reading the written list information of the size (gzDi), and segmenting the compressed data according to the list information of the size (gzDi) to obtain M pieces of data gzDi ;
  • the list information of size (gzDi) is used as a fast address index to perform segmentation, and then the subsequent multi-thread decompression is implemented.
  • Step S22 using the N 2 threads in the preset second thread pool to decompress M pieces of the data block gzDi, respectively, to obtain M pieces of decompressed original data Di;
  • the second thread pool when the second thread pool is set in advance, it is necessary to set a reasonable number of threads N 2 according to the performance of the machine and the number of blocks.
  • the number of threads N 2 is less than or equal to the maximum number of threads of the machine.
  • the above parallel decompression process is that one thread processes one compressed data block and recycles the thread pool until all data is decompressed. More specifically, the N 2 threads in the preset second thread pool are respectively corresponding to decompressing M pieces of N 2 data blocks in the data block gzDi, and the data blocks corresponding to any one of the N 2 threads are decompressed. After the completion, the remaining undecompressed data blocks are processed by the thread until the M pieces of the data block are decompressed, and the M pieces of decompressed original data Di are obtained.
  • the N 1 threads in the first thread pool are equal to the N 2 threads in the second thread pool.
  • M is preferably an integer multiple of N 1 and N 2 to prevent the computing resource from being idle.
  • N 1 is not necessarily limited to N 2 .
  • Step S23 the original data Di is decompressed in series according to the list information of the size (gzDi) to obtain complete original data.
  • compression and decompression use thread pool technology to implement multi-threading, compression process inputs raw data, and generates compressed data and block information; multi-thread decompression process inputs compressed data and block information to obtain solutions.
  • the compressed original information is shown in Figure 2.
  • the invention uses the thread pool to schedule multi-thread compression and decompression, and can better exert the performance of the hardware on different systems.
  • Its multi-threaded scheduling pseudo code is as follows:
  • a major innovation of the present invention is that the block information is stored in the gz compression format data header, thereby ensuring compatibility.
  • the processed format can still be decompressed using the original zlib or pigz single-thread decompression method without any modification.
  • it can extract the block information from the gz compressed file itself, thus achieving multi-thread fast decompression.
  • the data structure of the Gz compression format header is as follows:
  • the extra field of the gz header is empty.
  • the bgz method provided by the invention can open up a fixed length extra field (capable of storing 100 blocks of block information) in advance, and in the continuous compression process, the additional field can be continually modified. If the number of data blocks is less than 100, the remaining space is 0 padding. If the amount of compressed data is large, when the data block exceeds 100, then the extra field of the 100 block space is opened again in the 101st data compression, and so on.
  • the bgz file stored on the hard disk follows the gzip file format and consists of multiple data blocks.
  • the contents of each data block are as follows:
  • Each data block consists of three parts, a header part, a data part, and a tail part. From ID1 to the extra header field is the data header portion, and CRC32 and ISIZE are the tail portions. Except for the extra fields, the rest of the content is consistent with the normal gzip format, which is defined as follows:
  • Bit 1FHCRC – indicates the presence of a CRC16 header check field
  • Bit 2FEXTRA – indicates the presence of an optional field
  • Bit 3FNAME – indicates the presence of the original filename field
  • Bit 4FCOMMENT – indicates the presence of a comment field
  • MTIME 4 bytes. Change time. UINX format.
  • OS 1 byte. Indicate the operating system, specifically the file system. The following definitions are available:
  • FLG.FEXTRA 1 means there is an extra field, XLEN means extra field length, 800
  • the multi-thread compression and decompression method of the universal data gz format adopts a parallel multi-thread gz compression method separated from the read-write IO operation, and the read, write, compress and decompress are separated from each other, and can be separated according to the actual computing platform.
  • the resources are reasonably matched with compression and decompression.
  • the compression end and the decompression end may not be on the same machine.
  • the read and write threads, compression and decompression threads are allocated reasonably according to the needs to maximize the performance of the machine.
  • the multi-threaded gz decompression method allows the software to use more CPUs for decompression calculations on the computer, resulting in larger data input, and a single program can achieve higher computational occupancy;
  • the compressed data structure is specially designed to store the size (gzDi) gz format in the header section to ensure that the existing zlib and pigz methods can be decompressed without modification.
  • the user's version update concerns are greatly reduced, especially in the case where the data producer is separated from the data user, there is no version incompatibility, so that the method promotion cost is extremely low, only when the user thinks that there is When the relevant decompression requirements are concerned, the software can be replaced.
  • a multi-thread compression and decompression device of a general data gz format includes: a compression module 1 and a decompression module 2, wherein the compression module 1 includes:
  • the blocking module 11 is configured to input original data, and perform block processing on the original data to obtain M data blocks; wherein each data block is represented as Di, i ⁇ [0, M-1];
  • the compression module 12 is configured to respectively compress the M data blocks by using N 1 threads in the preset first thread pool, and reserve a preset space in the file header portion of the gz format during the compression process, and obtain M compression.
  • the writing module 13 is configured to sequentially write the M pieces of the compressed data gzDi into the disk, and sequentially write the size (gzDi) of the corresponding M pieces of the data gzDi into the preset space to obtain compression. data;
  • the decompression module 2 includes:
  • a segmentation module 21 configured to input the compressed data, read the written list information of the size (gzDi), and segment the compressed data according to the list information of the size (gzDi) to obtain M Data block gzDi;
  • the decompression module 22 is configured to decompress the M pieces of the data block gzDi by using N 2 threads in the preset second thread pool to obtain the M pieces of decompressed original data Di;
  • the serial module 23 is configured to obtain the complete original data by serially decompressing the original data Di according to the list information of the size (gzDi).
  • the gz compression format is used in Internet text data archival storage, network transmission, and general storage of FASTQ data (partial network transmission uses bz2 compression).
  • FASTQ data partial network transmission uses bz2 compression.
  • multi-threaded gz compression schemes are widely used.
  • the pigz method is mainly used.
  • the zlib library is redeveloped to achieve multi-thread compression. This is the currently known application area and application method.

Abstract

Multi-thread compression and decompression methods in generic data gz format, applicable to the field of data processing technology. The step of compression is as follows (S1): inputting original data, and performing block division processing on the original data, to obtain M data blocks (S11); using N1 threads in a preset first thread pool to compress said M data blocks, and during compression, reserving a preset space in a file header portion in a gz format, to obtain M copies of compressed data gzDi and the sizes of the data gzDi (S12); writing said M copies of compressed data gzDi into a disk in sequence, and writing the corresponding sizes of the M copies of data gzDi into the preset space in sequence, to obtain compressed data (S13). The step of decompression is as follows (S2): inputting the compressed data, reading the list information of the written sizes, and dividing the compressed data according to the list information of the sizes, to obtain M data blocks gzDi (S21); using N2 threads in a preset second thread pool to decompress said M data blocks gzDi, to obtain M copies of decompressed original data Di (S22); and connecting the decompressed original data Di in series according to the list information of the sizes to obtain complete original data (S23). The present method achieves the purpose of multi-thread compression and multi-thread decompression.

Description

一种通用数据gz格式的多线程压缩与解压方法及装置Multi-thread compression and decompression method and device for universal data gz format 技术领域Technical field
本发明属于数据处理技术领域,尤其涉及一种通用数据gz格式的多线程压缩与解压方法及装置。The invention belongs to the technical field of data processing, and in particular relates to a multi-thread compression and decompression method and device for a universal data gz format.
背景技术Background technique
目前对于文本数据的通用压缩方案,主要采用gz压缩格式。而对于gz压缩格式来说,目前最广泛使用的库是zlib单线程gz压缩,与pigz(A parallel implementation of gzip)多线程gz压缩。采用Zlib与pigz方法的gz格式压缩软件的主要缺点主要有以下两点:At present, for the general compression scheme of text data, the gz compression format is mainly used. For the gz compression format, the most widely used library is zlib single-threaded gz compression and multi-threaded gz compression with pigz (A parallel implementation of gzip). The main disadvantages of the gz format compression software using the Zlib and Pigz methods are as follows:
1,通用gz格式压缩软件往往假定输入为单一字符流,即只有一个数据源,对于多源数据,无法很好地进行并行处理。而在大数据领域,最常见的就是多源数据,如互联网用户信息数据收集,在同一时刻可能有多份用户信息需要压缩保存到同一份文件中。在数据量足够大的时候,唯有并行处理这些数据才能满足时间要求。zlib库只是实现了最基本的单线程gz压缩与解压,而pigz则是并行的gz压缩版本,使用pigz并行压缩保存的话,会出现严重的IO竞争,导致IO资源利用率过低,因为,与pigz将压缩与写、解压缩与读绑定在一起;另外,zlib也是将压缩与写、解压缩与读绑定在一起。将压缩与写入、解压缩与读取绑定在一起,虽然简化了用户操作,但这样的使用方式不够灵活,无法根据电脑的CPU与IO性能,使用最佳的读写配置。对于计算能力远远超出IO读写能力的计算机而言,要尽可能发挥计算机的计算性能,必须将读、写操作与解压缩、压缩计算分离开来。1, the general gz format compression software often assumes that the input is a single character stream, that is, there is only one data source, and for multi-source data, parallel processing is not well performed. In the field of big data, the most common is multi-source data, such as Internet user information data collection, at the same time, there may be more than one user information needs to be compressed and saved to the same file. When the amount of data is large enough, only processing these data in parallel can meet the time requirement. The zlib library only implements the most basic single-threaded gz compression and decompression, while pigz is a parallel gz compressed version. If you use pigz parallel compression to save, there will be serious IO competition, resulting in low IO resource utilization, because, Pigz binds compression to write, decompress, and read; in addition, zlib binds compression to write, decompress, and read. Binding compression to writing, decompression, and reading, while simplifying user operations, is not flexible enough to use the best read and write configuration based on the CPU and IO performance of the computer. For computers with computing power far beyond IO read and write capabilities, to maximize the computational performance of the computer, read and write operations must be separated from decompression and compression calculations.
2,Pigz的多线程压缩软件主要实现了单一数据的分块压缩,对于解压缩,却只提供了单线程的解决方案,这使得解压时的效率受到CPU单线程计算能力 的限制。而在海量数据的解压读取方面,通过并行多线程的解压方式在产业应用和学术领域也有巨大的需求,如高通量DNA测序产生上百GB的FASTA文件;但事实上在后续生物信息分析中,只能使用1个线程进行解压读取(通常HPC一个计算节点都会提供数十个线程),这实际上就大大延长了分析的时间。2, Pigz's multi-threaded compression software mainly implements block compression of single data. For decompression, it only provides a single-threaded solution, which makes the efficiency of decompression limited by the single-thread computing power of the CPU. In the decompression reading of massive data, there is also a huge demand in the industrial application and academic field through the decompression method of parallel multi-threading, such as high-throughput DNA sequencing to generate hundreds of GB of FASTA files; but in fact, subsequent bioinformatics analysis In this case, only one thread can be used for decompression reading (usually one compute node of HPC will provide dozens of threads), which actually extends the analysis time.
发明内容Summary of the invention
本发明提供一种通用数据gz格式的多线程压缩与解压方法及装置,旨在实现将读、写操作与解压缩、压缩计算分离开来的前提下,对原始数据进行多线程压缩,并对压缩后的数据进行多线程解压。The invention provides a multi-thread compression and decompression method and device for general data gz format, aiming at multi-thread compression of original data under the premise of separating read and write operations from decompression and compression calculation. The compressed data is multi-threaded and decompressed.
本发明提供了一种通用数据gz格式的多线程压缩与解压方法,包括:压缩步骤S1和解压步骤S2,其中,所述压缩步骤S1包括:The present invention provides a multi-thread compression and decompression method in a general data gz format, comprising: a compression step S1 and a decompression step S2, wherein the compression step S1 comprises:
步骤S11,输入原始数据,并将所述原始数据进行分块处理,得到M份数据块;Step S11, inputting original data, and performing block processing on the original data to obtain M pieces of data blocks;
其中,每份数据块表示为Di,i∈[0,M-1];Wherein, each data block is represented as Di, i ∈ [0, M-1];
步骤S12,利用预置的第一线程池中的N 1个线程分别压缩M份所述数据块,压缩过程中在gz格式的文件头部分预留预设空间,获得M份压缩后的数据gzDi和所述数据gzDi的大小size(gzDi); Step S12, preset by the first thread pool threads N 1 of each compressed data block to the M parts of the compression process is reserved in the header portion of the predetermined space of the gz format to obtain data compression gzDi parts M And the size of the data gzDi size (gzDi);
步骤S13,按顺序将M份压缩后的所述数据gzDi写入磁盘中,并将对应的M份所述数据gzDi的size(gzDi)顺序写入所述预设空间,得到压缩数据;Step S13, sequentially writing the M pieces of the compressed data gzDi into the disk, and sequentially writing the size (gzDi) of the corresponding M pieces of the data gzDi into the preset space to obtain compressed data;
其中,所述解压缩步骤S2包括:The decompression step S2 includes:
步骤S21,输入所述压缩数据,读取写入的所述size(gzDi)的列表信息,并按照所述size(gzDi)的列表信息对所述压缩数据进行切分,得到M份数据块gzDi;Step S21, inputting the compressed data, reading the written list information of the size (gzDi), and segmenting the compressed data according to the list information of the size (gzDi) to obtain M pieces of data gzDi ;
步骤S22,利用预置的第二线程池中的N 2个线程分别解压M份所述数据块gzDi,获得M份解压后的原始数据Di; Step S22, using the N 2 threads in the preset second thread pool to decompress M pieces of the data block gzDi, respectively, to obtain M pieces of decompressed original data Di;
步骤S23,根据所述size(gzDi)的列表信息串联解压后的所述原始数据Di, 得到完整的原始数据。Step S23, the original data Di is decompressed according to the list information of the size (gzDi) to obtain complete original data.
本发明还提供了一种通用数据gz格式的多线程压缩与解压装置,包括:压缩模块1和解压模块2,其中,所述压缩模块1包括:The present invention also provides a multi-thread compression and decompression device of the general data gz format, comprising: a compression module 1 and a decompression module 2, wherein the compression module 1 comprises:
分块模块11,用于输入原始数据,并将所述原始数据进行分块处理,得到M份数据块;The blocking module 11 is configured to input original data, and perform block processing on the original data to obtain M pieces of data blocks;
其中,每份数据块表示为Di,i∈[0,M-1];Wherein, each data block is represented as Di, i ∈ [0, M-1];
压缩模块12,用于利用预置的第一线程池中的N 1个线程分别压缩M份所述数据块,压缩过程中在gz格式的文件头部分预留预设空间,获得M份压缩后的数据gzDi和所述数据gzDi的大小size(gzDi); The compression module 12 is configured to respectively compress the M data blocks by using N 1 threads in the preset first thread pool, and reserve a preset space in the file header portion of the gz format during the compression process, and obtain M compression. Data gzDi and the size of the data gzDi size (gzDi);
写入模块13,用于按顺序将M份压缩后的所述数据gzDi写入磁盘中,并将对应的M份所述数据gzDi的size(gzDi)顺序写入所述预设空间,得到压缩数据;The writing module 13 is configured to sequentially write the M pieces of the compressed data gzDi into the disk, and sequentially write the size (gzDi) of the corresponding M pieces of the data gzDi into the preset space to obtain compression. data;
其中,所述解压缩模块2包括:The decompression module 2 includes:
切分模块21,用于输入所述压缩数据,读取写入的所述size(gzDi)的列表信息,并按照所述size(gzDi)的列表信息对所述压缩数据进行切分,得到M份数据块gzDi;a segmentation module 21, configured to input the compressed data, read the written list information of the size (gzDi), and segment the compressed data according to the list information of the size (gzDi) to obtain M Data block gzDi;
解压模块22,用于利用预置的第二线程池中的N 2个线程分别解压M份所述数据块gzDi,获得M份解压后的原始数据Di; The decompression module 22 is configured to decompress the M pieces of the data block gzDi by using N 2 threads in the preset second thread pool to obtain the M pieces of decompressed original data Di;
串联模块23,用于根据所述size(gzDi)的列表信息串联解压后的所述原始数据Di,得到完整的原始数据。The serial module 23 is configured to obtain the complete original data by serially decompressing the original data Di according to the list information of the size (gzDi).
本发明与现有技术相比,有益效果在于:本发明提供的一种通用数据gz格式的多线程压缩与解压方法及装置,压缩的步骤为,先将输入的原始数据分块处理,然后利用N1个线程分别压缩数据块,得到M份压缩后的数据gzDi和对应的size(gzDi),最后将gzDi写入磁盘,其中,M份size(gzDi)写入gz格式的文件头部分;解压缩的步骤为,读取写入的该size(gzDi)的列表信息,并按照该列表信息对输入的压缩数据进行切分,得到M份数据块;利用N2个线程 分别解压M份该数据块,获得M份解压后的原始数据Di;最后串联该原始数据Di,得到完整的原始数据;本发明与现有技术相比,采用与读写IO操作分离的并行的多线程gz压缩方法,将读、写操作与解压缩、压缩计算分离开来,可以根据实际情况进行调度,有效避免多线程压缩写操作的IO竞争;另外,采用多线程gz解压缩方法,让软件在计算机上可以使用更多的CPU进行解压缩计算,从而得到更大的数据输入,单个程序能够获得更高的计算占用率;同时,这种在文件头部分存储size(gzDi)的gz格式,使得可以利用原有的zlib或pigz的单线程解压方法进行解压,也可以采用bgz多线程解压,保证了兼容性,从而使得方法推广成本极低。Compared with the prior art, the present invention has the beneficial effects that the universal data gz format multi-thread compression and decompression method and device provided by the present invention, the compression step is that the input original data is first processed into blocks, and then utilized. N1 threads respectively compress the data block, obtain M pieces of compressed data gzDi and corresponding size(gzDi), and finally write gzDi to the disk, wherein M parts size(gzDi) are written into the file header part of gz format; decompression The step of reading the size (gzDi) of the written list information, and segmenting the input compressed data according to the list information to obtain M pieces of data blocks; and decompressing the M pieces of the data block by using N2 threads respectively. Obtaining M pieces of decompressed original data Di; finally concatenating the original data Di to obtain complete original data; compared with the prior art, the present invention uses a parallel multi-thread gz compression method separated from the read-write IO operation to read The write operation is separated from the decompression and compression calculations, and can be scheduled according to the actual situation, effectively avoiding the IO competition of the multi-thread compression write operation; in addition, using the multi-thread gz decompression method to make the soft The program can use more CPUs for decompression calculations on the computer, resulting in larger data input, and a single program can achieve higher computational occupancy; at the same time, this size (gzDi) gz format is stored in the header portion of the file. It can be decompressed by the original single-thread decompression method of zlib or pigz, or bgz multi-thread decompression can be used to ensure compatibility, so that the method promotion cost is extremely low.
附图说明DRAWINGS
图1是本发明实施例提供的一种通用数据gz格式的多线程压缩与解压方法的流程示意图;1 is a schematic flowchart of a multi-thread compression and decompression method of a general data gz format according to an embodiment of the present invention;
图2是本发明实施例提供的Bgz多线程压缩和解压缩的过程示意图;2 is a schematic diagram of a process of Bgz multi-thread compression and decompression provided by an embodiment of the present invention;
图3是本发明实施例提供的一种通用数据gz格式的多线程压缩与解压装置的模块示意图。FIG. 3 is a schematic block diagram of a multi-thread compression and decompression device of a general data gz format according to an embodiment of the present invention.
具体实施方式Detailed ways
为了使本发明的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。The present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It is understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
由于现有技术中存在,一方面,在采用pigz并行压缩保存时,由于pigz将压缩与写、解压缩与读绑定在一起,在IO资源相对于CPU计算有限的情况下,对硬件的利用率十分低下的技术问题;另一方面,zlib和pigz都未能实现gz的多线程解压的技术问题。Due to the existence in the prior art, on the one hand, when using pigz parallel compression storage, since pigz binds compression and writing, decompression and reading together, the use of hardware in the case where IO resources are limited with respect to CPU calculation. The technical problem is very low; on the other hand, both zlib and pigz fail to implement the technical problem of gz multi-thread decompression.
为了解决上述技术问题,本发明提出一种通用数据gz格式的多线程压缩与解压方法及装置,针对目前广泛使用的gz压缩格式,开发出针对多源数据的多 线程压缩与解压缩解决方案,其中,压缩计算与写操作分离,可以有效避免多线程压缩写操作的IO竞争;多线程gz解压缩方法的提供,让软件在计算机上可以使用更多的CPU进行解压缩计算,从而得到更大的数据输入,单个程序能够获得更高的计算占用率;同时本发明注意到软件兼容的情况,对压缩后的数据结构进行了特别设计,以保证现有的zlib与pigz方法无需改动即可进行单线程解压。In order to solve the above technical problem, the present invention provides a multi-thread compression and decompression method and device for a universal data gz format, and develops a multi-thread compression and decompression solution for multi-source data for the currently widely used gz compression format. Among them, the compression calculation and the write operation are separated, which can effectively avoid the IO competition of multi-thread compression write operation; the multi-thread gz decompression method provides the software to use the more CPU to decompress the calculation on the computer, thereby obtaining a larger Data input, a single program can obtain higher computing occupancy; at the same time, the present invention notices software compatibility, and the compressed data structure is specially designed to ensure that the existing zlib and pigz methods can be performed without modification. Single-threaded decompression.
事实上,随着互联网以及电子技术的发展,数据信息量越来越大,计算机性能也越来越好。在数据与硬件之间,需要更适合的软件来衔接,本发明正是提供了一种灵活的大数据多线程压缩读写解决方案,针对文本数据尤其适用于多数据来源的海量大数据在高性能计算平台上压缩读写,从而让大数据软件能够更全面的发挥高性能计算机(HPC)的计算能力。大数据的存储方案必须经济合理,所以压缩存储是必然的选择。相对于普通数据的直接存储,大数据在存储时使用一定的计算资源进行压缩,降低了数据字符总量以后,再进行存储,可以大大减少IO资源以及硬盘存储空间的占用。这样的方案能够更全面、更协调地使用计算机的各部分,避免直接存储只占用IO不适用CPU的局面,全面发挥HPC的硬件性能。In fact, with the development of the Internet and electronic technology, the amount of data information is getting larger and larger, and the computer performance is getting better and better. Between data and hardware, more suitable software is needed to connect. The present invention provides a flexible big data multi-thread compression and reading solution, which is especially suitable for large amounts of large data in multi-data sources. Compression and reading on the performance computing platform, enabling big data software to more fully utilize the computing power of high-performance computers (HPC). Big data storage solutions must be economical, so compressed storage is an inevitable choice. Compared with the direct storage of ordinary data, big data is compressed by using certain computing resources during storage. After the total amount of data characters is reduced, storage is performed, which can greatly reduce the occupation of IO resources and hard disk storage space. Such a solution can use all parts of the computer more comprehensively and in a more coordinated manner, avoiding the situation that direct storage only takes up IO and does not apply to the CPU, and fully utilizes the hardware performance of the HPC.
下面具体介绍本发明提供的方法,本发明是基于zlib开源库的通用数据gz格式多线程压缩与解压算法bgz(block gzip),首先bgz方法利用zlib的数据结构与deflate方法,实现了与读写IO操作分离的压缩、解压函数,即gzwrite=bgzCompress+fwrite,gzread=fread+bgzDecompress;并在内存压缩与解压的基础上,实现多线程。The method provided by the present invention is specifically described below. The present invention is based on the general data gz format multi-thread compression and decompression algorithm bgz (block gzip) of the zlib open source library. First, the bgz method utilizes zlib data structure and deflate method to realize and read and write. The IO operation separates the compression and decompression functions, namely gzwrite=bgzCompress+fwrite, gzread=fread+bgzDecompress; and implements multithreading based on memory compression and decompression.
请参阅图1,为本发明实施例提供的一种通用数据gz格式的多线程压缩与解压方法,所述方法包括:压缩步骤S1和解压步骤S2,其中,所述压缩步骤S1包括:1 is a multi-thread compression and decompression method of a general data gz format according to an embodiment of the present invention. The method includes: a compression step S1 and a decompression step S2, wherein the compression step S1 includes:
步骤S11,输入原始数据,并将所述原始数据进行分块处理,得到M份数据块;Step S11, inputting original data, and performing block processing on the original data to obtain M pieces of data blocks;
其中,每份数据块表示为Di,i∈[0,M-1];Wherein, each data block is represented as Di, i ∈ [0, M-1];
具体地,本发明实施例提供的所述原始数据不局限于一种数据源形式,可以是来自于一个数据源,也可以是多源数据;也不局限于数据份数,可以是一份数据,也可以是多份数据。Specifically, the original data provided by the embodiment of the present invention is not limited to a data source form, and may be from one data source or multiple source data; nor is it limited to the number of data copies, and may be one piece of data. It can also be multiple copies of data.
具体地,准备原始数据,将待压缩的原始数据载入内存,进行分块处理得到M份数据,若是多份数据,则按照数据来源分类,只对大份数据进行分块。块大小可调,根据机器内存配置来设置,一般默认为10MB。比如,共有10份数据,9份1M大小,1份10G大小,那么只要对10G大小的数据进行分块即可,另9份数据可作为1份处理。Specifically, the original data is prepared, the original data to be compressed is loaded into the memory, and the block processing is performed to obtain M pieces of data. If there are multiple pieces of data, the large data is divided into blocks according to the data source classification. The block size is adjustable and is set according to the machine memory configuration. The default is 10MB. For example, if there are 10 data, 9 1M size, and 1 10G size, then only 10G data can be divided into blocks, and another 9 data can be processed as 1 copy.
步骤S12,利用预置的第一线程池中的N 1个线程分别压缩M份所述数据块,压缩过程中在gz格式的文件头部分预留预设空间,获得M份压缩后的数据gzDi和所述数据gzDi的大小size(gzDi); Step S12, preset by the first thread pool threads N 1 of each compressed data block to the M parts of the compression process is reserved in the header portion of the predetermined space of the gz format to obtain data compression gzDi parts M And the size of the data gzDi size (gzDi);
具体地,在预先设置第一线程池时,需要根据机器性能,设置合理的线程数N 1,一般情况下,线程数N 1小于等于机器最大线程数。 Specifically, when the first thread pool is set in advance, a reasonable number of threads N 1 needs to be set according to the performance of the machine. In general, the number of threads N 1 is less than or equal to the maximum number of threads of the machine.
具体地,上述并行压缩的过程为,一个线程处理一份数据块,循环使用线程池,直到所有数据完成压缩,不同数据压缩耗时不一,因此需要灵活调度线程池,以保证所有线程都处于计算状态。更具体地,利用预置的第一线程池中的N 1个线程分别对应压缩M份所述数据块中的N 1个数据块,N 1个线程中的任一个线程对应的数据块压缩完毕后,继续利用所述线程处理剩余未压缩的数据块,直至M份所述数据块压缩完毕。 Specifically, the above parallel compression process is: one thread processes a data block, and recycles the thread pool until all the data is compressed, and different data compression takes time, so the thread pool needs to be flexibly scheduled to ensure that all threads are in Calculate the status. More specifically, the N 1 threads in the preset first thread pool respectively correspond to compressing N pieces of N 1 data blocks in the data block, and the data blocks corresponding to any one of the N 1 threads are compressed. Thereafter, the remaining uncompressed data blocks are processed by the thread until the data blocks of the M shares are compressed.
具体地,在gz格式的文件头部分预留预设空间的目的是为了记录多份压缩后数据的size(gzDi),作为后续解压过程中的快速地址索引,从而实现多线程解压。Specifically, the purpose of preserving the preset space in the file header portion of the gz format is to record the size (gzDi) of the plurality of compressed data as a fast address index in the subsequent decompression process, thereby implementing multi-thread decompression.
步骤S13,按顺序将M份压缩后的所述数据gzDi写入磁盘中,并将对应的M份所述数据gzDi的size(gzDi)顺序写入所述预设空间,得到压缩数据;Step S13, sequentially writing the M pieces of the compressed data gzDi into the disk, and sequentially writing the size (gzDi) of the corresponding M pieces of the data gzDi into the preset space to obtain compressed data;
具体地,根据需求,将压缩后的数据gzDi分别写入硬盘并记录对应的 size(gzDi)。若是单硬盘系统,则单线程写入,若是多机器分布式系统即多硬盘系统,则根据实际硬盘数来确定写入线程数量。Specifically, the compressed data gzDi is written to the hard disk and the corresponding size (gzDi) is recorded according to requirements. If it is a single hard disk system, single-threaded writing, if it is a multi-machine distributed system, that is, a multi-hard disk system, the number of writing threads is determined according to the actual number of hard disks.
其中,记录的size(gzDi)是多线程解压缩所必须的,本发明实施例将size(gzDi)记录在gz压缩格式的数据头中,事实上,也可以根据软件系统需求记录在内存或索引列表中。通过这种特殊设计的压缩后的数据结构,使得解压缩过程可以支持zlib与pigz的单线程解压,也可以利用size(gzDi)进行多线程解压。The size (gzDi) of the record is necessary for multi-thread decompression. In the embodiment of the present invention, the size (gzDi) is recorded in the data header of the gz compression format, and in fact, may also be recorded in the memory or index according to the requirements of the software system. List. Through this specially designed compressed data structure, the decompression process can support single-thread decompression of zlib and pigz, or multi-thread decompression using size(gzDi).
其中,所述解压缩步骤S2包括:The decompression step S2 includes:
步骤S21,输入所述压缩数据,读取写入的所述size(gzDi)的列表信息,并按照所述size(gzDi)的列表信息对所述压缩数据进行切分,得到M份数据块gzDi;Step S21, inputting the compressed data, reading the written list information of the size (gzDi), and segmenting the compressed data according to the list information of the size (gzDi) to obtain M pieces of data gzDi ;
具体地,利用size(gzDi)的列表信息作为快速地址索引来进行切分,并进而实现后续的多线程解压。Specifically, the list information of size (gzDi) is used as a fast address index to perform segmentation, and then the subsequent multi-thread decompression is implemented.
步骤S22,利用预置的第二线程池中的N 2个线程分别解压M份所述数据块gzDi,获得M份解压后的原始数据Di; Step S22, using the N 2 threads in the preset second thread pool to decompress M pieces of the data block gzDi, respectively, to obtain M pieces of decompressed original data Di;
具体地,在预先设置第二线程池时,需要根据机器性能与块的数量,设置合理的线程数N 2,一般情况下,线程数N 2小于等于机器最大线程数。 Specifically, when the second thread pool is set in advance, it is necessary to set a reasonable number of threads N 2 according to the performance of the machine and the number of blocks. In general, the number of threads N 2 is less than or equal to the maximum number of threads of the machine.
具体地,上述并行解压的过程为,一个线程处理一个压缩数据块,循环使用线程池,直到所有数据解压完成。更具体地,利用预置的第二线程池中的N 2个线程分别对应解压M份所述数据块gzDi中的N 2个数据块,N 2个线程中的任一个线程对应的数据块解压完毕后,继续利用所述线程处理剩余未解压的数据块,直至M份所述数据块解压完毕,获得M份解压后的原始数据Di。 Specifically, the above parallel decompression process is that one thread processes one compressed data block and recycles the thread pool until all data is decompressed. More specifically, the N 2 threads in the preset second thread pool are respectively corresponding to decompressing M pieces of N 2 data blocks in the data block gzDi, and the data blocks corresponding to any one of the N 2 threads are decompressed. After the completion, the remaining undecompressed data blocks are processed by the thread until the M pieces of the data block are decompressed, and the M pieces of decompressed original data Di are obtained.
具体地,本发明实施例中,所述第一线程池中的N 1个线程等于所述第二线程池中的N 2个线程。需要说明的是,若数据量较大,数据份数够多,则M最好是N 1和N 2的整数倍,以免计算资源闲置。事实上,N 1并不限定一定等于N 2Specifically, in the embodiment of the present invention, the N 1 threads in the first thread pool are equal to the N 2 threads in the second thread pool. It should be noted that if the amount of data is large and the number of data shares is sufficient, M is preferably an integer multiple of N 1 and N 2 to prevent the computing resource from being idle. In fact, N 1 is not necessarily limited to N 2 .
步骤S23,根据所述size(gzDi)的列表信息串联解压后的所述原始数据Di,得到完整的原始数据。Step S23, the original data Di is decompressed in series according to the list information of the size (gzDi) to obtain complete original data.
说要说明的是,压缩与解压缩都使用了线程池技术实现多线程,压缩过程输入原始数据,产生压缩数据与分块信息;多线程解压缩过程则输入压缩数据与分块信息,获得解压缩后的原始信息,具体如图2所示。It should be noted that both compression and decompression use thread pool technology to implement multi-threading, compression process inputs raw data, and generates compressed data and block information; multi-thread decompression process inputs compressed data and block information to obtain solutions. The compressed original information is shown in Figure 2.
本发明使用线程池对多线程压缩与解压缩进行调度,在不同系统上都能较好的发挥硬件的性能。其多线程调度伪代码如下:The invention uses the thread pool to schedule multi-thread compression and decompression, and can better exert the performance of the hardware on different systems. Its multi-threaded scheduling pseudo code is as follows:
Figure PCTCN2017117619-appb-000001
Figure PCTCN2017117619-appb-000001
需要说明的是,本发明的一大创新之处在于,将分块信息保存于gz压缩格式数据头中,从而保证了兼容性。使用bgz压缩方法,处理过后的格式,依旧可以使用原有的zlib或pigz的单线程解压方法,进行解压,无需任何修改。而使用bgz多线程解压时,则能够从gz压缩文件本身提取出分块信息,从而实现多线程快速解压。It should be noted that a major innovation of the present invention is that the block information is stored in the gz compression format data header, thereby ensuring compatibility. Using the bgz compression method, the processed format can still be decompressed using the original zlib or pigz single-thread decompression method without any modification. When using bgz multi-thread decompression, it can extract the block information from the gz compressed file itself, thus achieving multi-thread fast decompression.
Gz压缩格式数据头的数据结构如下:The data structure of the Gz compression format header is as follows:
Figure PCTCN2017117619-appb-000002
Figure PCTCN2017117619-appb-000002
Figure PCTCN2017117619-appb-000003
Figure PCTCN2017117619-appb-000003
从上述数据结构中,我们可以看到gz数据头中存在一个额外字段,正常情况下,额外字段不参与解压缩过程。存储到硬盘上时,gz数据头的额外字段为空。本发明提供的bgz方法能够可以事先开辟一段固定长度的额外字段(能够存储100block的分块信息),在持续压缩过程中,可以不断修改额外字段,如果数据block数目不足100,则剩余的空间以0填充。如果压缩的数据量较多,当数据块超过100时,则在第101块数据压缩时再一次开辟100block空间的额外字段,以此类推。From the above data structure, we can see that there is an extra field in the gz header. Under normal circumstances, the extra field does not participate in the decompression process. When stored on the hard disk, the extra field of the gz header is empty. The bgz method provided by the invention can open up a fixed length extra field (capable of storing 100 blocks of block information) in advance, and in the continuous compression process, the additional field can be continually modified. If the number of data blocks is less than 100, the remaining space is 0 padding. If the amount of compressed data is large, when the data block exceeds 100, then the extra field of the 100 block space is opened again in the 101st data compression, and so on.
目前1个block分块信息占用8Byte,主要记录block的压缩前原始数据大小size(Di)与压缩后大小size(gzDi)。因为是每一个block数据独立压缩,所以每一份压缩后的数据块gzDi都会有一个gz数据头,但不是所有数据头都有含分块信息的额外字段,根据block数的不同,额外字段只存在于gzD100*i+1的数据头中,其中i=0,1,2....At present, one block block information occupies 8 bytes, and the main data size size (Di) and the compressed size (gzDi) of the block are mainly recorded before compression. Because each block data is independently compressed, each compressed data block gzDi will have a gz header, but not all headers have extra fields with block information. Depending on the number of blocks, the extra fields are only Exists in the data header of gzD100*i+1, where i=0,1,2....
存储于硬盘的bgz文件遵循gzip文件格式,由多个数据块组成,每个数据块内容构成如下:The bgz file stored on the hard disk follows the gzip file format and consists of multiple data blocks. The contents of each data block are as follows:
+—+—+—+—+—+—+—+—+—+—+========//========+===========//==========+—+—+—++—+—+—+—+—+—+—+—+—+—+========//===================== =//==========+—+—+—+
|ID1|ID2|CM|FLG|MTIME|XFL|OS|额外的头字段|压缩的数据|CRC32|ISIZE||ID1|ID2|CM|FLG|MTIME|XFL|OS|Additional header fields|Compressed data|CRC32|ISIZE|
+—+—+—+—+—+—+—+—+—+—+========//========+===========//==========+—+—+—++—+—+—+—+—+—+—+—+—+—+========//===================== =//==========+—+—+—+
每个数据块由三个部分构成,头部分,数据部分,尾部分。从ID1到额外的头字段为数据头部分,CRC32与ISIZE则是尾部分。除额外字段,其余内容与普通gzip格式一致,其定义如下:Each data block consists of three parts, a header part, a data part, and a tail part. From ID1 to the extra header field is the data header portion, and CRC32 and ISIZE are the tail portions. Except for the extra fields, the rest of the content is consistent with the normal gzip format, which is defined as follows:
ID1与ID2:各1字节。固定值,ID1=31(0×1F),ID2=139(0×8B),指示GZIP格式。ID1 and ID2: 1 byte each. Fixed value, ID1 = 31 (0 × 1F), ID2 = 139 (0 × 8B), indicating the GZIP format.
CM:1字节。压缩方法。目前只有一种:CM=8,指示DEFLATE方法。CM: 1 byte. Compression method. There is currently only one type: CM=8, indicating the DEFLATE method.
FLG:1字节。标志。FLG: 1 byte. Sign.
bit 0FTEXT–指示文本数据Bit 0FTEXT – indicates text data
bit 1FHCRC–指示存在CRC16头校验字段Bit 1FHCRC – indicates the presence of a CRC16 header check field
bit 2FEXTRA–指示存在可选项字段Bit 2FEXTRA – indicates the presence of an optional field
bit 3FNAME–指示存在原文件名字段Bit 3FNAME – indicates the presence of the original filename field
bit 4FCOMMENT–指示存在注释字段Bit 4FCOMMENT – indicates the presence of a comment field
bit 5-7保留Bit 5-7 reserved
MTIME:4字节。更改时间。UINX格式。MTIME: 4 bytes. Change time. UINX format.
XFL:1字节。附加的标志。当CM=8时,XFL=2–最大压缩但最慢的算法;XFL=4–最快但最小压缩的算法XFL: 1 byte. Additional logo. When CM=8, XFL=2 – the most compressed but slowest algorithm; XFL=4 – the fastest but least compressed algorithm
OS:1字节。指明操作系统,确切地说应该是文件系统。有下列定义:OS: 1 byte. Indicate the operating system, specifically the file system. The following definitions are available:
0–FAT文件系统(MS-DOS,OS/2,NT/Win32)0–FAT file system (MS-DOS, OS/2, NT/Win32)
1–Amiga1–Amiga
2–VMS/OpenVMS2–VMS/OpenVMS
3–Unix3–Unix
4–VM/CMS4–VM/CMS
5–Atari TOS5–Atari TOS
6–HPFS文件系统(OS/2,NT)6–HPFS file system (OS/2, NT)
7–Macintosh7–Macintosh
8–Z-System8–Z-System
9–CP/M9–CP/M
10–TOPS-2010–TOPS-20
11–NTFS文件系统(NT)11–NTFS File System (NT)
12–QDOS12–QDOS
13–Acorn RISCOS13–Acorn RISCOS
255–未知255–unknown
额外的头字段:Additional header fields:
FLG.FEXTRA=1表示存在额外字段,XLEN表示额外字段长度,为800FLG.FEXTRA=1 means there is an extra field, XLEN means extra field length, 800
+—+—+—+—+===============//================++—+—+—+—+===================================
|SI1|SI2|XLEN|长度为XLEN字节的可选项||SI1|SI2|XLEN|Optional for length XLEN bytes|
+—+—+—+—+===============//================++—+—+—+—+===================================
FLG.FNAME=0表示无原文件FLG.FNAME=0 means no original file
FLG.FCOMMENT=0表示不存在注释信息,若等于1,添加注释信息FLG.FCOMMENT=0 means there is no comment information. If it is equal to 1, add comment information.
FLG.FHCRC=0表示采用默认的CRC32校验,若等于1,则采用CRC16校验FLG.FHCRC=0 indicates that the default CRC32 check is used. If it is equal to 1, the CRC16 check is used.
关于上述基于zlib多线程压缩与解压缩算法是为了DNA序列分析服务的,因此在这个生物信息大数据分析的时代,对碱基对信息文件压缩存储的效率的要求还是极为看重,所以单依靠多线程压缩和解压的提升对于海量的数据还是远远不够,我认为如今多线程的办法还可以用在文件读写和传输中,同时我们也是知道压缩存储可以固化在硬件上省区内存读写这一部分,可以让文件的压缩存储效率更高。The above-mentioned zlib multi-thread compression and decompression algorithm is for DNA sequence analysis services. Therefore, in the era of bioinformatics big data analysis, the requirement for the efficiency of compression and storage of base pair information files is extremely important, so relying on multiple The improvement of thread compression and decompression is still not enough for massive data. I think that the multi-threaded approach can still be used in file reading and writing and transmission. At the same time, we also know that compressed storage can be solidified on the hardware. In part, you can make file compression storage more efficient.
除此之外在本次实现的基于zlib多线程压缩解压的算法中,还可以将文件分块的信息写在zlib头文件的额外字段中,这样可以将多线程压缩解压更好的融入进zlib库中,而不用单独多建立一个index文件来记录,可以使多线程压缩解压更加的简单便利的去调用,可以省去index文件的读写时间。In addition, in this implementation of the zlib multi-thread compression decompression algorithm, you can also write the file block information in the extra field of the zlib header file, so that multi-thread compression decompression can be better integrated into zlib In the library, instead of creating an index file to record separately, multi-thread compression decompression can be more simple and convenient to call, which can save the reading and writing time of the index file.
本发明提供的一种通用数据gz格式的多线程压缩与解压方法,采用与读写IO操作分离的并行的多线程gz压缩方法,读、写、压缩与解压缩相互分离,可以根据实际计算平台资源合理搭配压缩与解压缩方式,压缩端与解压端可能不在同一个机器上,根据需要合理分配读写线程、压缩与解压缩线程,以最大程度发挥机器性能。另外,采用多线程gz解压缩方法,让软件在计算机上可以使用更多的CPU进行解压缩计算,从而得到更大的数据输入,单个程序能够获得更高的计算占用率;同时本发明注意到软件兼容的情况,对压缩后的数据结构进行了特别设计,即在文件头部分存储size(gzDi)的gz格式,以保证现有的zlib与pigz方法无需改动即可进行解压。这样一来,大大降低了使用者的版本更新担忧,尤其对于数据生产者与数据使用者分离的情况来说,不存在版本不兼容的情况,使得方法推广成本极低,只有当使用者认为有相关解压缩需求的时候,进行软件更换即可。The multi-thread compression and decompression method of the universal data gz format provided by the invention adopts a parallel multi-thread gz compression method separated from the read-write IO operation, and the read, write, compress and decompress are separated from each other, and can be separated according to the actual computing platform. The resources are reasonably matched with compression and decompression. The compression end and the decompression end may not be on the same machine. The read and write threads, compression and decompression threads are allocated reasonably according to the needs to maximize the performance of the machine. In addition, the multi-threaded gz decompression method allows the software to use more CPUs for decompression calculations on the computer, resulting in larger data input, and a single program can achieve higher computational occupancy; In the case of software compatibility, the compressed data structure is specially designed to store the size (gzDi) gz format in the header section to ensure that the existing zlib and pigz methods can be decompressed without modification. In this way, the user's version update concerns are greatly reduced, especially in the case where the data producer is separated from the data user, there is no version incompatibility, so that the method promotion cost is extremely low, only when the user thinks that there is When the relevant decompression requirements are concerned, the software can be replaced.
请参阅图3,为本发明实施例提供的一种通用数据gz格式的多线程压缩与解压装置,包括:压缩模块1和解压模块2,其中,所述压缩模块1包括:Referring to FIG. 3, a multi-thread compression and decompression device of a general data gz format according to an embodiment of the present invention includes: a compression module 1 and a decompression module 2, wherein the compression module 1 includes:
分块模块11,用于输入原始数据,并将所述原始数据进行分块处理,得到 M份数据块;其中,每份数据块表示为Di,i∈[0,M-1];The blocking module 11 is configured to input original data, and perform block processing on the original data to obtain M data blocks; wherein each data block is represented as Di, i ∈ [0, M-1];
压缩模块12,用于利用预置的第一线程池中的N 1个线程分别压缩M份所述数据块,压缩过程中在gz格式的文件头部分预留预设空间,获得M份压缩后的数据gzDi和所述数据gzDi的大小size(gzDi); The compression module 12 is configured to respectively compress the M data blocks by using N 1 threads in the preset first thread pool, and reserve a preset space in the file header portion of the gz format during the compression process, and obtain M compression. Data gzDi and the size of the data gzDi size (gzDi);
写入模块13,用于按顺序将M份压缩后的所述数据gzDi写入磁盘中,并将对应的M份所述数据gzDi的size(gzDi)顺序写入所述预设空间,得到压缩数据;The writing module 13 is configured to sequentially write the M pieces of the compressed data gzDi into the disk, and sequentially write the size (gzDi) of the corresponding M pieces of the data gzDi into the preset space to obtain compression. data;
其中,所述解压缩模块2包括:The decompression module 2 includes:
切分模块21,用于输入所述压缩数据,读取写入的所述size(gzDi)的列表信息,并按照所述size(gzDi)的列表信息对所述压缩数据进行切分,得到M份数据块gzDi;a segmentation module 21, configured to input the compressed data, read the written list information of the size (gzDi), and segment the compressed data according to the list information of the size (gzDi) to obtain M Data block gzDi;
解压模块22,用于利用预置的第二线程池中的N 2个线程分别解压M份所述数据块gzDi,获得M份解压后的原始数据Di; The decompression module 22 is configured to decompress the M pieces of the data block gzDi by using N 2 threads in the preset second thread pool to obtain the M pieces of decompressed original data Di;
串联模块23,用于根据所述size(gzDi)的列表信息串联解压后的所述原始数据Di,得到完整的原始数据。The serial module 23 is configured to obtain the complete original data by serially decompressing the original data Di according to the list information of the size (gzDi).
关于应用领域,在互联网文本数据归档存储、网络传输和FASTQ数据的普通存储等方面,都使用了gz压缩格式(部分网络传输使用bz2压缩)。对于数据量巨大的软件平台来说,多线程gz压缩方案被广泛使用,在linux平台上主要使用pigz方法,在windows等其他平台上,则对zlib库再开发,实现多线程压缩。这是目前已知的应用领域与应用方式。Regarding the application field, the gz compression format is used in Internet text data archival storage, network transmission, and general storage of FASTQ data (partial network transmission uses bz2 compression). For software platforms with huge data volume, multi-threaded gz compression schemes are widely used. On the Linux platform, the pigz method is mainly used. On other platforms such as windows, the zlib library is redeveloped to achieve multi-thread compression. This is the currently known application area and application method.
而在高通量DNA测序领域,随着生物信息计算的进一步发展,高性能计算机用于生物信息分析,fastq数据可能被多次读取,进行信息统计或计算,所以针对fastq.gz文件格式的多线程解压缩,也将成为一个重要的潜在应用领域。In the field of high-throughput DNA sequencing, with the further development of bioinformatics computing, high-performance computers are used for bioinformatics analysis, fastq data may be read multiple times for information statistics or calculations, so for the fastq.gz file format Multi-threaded decompression will also become an important potential application area.
随着云计算的进一步发展,数据集中处理、集中存储的情况会越来越普遍,无论是存储还是传输,多线程压缩与解压缩都能够得到广泛的使用。With the further development of cloud computing, data centralized processing and centralized storage will become more and more common. Both storage and transmission, multi-thread compression and decompression can be widely used.
以上所述仅为本发明的较佳实施例而已,并不用以限制本发明,凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等,均应包含在本发明的保护范围之内。The above is only the preferred embodiment of the present invention, and is not intended to limit the present invention. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the protection of the present invention. Within the scope.

Claims (10)

  1. 一种通用数据gz格式的多线程压缩与解压方法,其特征在于,所述方法包括:压缩步骤S1和解压步骤S2,其中,所述压缩步骤S1包括:A multi-thread compression and decompression method of the general data gz format, the method comprising: a compression step S1 and a decompression step S2, wherein the compressing step S1 comprises:
    步骤S11,输入原始数据,并将所述原始数据进行分块处理,得到M份数据块;Step S11, inputting original data, and performing block processing on the original data to obtain M pieces of data blocks;
    其中,每份数据块表示为Di,i∈[0,M-1];Wherein, each data block is represented as Di, i ∈ [0, M-1];
    步骤S12,利用预置的第一线程池中的N 1个线程分别压缩M份所述数据块,压缩过程中在gz格式的文件头部分预留预设空间,获得M份压缩后的数据gzDi和所述数据gzDi的大小size(gzDi); Step S12, preset by the first thread pool threads N 1 of each compressed data block to the M parts of the compression process is reserved in the header portion of the predetermined space of the gz format to obtain data compression gzDi parts M And the size of the data gzDi size (gzDi);
    步骤S13,按顺序将M份压缩后的所述数据gzDi写入磁盘中,并将对应的M份所述数据gzDi的size(gzDi)顺序写入所述预设空间,得到压缩数据;Step S13, sequentially writing the M pieces of the compressed data gzDi into the disk, and sequentially writing the size (gzDi) of the corresponding M pieces of the data gzDi into the preset space to obtain compressed data;
    其中,所述解压缩步骤S2包括:The decompression step S2 includes:
    步骤S21,输入所述压缩数据,读取写入的所述size(gzDi)的列表信息,并按照所述size(gzDi)的列表信息对所述压缩数据进行切分,得到M份数据块gzDi;Step S21, inputting the compressed data, reading the written list information of the size (gzDi), and segmenting the compressed data according to the list information of the size (gzDi) to obtain M pieces of data gzDi ;
    步骤S22,利用预置的第二线程池中的N 2个线程分别解压M份所述数据块gzDi,获得M份解压后的原始数据Di; Step S22, using the N 2 threads in the preset second thread pool to decompress M pieces of the data block gzDi, respectively, to obtain M pieces of decompressed original data Di;
    步骤S23,根据所述size(gzDi)的列表信息串联解压后的所述原始数据Di,得到完整的原始数据。Step S23, the original data Di is decompressed in series according to the list information of the size (gzDi) to obtain complete original data.
  2. 如权利要求1所述的多线程压缩与解压方法,其特征在于,所述原始数据为多源数据。The multi-thread compression and decompression method according to claim 1, wherein the original data is multi-source data.
  3. 如权利要求1所述的多线程压缩与解压方法,其特征在于,所述利用预置的第一线程池中的N 1个线程分别压缩M份所述数据块,包括: Multithreading according to claim 1 compression and decompression method, wherein said first thread pool using N 1 of the preset compression threads are parts of the M data block, comprising:
    利用预置的第一线程池中的N 1个线程分别对应压缩M份所述数据块中的N 1个数据块,N 1个线程中的任一个线程对应的数据块压缩完毕后,继续利用所述线程处理剩余未压缩的数据块,直至M份所述数据块压缩完毕。 Preset by the first thread pool threads N 1 of the N 1 respectively corresponding to the compressed data blocks in the data block M parts, and after any one of the N 1 corresponding to the thread in the thread compressed data block is completed, continue to use The thread processes the remaining uncompressed data blocks until the M portions of the data blocks are compressed.
  4. 如权利要求1所述的多线程压缩与解压方法,其特征在于,所述步骤S22具体为:The multi-thread compression and decompression method according to claim 1, wherein the step S22 is specifically:
    利用预置的第二线程池中的N 2个线程分别对应解压M份所述数据块gzDi中的N 2个数据块,N 2个线程中的任一个线程对应的数据块解压完毕后,继续利用所述线程处理剩余未解压的数据块,直至M份所述数据块解压完毕,获得M份解压后的原始数据Di。 After a second preset using thread pool threads corresponding to N 2 N 2 decompressed data blocks of the data block M gzDi parts respectively, corresponding to the thread of any one of N 2 threads decompressed data blocks is completed, continued The remaining undecompressed data blocks are processed by the thread until the M pieces of the data blocks are decompressed, and the M pieces of decompressed original data Di are obtained.
  5. 如权利要求1至4任一项所述的多线程压缩与解压方法,其特征在于,所述第一线程池中的N 1个线程等于所述第二线程池中的N 2个线程。 The multi-thread compression and decompression method according to any one of claims 1 to 4, wherein N 1 threads in the first thread pool are equal to N 2 threads in the second thread pool.
  6. 一种通用数据gz格式的多线程压缩与解压装置,其特征在于,所述装置包括:压缩模块1和解压模块2,其中,所述压缩模块1包括:A multi-thread compression and decompression device of the general-purpose data gz format, the device comprising: a compression module 1 and a decompression module 2, wherein the compression module 1 comprises:
    分块模块11,用于输入原始数据,并将所述原始数据进行分块处理,得到M份数据块;The blocking module 11 is configured to input original data, and perform block processing on the original data to obtain M pieces of data blocks;
    其中,每份数据块表示为Di,i∈[0,M-1];Wherein, each data block is represented as Di, i ∈ [0, M-1];
    压缩模块12,用于利用预置的第一线程池中的N 1个线程分别压缩M份所述数据块,压缩过程中在gz格式的文件头部分预留预设空间,获得M份压缩后的数据gzDi和所述数据gzDi的大小size(gzDi); The compression module 12 is configured to respectively compress the M data blocks by using N 1 threads in the preset first thread pool, and reserve a preset space in the file header portion of the gz format during the compression process, and obtain M compression. Data gzDi and the size of the data gzDi size (gzDi);
    写入模块13,用于按顺序将M份压缩后的所述数据gzDi写入磁盘中,并将对应的M份所述数据gzDi的size(gzDi)顺序写入所述预设空间,得到压缩数据;The writing module 13 is configured to sequentially write the M pieces of the compressed data gzDi into the disk, and sequentially write the size (gzDi) of the corresponding M pieces of the data gzDi into the preset space to obtain compression. data;
    其中,所述解压缩模块2包括:The decompression module 2 includes:
    切分模块21,用于输入所述压缩数据,读取写入的所述size(gzDi)的列表信息,并按照所述size(gzDi)的列表信息对所述压缩数据进行切分,得到M份数据块gzDi;a segmentation module 21, configured to input the compressed data, read the written list information of the size (gzDi), and segment the compressed data according to the list information of the size (gzDi) to obtain M Data block gzDi;
    解压模块22,用于利用预置的第二线程池中的N 2个线程分别解压M份所述数据块gzDi,获得M份解压后的原始数据Di; The decompression module 22 is configured to decompress the M pieces of the data block gzDi by using N 2 threads in the preset second thread pool to obtain the M pieces of decompressed original data Di;
    串联模块23,用于根据所述size(gzDi)的列表信息串联解压后的所述原始 数据Di,得到完整的原始数据。The serial module 23 is configured to obtain the complete original data by serially decompressing the original data Di according to the list information of the size (gzDi).
  7. 如权利要求6所述的多线程压缩与解压装置,其特征在于,所述原始数据为多源数据。The multi-thread compression and decompression apparatus according to claim 6, wherein said raw data is multi-source data.
  8. 如权利要求6所述的多线程压缩与解压装置,其特征在于,所述压缩模块12具体用于:利用预置的第一线程池中的N 1个线程分别对应压缩M份所述数据块中的N 1个数据块,N 1个线程中的任一个线程对应的数据块压缩完毕后,继续利用所述线程处理剩余未压缩的数据块,直至M份所述数据块压缩完毕,压缩过程中在gz格式的文件头部分预留预设空间,获得M份压缩后的数据gzDi和所述数据gzDi的大小size(gzDi)。 Multithreading according to claim 6 compression and decompression means, wherein said compression module 12 is specifically configured to: use a first preset thread pool threads of the N 1 corresponding compressed parts of the data block M After N 1 data blocks in the N 1 data block, after the data block corresponding to any one of the N 1 threads is compressed, the remaining uncompressed data blocks are processed by the thread until the M pieces of the data block are compressed, and the compression process is completed. The preset space is reserved in the header portion of the gz format, and the size data (gzDi) of the M-compressed data gzDi and the data gzDi is obtained.
  9. 如权利要求6所述的多线程压缩与解压装置,其特征在于,所述解压模块22具体用于:利用预置的第二线程池中的N 2个线程分别对应解压M份所述数据块gzDi中的N 2个数据块,N 2个线程中的任一个线程对应的数据块解压完毕后,继续利用所述线程处理剩余未解压的数据块,直至M份所述数据块解压完毕,获得M份解压后的原始数据Di。 The multi-thread compression and decompression apparatus according to claim 6, wherein the decompression module 22 is specifically configured to: decompress M pieces of the data block by using N 2 threads in a preset second thread pool. N 2 data blocks in gzDi, after the data blocks corresponding to any one of the N 2 threads are decompressed, continue to use the thread to process the remaining undecompressed data blocks until the M pieces of the data blocks are decompressed and obtained. The original data Di after decompression of M parts.
  10. 如权利要求6至9任一项所述的多线程压缩与解压装置,其特征在于,所述第一线程池中的N 1个线程等于所述第二线程池中的N 2个线程。 The multi-thread compression and decompression apparatus according to any one of claims 6 to 9, wherein N 1 threads in the first thread pool are equal to N 2 threads in the second thread pool.
PCT/CN2017/117619 2017-12-21 2017-12-21 Multi-thread compression and decompression methods in generic data gz format, and device WO2019119336A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/117619 WO2019119336A1 (en) 2017-12-21 2017-12-21 Multi-thread compression and decompression methods in generic data gz format, and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/117619 WO2019119336A1 (en) 2017-12-21 2017-12-21 Multi-thread compression and decompression methods in generic data gz format, and device

Publications (1)

Publication Number Publication Date
WO2019119336A1 true WO2019119336A1 (en) 2019-06-27

Family

ID=66994411

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/117619 WO2019119336A1 (en) 2017-12-21 2017-12-21 Multi-thread compression and decompression methods in generic data gz format, and device

Country Status (1)

Country Link
WO (1) WO2019119336A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114064140A (en) * 2021-10-15 2022-02-18 南京南瑞继保电气有限公司 Fault recording data storage and access method and device and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110282849A1 (en) * 2007-08-23 2011-11-17 Thomson Reuters (Markets) Llc System and Method for Data Compression Using Compression Hardware
CN103384884A (en) * 2012-12-11 2013-11-06 华为技术有限公司 File compression method and device, file decompression method and device, and server
CN103516369A (en) * 2013-06-20 2014-01-15 易乐天 Method and system for self-adaptation data compression and decompression and storage device
CN106096332A (en) * 2016-06-28 2016-11-09 深圳大学 Parallel fast matching method and system thereof towards the DNA sequence stored

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110282849A1 (en) * 2007-08-23 2011-11-17 Thomson Reuters (Markets) Llc System and Method for Data Compression Using Compression Hardware
CN103384884A (en) * 2012-12-11 2013-11-06 华为技术有限公司 File compression method and device, file decompression method and device, and server
CN103516369A (en) * 2013-06-20 2014-01-15 易乐天 Method and system for self-adaptation data compression and decompression and storage device
CN106096332A (en) * 2016-06-28 2016-11-09 深圳大学 Parallel fast matching method and system thereof towards the DNA sequence stored

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114064140A (en) * 2021-10-15 2022-02-18 南京南瑞继保电气有限公司 Fault recording data storage and access method and device and storage medium
CN114064140B (en) * 2021-10-15 2024-03-15 南京南瑞继保电气有限公司 Fault recording data storage and access method and device and storage medium

Similar Documents

Publication Publication Date Title
US9811424B2 (en) Optimizing restoration of deduplicated data
US9697228B2 (en) Secure relational file system with version control, deduplication, and error correction
US11681590B2 (en) File level recovery using virtual machine image level backup with selective compression
EP2898424B1 (en) System and method for managing deduplication using checkpoints in a file storage system
JP5732536B2 (en) System, method and non-transitory computer-readable storage medium for scalable reference management in a deduplication-based storage system
US8108446B1 (en) Methods and systems for managing deduplicated data using unilateral referencing
EP3376393B1 (en) Data storage method and apparatus
US11334255B2 (en) Method and device for data replication
CN108134609A (en) Multithreading compression and decompressing method and the device of a kind of conventional data gz forms
US11221992B2 (en) Storing data files in a file system
WO2016041401A1 (en) Method and device for writing data to cache
US9357007B2 (en) Controlling storing of data
US10509582B2 (en) System and method for data storage, transfer, synchronization, and security
CN101178726A (en) Method to efficiently use the disk space while unarchiving
US20160092131A1 (en) Storage system, storage system control method, and recording medium storing virtual tape device control program
CN104965835A (en) Method and apparatus for reading and writing files of a distributed file system
US20170192712A1 (en) Method and system for implementing high yield de-duplication for computing applications
US9678972B2 (en) Packing deduplicated data in a self-contained deduplicated repository
WO2021082926A1 (en) Data compression method and apparatus
WO2019119336A1 (en) Multi-thread compression and decompression methods in generic data gz format, and device
US11409604B1 (en) Storage optimization of pre-allocated units of storage
CN106708831B (en) FAT (file allocation table) image file processing method and device
US11113237B1 (en) Solid state cache index for a deduplicate storage system
Zhang et al. A Compatible LZMA ORC-Based Optimization for High Performance Big Data Load
US11436108B1 (en) File system agnostic content retrieval from backups using disk extents

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17935054

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 21/09/2020)

122 Ep: pct application non-entry in european phase

Ref document number: 17935054

Country of ref document: EP

Kind code of ref document: A1