CN111628779B - Parallel compression and decompression method and system for FASTQ file - Google Patents

Parallel compression and decompression method and system for FASTQ file Download PDF

Info

Publication number
CN111628779B
CN111628779B CN202010472611.1A CN202010472611A CN111628779B CN 111628779 B CN111628779 B CN 111628779B CN 202010472611 A CN202010472611 A CN 202010472611A CN 111628779 B CN111628779 B CN 111628779B
Authority
CN
China
Prior art keywords
data block
file
compression
offset
thread
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010472611.1A
Other languages
Chinese (zh)
Other versions
CN111628779A (en
Inventor
陈毓新
赵子健
李胜康
龚淳
黄志博
张勇
方林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BGI Shenzhen Co Ltd
Original Assignee
BGI Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BGI Shenzhen Co Ltd filed Critical BGI Shenzhen Co Ltd
Priority to CN202010472611.1A priority Critical patent/CN111628779B/en
Publication of CN111628779A publication Critical patent/CN111628779A/en
Application granted granted Critical
Publication of CN111628779B publication Critical patent/CN111628779B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1744Redundancy elimination performed by the file system using compression, e.g. sparse files

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

The application relates to a method and a system for parallel compression and decompression of FASTQ files, which comprises the following steps: s1, dividing a FASTQ file into a plurality of data blocks; s2, moving the file offset of the head and the tail of each data block to enable the head and the tail of each data block to be matched; s3, compressing each data block in parallel, wherein each data block is independently compressed by a working thread in the process, and the index information of the data block is recorded; s4, decompressing each data block in parallel, wherein each data block is independently decompressed by one working thread in the process, and the process is based on index information of the corresponding data block. The method solves the problem of thread blocking under the data blocking scheme by using the short read sequence head searching method used in the compression process and using index information in the decompression process to realize efficient parallel compression and decompression.

Description

Parallel compression and decompression method and system for FASTQ file
Technical Field
The application relates to a method and a system for parallel compression and decompression of FASTQ files, and belongs to the technical field of biological information data processing.
Background
Since the advent of DNA sequencing technology, sequencing data has been growing, and the growth rate has been increasing, and along with the accumulation of sequencing data, data management costs have been increasing, including storage costs and transmission costs, so that it is necessary to perform data compression on sequencing data to reduce costs, which is the meaning of sequencing data compression tools.
As a general genomic data storage format, the FASTQ file is responsible for storing a nucleic acid sequence and a corresponding mass fraction, which represents a short-reading sequence in four line units, including an identifier, a base sequence, and a mass value, and the standard FASTQ format file structure is as follows:
@SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36
GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC
wherein the first row starts with @ and the third row starts with + the identifier of the short read sequence and the third row starts with the same identifier, or possibly no other. The second action base sequence is usually a character string composed of ACGTN, and other characters are rarely present. The fourth row, of the same mass value and length as the second row, refers to the sequencing confidence of each base, has two mass value systems, respectively from-! And @ starts, a common range is within 40 characters.
The earliest compression methods used compression tools such as gzip, bzip2, etc., which were not ideal due to underutilization of the features of the FASTQ format, the only advantage was considerable performance.
After this, tools have emerged that compress FASTQ files exclusively, but initially the algorithm strategies are not uniform, such as SeqDB and G-SQZ have both chosen to encode base sequences in combination with quality values. SeqDB uses one byte to coexist a combination of base and mass values; G-SQZ is given as < base, quality value > and zero-order Huffman coding is performed after counting the frequency. Because of the poor compression rate of this strategy, it is soon abandoned, and another approach to independently compressing three information streams has become the mainstream.
The open source tools that inherit this framework are numerous, but the algorithms are not enough. In detail, the ID and the quality value are basically entropy-encoded, that is, redundancy is removed by using the similarity between information, but compression of the base sequence is complicated, and besides entropy encoding, since there are reference genome sequences, many sequences can be aligned to the genome and thus can be replaced by the alignment information. Among such open source tools are those that perform well, quip, fqzcomp, DSRC, LFQC, etc.
In the case of compressed file reading, part of the tool is to read the file serially and does not support multithreading, such as quip, fastqz; part of tools are serial read files, but support multithreading, and the number of the multithreading threads is fixed, because the tools adopt a mode of task segmentation rather than data segmentation, such as Fqzcomp, dualFqz; some tools are parallel reads, including DSRC2, MFCompress, GTZ, etc. The parallel reading mode can be matched with the number of CPUs in the actual running environment, so that the efficiency is higher than that of the other two modes. In compressed file writing, most tools will only generate one result file, while few tools will generate several result files (usually equal to the number of split information streams), the former being a popular choice of solution, which is advantageous in terms of convenience and file verification. In decompressed file reads, the tools for compressed single-threaded or task partitioning are read serially, while the tools for data partitioning are read in parallel. In terms of decompressed file writing, all tools use serial writing.
However, when the compression result is combined with parallel compression, the data block mode is necessarily selected for encoding and storing, and the existing tool does not utilize the characteristics of the data block to realize the optimal decompression method when storing the data. The parallel read of decompression as in publication CN103559020a employs a double buffer queue, the overall decompression efficiency of which is subject to two factors: file reading and file writing. In file reading, if a single thread is used for reading and then file contents are distributed to each working thread, the working thread has a certain waiting time. When writing a file, since the consistency of the file is ensured, each thread needs to write own decompressed data in turn, which causes serious thread blocking.
Disclosure of Invention
Aiming at the defects of the prior art, the application aims to provide a method and a system for parallel compression and decompression of FASTQ files, which effectively improve the efficiency of parallel compression and decompression of FASTQ files and solve the problem of work thread blocking in the compression and decompression process.
To achieve the above object, the present application provides a method for parallel compression and decompression of FASTQ files, including the steps of: s1, dividing a FASTQ file into a plurality of data blocks; s2, moving the file offset of the head and the tail of each data block to enable the head and the tail of each data block to be matched; s3, compressing each data block in parallel, wherein each data block is independently compressed by a working thread, and the index information of each data block is recorded in the compression process; s4, decompressing each data block in parallel, wherein each data block is independently decompressed by a working thread, and recording index information of each data block in the decompression process.
Further, the method for splitting the FASTQ file in S1 includes the following steps: s1.1, reading the first character of the first line of the FASTQ file, exiting segmentation if the first character is not @, and entering the next step if the first character is @; s1.2, dividing the file into a plurality of data blocks with the same size according to the size of a preset value, wherein the last data block is smaller than or equal to the preset value, and each data block comprises a file offset of the beginning and the end after dividing.
Further, the method for moving the file offset of the head and the tail of the data block in S2 comprises the following steps: s2.1, starting to read characters one by one for the initial file offset of a certain data block until a row L with the initial character of @ appears, and recording the file offset a of the first character of the row L; s2.2, judging whether the initial character of the L+1 line is @, if yes, recording the file offset b of the initial character of the L+1 line, wherein b is a target initial position, and if not, a is a target initial position; and S2.3, the file offset at the end of the data block is the initial file offset of the next data block.
Further, in S3, each data block is compressed in parallel, including a plurality of loops, each loop including two steps of data block encoding and outputting; in the process of encoding the data blocks, during the xth cyclic encoding, the [ (x-1) x n+1, & gt.
Further, the output process is as follows: at the position ofIn the x-th cycle, the total file offset c after the x-1 th cycle is obtained, the file offset c is the initial file offset of the x-th cycle, and the sizes of the result code streams of N working threads are respectively L 1 ,L 2 ,......,L N The initial file offset of the working thread i which finishes the coding first is c, and the initial file offset of the working thread j which finishes the coding later is c+L i ,...A.A.the start file offset for work thread z that completes the encoding is c+L 1 +L 2 ......+L N -L z
Further, the index information in S3 includes compression result index information and original file index information, the compression result index information includes the size of the compression result of the data block and the offset of the compression result in the file; the original file index information includes an original size of the data block and an offset of the data block in the original file.
Further, the index information in S3 includes compression result index information and original file index information, the compression result index information includes the size of the compression result of the data block and the offset of the compression result in the file; the original file index information includes an original size of the data block and an offset of the data block in the original file.
Further, before decompressing each data block in S4, each working thread in the decompression process obtains the compression result index information and the original file index information, and in the parallel decompression process, each working thread includes a handle f pointing to the compressed file in And a handle f pointing to the decompressed object file out Wherein f in For reading compressed files, f out For writing to the target file.
Further, the decompression process is as follows: in a plurality of cycles, each working thread is respectively allocated to a data block, and directly deviates to the position appointed by the compressed file according to the compression deviation amount and the compression size of the data block, and reads the content of the appointed size; and the working thread shifts to the appointed position of the decompression target file according to the original offset of the data block, and writes the decompression content. The above process is decoupled among threads.
The application also discloses a parallel compression and decompression system of the FASTQ file, which comprises the following steps: the data block segmentation module is used for segmenting the FASTQ file into a plurality of data blocks; the data block head and tail moving module is used for moving the file offset of the head and tail of the data so that the data blocks are all complete short reading sequences; the compression module is used for carrying out parallel compression on each data block according to the starting position of the short reading sequence, wherein each working thread in the compression process corresponds to one data block in the parallel compression process, and index information of each data block is recorded in the compression process; and the decompression module is used for carrying out parallel decompression on each data block, in the parallel decompression process, each working thread in the decompression process corresponds to one data block, and the working thread decompresses the data in the data block by reading the index information of the corresponding data block.
Due to the adoption of the technical scheme, the application has the following advantages:
1. in the compression part, the technical scheme of the application enables the working threads to be decoupled in the file reading stage and respectively jump to the corresponding file position to start compression, and in the decompression part, the technical scheme of the application enables the working threads to be decoupled in the file reading stage and the file writing stage and respectively jump to the corresponding file position to start decompression, and then respectively jump to the corresponding position of the decompressed file to perform writing, thereby effectively avoiding the blockage caused by the sequential writing of the threads and enabling the compression and decompression process of FASTQ file data to be more efficient and reliable.
2. In the compression part, the application provides parallel compression with the output sequence inconsistent with the original file sequence part, so that the data blocks in the working thread which is coded first are output preferentially, the waiting time in the compression process is further reduced, and especially for the case of large data volume and multiple times of circulation, the compression time can be effectively reduced, and the efficiency of the compression and decompression process is obviously improved.
Drawings
FIG. 1 is a schematic diagram of a method for parallel compression and decompression of FASTQ files in an embodiment of the present application;
FIG. 2 is a schematic diagram of a data block pattern of a FASTQ file in accordance with one embodiment of the present application;
FIG. 3 is a schematic diagram of a parallel compression process in which the output order of FASTQ files is consistent with the order of the original files in an embodiment of the present application;
FIG. 4 is a schematic diagram of a parallel compression process in which the output order of FASTQ files is not consistent with the original file order in an embodiment of the present application;
fig. 5 is a schematic diagram of a parallel decompression process of FASTQ files in an embodiment of the present application.
Detailed Description
The present application will be described in detail with reference to specific examples thereof in order to better understand the technical direction of the present application by those skilled in the art. It should be understood, however, that the detailed description is presented only to provide a better understanding of the application, and should not be taken to limit the application. In the description of the present application, it is to be understood that the terminology used is for the purpose of description only and is not to be interpreted as indicating or implying relative importance.
Example 1
The application provides a method for compressing and decompressing FASTQ files in parallel, which is shown in figure 1 and comprises the following steps:
s1, segmenting the FASTQ file into a plurality of data blocks, wherein the data blocks comprise complete short reading sequences.
The method for dividing the data blocks comprises the following steps: the FASTQ file to be compressed is split into several data blocks in capacity size or number of sequence stripes. The data in each data block is independently encoded during compression. The specific coding scheme, decoding scheme, and the choice of encoder and decoder are not the main discussion of the present application, so that the present application can be used in this embodiment as long as the function is normal and the coding and decoding requirements in the present application can be satisfied.
As shown in fig. 2, in order to implement efficient parallel compression and decompression, index information of each data block needs to be saved. The index information may be stored in the data block in the manner of mode 1 and mode 2 in fig. 2. If the mode 1 is adopted, that is, the index information of each data block is uniformly concentrated in one place in the file in the compression process, the index information of each data block is usually concentrated and stored before all data blocks or after the data blocks, but may be stored in other positions of the file. Therefore, when the data is decompressed, the main thread firstly reads the part of data, and distributes data blocks to each working thread in the decompression process by reading the part of information. If the mode 2 is adopted, that is, during the compression process, the index information of each data block is stored in the header of the corresponding data block, so that the data is read first for each data block to be decompressed.
Index information is divided into two categories: the result index information and the original file index information are compressed.
The compression result index information includes the size of the compression result of the data block and the offset of the compression result in the file, which may be converted with each other, specifically, the offset information of a certain data block is equal to the sum of the file header information and the sizes of all the previous data blocks, so that only one of the size and the offset of the compression result is recorded to realize the index function. Only one of them may be retained if complexity is a priority, and both may be recorded simultaneously if performance is a priority. After the compression result index is recorded, the decompressed working thread can quickly locate the target position through the compression result index information, acquire the size of the target data block and decompress.
The original file index information includes the size of the original data of the data block and the offset of the data block in the original file, and the original data size and the offset may be converted from each other, and if the complexity is a priority, only one of them may be reserved, and if the performance is a priority, both may be recorded at the same time. After the original file index is recorded, the decompressed working thread can be decoupled when the file is written, the writing position of the decompressed result file is rapidly positioned through the original file index information, and the decompressed result file is independently written.
S2, shifting the file offset of the head and the tail of each data block to enable the head and the tail of each data block to be matched
Since the original file is to be split into a plurality of data blocks, and each data block is independently encoded, the head and tail of each data block need to contain a complete short reading sequence, and therefore, the initial position of the short reading sequence needs to be searched after the file is split into a plurality of data blocks. The FASTQ file has a short read sequence in four behaviors, i.e., one basic structure of the FASTQ file is a short read sequence.
A method of shifting the file offset from beginning to end of each data block, comprising the steps of:
s2.1, reading the first character of the first line of the FASTQ file, if the first character is not @ exiting the search, and if the first character is @ entering the next step;
s2.2, positioning to a searched starting position according to the received file offset;
s2.3, starting to read characters one by one from the initial position until a row L beginning with @ appears, and recording the row number of the row L and the position a of @ in the row L in the file;
s2.4, judging whether the initial character of the next line is @, if yes, recording the position b of @ in the file in the L+1 line, wherein the position b is a target initial position, and if not, the position a is the target initial position.
The offset of the file refers to the number of bytes moved forward or backward from a designated position, and is used for finding the position of the used data in the file, and is generally divided into three types, namely, the first type is to move a plurality of bytes backward from the beginning of the file to find a target, the second type is to move a plurality of bytes forward from the end of the file to find a target, and the third type is relative position and the current position of the file is moved forward or backward to find a target. All file offsets herein are first type of file offset.
S3, carrying out parallel compression on each data block according to the starting position of the short reading sequence, wherein in the parallel compression process, each working thread in the compression process corresponds to one data block, and in the compression process, the index information of each data block is recorded. Each worker thread individually encodes a block of data. According to different encoders, the encoding process is different, and only the independence of encoding and decoding among data is ensured, and no mutual dependence or influence is caused.
First, the form is introduced in which each cycle waits for all threads to complete encoding.
The method comprises the steps of performing parallel compression on each data block, wherein each data block comprises a plurality of loops, and each loop comprises two steps of data block coding and outputting; in the process of coding the data blocks, if the maximum number of working threads is Nmax and the number of idle working threads is N, nmax=n in this form, and the number of data blocks included in the file to be compressed is M, [ (x-1) ×n+1 ] data blocks are allocated to N working threads in the compression process during the coding of the xth cycle. Wherein xN is equal to or less than M.
In the output link, because of the difference of the content of the data blocks, the time for each working thread to code the data blocks is necessarily different, so that the sequence of the working thread to finish the coding is different from the sequence of the original working thread, that is, the first working thread is not necessarily the working thread which finishes the coding first. Nor the earlier the data block in the original file is, the faster the encoding process is completed. Therefore, according to whether the result code streams of the data blocks are arranged according to the sequence of the result code streams in the original file, the output process is divided into two types, wherein the first type is output with the output sequence consistent with the sequence of the original file, and the second type is output with the output sequence inconsistent with the sequence of the original file.
For the first type of output, the output process, as shown in fig. 3, includes: in the x-th cycle, after N working threads finish encoding, the size of the result code stream output by each working thread and the total file offset c after the x-1 th cycle output are obtained, and the sizes of the result code streams of the N working threads are respectively L 1 ,L 2 ,......,L N The initial file offset of the working thread 1 is c, and the initial file offset of the working thread 2 is c+L 1 ,... the initial file offset of the working thread N is c+L 1 +L 2 ......+L N-1 . As long as each working thread is provided with a respective independent file handle, each working thread can be enabled to move the file handle to a corresponding position in an output link to be output independently, the waiting time of the working thread can be reduced to a certain extent, but the problem that the waiting time of the working thread cannot be solved fundamentally is solvedAnd the working thread waits. In addition, in this scheme, it is not necessary to wait until all threads finish encoding to start outputting, and in order to reduce the waiting time as much as possible, the first n working threads may be processed first based on the method of searching the file offset after all the first n threads finish encoding. This approach is suitable for the case where the earlier data blocks in the original file are encoded faster.
In one cycle, if waiting time of a working thread after a data block is to be eliminated, a second output method is needed, and the output process thereof, as shown in fig. 4, includes: in the x-th cycle, the total file offset c after the x-1 th cycle is obtained, and the sizes of the result code streams of N working threads are respectively L 1 ,L 2 ,......,L N The initial file offset of the working thread i which finishes the coding first is c, and the initial file offset of the working thread j which finishes the coding later is c+L i ,...A.A.the start file offset for work thread z that completes the encoding is c+L 1 +L 2 ......+L N -L z . As shown in fig. 4, this method needs to record the order of completing the encoding by the working thread additionally, and after all the data blocks complete the encoding output, the output results are arranged according to the original file order. Compared with the output mode of ensuring the sequence of the original files, the output mode has shorter time and higher compression efficiency. In addition, the order of each cycle may be additionally recorded, and the index information of the original file may be adjusted according to the output order of the current cycle, and as shown in fig. 4, the index data of the original file may be recorded in the order of the data blocks 3,1,4,2.
In the last cycle, the number of worker threads may be smaller than N, and the worker thread task allocation may be performed according to the data of the actual data block. In addition, the compressed file obtained through the compression process includes not only the data block information but also the index information of the data block.
The above is a form of waiting Nmax threads to complete encoding every cycle, and based on the "second output method" in the above, a form of completely decoupling all threads in the compression process, that is, an implementation of the idle thread number n=1, may also be implemented.
Each data block is compressed in parallel, and the method comprises four steps of initial preparation, waiting for thread encoding to be completed, thread output and data block distribution; in the initial preparation process, the maximum working thread number is set as Nmax, the data blocks of the Nmax working threads are allocated, then the encoding work is started until the threads finish encoding, the data blocks are immediately output to the tail end of the last compressed data block (the beginning of an output file if the working thread is the first finished), the compression result index information of the current working thread and the original file index information (for use in the next working thread and decompression stage) are recorded, after the output of the working thread is finished, the next data block to be compressed is allocated, and the cycle is repeated until all the data blocks are compressed.
S4, carrying out parallel decompression on each data block, wherein in the parallel decompression process, each working thread in the decompression process corresponds to one data block, and the working thread decompresses data in the data block by reading index information of the corresponding data block.
Before decompressing each data block, each working thread in the decompression process acquires compression result index information and original file index information, and in the parallel decompression process, each working thread comprises a handle f pointing to a compressed file in And a handle f pointing to the decompressed object file out Wherein f in For reading compressed files, f out For writing to the target file. Since each worker thread includes a handle f of the compressed file in The compressed files may be read in parallel by the individual worker threads, each worker thread processing only the data blocks that are relevant to the worker thread's needs. While since each worker thread includes handle f for decompressing the target file out And when writing content, the file offsets of all the working threads are not overlapped, so that the parallel writing target files of all the working threads can be realized.
As shown in fig. 5, each working thread of the decompression process starts a loop, the lower limit of the loop is 0, and the upper limit of the loop is the number of data blocks of the whole compressed file. Calculating the module of the sequence number of the data block and the total number of the working threads in the cycle, if the module operation result is equal to the thread sequence number of the working threads, processing the data block by the working threads, otherwise, continuing to calculate the cycle; the working thread is a working thread corresponding to a data block, the working thread can directly acquire data block information through a data block sequence number, then directly deviates to a position appointed by a compressed file according to a compression offset and a compression size in the data block information, reads the content of the appointed size, and restores the read compressed content to obtain the original file content; the working thread shifts to the appointed position of the decompression target file according to the original offset of the data block, and writes the original file.
In the technical scheme of the embodiment, when in compression, unlike the mode that the whole input file can be read and data distribution can be carried out only through a single thread in the prior art, the working thread can be decoupled in a file reading stage by adopting a mode of parallel compression of a plurality of threads, and each thread jumps to a corresponding file position to start compression, so that the compression efficiency is obviously improved. When the part is decompressed, unlike the prior art that the whole input file can be read only through a single thread in decompression, the working thread can be decoupled in the file reading and writing stages according to the sequential writing mode, and each working thread jumps to the corresponding file position to start decompression and then jumps to the corresponding position of the decompressed file to perform writing, so that blocking caused by writing is effectively avoided.
Example two
Based on the same inventive concept, the present embodiment also discloses a parallel compression and decompression system for FASTQ files, including:
the data block segmentation module is used for segmenting the FASTQ file into a plurality of data blocks, wherein the data blocks comprise complete short reading sequences;
the data block head-tail moving module is used for moving the file offset of the head and the tail of each data block so as to enable the head and the tail of each data block to be connected;
the compression module is used for compressing each data block in parallel, each data block is independently compressed by one working thread, and index information of each data block is recorded in the compression process;
the decompression module is used for decompressing each data block in parallel, each data block is independently decompressed by one working thread, and index information of each data block is recorded in the decompression process.
The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily appreciate variations or alternatives within the scope of the present application. Therefore, the protection scope of the application is subject to the protection scope of the claims.

Claims (7)

1. A method for parallel compression and decompression of FASTQ files, comprising the steps of:
s1, dividing a FASTQ file into a plurality of data blocks;
s2, moving the file offset of the head and the tail of each data block to enable the head and the tail of each data block to be matched;
s3, compressing the data blocks in parallel, wherein each data block is independently compressed by a working thread, and the index information of each data block is recorded in the compression process;
the step S3 of compressing each data block in parallel comprises a plurality of loops, wherein each loop comprises two steps of data block coding and outputting; in the process of coding the data block, during the xth cyclic coding, [ (x-1) x N+1, … …, x N ] data blocks are distributed to N working threads in the compression process, and after the respective coding process is completed, each working thread outputs a respective result code stream;
s4, decompressing each data block in parallel, wherein each data block is independently decompressed by a working thread, and the index information of each data block is recorded in the decompression process;
the output process is as follows: during the xth cycle, the size of the result code stream output by each working thread and the total file after the xth-1 cycle output are obtainedOffset c, which is the initial file offset of the xth cycle, and the sizes of the result code streams of the N working threads are respectively set to be L 1 ,L 2 ,……,L N The initial file offset of the working thread 1 is c, and the initial file offset of the working thread 2 is c+L 1 … …, the initial file offset of the working thread N is c+L 1 +L 2 ……+L N-1
Alternatively, the output process is: in the x-th cycle, acquiring the total file offset c after the x-1 th cycle is output, wherein the file offset c is the initial file offset of the x-th cycle, and the sizes of the result code streams of N working threads are respectively L 1 ,L 2 ,……,L N The initial file offset of the working thread i which finishes the coding first is c, and the initial file offset of the working thread j which finishes the coding later is c+L i … …, the starting file offset of the last encoded worker thread z is c+L 1 +L 2 ……+L N -L z
2. The method for parallel compression and decompression of FASTQ files according to claim 1, wherein the method for splitting FASTQ files in S1 comprises the following steps:
s1.1, reading the first character of the first line of the FASTQ file, exiting segmentation if the first character is not @, and entering the next step if the first character is @;
s1.2, dividing the file into a plurality of data blocks with the same size according to the size of a preset value, wherein after dividing, each data block comprises file offset of the beginning and the end.
3. The method for parallel compression and decompression of FASTQ files according to claim 1, wherein the method for moving the file offset of the beginning and the end of the data block in S2 is as follows:
starting to read characters one by one for the initial file offset of a certain data block until L rows with initial characters of @ appear, and recording the file offset a of the initial characters of the L rows; judging whether the initial character of the L+1 line is @, if yes, recording the file offset b of the initial character of the L+1 line, enabling b to be a target initial position, and if not, enabling a to be a target initial position; the offset of the file at the end of the data block is the offset of the initial file of the next data block.
4. A method of parallel compression and decompression of FASTQ files according to any one of claims 1-3, wherein the index information in S3 includes compression result index information and original file index information, the compression result index information including the size of the compression result of the data block and the offset of the compression result in the file; the original file index information includes an original size of the data block and an offset of the data block in the original file.
5. The method for parallel compression and decompression of a FASTQ file according to claim 4, wherein each working thread of the decompression process acquires the compression result index information and original file index information before decompressing each data block in S4, and each working thread includes a handle f pointing to a compressed file in the parallel decompression process in And a handle f pointing to the decompressed object file out Wherein f in For reading compressed files, f out For writing to the target file.
6. The method for parallel compression and decompression of FASTQ files according to claim 5, wherein each of the worker threads of the decompression process starts a loop in which a modulus of the sequence number of the data block and the total number of worker threads is calculated, and if the modulus result is equal to the thread sequence number of the worker thread, the worker thread processes the data block, otherwise continues the loop calculation; the working thread directly acquires data block information through the sequence number of the data block, then directly shifts to the designated position of the compressed file according to the compression offset and the compression size in the data block information, and reads the content with the designated size; and the working thread shifts to the appointed position of the decompression target file according to the original offset of the data block, and writes the original file.
7. A system for parallel compression and decompression of FASTQ files, comprising:
the data block segmentation module is used for segmenting the FASTQ file into a plurality of data blocks;
the data block head-tail moving module is used for moving the file offset of the head and the tail of each data block so as to enable the head and the tail of each data block to be matched;
the compression module is used for compressing the data blocks in parallel, each data block is independently compressed by a working thread, and index information of each data block is recorded in the compression process;
the compression module performs parallel compression on each data block and comprises a plurality of loops, and each loop comprises two steps of data block coding and outputting; in the process of coding the data block, during the xth cyclic coding, [ (x-1) x N+1, … …, x N ] data blocks are distributed to N working threads in the compression process, and after the respective coding process is completed, each working thread outputs a respective result code stream;
the decompression module is used for decompressing each data block in parallel, each data block is independently decompressed by a working thread, and index information of each data block is recorded in the decompression process;
the output process is as follows: in the x-th cycle, the size of the result code stream output by each working thread and the total file offset c after the x-1 th cycle output are obtained, wherein the file offset c is the initial file offset of the x-th cycle, and the sizes of the result code streams of N working threads are respectively L 1 ,L 2 ,……,L N The initial file offset of the working thread 1 is c, and the initial file offset of the working thread 2 is c+L 1 … …, the initial file offset of the working thread N is c+L 1 +L 2 ……+L N-1
Alternatively, the output process is: in the x-th cycle, acquiring the total file offset c after the x-1 th cycle output, wherein the file offsetThe shift c is the shift of the initial file of the xth cycle, and the sizes of the result code streams of N working threads are respectively L 1 ,L 2 ,……,L N The initial file offset of the working thread i which finishes the coding first is c, and the initial file offset of the working thread j which finishes the coding later is c+L i … …, the starting file offset of the last encoded worker thread z is c+L 1 +L 2 ……+L N -L z
CN202010472611.1A 2020-05-29 2020-05-29 Parallel compression and decompression method and system for FASTQ file Active CN111628779B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010472611.1A CN111628779B (en) 2020-05-29 2020-05-29 Parallel compression and decompression method and system for FASTQ file

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010472611.1A CN111628779B (en) 2020-05-29 2020-05-29 Parallel compression and decompression method and system for FASTQ file

Publications (2)

Publication Number Publication Date
CN111628779A CN111628779A (en) 2020-09-04
CN111628779B true CN111628779B (en) 2023-10-20

Family

ID=72260231

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010472611.1A Active CN111628779B (en) 2020-05-29 2020-05-29 Parallel compression and decompression method and system for FASTQ file

Country Status (1)

Country Link
CN (1) CN111628779B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111984610A (en) * 2020-09-27 2020-11-24 苏州浪潮智能科技有限公司 Data compression method and device and computer readable storage medium
CN112860646B (en) * 2021-02-24 2022-12-02 上海泰宇信息技术股份有限公司 Method for distributed aggregate compression and unitary extraction of mass file files

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103427844A (en) * 2013-07-26 2013-12-04 华中科技大学 High-speed lossless data compression method based on GPU-CPU hybrid platform
CN103546160A (en) * 2013-09-22 2014-01-29 上海交通大学 Multi-reference-sequence based gene sequence stage compression method
CN106991134A (en) * 2017-03-13 2017-07-28 人和未来生物科技(长沙)有限公司 A kind of large data cloud storage method stored based on object
WO2017214765A1 (en) * 2016-06-12 2017-12-21 深圳大学 Multi-thread fast storage lossless compression method and system for fastq data
CN107851118A (en) * 2015-05-21 2018-03-27 基因福米卡数据系统有限公司 Storage, transmission and the compression of sequencing data of future generation
CN108134609A (en) * 2017-12-21 2018-06-08 深圳大学 Multithreading compression and decompressing method and the device of a kind of conventional data gz forms
CN108537007A (en) * 2017-03-04 2018-09-14 上海逐玛信息技术有限公司 A kind of access method for gene sequencing data
CN109582653A (en) * 2018-11-14 2019-04-05 网易(杭州)网络有限公司 Compression, decompression method and the equipment of file

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10090857B2 (en) * 2010-04-26 2018-10-02 Samsung Electronics Co., Ltd. Method and apparatus for compressing genetic data
KR101922129B1 (en) * 2011-12-05 2018-11-26 삼성전자주식회사 Method and apparatus for compressing and decompressing genetic information using next generation sequencing(NGS)

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103427844A (en) * 2013-07-26 2013-12-04 华中科技大学 High-speed lossless data compression method based on GPU-CPU hybrid platform
CN103546160A (en) * 2013-09-22 2014-01-29 上海交通大学 Multi-reference-sequence based gene sequence stage compression method
CN107851118A (en) * 2015-05-21 2018-03-27 基因福米卡数据系统有限公司 Storage, transmission and the compression of sequencing data of future generation
WO2017214765A1 (en) * 2016-06-12 2017-12-21 深圳大学 Multi-thread fast storage lossless compression method and system for fastq data
CN108537007A (en) * 2017-03-04 2018-09-14 上海逐玛信息技术有限公司 A kind of access method for gene sequencing data
CN106991134A (en) * 2017-03-13 2017-07-28 人和未来生物科技(长沙)有限公司 A kind of large data cloud storage method stored based on object
CN108134609A (en) * 2017-12-21 2018-06-08 深圳大学 Multithreading compression and decompressing method and the device of a kind of conventional data gz forms
CN109582653A (en) * 2018-11-14 2019-04-05 网易(杭州)网络有限公司 Compression, decompression method and the equipment of file

Also Published As

Publication number Publication date
CN111628779A (en) 2020-09-04

Similar Documents

Publication Publication Date Title
CN111628779B (en) Parallel compression and decompression method and system for FASTQ file
CN107111623B (en) Parallel history search and encoding for dictionary-based compression
CN109830263B (en) DNA storage method based on oligonucleotide sequence coding storage
KR101737451B1 (en) Evaluating alternative encoding solutions during data compression
CN110428868B (en) Method and system for compressing, preprocessing and decompressing and reducing gene sequencing mass data
CN103559020A (en) Method for realizing parallel compression and parallel decompression on FASTQ file containing DNA (deoxyribonucleic acid) sequence read data
CN112544038B (en) Method, device, equipment and readable storage medium for data compression of storage system
US20090015444A1 (en) Data compression for communication between two or more components in a system
CN108134609A (en) Multithreading compression and decompressing method and the device of a kind of conventional data gz forms
CN115438114B (en) Storage format conversion method, system, device, electronic equipment and storage medium
CN110708076A (en) DNA storage coding and decoding method based on mixed model
WO2019080670A1 (en) Gene sequencing data compression method and decompression method, system, and computer readable medium
US20100166052A1 (en) Encoder, decoder, encoding method and decoding method, and recording medium
CN110867213A (en) Method and device for storing DNA data
CN103152606B (en) Video file processing method and device, system
CN113035278B (en) TPBWT-based sliding window compression method based on self-indexing structure
CN111028897B (en) Hadoop-based distributed parallel computing method for genome index construction
JP2007537642A (en) Method and apparatus for compression and decompression of structured block unit of XML data
CN112102883A (en) Base sequence coding method and system in FASTQ file compression
CN107633158A (en) The method and apparatus for being compressed and decompressing to gene order
US20160079996A1 (en) Compression ratio for a compression engine
EP0945795B1 (en) Computer system having a multi-pointer branch instruction and method
CN111370070B (en) Compression processing method for big data gene sequencing file
JP2007096738A (en) Image processing device, and program for allowing computer to perform image processing method
CN110311687B (en) Time sequence data lossless compression method based on integration algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant