CN103559020B

CN103559020B - A kind of DNA reads ordinal number according to the compression of FASTQ file in parallel and decompression method

Info

Publication number: CN103559020B
Application number: CN201310551802.7A
Authority: CN
Inventors: 郑晶晶; 王婷; 张常有; 詹科
Original assignee: Institute of Software of CAS; Institute of Software Application Technology Guangzhou GZIS of CAS
Current assignee: Institute of Software of CAS; Institute of Software Application Technology Guangzhou GZIS of CAS
Priority date: 2013-11-07
Filing date: 2013-11-07
Publication date: 2016-07-06
Anticipated expiration: 2033-11-07
Also published as: CN103559020A

Abstract

A kind of DNA reads ordinal number according to the compression of FASTQ file in parallel and parallel decompression method, the ordinal number compression & decompression according to FASTQ file is read for DNA, utilize circulation double buffering queue, the double; two internal memory of circulation to map and internal memory maps and in conjunction with technology such as deblocking processs, the parallelly compressed decompression of multithreading flowing water, read-write sequentially two-dimensional arrays, it is achieved the parallelly compressed and parallel decompression process between multiple processes and in-process multiple threads of FASTQ file.Can realize based on MPI and OpenMP, it is also possible to realize based on MPI and Pthread.The present invention makes full use of the powerful calculating ability of multi-core CPU in each computing node and node, it is possible to solve the restriction of the resources such as the processor suffered by serial compressed gunzip, internal memory.

Description

A kind of DNA reads ordinal number according to the compression of FASTQ file in parallel and decompression method

Technical field

The present invention relates to bio information, data compression and high-performance computing sector, read ordinal number according to the parallelly compressed of FASTQ file and parallel decompression method particularly to a kind of DNA.

Background technology

One of main task of bioinformatics is to gather and analyze substantial amounts of gene data.These data, for most important gene studies, aid in determining whether to prevent or cause the gene cassette that disease produces, and work out pointed therapy.High-throughout sequence measurement and equipment produce the short reading ordinal number evidence of magnanimity.It is adopt FASTQ file format that storage, management and transmission DNA read the common method of ordinal number evidence, this form mainly comprises DNA and reads ordinal number annotation information according to this and corresponding to each DNA base, for instance represent probabilistic QualityScores information of order-checking labeling process.The description reading sequence labelling and other such as device name is also contained in FASTQ file.Comparing the storage format (such as FASTA) of other DNA data, FASTQ form can store more information, but this also makes file size and memory space sharp increase simultaneously.Read ordinal number evidence currently for base and it annotates description information and carries out the algorithm research of effective lossless compress and decompression, be a study hotspot.

Compression for FASTQ file data, that make important progress at present is U.S. translation genetic research institute (TGen) G_SQZ algorithm (Tembe, the W.etal.G-SQZ:compactencodingofgenomicsequenceandqualityd ata.Bioinformatics2010 that studies;, and DSRC algorithm (DeorowiczS, the GrabowskiS.CompressionofDNAsequencereadsinFASTQformat.Bi oinformatics2011 that studies of DeorowiczSebastian et al. 26,2,192 2194.);27:860-862.).Both algorithms all employ directory system and allow the interval (abbreviation piecemeal) from rule to conduct interviews, and information needed need not start anew to decode.G_SQZ algorithm mainly uses Huffman coding<base, Qualityscore>right, and base data row and QualityScore row encodes individually with Huffman and be aided with other fine compression process (such as distance of swimming process etc.) by DSRC algorithm.The advantage of this kind of method can decoded portion data conduct interviews the while of being and retain data relative ranks information at random, and lossless compress efficiency is high.This represent a class FASTQ file compression method, for convenience of narration, hereinafter referred to as block index serial algorithm.

It is presently required the data carrying out genomic sequence analysis and reaches the TB order of magnitude.Large-scale order-checking center is just being planned or is installing the storage device of PB level scale.For these mass datas, for reducing memory space and transmission time, it is simple to lots of genes data unit sequence being analyzed in real time, it is necessary to it is carried out real-time data compression and decompression, this needs the powerful computing capability by high-performance calculation platform.Fast development along with high-performance calculation platform, make full use of the powerful computing capability of multi-core CPU on each computing node, come Real Time Compression and the big FASTQ file decompressing magnanimity, it is possible to solve the restriction of the resources such as the processor disposal ability suffered by serial compressed gunzip, internal memory.

Above-mentioned G-SQZ algorithm and DSRC algorithm are all serial algorithms, have not yet to see research article and the patent of the parallel algorithm of the multi-core CPU based on multinode relevant to this kind of algorithm.

Summary of the invention

In view of the research and the patent that have not yet to see the parallel algorithm relevant to this class block index serial algorithms such as above-mentioned G-SQZ and DSRC, it is an object of the invention to provide this class FASTQ file block a kind of and index the parallelly compressed decompression method that serial compressed decompression algorithm is corresponding, utilize many computing nodes and multi-core CPU, can realize based on MPI+OpenMP, it is also possible to realize based on MPI+Pthread；The powerful computing capability of high-performance calculation platform can be made full use of, significantly promote the speed that magnanimity genome sequence is analyzed and processed in real time, the technical foundation that the broader applications offer of gene data is important.

Technical scheme is as described below.

A kind of DNA reads the parallel compression method of the FASTQ file of ordinal number evidence, comprises the following steps:

One, parallelly compressed process task segmentation

(other annotation information of base information and correspondence is comprised according to reading sequence fragment each in FASTQ file size, parallelly compressed number of processes, FASTQ file, below for sake of convenience, a record it is labeled as) feature of data determines the starting and ending position of the pending data of each process.Each process all runs process task segmentation module, is approximately uniformly assigned in each process by initial data to be compressed, to realize data parallel.So each process does not have the consumption of call duration time each other when processing, and improves the treatment effeciency of data parallel.Each process obtains independent compressed file, and the order of compression data is consistent with process number.

Two, parallelly compressed in-process multithreading flowing water is parallelly compressed

Comprising an initial data in process processing module and read thread, a compression data write thread and multiple compression work thread, the specific number of worker thread can arrange according to the check figure of hardware CPU and process and set.

Data handled by each process are read thread by initial data and are divided into multiple pieces, the record data (most end end block is possibly less than this fixed number) that each piece comprises specific fixed number.

Each worker thread all has two circulation double buffering queues, and one is initial data circulation double buffering queue, and one is compression datacycle double buffering queue.Initial data circulation double buffering queue and compression datacycle double buffering queue have similar structures, and wherein the structure of relief area is slightly different according to the difference of storage data, after the structure of each relief area is discussed in detail in detailed description of the invention part.The circulation double buffering queue of each initial data comprises two queues: one is sky block buffer queue, and one is original data block queue.Each compression datacycle double buffering queue also comprises two queues: one is sky block buffer queue, and one is compression data block queue.The processing mode of the two circulation double buffering queue is identical.

Circulate double buffering queue for initial data below, describe its processing mode in detail:

(1) initial data circulation double buffering queue initialization processes: by sky block buffer queue instantiation, has certain number of empty block buffer, and original data block queue is empty.

(2) initial data reads thread and reads an original data block.

(3) an empty block buffer is obtained in empty block buffer queue heads.

(4) this sky block buffer obtained is filled with original data block.

(5) original data block of this filling is put into the end of original data block queue.

(6) compression work thread blocks of data obtained in an initial data block buffer in original data block queue heads is compressed processing.

(7) by this original data block buffer empty, and sky block buffer queue is put into.

Each in-process, carrying out the parallelly compressed pipeline processes of data in units of original data block, concrete flowing water parallel processing flow process is as follows:

(1) initial data reads thread and constantly resolves reading original data block according to record data characteristics, the empty block buffer in the initial data circulation double buffering queue of each compression work thread is searched in circulation successively, after finding, original data block is put into, then discharge this block buffer to the end of original data block queue in this circulation double buffering queue.

(2) each compression work thread constantly circulates the original data block queue heads double buffering queue from the initial data of this thread and obtains original data block, is then compressed processing.

(3) blocks of data after compression is constantly filled in the empty block buffer in the compression datacycle double buffering queue of this thread of acquisition by each compression work thread, and discharges this relief area afterbody to the compression data block queue of this circulation double buffering queue.

(4) compression data write thread constantly searches the thread number at the compressed blocks of data place being disposed successively according to block number order from small to large, obtain this block compression data in the compression data block queue heads in the compression datacycle double buffering queue in this thread, write final compressed file.

The specific algorithm of each thread above-mentioned and termination condition refer to detailed description of the invention part.

Initial data reads in thread, adopts memory mapping technique to improve the reading speed of large data files in conjunction with FASTQ deblocking technology.Read the piecemeal of sequence sheet segment information, the space size according to memory pages size and mapping in conjunction with DNA, calculate the data of each piece in the position of internal memory mapping space, and when carry out the release of internal memory mapping space and remap.Adopt internal memory to map a benefit clearly to be exactly: process can direct read/write internal memory, substantially without any extra data copy. and for the file I/O as fread/fwrite etc, then need between kernel spacing and user's space, carry out four secondary data copies, and internal memory maps and has only to twice copy: be once copy memory mapping area to from input file, be once additionally copy to output file from memory mapping area.Actually, ordinary file can be operated by process as accessing internal memory.The detailed description of this technology is shown in and is embodied as part.

In compression data write thread, the order writing final compressed file after each piece of compression needs the reading order reading thread Central Plains beginning data block with initial data identical, uses a read-write order two-dimensional array at this.Namely first dimension of two-dimensional array represents block number；Second dimension be sized to 2, the thread number and the compression that record each piece of distribution respectively are disposed flag information,

The compressed file that each process obtains is initiated with file header, comprises configuration information, the record count that such as block comprises.Followed by data after the compression of each block according to original data block order.File is finally tail of file data, comprises the compression data of each piece location index information hereof, block number and the positional information that tail of file is in whole file.These information is used for parallel decoding and only decoded portion data during specific piece of random access, it is not necessary to decode whole file.

A kind of DNA reads ordinal number according to the parallel decompression method of FASTQ file, comprises the following steps:

One, determine, according to process number, the compressed file that process processes

FASTQ file to be compressed obtains the compressed file of respective number according to the number of the parallelly compressed process arranged.In decompression, arranging the number of parallel decompression process according to the number of compressed file, the order decompressing file that each decompression process obtains is determined by the order of compressed file.Each decompression process does not have the consumption of call duration time each other when processing, and improves the treatment effeciency of data parallel.

Two, read compressed file afterbody, obtain the information such as block setting, block index and block number

What be different from parallel compression method is, parallel decompression method initially obtains the information such as the number of the index such as the setting of the record count that block comprises, each piece of position compressed file, block in each process from the afterbody of compressed file, and these information make parallel decompression method be different from parallel compression method.

Three, the in-process multithreading flowing water parallel decompression of parallel decompression

Similar with parallel compression method, comprising a compression digital independent thread, a decompression data write thread and multiple decompression work thread in the decompression process processing module of parallel decompression method, the specific number of worker thread can arrange according to the check figure of hardware CPU and process and set.

Each decompression work thread has two circulation double buffering queues, and one is compression datacycle double buffering queue, and one is decompress datacycle double buffering queue.Compression datacycle double buffering queue and decompress datacycle double buffering queue there is similar structures, wherein the structure of relief area is slightly different according to the difference of storage data, after the structure of each relief area is discussed in detail in detailed description of the invention part.Each compression datacycle double buffering queue comprises two queues: one is sky block buffer queue, and one is compression data block queue.Each decompression datacycle double buffering queue also comprises two queues: one is sky block buffer queue, and one is decompress data block queue.It is identical that the processing mode of the two circulation double buffering queue all circulates double buffering queue processing mode with the initial data in aforesaid parallel compression method, repeats no more.

Each in-process, carrying out the parallel decompression pipeline processes to read the data in units of sequence compressed data block, concrete parallel pipelining process handling process is as follows:

(1) the location index information of the compression blocks that compression digital independent thread obtains according to tail of file, according to block number order from small to large, constantly read the compression blocks of known compression sizes, the empty block buffer of the compression datacycle double buffering queue heads of each decompression work thread is searched in circulation successively, after finding, compression blocks data are put into, and discharge this relief area to the end of compression data block queue in this circulation double buffering queue.

(2) each decompression work thread constantly compression data block queue heads from the compression datacycle double buffering queue of this thread obtains compression data block, then carries out decompression.

(3) blocks of data after decompression is constantly filled in the empty block buffer decompressed in datacycle double buffering queue of this thread of acquisition by each decompression work thread, and discharges this relief area decompression data block queue tail to this circulation double buffering queue.

(4) decompress data write thread and constantly search the thread number at the compressed blocks of data place being disposed according to block number order from small to large successively, this block obtaining the decompression data block queue heads decompressed in datacycle double buffering queue in this thread decompresses data, writes final compressed file.

Compression digital independent thread adopts the double; two memory mapping technique of circulation to improve the reading speed of large data files in conjunction with data partition.Wherein key technology is the double; two memory mapping technique of circulation so that decompression work thread reads compression data to carry out decompressing and compress digital independent thread internal memory mapping executed in parallel.Namely there are two internal memory mapped inner-storage and map 1 and internal memory mapping 2, circulate successively according to the order of compression blocks and put in the mapping of the two internal memory.Compression blocks Data Position index information according to compressed file afterbody, and the size of two internal memory mapping spaces, according to block number order from small to large, calculate the internal memory mapped buffer at each compression blocks data place and the positional information in internal memory mapped buffer successively.Decompression work thread directly uses the double; two territory, memory mapping area of this circulation in compression datacycle double buffering queue, to reduce data copy number of times.Internal memory for being currently in use maps, it is necessary to wait that the front data that once map that this internal memory maps, after the use of all decompression work threads, just can be re-started internal memory and map.Detailed description of the invention part is shown in the detailed description of this technology.

Decompressing in data write thread, after each piece of decompression, the order of write last solution compressed file needs identical with the reading order of compression blocks in compression digital independent thread.Identical with parallel compression method, also use an identical read-write order two-dimensional array to record thread number and the complete flag information of decompression of each piece of distribution at this.

For improving I/O speed, decompress data write thread and also using internal memory mapping in conjunction with data partition, the Memory Mapping File of particular size is set up according to block number to be decompressed, will de-compress the data block according to block number order from small to large and be sequentially placed in internal memory mapping space, period needs to remap according to the position of write data, the size of internal memory mapping space, the threshold value that remaps, and adjusts the threshold value remapped.Detailed description is shown in and is embodied as part.

Reading the ordinal number compression & decompression according to FASTQ file for DNA, what made important progress in recent years is this class block index serial algorithm of G_SQZ and DSRC.Have not yet to see research article and the patent of the parallel algorithm relevant to this kind of algorithm.Along with the data of genomic sequence analysis reach TB level even PB level scale, for ease of lots of genes data unit sequence is analyzed in real time, it is necessary to it is carried out real-time data compression and decompression, this needs the powerful computing capability by high-performance calculation platform.Therefore, the parallelly compressed decompression method studying this class block index serial algorithm of above-mentioned G_SQZ and DSRC is significant.

Present invention firstly provides this serial compressed decompression algorithm of class block index of above-mentioned G_SQZ and DSRC parallelly compressed and parallel decompression method accordingly.Utilize circulation double buffering queue, the double; two internal memory of circulation to map and internal memory maps and in conjunction with technology such as deblocking processs, the parallelly compressed decompression of multithreading flowing water, read-write sequentially two-dimensional arrays, it is achieved the parallelly compressed and parallel decompression process between multiple processes and in-process multiple threads of FASTQ file.

It is an advantage of the current invention that:

(1) present invention makes full use of the powerful calculating ability of multi-core CPU in each computing node and node, it is possible to solve the restriction of the resources such as the processor suffered by serial compressed gunzip, internal memory.This is parallelly compressed and the realization of parallel decompression method is flexible, it is possible to realize based on MPI and OpenMP, it is also possible to realize based on MPI and Pthread.

(2) compression in any data block and decompression algorithm it are applicable to due to the present invention, therefore this parallelly compressed parallelization being not limited only to G_SQZ and DSRC the two serial algorithm with parallel decompression method, as long as serial compressed decompression algorithm has piecemeal and index the two feature, this parallelly compressed and parallel decompression method is just suitable in the parallelization of this kind of block index serial algorithm.

(3) present invention makes full use of the powerful computing capability of high-performance calculation platform, it is possible to significantly promote the speed that magnanimity genome sequence analyzes and processes in real time, the technical foundation that the broader applications offer of gene data is important.

Accompanying drawing explanation

Fig. 1 is the in-process parallelly compressed figure of multithreading flowing water in parallel compression method of the present invention；

Fig. 2 is original datacycle double buffering queue figure in the present invention；

Fig. 3 is compression data block relief area in parallel compression method of the present invention；

Fig. 4 is the in-process multithreading flowing water parallel decompression figure in parallel decompression method of the present invention；

Fig. 5 is that in parallel decompression method of the present invention, the double; two internal memory of circulation maps and compression data block relief area graph of a relation；

Fig. 6 is that in parallel decompression method of the present invention, the double; two internal memory of circulation is mapped in and reads the collaborative of thread and decompression work cross-thread；

In figure, 1. internal memories map 1, and 2. internal memory maps 2,3. time shaft, 4. territory, memory mapping area pointer, 5. piece at territory, memory mapping area starting point, 6. compression data block length, 7. compression data block number.

Detailed description of the invention

The present invention provides a kind of DNA parallelly compressed decompression method of the FASTQ file reading ordinal number evidence, and for making the purpose of the present invention, technical scheme and effect clearly, clearly, below in conjunction with accompanying drawing, the present invention is described in more detail.Should be appreciated that specific embodiment described herein is only in order to explain the present invention, is not intended to limit the present invention.

Initial data in the parallel compression method of FASTQ file being explained in detail below and reads thread, it is as follows that it is embodied as step:

(1) open original DNA to be compressed and read ordinal number according to FASTQ compressed file.

(2) the paging size of the file system of currently running machine is obtained.

(3) according to paging size set memory mapping space size.

(4) scope of the initial data processed needed for the current process that module is distributed is split according to process task, (starting point needs according to memory pages size the starting point that set memory maps, the border of alignment memory pages) and map size of data, carry out internal memory mapping.

(5) in record the process, the internal memory of first compressing original data block maps original position.

(6) the initial data circulation double buffering queue of each compression work thread is searched in circulation successively, searches empty block buffer.

(7) if empty fast relief area exists, turn (8), otherwise turn (6).

(8) from territory, memory mapping area, to be recorded as granularity, circular order reads the record data of some, forms an original data block, fills this sky block buffer, and block number adds 1, the record count in memory block.Or arrive and turn (9) when mapping terminal.

(9) this relief area original data block queue end to this circulation double buffering queue is discharged.

(10) the thread distribution number of this block is set in read-write order two-dimensional array.

(11) if currently arriving the data end of course allocation task, turning (15), otherwise, turning (12).

(12) map end position according to the current internal memory read, calculate the length of the original data block read, next block to be read original position in territory, memory mapping area is set.

(13) if (current memory maps the length) &&(of the length the reading positional distance this time mapping end position block complete less than the previous firm reading of 1.5 times and do not arrive end of file position), (14), otherwise (8) are turned.

(14) the last internal memory of release maps, and reads the position starting point of block (next one continue) according to current file and calculates internal memory maps next time starting point and map size, re-starts internal memory and map.Turn (8).

(15) releasing memory maps, and arranges reading thread end mark.

(16) initial data reading thread terminates.

Initial data block buffer in the initial data circulation double buffering queue used in above-mentioned steps is as shown in Figure 3.This relief area uses a structure to realize, comprise record count in three field blocks of data pointers, block number and block, blocks of data pointed one record array, each array element points to a record object, this record object comprises FASTQ tetra-partial information (title part, DNAsequence part, "+" part, QualityScore part) total data, being respectively as follows: title part title data pointer, title data length, title headspace length, wherein title data pointer points to title data buffer zone；DNAsequence part sequence data pointer, sequence data length, sequence headspace length, wherein sequence data pointer points to sequence data buffer zone；"+" part plus data pointer, plus data length, plus headspace length, wherein plus data pointer points to plus data buffer zone；QualityScore part Quality data pointer, Quality data length, Quality headspace length, wherein Quality data pointer points to Quality data buffer zone；DNA reads ordinal number according to cutoff information sequence cutoff information vector and quality cutoff information vector

Compression work thread in the parallel compression method of FASTQ file is explained in detail below, and it is as follows that it is embodied as step:

(1) compression work thread preliminary preparation, comprises foundation and the initial work of object in thread.

(2) circulate from initial data the original data block queue of double buffering queue and obtain queue heads.

(3) queue heads acquired by if it is empty, turns (2), otherwise turns (4).

(4) compress the original data block in this queue heads relief area, data after compression are stored in the extra buffer in thread the blocks of data size after recording compressed.

(5) discharge this relief area to circulate in the empty block buffer queue in double buffering queue to this.

(6) from the empty block buffer queue of compression datacycle double buffering queue, queue heads is obtained.

(7) queue heads acquired by if it is empty, turns (6), otherwise turns (8).

(8) blocks of data after the compression of buffer memory in the extra buffer in thread is stored in the empty block buffer in this queue heads, and recording compressed size of data and block number.

(9) this relief area is discharged to the afterbody of compression data block queue in this circulation double buffering queue.

(10) in read-write order two-dimensional array, arrange the compression of this block be disposed mark.

(11) if initial data read thread terminate and all pieces be all disposed, turn (13), otherwise turn (2).

(12) if compression data write thread terminates, turn (13), otherwise turn (2).

(13) this compression work thread end mark is set.

(14) this compression work thread terminates.

It should be noted that each original data block buffer data in above-mentioned steps (4) according to demand, can use the algorithm of this type of block index such as DSRC algorithm, G_SQZ algorithm to compress data.

Compression data block relief area in the compression datacycle double buffering queue used in above-mentioned steps uses a structure to realize, comprise record count in four field compression data block pointer, compression data block length, compression data block number and block, wherein one data buffer zone of compression data block pointed.

Compression data write thread in the parallel compression method of FASTQ file is explained in detail below, and it is as follows that it is embodied as step:

(1) compression data write thread preparation, be included in the head write-in block packet of compressed file containing record count configuration information.

(2) block block_no=0 is set.

(3) search read-write order two-dimensional array block_no block compression and process mark.

(4) if the compression of block_no block is complete, turn (5), otherwise turn (3).

(5) from the compression data block queue of compression datacycle double buffering queue, queue heads is obtained.

(6) if queue heads is empty, turn (5), otherwise turn (7).

(7) compression data block of this queue heads is write in final compressed file.

(8) discharge this relief area and circulate the empty block buffer queue tail in double buffering queue to this.

(9) block_no adds 1

(10) if initial data read thread terminate and all pieces all have been written into final compressed file, turn (11), otherwise turn (3).

(11) the compression tail of file information such as each block compression position flow location index in compressed file, total block number, end-of-file original position are write at compressed file end.

(12) compression data write thread end mark is set.

(13) compression data write thread terminates.

Compression digital independent thread in the parallel decompression method of FASTQ file is explained in detail below, and it is as follows that it is embodied as step:

(1) open the compressed file to be decompressed of course allocation, obtain filec descriptor 1, i.e. fd1.

(2) it is again turned on the compressed file to be decompressed of course allocation, obtains filec descriptor 2, i.e. fd2.

(3) the paging size of the file system of currently running machine is obtained.

(4) according to paging size set memory mapping space size.

(5) the location index information according to each block compression position flow of compressed file afterbody, the process that obtains to be decompressed all pieces original position in compression position flow and end position.

(6) according to above-mentioned original position, (starting point needs according to memory pages size the starting point that set memory maps, the border of alignment memory pages), map size of data, internal memory maps end point, fd1 is carried out mapped inner-storage and maps and obtain the memory address lpBuf1 in territory, memory mapping area 1.Current mapping starting point is that relative whole compressed file calculates with end point.

(7) arranging current memory mapping area current_buffer_symbol is 1.

(8) initializing territory, memory mapping area conversion times variable reverse_num is 0；The starting block variable and the termination block variable that initialize current memory mapping area current_lpbuffer are 0, and the starting block variable and the termination block variable that initialize previous memory mapping area territory last_lpbuffer are 0.

(9) block number=0 currently waiting to decompress block is set.

(10) retrieval compression position flow location index information, is currently treated decompression block original position in compressed file and end position.

(11) if currently treating that the original position of decompression block and end position are in the scope of current mapping area, turn (20)；Otherwise, turn (12).

(12) territory, memory mapping area conversion times variable reverse_num+1.

(13) the starting block value of current memory mapping area is assigned to the starting block variable in territory, previous memory mapping area, current block number value is assigned to the starting block variable of current memory mapping area, and (current block number value-1) is assigned to the termination block variable in territory, previous memory mapping area.

(14) according to currently treating decompression block original position in compressed file, the starting point (starting point needs according to memory pages size, the border of alignment memory pages) of set memory mapping, mapping size of data, internal memory map end point.Current mapping starting point is that relative whole compressed file calculates with end point.

(15) changing current memory mapping area number, namely two map rotation: if current_buffer_symbol is 1, then change to 2；If current_buffer_symbol is 2, then change to 1.

(16) if reverse_num >=2, then turn (17), otherwise turn (19)

(17) according to the starting block number of territory, previous memory mapping area record and termination block number, the decompression end mark of relevant block in continuous cyclic query read-write order two-dimensional array, until all pieces of all decompressed process in scope terminate.Turn (18).

(18) if current_buffer_symbol=1, releasing memory maps 1；Otherwise releasing memory maps 2.

(19) if current_buffer_symbol=1, then fd1 carrying out internal memory mapping 1 again and obtains internal memory mapping address lpbuf1, fd2 is carried out internal memory mapping 2 and obtains internal memory mapping address lpbuf2 by no person again.The internal memory mapping parameters re-started above is all the parameter calculated in (14).

(20) the compression datacycle double buffering queue of each decompression work thread is searched in circulation successively, searches empty block buffer.

(21) if empty block buffer exists, turn (22), otherwise turn (20).

(22) according to current_buffer_symbol, the mapping starting point of current memory mapping area, the original position currently treating decompression block and end position, four fields in empty block buffer structure in currently available compression datacycle double buffering queue are set: block, at territory, memory mapping area starting point, territory, memory mapping area pointer, compression data block length, compression data block number, forms compression data block relief area.

(23) this relief area compression data block queue end to this circulation double buffering queue is discharged.

(24) the thread distribution number of this block is set in read-write order two-dimensional array.

(25) currently treat that the block number of decompression block adds one, if currently treating the block number of decompression block > largest block number that process is to be decompressed, turn (26), no person turns (10).

Etc. (26) thread to be written terminates, if being not over, then waits always；If writing thread to terminate, releasing memory maps 1 and internal memory mapping 2, closes fd1 and fd2.

(27) compression digital independent thread end mark is set.

(28) compression digital independent thread terminates.

Compression data block relief area in the compression datacycle double buffering queue used in above-mentioned steps uses structure to realize, comprise territory pointer, four field memory mapping area (4), block in territory, memory mapping area starting point (5), compression data block length (6) and compression data block number (7).For saving the time that space and data repeatedly copy, the directly territory, memory mapping area at use pointed compression data block place and the block starting point in territory, memory mapping area.

Fig. 5 illustrates that in parallel decompression method, the double; two internal memory of circulation maps the graph of a relation with compression data block relief area.For block 1 in figure, it is shown that each field of relief area maps (1) and the relation of internal memory mapping (2) with double; two territories, memory mapping area internal memory.Can be seen that decompression work thread directly uses the double; two territory, memory mapping area of circulation in compression datacycle double buffering queue, decrease data copy number of times.It should be noted that, starting point and end point that each internal memory maps are very possible not at the beginning of block and end position, this is owing to starting point needs the border according to memory pages size alignment memory pages, the space size that internal memory maps is except mapping for the last time and being affected mapping size by tail of file, and remaining maps size is all a fixed value.Some blocks contains only part data in internal memory maps, and such piece of needs remap monoblock data in the mapping of another one internal memory.

Decompression work thread in the parallel decompression method of FASTQ file is explained in detail below, and it is as follows that it is embodied as step:

(1) decompression work thread preliminary preparation, comprise thread and obtain the information such as the compression blocks number that the process incipient stage processes block record number configuration information that compressed file head and end-of-file obtain, compressed file comprises, each piece of location index in compressed file, and the work such as the foundation of some objects and initialization in this thread.

(2) from the compression data block queue of compression datacycle double buffering queue, queue heads is obtained.

(3) queue heads acquired by if it is empty, turns (2), otherwise turns (4).

(4) data block of four fields in this queue heads relief area structure is read at territory, memory mapping area starting point, territory, memory mapping area pointer, compression data block length, compression data block number, decompress the compression position flow of this block, data after decompression are stored in thread in specific block interrecord structure object.

(6) from the empty block buffer queue decompressing datacycle double buffering queue, queue heads is obtained.

(7) queue heads acquired by if it is empty, turns (6), otherwise turns (8).

(8) in thread inherence block interrecord structure object, blocks of data after the decompression of buffer memory is stored in the empty block buffer in this queue heads, and records the length and block number that decompress data block.

(9) this relief area is discharged to the afterbody of decompression data block queue in this circulation double buffering queue.

(10) the complete mark of decompression of this block is set in read-write order two-dimensional array.

(11) retrieval read-write order two-dimensional array, if all pieces of equal decompression are complete, turn (13), otherwise turns (2).

(12) if decompressing data write thread and terminating, turn (13), otherwise turn (2).

(13) this decompression work thread end mark is set.

(14) this decompression work thread terminates.

It should be noted that each compression blocks buffer data in above-mentioned steps (4) according to compression algorithm, can use the algorithm of this type of block index such as DSRC algorithm, G_SQZ algorithm to decompress data.

The decompression data block buffer decompressed in datacycle double buffering queue used in above-mentioned steps uses a structure to realize, comprise three fields to decompress data block pointer, decompress data block length reconciliation compression data block number, wherein decompress data block pointer and point to a relief area.

Fig. 6 shows in parallel decompression method that the double; two internal memory mapped inner-storage of circulation maps 1(1) and internal memory map 2(2) reading working in coordination with of thread and decompression work cross-thread, time shaft is (3), and the setting of block is identical with Fig. 5.Figure can be seen that, second time use internal memory map 1(1) remap time, need to wait that the complete original internal memory of all decompression work thread process maps 1(1) in all blocks of data, namely after waiting that block 0 is disposed to block i, can discharging and map last time, remap the block j+1 data to block k.Same situation also appears in second time and uses internal memory to map 2(2) when internal memory maps new compression blocks again, it is necessary to wait all decompression thread process complete piece of i+1 to block j+1.

Decompression data write thread in the parallel decompression method of FASTQ file is explained in detail below, and it is as follows that it is embodied as step:

(1) decompress data write thread preparation, comprise the determination decompressing filename.

(3) according to block number to be decompressed, the size qwFileSize of Memory Mapping File is set.According to paging size, the size in territory, each memory mapping area is set, and re-starts the threshold value that internal memory maps.

(4) set up and decompress file, obtain filec descriptor fd, and to arrange space shared by file be qwFileSize size.

(5) calculate this internal memory and map size, fd is carried out internal memory mapping, obtain the memory headroom address lpBuf that internal memory maps.

(6) block block_no=0 is set.

(7) the block_no block decompression mark in read-write order two-dimensional array is searched.

(8) if block_no block decompresses complete, turn (9), otherwise turn (7).

(9) from the decompression data block queue decompressing datacycle double buffering queue, queue heads is obtained.

(10) if queue heads is empty, turn (9), otherwise turn (11).

(11) writing in territory, memory mapping area by the decompression data block of this queue heads, the skew of the skew of memory mapping area numeric field data and file data correspondingly increases all in accordance with the size of write data.

(12) discharge this relief area and circulate the empty block buffer queue tail in double buffering queue to this.

(13) if the data of current memory mapping area write reach threshold value, discharge this internal memory and map.And the skew and file size according to current file data calculates starting point, mapping size, the skew of memory mapping area numeric field data and the new threshold value that internal memory maps, re-start internal memory and map.

(14) block_no adds 1.

(15) if all pieces all have been written into territory, memory mapping area, turn (16), otherwise turn (7).

(16) releasing memory maps and will de-compress the data in the decompression file that write is final, closes filec descriptor.

(17) decompression data write thread end mark is set.

(18) decompress data write thread to terminate.

Claims

1. a DNA reads ordinal number according to FASTQ file in parallel compression method, it is characterised in that include parallelly compressed process task partitioning portion and compression procedure process part, specific as follows:

(1) parallelly compressed process task partitioning portion

According to the data characteristics of each reading sequence fragment and each record in FASTQ file size, parallelly compressed number of processes, FASTQ file, it is determined that the starting and ending position of each pending data of parallelly compressed process；Initial data to be compressed is approximately uniformly assigned in each parallelly compressed process by each parallelly compressed process, to realize data parallel, so each parallelly compressed process does not have the consumption of call duration time each other when processing, and improves the treatment effeciency of data parallel；Each parallelly compressed process obtains independent compressed file, and the order of compression data is consistent with process number；

(2) the responsible parallelly compressed in-process multithreading flowing water of parallelly compressed process process part is parallelly compressed

Each parallelly compressed process process part comprises an initial data and reads thread, a compression data write thread and multiple compression work thread；The specific number of multiple compression work threads arranges set according to check figure and the parallelly compressed process of hardware CPU；

Data to be compressed handled by each process are read thread by initial data and are divided into multiple pieces, and each piece of record comprising specific fixed number, most end end block is less than described fixed number；Each compression work thread all has two circulation double buffering queues, and one is initial data circulation double buffering queue, and another is compression datacycle double buffering queue；The circulation double buffering queue of each initial data comprises two queues: one is sky block buffer queue, and one is original data block queue；Each compression datacycle double buffering queue also comprises two queues: one is sky block buffer queue, and another is compression data block queue；

(1) initial data reads thread and constantly resolves reading original data block according to record data characteristics, the empty block buffer queue in the initial data circulation double buffering queue of each compression work thread is searched in circulation successively, after finding, original data block is put into, then discharge this block buffer to the end of original data block queue in this circulation double buffering queue；

Initial data reads thread and have employed internal memory mapping in conjunction with data partition；

(2) each compression work thread constantly circulates the original data block queue heads double buffering queue from the initial data of this thread and obtains original data block, is then compressed processing；

(3) blocks of data after compression is constantly filled in the empty block buffer in the compression datacycle double buffering queue of this compression work thread of acquisition by each compression work thread, and discharges this relief area afterbody to the compression data block queue of this circulation double buffering queue；

(4) compression data write thread constantly searches the thread number at the compressed blocks of data place being disposed successively according to block number order from small to large, obtain this block compression data in the compression data block queue heads in the compression datacycle double buffering queue in this compression work thread, write final compressed file.

2. DNA according to claim 1 reads ordinal number according to FASTQ file in parallel compression method, it is characterised in that: described initial data circulation double buffering queue processing mode is as follows:

(1) initial data circulation double buffering queue initialization processes: by sky block buffer queue instantiation, has certain number of empty block buffer, and original data block queue is empty；

(2) initial data reads thread and reads an original data block；

(3) an empty block buffer is obtained in empty block buffer queue heads；

(4) this sky block buffer obtained is filled with original data block；

(5) original data block of this filling is put into the end of original data block queue；

(6) compression work thread blocks of data obtained in an initial data block buffer in original data block queue heads is compressed processing；

3. DNA according to claim 1 reads ordinal number according to FASTQ file in parallel compression method, it is characterized in that: described initial data reads the internal memory of thread employing and maps in conjunction with data partition, it is used for improving the reading speed of big file, in conjunction with deblocking, space size according to memory pages size and mapping, calculate the data of each piece in the position of internal memory mapping space, and when carry out the release of internal memory mapping space and remap.

4. a DNA reads ordinal number according to FASTQ file in parallel decompression method, it is characterised in that include following part:

(1) determine that parallel decompression process needs compressed file to be processed according to process number

FASTQ file to be compressed obtains the compressed file of respective number according to the number of the parallelly compressed process arranged；In parallel decompression, arranging the number of parallel decompression process according to the number of compressed file, the order decompressing file that each parallel decompression process obtains is determined by the order of compressed file；Each parallel decompression process does not have the consumption of call duration time each other when processing, and improves the treatment effeciency of data parallel；

(2) read compressed file afterbody, obtain block setting, block index and block number information

Initially obtain the information of number of the setting of the record count that block comprises, each piece of location index compressed file, block from the afterbody of compressed file in each parallel decompression process；

(3) parallel decompression is in-process carries out multithreading flowing water parallel decompression

Each parallel decompression process comprises a compression digital independent thread, a decompression data write thread and multiple decompression work thread；

Each decompression work thread has two circulation double buffering queues, and one is compression datacycle double buffering queue, and one is decompress datacycle double buffering queue；Each compression datacycle double buffering queue comprises two queues: one is sky block buffer queue, and another is compression data block queue；Each decompression datacycle double buffering queue also comprises two queues: one is sky block buffer queue, and another is to decompress data block queue；

In-process in each parallel decompression, carry out the parallel decompression pipeline processes of data in units of compression blocks, concrete parallel pipelining process handling process is as follows:

(1) the location index information of the compression blocks that compression digital independent thread obtains according to compressed file afterbody, according to block number order from small to large, constantly read the compression blocks of known compression sizes, the empty block buffer of the compression datacycle double buffering queue heads of each decompression work thread is searched in circulation successively, after finding, compression blocks data are put into, and discharge this relief area to the end of compression data block queue in this circulation double buffering queue；

Compression digital independent thread adopts the double; two internal memory of circulation to map in conjunction with data partition；

(2) each decompression work thread constantly compression data block queue heads from the compression datacycle double buffering queue of this thread obtains compression data block, then carries out decompression；

(3) blocks of data after decompression is constantly filled in the empty block buffer decompressed in datacycle double buffering queue of this thread of acquisition by each decompression work thread, and discharges this relief area decompression data block queue tail to this circulation double buffering queue；

5. DNA according to claim 4 reads ordinal number according to FASTQ file in parallel decompression method, it is characterized in that: the compression data block queue in described compression datacycle double buffering queue uses a structure to realize, and comprises territory, four field memory mapping area pointer, block at territory, memory mapping area starting point, compression data block length and compression data block number；For saving the time that space and data repeatedly copy, the directly territory, memory mapping area at use pointed compression data block place and the block starting point in territory, memory mapping area.

6. DNA according to claim 4 reads ordinal number according to FASTQ file in parallel decompression method, it is characterised in that: the double; two memory mapping technique of the circulation that described compression digital independent thread adopts is implemented as follows in conjunction with data partition:

Wherein key technology is the double; two memory mapping technique of circulation, make decompression work thread read compression data to carry out decompressing and compress digital independent thread internal memory mapping executed in parallel, namely there are two internal memory mapped inner-storage and map 1 and internal memory mapping 2, circulate successively according to the order of compression blocks and put in the mapping of the two internal memory；Compression blocks Data Position index information according to compressed file afterbody, and the size of two internal memory mapping spaces, according to block number order from small to large, calculate the internal memory mapped buffer at each compression blocks data place and the positional information in internal memory mapped buffer successively；Decompression work thread directly uses the double; two territory, memory mapping area of this circulation in compression datacycle double buffering queue, to reduce data copy number of times；Internal memory for being currently in use maps, it is necessary to wait that the front data that once map that this internal memory maps, after the use of all of decompression work thread, just can be re-started internal memory and map.