CN110349635A

CN110349635A - A kind of parallel compression method of gene sequencing quality of data score

Info

Publication number: CN110349635A
Application number: CN201910499892.7A
Authority: CN
Inventors: 董守斌; 柯璧新; 付佳兵; 胡金龙
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2019-06-11
Filing date: 2019-06-11
Publication date: 2019-10-18
Anticipated expiration: 2039-06-11
Also published as: CN110349635B

Abstract

The invention discloses a kind of parallel compression methods of gene sequencing quality of data score, comprising steps of 1) dividing to FASTQ formatted file data, obtain the data of mass fraction part；2) with behavior unit, the score of every a line mass fraction is calculated, and is classified according to score to this data line；3) when one classification in mass fraction quantity reach threshold value or this classification without more mass fractions be added when, using this classify in mass fraction as a data block be put into calculate buffering queue in, and empty this classification in data；4) data block calculated in buffering queue is taken away by an idle computing unit, is converted, encoded using the ZPAQ that vectorization optimizes, be put into output buffer queue after the completion；5) it is exported by the compressed data of output processing unit processes, then maintenance information is added in the output until completing all compressed datas.Technical solution of the present invention has the characteristics that performance is high, scalability is strong.

Description

A kind of parallel compression method of gene sequencing quality of data score

Technical field

The present invention relates to the technical fields of biological gene sequencing data compression, refer in particular to a kind of gene sequencing quality of data The parallel compression method of score.

Background technique

With the development of second generation sequencing technologies, the cost of gene sequencing declines rapidly, opposite, storage, transmission gene Ratio of the cost of sequencing data in expense is unprecedented soaring.Therefore, reduce gene sequencing data stores and transmits cost tool There is important meaning.Data compression can efficiently reduce gene sequencing size of data, be to reduce by one that stores and transmits cost Key technology.Although current gene data tool of compression research achieves many achievements, there is compression speed in many compression schemes Slow problem is spent, fatal influence is produced to the practicability of scheme.

FASTQ format has become the general format of gene sequencing data at present.Wherein, the quality in FASTQ format point Number data randomness with higher and noise, are relatively difficult to compress, the usual accounting 70% or so in compressed file.Therefore, The compression effectiveness of FASTQ format mass fraction data plays crucial influence to the compression effectiveness of entire FASTQ formatted data.? In data compression, the compact model for more meeting data actual distribution can obtain better compression effectiveness, but complicated model meeting Bring the computing cost of exponential increase.On the one hand, compression algorithm needs to obtain balance on compression ratio and computing resource consumption.Separately On the one hand, the raising of computer hardware performance is provided such as the large-scale application of multi-core processor for the better compression effectiveness of realization Possibility.

Currently, universal compressed tool such as Gzip etc. is still the most common compress mode of FASTQ formatted data.Universal compressed work Tool design is mature, is verified under several scenes, program stability, robustness are excellent.But general utility tool does not account for The design feature and the distinctive semanteme of gene sequencing data, compression ratio and compression speed of FASTQ data format are all unable to fully Optimization, the gene sequencing effect data for coping with explosive increase are limited.In recent years, researchers propose some dedicated alternatives, Such as Quip, SCALCE, QVZ.These schemes have fully considered the characteristic of FASTQ data, and compression ratio greatly improved.However, Since the design of these schemes serves primarily in raising compression ratio, complicated model is often introduced, and lack in performance optimization The structure of few concern or scheme is difficult to optimize, the scheme of these special designs in processing speed from practical application have it is comparable away from From.

The present invention provides a kind of parallel compression method of gene sequencing quality of data score, it is contemplated that mass fraction data are special The characteristics of with compression method process, can be used for carrying out existing mass fraction compression method parallel acceleration optimization, to realize Processing speed is fast, the good parallel compression method of scalability.

Summary of the invention

The purpose of the present invention is to overcome the shortcomings of the existing technology and deficiency, proposes a kind of gene sequencing quality of data point Several parallel compression methods effectively improves the processing speed of mass fraction data compression process, and scalability is good, solves dedicated Mass fraction tool of compression meets Efficient Compression under big data background due to processing speed too low the problem of causing practicability to lack The demand of gene sequencing data.

To achieve the above object, technical solution provided by the present invention are as follows: a kind of gene sequencing quality of data score and Row compression method, comprising the following steps:

1) FASTQ formatted file data are divided, obtains the data of mass fraction part；

2) with behavior unit, the score of every a line mass fraction is calculated, and is classified according to score to this data line, There are classification 1 and classification 2；

3) when mass fraction quantity reaches threshold value in a classification or this is classified without the addition of more mass fractions When, using this classify in mass fraction as a data block be put into calculate buffering queue in, and empty this classification in Data；

4) by an idle computing unit take away calculate buffering queue in a data block, converted, using to The ZPAQ of quantization optimization is encoded, and is put into output buffer queue after the completion；

5) exported by the compressed data of output processing unit processes, the output until completing all compressed datas, then plus Enter to safeguard information.

In step 1), the FASTQ formatted file data of input, 4 row input datas of every acquisition, by it are obtained by main thread In the 4th row data, that is, mass fraction data retain.

In step 2), successively every a line mass fraction data are assessed by mass fraction valuation functions, obtain it This line mass fraction is put into classification 1, otherwise, this line mass fraction is put by score x if x is bigger than given threshold θ Enter classification 2, and is put into a null into classification 1.

In step 3), the number t of initialization classification 2 being emptied successively is 0, and each pair of a line mass fraction is divided Class calculates the size of data S in classification 1 and classification 2₁Byte and S₂Byte, enabling i is the serial number of classification, when there are S_iGreater than given Size k byte or classification i without the mass fraction data being more added when, using classify i in data as a data block It is put into buffering queue, and empties the data in this classification；At this moment if i is 2, will just classify the 2 number t being continuously emptied Increase by 1, otherwise t is set to 0；If the value of t reaches given number T, t is set to 0, then classifying, to be also placed in calculating slow for 1 data It rushes in queue, and empties the data in classification 1；Note n is the open ended data number of blocks of buffering queue, when calculating buffering queue number When according to having expired, the buffering queue to be calculated such as main thread then proceedes to work until there is spare space to add data.

In step 4), when calculating buffering queue, there are untreatment data blocks, when existing simultaneously idle computing unit, by one A free time computing unit takes the data block calculated in buffering queue away, and computing unit becomes data block application volume of data Method is changed, the ZPAQ for then reusing vectorization optimization is encoded, and input data and compact model is specified, if model is untreated Data width is less than vector memory width, then uses the scheme before optimization without using the scheme of vector optimization；Coding is completed Afterwards, compression data block is put into output buffer queue by computing unit.

In step 5), output processing unit safeguards compressed data file, and file is by head, data field and concordance list Three parts form, and each section includes that content is as follows:

Head: having recorded the global information of file, including for parameter needed for the map function of step 4), classification 1 Line number, the line number of classification 2, the starting position of concordance list hereof, wherein the line number of classification 1 includes null；

Data field: include each data block compressed；

Concordance list: including index block identical with compressed data number of blocks, and each index block has recorded its corresponding number According to the relevant information of block, the byte that is occupied including starting position in compressed file of data block label, data block, data block Number, correspondence end line line number of correspondence starting row line number, data block of the data block in original in original, data block institute Classification where being sorted in the null quantity and the data block that the data block starting row front includes；

Processing unit out-feed head information first is exported, and is the space of the reserved output concordance list starting position in head；It connects , processing unit is exported successively by the compression data block in output buffer queue, and generates the information of manipulative indexing block；It is all After the completion of data block output, each index block in concordance list is exported, and the information in head recording indexes table starting position.

Compared with prior art, the present invention have the following advantages that with the utility model has the advantages that

1, general applicability of the present invention is strong, can to the FASTQ mass fraction data that various distinct devices are sequenced into Row compression, can greatly improve the processing speed of gene data compression method, enhance the practicability of compression algorithm.

2, access has locality when the present invention considers portion downstream application access compression FASTQ mass fraction data Feature is indexed block data, and devises the efficient parallel calculation based on multithreading, can effectively meet and actually answer Performance requirement.

3, the present invention carries out vectorization optimization to common compression module ZPAQ, can be widely applied to various use modules Compression scheme, be with good expansibility.

Detailed description of the invention

Fig. 1 is the flow chart of the method for the present invention.

Fig. 2 is gene sequencing information structural schematic diagram.

Fig. 3 is that the task of each thread distributes schematic diagram.

Fig. 4 is compressed file structural schematic diagram of the invention.

Specific embodiment

The present invention is further explained in the light of specific embodiments.

As shown in Figure 1, the parallel compression method of gene sequencing quality of data score provided by the present embodiment, including it is following Step:

1) in the FASTQ file inputted, every 4 row represents one section of gene sequencing information, as shown in Figure 2.The 4th in this 4 row Behavior mass fraction, the base sequence information equal length with the 2nd row, each mass fraction represent same position in the 2nd row The sequencing accuracy for the base data set.Here only retain mass fraction and be advanced into following compression process.

2) to the mass fraction extracted, main thread calculates the score of every a line, and score is higher to be represented comprising more high Frequency substring.According to score height, mass fraction assigns to classification 1 or classification 2 with behavior unit.The calculating of mass fraction score, data It divides and constitutes the different phase on assembly line with subsequent compression process, main thread is responsible for score calculating and division stage, more A worker thread handles more time-consuming conversion, coding stage.The division of labor of different threads is as shown in Figure 3.

As soon as 3) when a mass fraction data for classification reach 32M byte, the data in this is classified are as a number Calculating buffering queue is put into according to block.If classification 2 continuous 3 times reach 32M, the data in classification 1 are just also placed in calculating buffering Queue.When classify 1 data be put into calculate buffering queue when, the 2 continuously number that is emptied of classifying resets to 0.When all matter After amount fractional data is all inputted, had been classified, the data in two classification are respectively put into calculating buffering queue by main thread.

The size for calculating buffering queue is designed as 2 times of active line number of passes, when worker thread calculates buffering queue in processing In data when, main thread can continue to add data into buffering queue, guarantee worker thread complete current data processing after There are ready data that can be handled at once, reduces the time of idle waiting.

In order to safeguard the state for calculating buffering queue, semaphore and mutual exclusion lock are introduced here.It is main when buffering queue is full of Thread suspension discharges processor resource；When worker thread, which calculates, finishes the data block in release buffering queue, main thread is called out It wakes up, continues to add data toward buffering queue.After the completion of all data block additions, the clear position of main thread toward buffering queue is added Empty data and end label, for judging end of input, mutual exclusion lock ensure that the global information synchronization quilt of buffering queue One thread accesses, avoids the generation of race condition, protects correctness of the data under concurrent environment.

4) when calculating buffering queue, there are untreatment data blocks, when existing simultaneously vacant working thread, by an idle work Take the data block calculated in buffering queue away as thread.Here it is managed using semaphore, when worker thread is ready, Begin to wait the arrival of data block, the empty data block until receiving an end of tape marker, worker thread will terminate to move back Out.Semaphore, which ensure that, safely distributes data block to be processed for worker thread.

Mass fraction data are indicated using visible ASCII, and human eye is facilitated to be directly viewable, but number According to expression more redundancy.Worker thread is first according to the distribution of data, map function to data block application light weight, by one Continuous multiple mass fractions are divided reversibly to be mapped as single byte.Then, worker thread utilizes the combination of specific customization model ZPAQ encodes transformed data, obtains final compression data block, and data block is then put into output buffering team It arranges medium to be output.

5) thread elder generation out-feed head information is exported, and file pointer is moved to data field.When in output buffer queue plus When entering new data block, output thread exports compression data block, finishes until all data blocks all export, i.e., all calculating Thread terminates, and the data block in output buffer queue has emptied.Output thread then exports the index block in concordance list, most Starting position in head recording indexes table hereof afterwards, completes the output of all information.File structure is as shown in Figure 4.

Since concordance list has had recorded each piece of information, when a data block completes compression, thread is exported Data block can be exported immediately, without in order to maintain with input Sequential output and cause unnecessary waiting, can and When export and release worker thread.Only need to obtain the order information of data block script when decompression from index block.Index block Presence but also when application need to access local data when, can only decompression partial data block, improve access efficiency.

Embodiment described above is only the preferred embodiments of the invention, and but not intended to limit the scope of the present invention, therefore All shapes according to the present invention change made by principle, should all be included within the scope of protection of the present invention.

Claims

1. a kind of parallel compression method of gene sequencing quality of data score, which comprises the following steps:

2) with behavior unit, the score of every a line mass fraction is calculated, and is classified according to score to this data line, is had point Class 1 and classification 2；

It 3), will when mass fraction quantity reaches threshold value or this classification without the addition of more mass fractions in a classification Mass fraction in this classification is put into as a data block to be calculated in buffering queue, and empties the data in this classification；

4) data block calculated in buffering queue is taken away by an idle computing unit, is converted, uses vectorization The ZPAQ of optimization is encoded, and is put into output buffer queue after the completion；

5) it is exported by the compressed data of output processing unit processes, then dimension is added in the output until completing all compressed datas Protect information.

2. a kind of parallel compression method of gene sequencing quality of data score according to claim 1, it is characterised in that: In step 1), the FASTQ formatted file data of input, 4 row input datas of every acquisition, by wherein the 4th row are obtained by main thread Data, that is, mass fraction data retain.

3. a kind of parallel compression method of gene sequencing quality of data score according to claim 1, it is characterised in that: In step 2), successively every a line mass fraction data are assessed by mass fraction valuation functions, its score x are obtained, if x It is bigger than given threshold θ, this line mass fraction is put into classification 1, otherwise, this line mass fraction is put into classification 2, and past A null is put into classification 1.

4. a kind of parallel compression method of gene sequencing quality of data score according to claim 1, it is characterised in that: In step 3), the number t of initialization classification 2 being emptied successively is 0, and each pair of a line mass fraction is classified, and calculates classification 1 With the size of data S in classification 2₁Byte and S₂Byte, enabling i is the serial number of classification, when there are S_iGreater than given size k byte Or classification i without the mass fraction data being more added when, using classify i in data be put into buffering queue as a data block In, and empty the data in this classification；At this moment if i is 2, the number t that 2 are continuously emptied that will just classify increases by 1, otherwise t It is set to 0；If the value of t reaches given number T, t is set to 0, then by classify 1 data be also placed in calculate buffering queue in, and Empty the data in classification 1；Note n is the open ended data number of blocks of buffering queue, main when calculating buffering queue data has expired The buffering queue to be calculated such as thread then proceedes to work until there is spare space to add data.

5. a kind of parallel compression method of gene sequencing quality of data score according to claim 1, it is characterised in that: In step 4), when calculating buffering queue, there are untreatment data blocks, when existing simultaneously idle computing unit, by an idle calculating Unit takes the data block calculated in buffering queue away, and computing unit is to data block application volume of data transform method, so The ZPAQ for reusing vectorization optimization afterwards is encoded, and input data and compact model is specified, if model untreatment data width Less than vector memory width, then the scheme before optimization is used without using the scheme of vector optimization；After the completion of coding, calculate single Compression data block is put into output buffer queue by member.

6. a kind of parallel compression method of gene sequencing quality of data score according to claim 1, it is characterised in that: In step 5), output processing unit safeguards compressed data file, and file is by three head, data field and concordance list part groups At each section includes that content is as follows:

Head: having recorded the global information of file, including for step 4) map function needed for parameter, classify 1 line number, The line number of classification 2, the starting position of concordance list hereof, wherein the line number of classification 1 includes null；

Data field: include each data block compressed；

Concordance list: including index block identical with compressed data number of blocks, and each index block has recorded its corresponding data block Relevant information, byte number, the number occupied including starting position in compressed file of data block label, data block, data block According to where correspondence end line line number of correspondence starting row line number, data block of the block in original in original, data block points Class is in the classification where the null quantity and the data block that the data block starting row front includes；

Processing unit out-feed head information first is exported, and is the space of the reserved output concordance list starting position in head；Then, defeated Processing unit is successively by the compression data block in output buffer queue out, and generates the information of manipulative indexing block；All data After the completion of block output, each index block in concordance list is exported, and the information in head recording indexes table starting position.