CN110349635A - A kind of parallel compression method of gene sequencing quality of data score - Google Patents

A kind of parallel compression method of gene sequencing quality of data score Download PDF

Info

Publication number
CN110349635A
CN110349635A CN201910499892.7A CN201910499892A CN110349635A CN 110349635 A CN110349635 A CN 110349635A CN 201910499892 A CN201910499892 A CN 201910499892A CN 110349635 A CN110349635 A CN 110349635A
Authority
CN
China
Prior art keywords
data
classification
mass fraction
block
data block
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910499892.7A
Other languages
Chinese (zh)
Other versions
CN110349635B (en
Inventor
董守斌
柯璧新
付佳兵
胡金龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN201910499892.7A priority Critical patent/CN110349635B/en
Publication of CN110349635A publication Critical patent/CN110349635A/en
Application granted granted Critical
Publication of CN110349635B publication Critical patent/CN110349635B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/50Compression of genetic data

Abstract

The invention discloses a kind of parallel compression methods of gene sequencing quality of data score, comprising steps of 1) dividing to FASTQ formatted file data, obtain the data of mass fraction part;2) with behavior unit, the score of every a line mass fraction is calculated, and is classified according to score to this data line;3) when one classification in mass fraction quantity reach threshold value or this classification without more mass fractions be added when, using this classify in mass fraction as a data block be put into calculate buffering queue in, and empty this classification in data;4) data block calculated in buffering queue is taken away by an idle computing unit, is converted, encoded using the ZPAQ that vectorization optimizes, be put into output buffer queue after the completion;5) it is exported by the compressed data of output processing unit processes, then maintenance information is added in the output until completing all compressed datas.Technical solution of the present invention has the characteristics that performance is high, scalability is strong.

Description

A kind of parallel compression method of gene sequencing quality of data score
Technical field
The present invention relates to the technical fields of biological gene sequencing data compression, refer in particular to a kind of gene sequencing quality of data The parallel compression method of score.
Background technique
With the development of second generation sequencing technologies, the cost of gene sequencing declines rapidly, opposite, storage, transmission gene Ratio of the cost of sequencing data in expense is unprecedented soaring.Therefore, reduce gene sequencing data stores and transmits cost tool There is important meaning.Data compression can efficiently reduce gene sequencing size of data, be to reduce by one that stores and transmits cost Key technology.Although current gene data tool of compression research achieves many achievements, there is compression speed in many compression schemes Slow problem is spent, fatal influence is produced to the practicability of scheme.
FASTQ format has become the general format of gene sequencing data at present.Wherein, the quality in FASTQ format point Number data randomness with higher and noise, are relatively difficult to compress, the usual accounting 70% or so in compressed file.Therefore, The compression effectiveness of FASTQ format mass fraction data plays crucial influence to the compression effectiveness of entire FASTQ formatted data.? In data compression, the compact model for more meeting data actual distribution can obtain better compression effectiveness, but complicated model meeting Bring the computing cost of exponential increase.On the one hand, compression algorithm needs to obtain balance on compression ratio and computing resource consumption.Separately On the one hand, the raising of computer hardware performance is provided such as the large-scale application of multi-core processor for the better compression effectiveness of realization Possibility.
Currently, universal compressed tool such as Gzip etc. is still the most common compress mode of FASTQ formatted data.Universal compressed work Tool design is mature, is verified under several scenes, program stability, robustness are excellent.But general utility tool does not account for The design feature and the distinctive semanteme of gene sequencing data, compression ratio and compression speed of FASTQ data format are all unable to fully Optimization, the gene sequencing effect data for coping with explosive increase are limited.In recent years, researchers propose some dedicated alternatives, Such as Quip, SCALCE, QVZ.These schemes have fully considered the characteristic of FASTQ data, and compression ratio greatly improved.However, Since the design of these schemes serves primarily in raising compression ratio, complicated model is often introduced, and lack in performance optimization The structure of few concern or scheme is difficult to optimize, the scheme of these special designs in processing speed from practical application have it is comparable away from From.
The present invention provides a kind of parallel compression method of gene sequencing quality of data score, it is contemplated that mass fraction data are special The characteristics of with compression method process, can be used for carrying out existing mass fraction compression method parallel acceleration optimization, to realize Processing speed is fast, the good parallel compression method of scalability.
Summary of the invention
The purpose of the present invention is to overcome the shortcomings of the existing technology and deficiency, proposes a kind of gene sequencing quality of data point Several parallel compression methods effectively improves the processing speed of mass fraction data compression process, and scalability is good, solves dedicated Mass fraction tool of compression meets Efficient Compression under big data background due to processing speed too low the problem of causing practicability to lack The demand of gene sequencing data.
To achieve the above object, technical solution provided by the present invention are as follows: a kind of gene sequencing quality of data score and Row compression method, comprising the following steps:
1) FASTQ formatted file data are divided, obtains the data of mass fraction part;
2) with behavior unit, the score of every a line mass fraction is calculated, and is classified according to score to this data line, There are classification 1 and classification 2;
3) when mass fraction quantity reaches threshold value in a classification or this is classified without the addition of more mass fractions When, using this classify in mass fraction as a data block be put into calculate buffering queue in, and empty this classification in Data;
4) by an idle computing unit take away calculate buffering queue in a data block, converted, using to The ZPAQ of quantization optimization is encoded, and is put into output buffer queue after the completion;
5) exported by the compressed data of output processing unit processes, the output until completing all compressed datas, then plus Enter to safeguard information.
In step 1), the FASTQ formatted file data of input, 4 row input datas of every acquisition, by it are obtained by main thread In the 4th row data, that is, mass fraction data retain.
In step 2), successively every a line mass fraction data are assessed by mass fraction valuation functions, obtain it This line mass fraction is put into classification 1, otherwise, this line mass fraction is put by score x if x is bigger than given threshold θ Enter classification 2, and is put into a null into classification 1.
In step 3), the number t of initialization classification 2 being emptied successively is 0, and each pair of a line mass fraction is divided Class calculates the size of data S in classification 1 and classification 21Byte and S2Byte, enabling i is the serial number of classification, when there are SiGreater than given Size k byte or classification i without the mass fraction data being more added when, using classify i in data as a data block It is put into buffering queue, and empties the data in this classification;At this moment if i is 2, will just classify the 2 number t being continuously emptied Increase by 1, otherwise t is set to 0;If the value of t reaches given number T, t is set to 0, then classifying, to be also placed in calculating slow for 1 data It rushes in queue, and empties the data in classification 1;Note n is the open ended data number of blocks of buffering queue, when calculating buffering queue number When according to having expired, the buffering queue to be calculated such as main thread then proceedes to work until there is spare space to add data.
In step 4), when calculating buffering queue, there are untreatment data blocks, when existing simultaneously idle computing unit, by one A free time computing unit takes the data block calculated in buffering queue away, and computing unit becomes data block application volume of data Method is changed, the ZPAQ for then reusing vectorization optimization is encoded, and input data and compact model is specified, if model is untreated Data width is less than vector memory width, then uses the scheme before optimization without using the scheme of vector optimization;Coding is completed Afterwards, compression data block is put into output buffer queue by computing unit.
In step 5), output processing unit safeguards compressed data file, and file is by head, data field and concordance list Three parts form, and each section includes that content is as follows:
Head: having recorded the global information of file, including for parameter needed for the map function of step 4), classification 1 Line number, the line number of classification 2, the starting position of concordance list hereof, wherein the line number of classification 1 includes null;
Data field: include each data block compressed;
Concordance list: including index block identical with compressed data number of blocks, and each index block has recorded its corresponding number According to the relevant information of block, the byte that is occupied including starting position in compressed file of data block label, data block, data block Number, correspondence end line line number of correspondence starting row line number, data block of the data block in original in original, data block institute Classification where being sorted in the null quantity and the data block that the data block starting row front includes;
Processing unit out-feed head information first is exported, and is the space of the reserved output concordance list starting position in head;It connects , processing unit is exported successively by the compression data block in output buffer queue, and generates the information of manipulative indexing block;It is all After the completion of data block output, each index block in concordance list is exported, and the information in head recording indexes table starting position.
Compared with prior art, the present invention have the following advantages that with the utility model has the advantages that
1, general applicability of the present invention is strong, can to the FASTQ mass fraction data that various distinct devices are sequenced into Row compression, can greatly improve the processing speed of gene data compression method, enhance the practicability of compression algorithm.
2, access has locality when the present invention considers portion downstream application access compression FASTQ mass fraction data Feature is indexed block data, and devises the efficient parallel calculation based on multithreading, can effectively meet and actually answer Performance requirement.
3, the present invention carries out vectorization optimization to common compression module ZPAQ, can be widely applied to various use modules Compression scheme, be with good expansibility.
Detailed description of the invention
Fig. 1 is the flow chart of the method for the present invention.
Fig. 2 is gene sequencing information structural schematic diagram.
Fig. 3 is that the task of each thread distributes schematic diagram.
Fig. 4 is compressed file structural schematic diagram of the invention.
Specific embodiment
The present invention is further explained in the light of specific embodiments.
As shown in Figure 1, the parallel compression method of gene sequencing quality of data score provided by the present embodiment, including it is following Step:
1) in the FASTQ file inputted, every 4 row represents one section of gene sequencing information, as shown in Figure 2.The 4th in this 4 row Behavior mass fraction, the base sequence information equal length with the 2nd row, each mass fraction represent same position in the 2nd row The sequencing accuracy for the base data set.Here only retain mass fraction and be advanced into following compression process.
2) to the mass fraction extracted, main thread calculates the score of every a line, and score is higher to be represented comprising more high Frequency substring.According to score height, mass fraction assigns to classification 1 or classification 2 with behavior unit.The calculating of mass fraction score, data It divides and constitutes the different phase on assembly line with subsequent compression process, main thread is responsible for score calculating and division stage, more A worker thread handles more time-consuming conversion, coding stage.The division of labor of different threads is as shown in Figure 3.
As soon as 3) when a mass fraction data for classification reach 32M byte, the data in this is classified are as a number Calculating buffering queue is put into according to block.If classification 2 continuous 3 times reach 32M, the data in classification 1 are just also placed in calculating buffering Queue.When classify 1 data be put into calculate buffering queue when, the 2 continuously number that is emptied of classifying resets to 0.When all matter After amount fractional data is all inputted, had been classified, the data in two classification are respectively put into calculating buffering queue by main thread.
The size for calculating buffering queue is designed as 2 times of active line number of passes, when worker thread calculates buffering queue in processing In data when, main thread can continue to add data into buffering queue, guarantee worker thread complete current data processing after There are ready data that can be handled at once, reduces the time of idle waiting.
In order to safeguard the state for calculating buffering queue, semaphore and mutual exclusion lock are introduced here.It is main when buffering queue is full of Thread suspension discharges processor resource;When worker thread, which calculates, finishes the data block in release buffering queue, main thread is called out It wakes up, continues to add data toward buffering queue.After the completion of all data block additions, the clear position of main thread toward buffering queue is added Empty data and end label, for judging end of input, mutual exclusion lock ensure that the global information synchronization quilt of buffering queue One thread accesses, avoids the generation of race condition, protects correctness of the data under concurrent environment.
4) when calculating buffering queue, there are untreatment data blocks, when existing simultaneously vacant working thread, by an idle work Take the data block calculated in buffering queue away as thread.Here it is managed using semaphore, when worker thread is ready, Begin to wait the arrival of data block, the empty data block until receiving an end of tape marker, worker thread will terminate to move back Out.Semaphore, which ensure that, safely distributes data block to be processed for worker thread.
Mass fraction data are indicated using visible ASCII, and human eye is facilitated to be directly viewable, but number According to expression more redundancy.Worker thread is first according to the distribution of data, map function to data block application light weight, by one Continuous multiple mass fractions are divided reversibly to be mapped as single byte.Then, worker thread utilizes the combination of specific customization model ZPAQ encodes transformed data, obtains final compression data block, and data block is then put into output buffering team It arranges medium to be output.
5) thread elder generation out-feed head information is exported, and file pointer is moved to data field.When in output buffer queue plus When entering new data block, output thread exports compression data block, finishes until all data blocks all export, i.e., all calculating Thread terminates, and the data block in output buffer queue has emptied.Output thread then exports the index block in concordance list, most Starting position in head recording indexes table hereof afterwards, completes the output of all information.File structure is as shown in Figure 4.
Since concordance list has had recorded each piece of information, when a data block completes compression, thread is exported Data block can be exported immediately, without in order to maintain with input Sequential output and cause unnecessary waiting, can and When export and release worker thread.Only need to obtain the order information of data block script when decompression from index block.Index block Presence but also when application need to access local data when, can only decompression partial data block, improve access efficiency.
Embodiment described above is only the preferred embodiments of the invention, and but not intended to limit the scope of the present invention, therefore All shapes according to the present invention change made by principle, should all be included within the scope of protection of the present invention.

Claims (6)

1. a kind of parallel compression method of gene sequencing quality of data score, which comprises the following steps:
1) FASTQ formatted file data are divided, obtains the data of mass fraction part;
2) with behavior unit, the score of every a line mass fraction is calculated, and is classified according to score to this data line, is had point Class 1 and classification 2;
It 3), will when mass fraction quantity reaches threshold value or this classification without the addition of more mass fractions in a classification Mass fraction in this classification is put into as a data block to be calculated in buffering queue, and empties the data in this classification;
4) data block calculated in buffering queue is taken away by an idle computing unit, is converted, uses vectorization The ZPAQ of optimization is encoded, and is put into output buffer queue after the completion;
5) it is exported by the compressed data of output processing unit processes, then dimension is added in the output until completing all compressed datas Protect information.
2. a kind of parallel compression method of gene sequencing quality of data score according to claim 1, it is characterised in that: In step 1), the FASTQ formatted file data of input, 4 row input datas of every acquisition, by wherein the 4th row are obtained by main thread Data, that is, mass fraction data retain.
3. a kind of parallel compression method of gene sequencing quality of data score according to claim 1, it is characterised in that: In step 2), successively every a line mass fraction data are assessed by mass fraction valuation functions, its score x are obtained, if x It is bigger than given threshold θ, this line mass fraction is put into classification 1, otherwise, this line mass fraction is put into classification 2, and past A null is put into classification 1.
4. a kind of parallel compression method of gene sequencing quality of data score according to claim 1, it is characterised in that: In step 3), the number t of initialization classification 2 being emptied successively is 0, and each pair of a line mass fraction is classified, and calculates classification 1 With the size of data S in classification 21Byte and S2Byte, enabling i is the serial number of classification, when there are SiGreater than given size k byte Or classification i without the mass fraction data being more added when, using classify i in data be put into buffering queue as a data block In, and empty the data in this classification;At this moment if i is 2, the number t that 2 are continuously emptied that will just classify increases by 1, otherwise t It is set to 0;If the value of t reaches given number T, t is set to 0, then by classify 1 data be also placed in calculate buffering queue in, and Empty the data in classification 1;Note n is the open ended data number of blocks of buffering queue, main when calculating buffering queue data has expired The buffering queue to be calculated such as thread then proceedes to work until there is spare space to add data.
5. a kind of parallel compression method of gene sequencing quality of data score according to claim 1, it is characterised in that: In step 4), when calculating buffering queue, there are untreatment data blocks, when existing simultaneously idle computing unit, by an idle calculating Unit takes the data block calculated in buffering queue away, and computing unit is to data block application volume of data transform method, so The ZPAQ for reusing vectorization optimization afterwards is encoded, and input data and compact model is specified, if model untreatment data width Less than vector memory width, then the scheme before optimization is used without using the scheme of vector optimization;After the completion of coding, calculate single Compression data block is put into output buffer queue by member.
6. a kind of parallel compression method of gene sequencing quality of data score according to claim 1, it is characterised in that: In step 5), output processing unit safeguards compressed data file, and file is by three head, data field and concordance list part groups At each section includes that content is as follows:
Head: having recorded the global information of file, including for step 4) map function needed for parameter, classify 1 line number, The line number of classification 2, the starting position of concordance list hereof, wherein the line number of classification 1 includes null;
Data field: include each data block compressed;
Concordance list: including index block identical with compressed data number of blocks, and each index block has recorded its corresponding data block Relevant information, byte number, the number occupied including starting position in compressed file of data block label, data block, data block According to where correspondence end line line number of correspondence starting row line number, data block of the block in original in original, data block points Class is in the classification where the null quantity and the data block that the data block starting row front includes;
Processing unit out-feed head information first is exported, and is the space of the reserved output concordance list starting position in head;Then, defeated Processing unit is successively by the compression data block in output buffer queue out, and generates the information of manipulative indexing block;All data After the completion of block output, each index block in concordance list is exported, and the information in head recording indexes table starting position.
CN201910499892.7A 2019-06-11 2019-06-11 Parallel compression method for gene sequencing data quality fraction Active CN110349635B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910499892.7A CN110349635B (en) 2019-06-11 2019-06-11 Parallel compression method for gene sequencing data quality fraction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910499892.7A CN110349635B (en) 2019-06-11 2019-06-11 Parallel compression method for gene sequencing data quality fraction

Publications (2)

Publication Number Publication Date
CN110349635A true CN110349635A (en) 2019-10-18
CN110349635B CN110349635B (en) 2021-06-11

Family

ID=68181767

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910499892.7A Active CN110349635B (en) 2019-06-11 2019-06-11 Parallel compression method for gene sequencing data quality fraction

Country Status (1)

Country Link
CN (1) CN110349635B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111782609A (en) * 2020-05-22 2020-10-16 北京和瑞精准医学检验实验室有限公司 Method for rapidly and uniformly fragmenting fastq file
CN114489518A (en) * 2022-03-28 2022-05-13 山东大学 Sequencing data quality control method and system

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8594951B2 (en) * 2011-02-01 2013-11-26 Life Technologies Corporation Methods and systems for nucleic acid sequence analysis
CN103559020B (en) * 2013-11-07 2016-07-06 中国科学院软件研究所 A kind of DNA reads ordinal number according to the compression of FASTQ file in parallel and decompression method
CA2980769A1 (en) * 2015-04-02 2016-10-06 The Jackson Laboratory Method for detecting genomic variations using circularised mate-pair library and shotgun sequencing
CN107851118A (en) * 2015-05-21 2018-03-27 基因福米卡数据系统有限公司 Storage, transmission and the compression of sequencing data of future generation
JP6946292B2 (en) * 2015-08-06 2021-10-06 エイアールシー バイオ リミテッド ライアビリティ カンパニー Systems and methods for genome analysis
CN105391454B (en) * 2015-12-14 2017-08-11 季检 A kind of DNA sequencing qualities fraction lossless compression method
WO2018031739A1 (en) * 2016-08-10 2018-02-15 New York Genome Center, Inc. Ultra-low coverage genome sequencing and uses thereof
CN108287983A (en) * 2017-01-09 2018-07-17 朱瑞星 A kind of method and apparatus for carrying out compression and decompression to genome
WO2018152542A1 (en) * 2017-02-17 2018-08-23 The Board Of Trustees Of The Leland Stanford Junior University Accurate and sensitive unveiling of chimeric biomolecule sequences and applications thereof
CN106991134B (en) * 2017-03-13 2019-04-05 人和未来生物科技(长沙)有限公司 A kind of large data cloud storage method based on object storage
CN107704728B (en) * 2017-09-26 2021-01-19 华南理工大学 Cloud computing acceleration method for gene sequence comparison

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111782609A (en) * 2020-05-22 2020-10-16 北京和瑞精准医学检验实验室有限公司 Method for rapidly and uniformly fragmenting fastq file
CN111782609B (en) * 2020-05-22 2023-10-13 北京和瑞精湛医学检验实验室有限公司 Method for rapidly and uniformly slicing fastq file
CN114489518A (en) * 2022-03-28 2022-05-13 山东大学 Sequencing data quality control method and system

Also Published As

Publication number Publication date
CN110349635B (en) 2021-06-11

Similar Documents

Publication Publication Date Title
CN110059067B (en) Water conservancy space vector big data storage management method
CN104361113B (en) A kind of OLAP query optimization method under internal memory flash memory mixing memory module
CN106709067A (en) Multi-source heterogeneous spatial data flow method based on Oracle database
CN106547882A (en) A kind of real-time processing method and system of big data of marketing in intelligent grid
CN103309958A (en) OLAP star connection query optimizing method under CPU and GPU mixing framework
CN110222029A (en) A kind of big data multidimensional analysis computational efficiency method for improving and system
CN105808358B (en) A kind of data dependence thread packet mapping method for many-core system
CN107247624A (en) A kind of cooperative optimization method and system towards Key Value systems
JP4267707B2 (en) N-way processing of bit strings in data flow architecture
CN110349635A (en) A kind of parallel compression method of gene sequencing quality of data score
CN106055590A (en) Power grid data processing method and system based on big data and graph database
CN106844607A (en) A kind of SQLite data reconstruction methods suitable for non-integer major key and idle merged block
CN103198099A (en) Cloud-based data mining application method facing telecommunication service
CN110427404A (en) A kind of across chain data retrieval system of block chain
CN102136007B (en) Small world property-based engineering information organization method
CN110134646A (en) The storage of knowledge platform service data and integrated approach and system
CN113254517A (en) Service providing method based on internet big data
CN106815320B (en) Investigation big data visual modeling method and system based on expanded three-dimensional histogram
CN108334532A (en) A kind of Eclat parallel methods, system and device based on Spark
CN103942235A (en) Distributed computation system and method for large-scale data set cross comparison
CN107168989A (en) One kind is multi-source heterogeneous to isolate structural data method for transformation and system
CN103714192A (en) Adaptive R-tree based large-data-volume three-dimensional railway design model rendering method
CN107784032A (en) Gradual output intent, the apparatus and system of a kind of data query result
CN110162531B (en) Distributed concurrent data processing task decision method
JPWO2011099114A1 (en) Hybrid database system and operation method thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant