CN117133365A - High-throughput genome sequencing quality score data parallel compression method - Google Patents
High-throughput genome sequencing quality score data parallel compression method Download PDFInfo
- Publication number
- CN117133365A CN117133365A CN202311018059.9A CN202311018059A CN117133365A CN 117133365 A CN117133365 A CN 117133365A CN 202311018059 A CN202311018059 A CN 202311018059A CN 117133365 A CN117133365 A CN 117133365A
- Authority
- CN
- China
- Prior art keywords
- file
- sequence
- compressed
- parallel
- ith
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000007906 compression Methods 0.000 title claims abstract description 74
- 230000006835 compression Effects 0.000 title claims abstract description 74
- 238000000034 method Methods 0.000 title claims abstract description 50
- 238000012268 genome sequencing Methods 0.000 title claims abstract description 30
- 238000005192 partition Methods 0.000 claims abstract description 41
- 238000004458 analytical method Methods 0.000 claims abstract description 29
- 238000012417 linear regression Methods 0.000 claims abstract description 24
- 230000008030 elimination Effects 0.000 claims abstract description 23
- 238000003379 elimination reaction Methods 0.000 claims abstract description 23
- 238000012163 sequencing technique Methods 0.000 claims abstract description 22
- 238000013507 mapping Methods 0.000 claims abstract description 17
- 238000005070 sampling Methods 0.000 claims abstract description 15
- 108090000623 proteins and genes Proteins 0.000 claims abstract description 9
- 238000004364 calculation method Methods 0.000 claims description 7
- 238000011084 recovery Methods 0.000 claims description 5
- 108091028043 Nucleic acid sequence Proteins 0.000 claims description 3
- 230000000903 blocking effect Effects 0.000 claims description 3
- 238000000605 extraction Methods 0.000 claims description 3
- 238000012856 packing Methods 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 2
- 238000000638 solvent extraction Methods 0.000 claims description 2
- 238000013144 data compression Methods 0.000 abstract description 4
- 238000010276 construction Methods 0.000 abstract description 3
- 238000013500 data storage Methods 0.000 abstract description 2
- 230000001133 acceleration Effects 0.000 description 9
- 230000006837 decompression Effects 0.000 description 7
- 230000000694 effects Effects 0.000 description 4
- 239000013598 vector Substances 0.000 description 4
- 238000011161 development Methods 0.000 description 3
- 230000018109 developmental process Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000012165 high-throughput sequencing Methods 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- ADLSFYJDHWUKFP-UHFFFAOYSA-N 5-methoxy-11,12-dihydroindolo[2,3-a]carbazole-6-carbonitrile Chemical compound N1C2=C3NC4=CC=C[CH]C4=C3C(OC)=C(C#N)C2=C2[C]1C=CC=C2 ADLSFYJDHWUKFP-UHFFFAOYSA-N 0.000 description 1
- 241000700605 Viruses Species 0.000 description 1
- 230000001174 ascending effect Effects 0.000 description 1
- 238000012098 association analyses Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000003745 diagnosis Methods 0.000 description 1
- 238000009509 drug development Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
- G16B50/50—Compression of genetic data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/18—Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Medical Informatics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Databases & Information Systems (AREA)
- Evolutionary Biology (AREA)
- Mathematical Analysis (AREA)
- Computational Mathematics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Pure & Applied Mathematics (AREA)
- Bioethics (AREA)
- Mathematical Optimization (AREA)
- Biotechnology (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Public Health (AREA)
- Genetics & Genomics (AREA)
- Evolutionary Computation (AREA)
- Epidemiology (AREA)
- Operations Research (AREA)
- Probability & Statistics with Applications (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Algebra (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention relates to the technical field of data compression and storage, and provides a high-throughput genome sequencing quality score data parallel compression method. The method comprises the following steps: dividing an original gene sequencing file; randomly sampling and carrying out k-mer analysis on the sampled data to obtain statistical characteristic information, establishing a parallel sequence partition model for carrying out secondary classification, and splicing two partition files obtained by the secondary classification according to splicing parameters; predicting the file to be compressed by a multiple linear regression analysis prediction method to obtain compression rate gain, and establishing a parallel four-level run prediction mapping model to perform data redundancy elimination; and carrying out context modeling on the two redundancy elimination sub-files through the multi-core processor cluster, and carrying out cascade compression by combining arithmetic coding to obtain a final compressed file. The invention also improves the quality score data compression rate, reduces the size of the storage file to be compressed and saves the construction cost of the infrastructure storage facility on the premise of obviously reducing the quality score data compression time and the peak memory overhead.
Description
Technical Field
The invention relates to the technical field of data compression and storage, in particular to a high-throughput genome sequencing quality score data parallel compression method.
Background
High Throughput Genome Sequencing Data (HTGSD) is an important biological big data type and is widely used in the fields of drug development, whole genome association analysis, virus tracing, accurate diagnosis and treatment and the like. In recent years, with the rapid development of high throughput sequencing technology, HTGSD sequencing costs have been significantly reduced, such cost reduction causing HTGSD data sizes to exhibit a significant increase in supermoore's law speed. For example, only the national gene library sequence archiving system maintains high throughput genome sequencing data compressed by the GZIP algorithm for 11372TB. This scale increase reflects the importance and widespread use of biotechnology in the current era, while also providing greater development space and opportunity for the depth and breadth of genomics and biomedical research.
HTGSDs are typically stored in a FastQ file format, wherein the quality score data (Quality Scores Data, QSD) occupies up to 70% of the space in the FastQ lossless compression file, i.e., QSD plays a key role in increasing the compression rate of the FastQ file, i.e., the ratio of the file size before compression to the file size after compression. How to compress and store large-scale high-throughput genome sequencing quality score data so as to balance the generation speed of genome big data, reduce the construction cost of infrastructure and the cost of data sharing transmission is an important problem which needs to be solved currently.
With the continuous development of high-throughput sequencing technology, the traditional high-throughput genome sequencing quality score data serial compression algorithm and the simple CPU parallel compression algorithm have difficulty in meeting the requirement of rapid real-time processing of mass data, and meanwhile, the existing lossless compression algorithm still has a large compression rate improvement space.
Disclosure of Invention
The present invention is directed to solving at least one of the technical problems existing in the related art. Therefore, the invention provides a high-throughput genome sequencing quality score data parallel compression method.
The invention provides a high-throughput genome sequencing quality score data parallel compression method, which comprises the following steps:
s100: dividing an original gene sequencing file to obtain quality score data as a file to be compressed;
s200: randomly sampling the file to be compressed, carrying out k-mer analysis on sampling data to obtain statistical characteristic information of the file to be compressed, and establishing a parallel sequence partition model according to the statistical characteristic information;
s300: performing two-classification on the file to be compressed through the parallel sequence partition model to obtain a first partition sub-file and a second partition sub-file, and splicing the first partition sub-file and the second partition sub-file according to splicing parameters to obtain a first splicing sub-file and a second splicing sub-file;
s400: predicting the file to be compressed by a multiple linear regression analysis prediction method to obtain a compression rate gain, establishing a parallel four-level run prediction mapping model according to the compression rate gain, and respectively carrying out data redundancy elimination on the first spliced sub-file and the second spliced sub-file by the parallel four-level run prediction mapping model to correspondingly obtain a first redundancy elimination sub-file and a second redundancy elimination sub-file;
s500: and carrying out context modeling on the first redundancy elimination sub-file and the second redundancy elimination sub-file through a multi-core processor cluster, and carrying out cascade compression by combining arithmetic coding to obtain a final compressed file.
The invention provides a high-throughput genome sequencing quality score data parallel compression method, which further comprises the following steps:
s600: and decompressing the final compressed file in parallel through the multi-core processor cluster to obtain a final output file subjected to lossless recovery.
According to the high-throughput genome sequencing quality score data parallel compression method provided by the invention, the original gene sequencing file comprises sequencing description information data, sequencing DNA sequence data, sequencing additional information data and sequencing quality score data.
According to the high-throughput genome sequencing quality score data parallel compression method provided by the invention, the expression of the parallel sequence partition model in the step S200 is as follows:
wherein,the partition result mark obtained by calculating the ith sequence for the p-th processor computing core is represented by I, alpha is a partition threshold value, and when I<The value of I is 0 when alpha, n is the average sequence length, k is the size of a sliding window used for feature extraction in k-mer analysis, M is the number of samples for randomly sampling a file to be compressed, I is a first index value, j is a second index value, I ′ Is the third index value, j ′ For the fourth index value, NFactor is to perform compression on the file to be compressed
A k-mer sequence of length k acquired at the j-th position of the i-th sequence, q i′,j′:j′+k+1 Is the ith ′ Stripe random sampling sequence q i′ At the j th ′ And calculating the k-mer sequences at the positions.
According to the high-throughput genome sequencing quality score data parallel compression method provided by the invention, the expression of the compression rate gain predicted by the multiple linear regression analysis prediction method in the step S400 is as follows:
wherein,for compression ratio gain, w 0 First weight parameters, w, obtained for multiple linear regression analysis prediction 1 Calculating a second weight parameter, w, for the multiple linear regression analysis prediction method 2 Third weight parameter, w, calculated for multiple linear regression analysis prediction 3 Fourth weight parameter, w, obtained by calculation for multiple linear regression analysis prediction method 4 Fifth weight parameter calculated for multiple linear regression analysis prediction method, +.>Calculating a mode character obtained by calculating an ith sequence by a core for a p-th processor, wherein N is an average sequence length, N is the number of spliced sequences, I is a first index value, j is a second index value, v is a fifth index value, and I is an indication function>For representing the characters corresponding to the jth position in the ith QSD sequence in the jth splicing partition file of the p-th processor computing core,/->Representing the characters corresponding to the j-1 th position in the ith QSD sequence in the ith splicing partition file of the computing core pair of the p-th processor,/->Calculating the run length of the core calculated at the j-th position of the ith sequence for the p-th processor, wherein Θ is the character value space size of the quality score data,the size of the character value space of the k-mer sequence acquired by the core at the j position of the ith sequence is calculated for the p-th processor.
According to the high-throughput genome sequencing quality score data parallel compression method provided by the invention, the expression of the parallel four-level run prediction mapping model in the step S400 is as follows:
wherein,mapping result label calculated by the kernel for the p-th processor for the j position of the ith sequence of the v-th file, β being the dynamic run prediction switch,>representing the characters corresponding to the j+1th position in the ith QSD sequence in the ith splicing partition file of the computing core pair of the p-th processor,/->Representing the characters corresponding to the j+2 positions in the ith QSD sequence in the ith splicing partition file by the p-th processor computing core.
According to the high throughput genome sequencing quality score data parallel compression method provided by the invention, the step S500 comprises the following steps:
s510: dividing the first redundancy elimination sub-file and the second redundancy elimination sub-file respectively through a data blocking strategy to obtain a plurality of blocks to be compressed;
s520: modeling a plurality of blocks to be compressed through parallel contexts of a multi-core processor cluster and performing cascade compression to obtain a plurality of files to be combined;
s530: and merging and packing a plurality of files to be merged to obtain a final compressed file.
According to the high-throughput genome sequencing quality score data parallel compression method provided by the invention, the value range of the splicing parameters in the step S300 is as follows:
10000≤τ≤r;
wherein τ is a splicing parameter, and r is the total number of character string sequences in the file to be compressed.
The high-throughput genome sequencing quality score data parallel compression method provided by the invention aims at optimizing and improving the compression rate of large-scale high-throughput genome sequencing quality score data, reducing the compression and decompression time and the peak memory overhead, realizes the large-scale high-throughput genome sequencing quality score data parallel compression method based on the parallel sequence partition model, the parallel four-level run prediction mapping model and the multi-core CPU cluster, improves the compression and storage efficiency of genome sequencing data, and has important value in industrial and academic applications.
Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method for parallel compression of high throughput genome sequencing quality score data according to an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention. The following examples are illustrative of the invention but are not intended to limit the scope of the invention.
In the description of the embodiments of the present invention, it should be noted that the terms "center", "longitudinal", "lateral", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", etc. indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, are merely for convenience in describing the embodiments of the present invention and simplifying the description, and do not indicate or imply that the apparatus or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the embodiments of the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
In describing embodiments of the present invention, it should be noted that, unless explicitly stated and limited otherwise, the terms "coupled," "coupled," and "connected" should be construed broadly, and may be either a fixed connection, a removable connection, or an integral connection, for example; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium. The specific meaning of the above terms in embodiments of the present invention will be understood in detail by those of ordinary skill in the art.
In embodiments of the invention, unless expressly specified and limited otherwise, a first feature "up" or "down" on a second feature may be that the first and second features are in direct contact, or that the first and second features are in indirect contact via an intervening medium. Moreover, a first feature being "above," "over" and "on" a second feature may be a first feature being directly above or obliquely above the second feature, or simply indicating that the first feature is level higher than the second feature. The first feature being "under", "below" and "beneath" the second feature may be the first feature being directly under or obliquely below the second feature, or simply indicating that the first feature is less level than the second feature.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the embodiments of the present invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.
An embodiment provided by the present invention is described below with reference to fig. 1.
The invention provides a high-throughput genome sequencing quality score data parallel compression method, which comprises the following steps:
s100: dividing an original gene sequencing file to obtain quality score data as a file to be compressed;
the original gene sequencing file comprises sequencing description information data, sequencing DNA sequence data, sequencing additional information data and sequencing quality score data.
Further, the format of the original gene sequencing file is fastQ format, and the step S100 is to obtain the sequencing quality score data file to be compressed for the subsequent compressed storage operation through data segmentation.
S200: randomly sampling the file to be compressed, carrying out k-mer analysis on sampling data to obtain statistical characteristic information of the file to be compressed, and establishing a parallel sequence partition model according to the statistical characteristic information;
the expression of the parallel sequence partition model in step S200 is:
wherein,the partition result mark obtained by calculating the ith sequence for the p-th processor computing core is represented by I, alpha is a partition threshold value, and when I<The value of I is 0 when alpha, n is the average sequence length, k is the size of a sliding window used for feature extraction in k-mer analysis, M is the number of samples for randomly sampling a file to be compressed, I is a first index value, j is a second index value, I ′ Is the third index value, j ′ For the fourth index value, NFactor is a k-mer frequency normalization factor for randomly sampling the file to be compressed, < >>Calculating for the p-th processor a k-mer sequence of length k acquired by the core at the j-th position of the i-th sequence, q i′,j′:j′+k+1 Is the ith ′ Stripe random sampling sequence q i′ At the j th ′ And calculating the k-mer sequences at the positions.
S300: performing two-classification on the file to be compressed through the parallel sequence partition model to obtain a first partition sub-file and a second partition sub-file, and splicing the first partition sub-file and the second partition sub-file according to splicing parameters to obtain a first splicing sub-file and a second splicing sub-file;
the range of values of the splicing parameters in step S300 is as follows:
10000≤τ≤r;
wherein τ is a splicing parameter, and r is the total number of character string sequences in the file to be compressed.
S400: predicting the file to be compressed by a multiple linear regression analysis prediction method to obtain a compression rate gain, establishing a parallel four-level run prediction mapping model according to the compression rate gain, and respectively carrying out data redundancy elimination on the first spliced sub-file and the second spliced sub-file by the parallel four-level run prediction mapping model to correspondingly obtain a first redundancy elimination sub-file and a second redundancy elimination sub-file;
the expression of the compression ratio gain predicted by the multiple linear regression analysis prediction method in step S400 is as follows:
wherein,for compression ratio gain, w 0 First weight parameters, w, obtained for multiple linear regression analysis prediction 1 Calculating a second weight parameter, w, for the multiple linear regression analysis prediction method 2 Third weight parameter, w, calculated for multiple linear regression analysis prediction 3 Fourth weight parameter, w, obtained by calculation for multiple linear regression analysis prediction method 4 Fifth weight parameter calculated for multiple linear regression analysis prediction method, +.>Calculating a mode character obtained by calculating an ith sequence by a core for a p-th processor, wherein N is an average sequence length, N is the number of spliced sequences, I is a first index value, j is a second index value, v is a fifth index value, and I is an indication function>For representing the characters corresponding to the jth position in the ith QSD sequence in the jth splicing partition file of the p-th processor computing core,/->Representing the characters corresponding to the j-1 th position in the ith QSD sequence in the ith splicing partition file of the computing core pair of the p-th processor,/->Calculating, for the p-th processor, the run length of the core calculated at the j-th position of the i-th sequence, |Θ|, as a quality scoreAccording to the size of the character value space,the size of the character value space of the k-mer sequence acquired by the core at the j position of the ith sequence is calculated for the p-th processor.
Further, the objective of the multiple linear regression prediction is to solve the optimal weight vector of the parallel four-stage run prediction mapping model, wherein the parameter multiplied by the first weight in the above formula is the proportion of mode characters to total sequence characters, the parameter multiplied by the second weight parameter is the proportion of characters with adjacent quality values larger than 7 to total string sequence characters, the character multiplied by the third weight parameter is the proportion of characters with run length larger than 3 to total sequence characters, and the space multiplied by the fourth weight parameter is the proportion of the space taken by the current sequence characters to source character value space.
The expression of the parallel four-level run prediction mapping model in step S400 is:
wherein,mapping result label calculated by the kernel for the p-th processor for the j position of the ith sequence of the v-th file, β being the dynamic run prediction switch,>representing the characters corresponding to the j+1th position in the ith QSD sequence in the ith splicing partition file of the computing core pair of the p-th processor,/->Representing the characters corresponding to the j+2 positions in the ith QSD sequence in the ith splicing partition file by the p-th processor computing core.
Further, four parameters multiplied by the weight parameters and output in the expression of the compression rate gain obtained by the prediction of the multiple linear regression analysis prediction method are recorded as sequence feature vectors, and a parallel four-level run prediction mapping model can be established through the sequence feature vectors by inputting the sequence feature vectors.
S500: and carrying out context modeling on the first redundancy elimination sub-file and the second redundancy elimination sub-file through a multi-core processor cluster, and carrying out cascade compression by combining arithmetic coding to obtain a final compressed file.
Wherein, step S500 includes:
s510: dividing the first redundancy elimination sub-file and the second redundancy elimination sub-file respectively through a data blocking strategy to obtain a plurality of blocks to be compressed;
s520: modeling a plurality of blocks to be compressed through parallel contexts of a multi-core processor cluster and performing cascade compression to obtain a plurality of files to be combined;
s530: and merging and packing a plurality of files to be merged to obtain a final compressed file.
Wherein, still include:
s600: and decompressing the final compressed file in parallel through the multi-core processor cluster to obtain a final output file subjected to lossless recovery.
Furthermore, when cascade compression and cascade decompression are carried out, the ZPAQ algorithm is adopted, the final compressed file after compression is Q.save, and the output decompressed lossless recovery file is Q.recovery.
In some embodiments, the invention uses 11 sets of real open source data from NCBI (National Center for Biotechnology Information) database to evaluate and verify that the total file size is 72618889KiloByte, which contains 219959565 quality score sequence data and 20363018799 quality score characters in total. The QSD parallel compression algorithm and 4 groups of benchmark methods ZPAQ, qscomp, LCQS and CMIC of the invention are tested and evaluated during experimental tests, and the compression performance is obtained by running 11 groups of data sets through single-node 28 CPU calculation cores.
The experiment result respectively collects normalized peak memory overhead and normalized time overhead, the algorithm time and peak memory overhead are lower, the algorithm performance is better, the experiment result shows that the compression method provided by the invention obtains 4 overall optimal effects on 5 indexes of compression rate, compression time, compression peak memory overhead, decompression time and decompression peak memory overhead, and the result shows that the large-scale high-throughput genome sequencing quality score data parallel compression method provided by the invention is effective in optimizing and improving the large-scale QSD compression rate, reducing the compression time and the decompression time and the peak memory overhead.
In some embodiments, the parallel acceleration ratio of the invention under the condition of adopting different cores of a single node and 4 cores of multiple nodes is verified by adopting multi-core CPU acceleration sequence partitioning and four-level run prediction mapping calculation and adopting multi-core CPU cluster acceleration ZPAQ to perform context modeling and arithmetic coding cascade compression.
The experimental results respectively collect parallel acceleration ratios, which are the ratio of the parallel calculation time cost and the serial calculation time cost; and the other parallel acceleration ratio is the ratio of the time overhead calculated for multiple nodes and the time overhead calculated for a single node. Experimental results show that the large-scale high-throughput genome sequencing quality score data parallel compression method provided by the invention has the advantages that as the data scale is continuously increased, the parallel acceleration ratio of compression and decompression is in an ascending trend as a whole when the number of CPU cores is increased, which shows that the multi-core CPU parallel acceleration effect on a single node of the compression method is obvious; the parallel compression method is obvious in parallel acceleration effect by the multi-core CPU cluster, and particularly obvious in decompression parallel acceleration effect of the multi-core CPU cluster.
The high-throughput genome sequencing quality score data parallel compression method provided by the invention has the advantages that the QSD compression rate is optimized and improved, the time and peak memory consumption are reduced as the whole body target, the QSD compression rate is improved on the premise of obviously reducing the QSD compression time and peak memory expenditure, the size of a storage file to be compressed is reduced, and the construction cost of an infrastructure storage facility is saved.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims (8)
1. A method for parallel compression of high throughput genome sequencing quality score data, comprising:
s100: dividing an original gene sequencing file to obtain quality score data as a file to be compressed;
s200: randomly sampling the file to be compressed, carrying out k-mer analysis on sampling data to obtain statistical characteristic information of the file to be compressed, and establishing a parallel sequence partition model according to the statistical characteristic information;
s300: performing two-classification on the file to be compressed through the parallel sequence partition model to obtain a first partition sub-file and a second partition sub-file, and splicing the first partition sub-file and the second partition sub-file according to splicing parameters to obtain a first splicing sub-file and a second splicing sub-file;
s400: predicting the file to be compressed by a multiple linear regression analysis prediction method to obtain a compression rate gain, establishing a parallel four-level run prediction mapping model according to the compression rate gain, and respectively carrying out data redundancy elimination on the first spliced sub-file and the second spliced sub-file by the parallel four-level run prediction mapping model to correspondingly obtain a first redundancy elimination sub-file and a second redundancy elimination sub-file;
s500: and carrying out context modeling on the first redundancy elimination sub-file and the second redundancy elimination sub-file through a multi-core processor cluster, and carrying out cascade compression by combining arithmetic coding to obtain a final compressed file.
2. The high throughput genome sequencing quality score data parallel compression method of claim 1, further comprising:
s600: and decompressing the final compressed file in parallel through the multi-core processor cluster to obtain a final output file subjected to lossless recovery.
3. The high throughput genome sequencing quality score data parallel compression method of claim 1, wherein the raw gene sequencing file comprises sequencing description information data, sequencing DNA sequence data, sequencing additional information data, and sequencing quality score data.
4. The method according to claim 1, wherein the expression of the parallel sequence partitioning model in step S200 is:
wherein i is a first index value, j is a second index value, i ′ Is the third index value, j ′ For the fourth index value of the index value,the partition result mark obtained by calculating the ith sequence for the p-th processor computing core is represented by I, alpha is a partition threshold value, and when I<The value of I at alpha is 0, n is the average sequence length, and k is the sliding used for feature extraction in k-mer analysisThe window size, M is the number of samples for randomly sampling the file to be compressed, NFactor is the k-mer frequency normalization factor for randomly sampling the file to be compressed, < + >>Calculating for the p-th processor a k-mer sequence of length k acquired by the core at the j-th position of the i-th sequence, q i′,j′:j′+k+1 Is the ith ′ Stripe random sampling sequence q i′ At the j th ′ And calculating the k-mer sequences at the positions.
5. The method according to claim 1, wherein the expression of the compression rate gain predicted by the multiple linear regression analysis prediction method in step S400 is:
wherein,for compression ratio gain, w 0 First weight parameters, w, obtained for multiple linear regression analysis prediction 1 Calculating a second weight parameter, w, for the multiple linear regression analysis prediction method 2 Third weight parameter, w, calculated for multiple linear regression analysis prediction 3 Fourth weight parameter, w, obtained by calculation for multiple linear regression analysis prediction method 4 Fifth weight parameter calculated for multiple linear regression analysis prediction method, +.>Calculating a mode character obtained by calculating an ith sequence by a core for a p-th processor, wherein N is an average sequence length, N is the number of spliced sequences, I is a first index value, j is a second index value, v is a fifth index value, and I is an indicationFunction (F)>For representing the characters corresponding to the jth position in the ith QSD sequence in the jth splicing partition file of the p-th processor computing core,/->Representing the characters corresponding to the j-1 th position in the ith QSD sequence in the ith splicing partition file of the computing core pair of the p-th processor,/->Calculating the run length of the core calculated at the j-th position of the ith sequence for the p-th processor, wherein Θ is the character value space size of the quality score data,the size of the character value space of the k-mer sequence acquired by the core at the j position of the ith sequence is calculated for the p-th processor.
6. The method according to claim 5, wherein the expression of the parallel four-level run prediction mapping model in step S400 is:
wherein,mapping result label calculated by the kernel for the p-th processor for the j position of the ith sequence of the v-th file, β being the dynamic run prediction switch,>representing the characters corresponding to the j+1th position in the ith QSD sequence in the ith splicing partition file of the computing core pair of the p-th processor,/->Representing the characters corresponding to the j+2 positions in the ith QSD sequence in the ith splicing partition file by the p-th processor computing core.
7. The method of claim 1, wherein step S500 comprises:
s510: dividing the first redundancy elimination sub-file and the second redundancy elimination sub-file respectively through a data blocking strategy to obtain a plurality of blocks to be compressed;
s520: modeling a plurality of blocks to be compressed through parallel contexts of a multi-core processor cluster and performing cascade compression to obtain a plurality of files to be combined;
s530: and merging and packing a plurality of files to be merged to obtain a final compressed file.
8. The method for parallel compression of high-throughput genome sequencing quality score data according to claim 1, wherein the range of values of the splicing parameters in step S300 is:
10000≤τ≤r;
wherein τ is a splicing parameter, and r is the total number of character string sequences in the file to be compressed.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311018059.9A CN117133365A (en) | 2023-08-14 | 2023-08-14 | High-throughput genome sequencing quality score data parallel compression method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311018059.9A CN117133365A (en) | 2023-08-14 | 2023-08-14 | High-throughput genome sequencing quality score data parallel compression method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117133365A true CN117133365A (en) | 2023-11-28 |
Family
ID=88857598
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311018059.9A Pending CN117133365A (en) | 2023-08-14 | 2023-08-14 | High-throughput genome sequencing quality score data parallel compression method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117133365A (en) |
-
2023
- 2023-08-14 CN CN202311018059.9A patent/CN117133365A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106687966B (en) | Method and system for data analysis and compression | |
US11620567B2 (en) | Method, apparatus, device and storage medium for predicting protein binding site | |
Zhang et al. | Real-time mapping of nanopore raw signals | |
US8798936B2 (en) | Methods and systems for data analysis using the Burrows Wheeler transform | |
JP2019535057A5 (en) | ||
CN113393911B (en) | Ligand compound rapid pre-screening method based on deep learning | |
CN104699998A (en) | Method and device for compressing and decompressing genome | |
CN113488104B (en) | Cancer driving gene prediction method and system based on local and global network centrality analysis | |
CN113539364B (en) | Method for predicting protein phosphorylation by deep neural network framework | |
CN113743453A (en) | Population quantity prediction method based on random forest | |
CN117133365A (en) | High-throughput genome sequencing quality score data parallel compression method | |
CN111816246A (en) | Method for identifying driving gene from difference network | |
CN112259157A (en) | Protein interaction prediction method | |
CN116386724A (en) | Method and device for predicting protein interaction, electronic device and storage medium | |
CN114758721A (en) | Deep learning-based transcription factor binding site positioning method | |
CN113053461A (en) | Target-based gene cluster directional mining method | |
CN113838525B (en) | Prediction method and system for pathogenic gene pair | |
CN117059181A (en) | High-flux genome sequence data compression parallel optimization method | |
Cai et al. | Application and research progress of machine learning in Bioinformatics | |
Liu et al. | A novel Group Template Pattern Classifiers (GTPCs) method in protein secondary structure prediction | |
CN111640467B (en) | DNA sequencing quality fraction lossless compression method based on self-adaptive coding sequence | |
Yi et al. | ACO: lossless quality score compression based on adaptive coding order | |
KR20230134429A (en) | Cancer status diagnostic determining apparatus using cell-free dna amd method thereof | |
Wu et al. | A Convolutional Neural Network-Based Approach to Identify the Origins of Replication in Saccharomyces Cerevisiae | |
CN117637035A (en) | Classification model and method for multiple groups of credible integration of students based on graph neural network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |