CN117133365A - High-throughput genome sequencing quality score data parallel compression method - Google Patents

High-throughput genome sequencing quality score data parallel compression method Download PDF

Info

Publication number
CN117133365A
CN117133365A CN202311018059.9A CN202311018059A CN117133365A CN 117133365 A CN117133365 A CN 117133365A CN 202311018059 A CN202311018059 A CN 202311018059A CN 117133365 A CN117133365 A CN 117133365A
Authority
CN
China
Prior art keywords
file
sequence
compressed
parallel
ith
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311018059.9A
Other languages
Chinese (zh)
Inventor
王刚
孙辉
刘晓光
郑营锋
马汇东
王潇霏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nankai University
Original Assignee
Nankai University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nankai University filed Critical Nankai University
Priority to CN202311018059.9A priority Critical patent/CN117133365A/en
Publication of CN117133365A publication Critical patent/CN117133365A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/50Compression of genetic data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Biology (AREA)
  • Mathematical Analysis (AREA)
  • Computational Mathematics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Bioethics (AREA)
  • Mathematical Optimization (AREA)
  • Biotechnology (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Public Health (AREA)
  • Genetics & Genomics (AREA)
  • Evolutionary Computation (AREA)
  • Epidemiology (AREA)
  • Operations Research (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Algebra (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to the technical field of data compression and storage, and provides a high-throughput genome sequencing quality score data parallel compression method. The method comprises the following steps: dividing an original gene sequencing file; randomly sampling and carrying out k-mer analysis on the sampled data to obtain statistical characteristic information, establishing a parallel sequence partition model for carrying out secondary classification, and splicing two partition files obtained by the secondary classification according to splicing parameters; predicting the file to be compressed by a multiple linear regression analysis prediction method to obtain compression rate gain, and establishing a parallel four-level run prediction mapping model to perform data redundancy elimination; and carrying out context modeling on the two redundancy elimination sub-files through the multi-core processor cluster, and carrying out cascade compression by combining arithmetic coding to obtain a final compressed file. The invention also improves the quality score data compression rate, reduces the size of the storage file to be compressed and saves the construction cost of the infrastructure storage facility on the premise of obviously reducing the quality score data compression time and the peak memory overhead.

Description

High-throughput genome sequencing quality score data parallel compression method
Technical Field
The invention relates to the technical field of data compression and storage, in particular to a high-throughput genome sequencing quality score data parallel compression method.
Background
High Throughput Genome Sequencing Data (HTGSD) is an important biological big data type and is widely used in the fields of drug development, whole genome association analysis, virus tracing, accurate diagnosis and treatment and the like. In recent years, with the rapid development of high throughput sequencing technology, HTGSD sequencing costs have been significantly reduced, such cost reduction causing HTGSD data sizes to exhibit a significant increase in supermoore's law speed. For example, only the national gene library sequence archiving system maintains high throughput genome sequencing data compressed by the GZIP algorithm for 11372TB. This scale increase reflects the importance and widespread use of biotechnology in the current era, while also providing greater development space and opportunity for the depth and breadth of genomics and biomedical research.
HTGSDs are typically stored in a FastQ file format, wherein the quality score data (Quality Scores Data, QSD) occupies up to 70% of the space in the FastQ lossless compression file, i.e., QSD plays a key role in increasing the compression rate of the FastQ file, i.e., the ratio of the file size before compression to the file size after compression. How to compress and store large-scale high-throughput genome sequencing quality score data so as to balance the generation speed of genome big data, reduce the construction cost of infrastructure and the cost of data sharing transmission is an important problem which needs to be solved currently.
With the continuous development of high-throughput sequencing technology, the traditional high-throughput genome sequencing quality score data serial compression algorithm and the simple CPU parallel compression algorithm have difficulty in meeting the requirement of rapid real-time processing of mass data, and meanwhile, the existing lossless compression algorithm still has a large compression rate improvement space.
Disclosure of Invention
The present invention is directed to solving at least one of the technical problems existing in the related art. Therefore, the invention provides a high-throughput genome sequencing quality score data parallel compression method.
The invention provides a high-throughput genome sequencing quality score data parallel compression method, which comprises the following steps:
s100: dividing an original gene sequencing file to obtain quality score data as a file to be compressed;
s200: randomly sampling the file to be compressed, carrying out k-mer analysis on sampling data to obtain statistical characteristic information of the file to be compressed, and establishing a parallel sequence partition model according to the statistical characteristic information;
s300: performing two-classification on the file to be compressed through the parallel sequence partition model to obtain a first partition sub-file and a second partition sub-file, and splicing the first partition sub-file and the second partition sub-file according to splicing parameters to obtain a first splicing sub-file and a second splicing sub-file;
s400: predicting the file to be compressed by a multiple linear regression analysis prediction method to obtain a compression rate gain, establishing a parallel four-level run prediction mapping model according to the compression rate gain, and respectively carrying out data redundancy elimination on the first spliced sub-file and the second spliced sub-file by the parallel four-level run prediction mapping model to correspondingly obtain a first redundancy elimination sub-file and a second redundancy elimination sub-file;
s500: and carrying out context modeling on the first redundancy elimination sub-file and the second redundancy elimination sub-file through a multi-core processor cluster, and carrying out cascade compression by combining arithmetic coding to obtain a final compressed file.
The invention provides a high-throughput genome sequencing quality score data parallel compression method, which further comprises the following steps:
s600: and decompressing the final compressed file in parallel through the multi-core processor cluster to obtain a final output file subjected to lossless recovery.
According to the high-throughput genome sequencing quality score data parallel compression method provided by the invention, the original gene sequencing file comprises sequencing description information data, sequencing DNA sequence data, sequencing additional information data and sequencing quality score data.
According to the high-throughput genome sequencing quality score data parallel compression method provided by the invention, the expression of the parallel sequence partition model in the step S200 is as follows:
wherein,the partition result mark obtained by calculating the ith sequence for the p-th processor computing core is represented by I, alpha is a partition threshold value, and when I<The value of I is 0 when alpha, n is the average sequence length, k is the size of a sliding window used for feature extraction in k-mer analysis, M is the number of samples for randomly sampling a file to be compressed, I is a first index value, j is a second index value, I Is the third index value, j For the fourth index value, NFactor is to perform compression on the file to be compressed
A k-mer sequence of length k acquired at the j-th position of the i-th sequence, q i′,j′:j′+k+1 Is the ith Stripe random sampling sequence q i′ At the j th And calculating the k-mer sequences at the positions.
According to the high-throughput genome sequencing quality score data parallel compression method provided by the invention, the expression of the compression rate gain predicted by the multiple linear regression analysis prediction method in the step S400 is as follows:
wherein,for compression ratio gain, w 0 First weight parameters, w, obtained for multiple linear regression analysis prediction 1 Calculating a second weight parameter, w, for the multiple linear regression analysis prediction method 2 Third weight parameter, w, calculated for multiple linear regression analysis prediction 3 Fourth weight parameter, w, obtained by calculation for multiple linear regression analysis prediction method 4 Fifth weight parameter calculated for multiple linear regression analysis prediction method, +.>Calculating a mode character obtained by calculating an ith sequence by a core for a p-th processor, wherein N is an average sequence length, N is the number of spliced sequences, I is a first index value, j is a second index value, v is a fifth index value, and I is an indication function>For representing the characters corresponding to the jth position in the ith QSD sequence in the jth splicing partition file of the p-th processor computing core,/->Representing the characters corresponding to the j-1 th position in the ith QSD sequence in the ith splicing partition file of the computing core pair of the p-th processor,/->Calculating the run length of the core calculated at the j-th position of the ith sequence for the p-th processor, wherein Θ is the character value space size of the quality score data,the size of the character value space of the k-mer sequence acquired by the core at the j position of the ith sequence is calculated for the p-th processor.
According to the high-throughput genome sequencing quality score data parallel compression method provided by the invention, the expression of the parallel four-level run prediction mapping model in the step S400 is as follows:
wherein,mapping result label calculated by the kernel for the p-th processor for the j position of the ith sequence of the v-th file, β being the dynamic run prediction switch,>representing the characters corresponding to the j+1th position in the ith QSD sequence in the ith splicing partition file of the computing core pair of the p-th processor,/->Representing the characters corresponding to the j+2 positions in the ith QSD sequence in the ith splicing partition file by the p-th processor computing core.
According to the high throughput genome sequencing quality score data parallel compression method provided by the invention, the step S500 comprises the following steps:
s510: dividing the first redundancy elimination sub-file and the second redundancy elimination sub-file respectively through a data blocking strategy to obtain a plurality of blocks to be compressed;
s520: modeling a plurality of blocks to be compressed through parallel contexts of a multi-core processor cluster and performing cascade compression to obtain a plurality of files to be combined;
s530: and merging and packing a plurality of files to be merged to obtain a final compressed file.
According to the high-throughput genome sequencing quality score data parallel compression method provided by the invention, the value range of the splicing parameters in the step S300 is as follows:
10000≤τ≤r;
wherein τ is a splicing parameter, and r is the total number of character string sequences in the file to be compressed.
The high-throughput genome sequencing quality score data parallel compression method provided by the invention aims at optimizing and improving the compression rate of large-scale high-throughput genome sequencing quality score data, reducing the compression and decompression time and the peak memory overhead, realizes the large-scale high-throughput genome sequencing quality score data parallel compression method based on the parallel sequence partition model, the parallel four-level run prediction mapping model and the multi-core CPU cluster, improves the compression and storage efficiency of genome sequencing data, and has important value in industrial and academic applications.
Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
In order to more clearly illustrate the invention or the technical solutions of the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method for parallel compression of high throughput genome sequencing quality score data according to an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention. The following examples are illustrative of the invention but are not intended to limit the scope of the invention.
In the description of the embodiments of the present invention, it should be noted that the terms "center", "longitudinal", "lateral", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", etc. indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, are merely for convenience in describing the embodiments of the present invention and simplifying the description, and do not indicate or imply that the apparatus or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the embodiments of the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
In describing embodiments of the present invention, it should be noted that, unless explicitly stated and limited otherwise, the terms "coupled," "coupled," and "connected" should be construed broadly, and may be either a fixed connection, a removable connection, or an integral connection, for example; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium. The specific meaning of the above terms in embodiments of the present invention will be understood in detail by those of ordinary skill in the art.
In embodiments of the invention, unless expressly specified and limited otherwise, a first feature "up" or "down" on a second feature may be that the first and second features are in direct contact, or that the first and second features are in indirect contact via an intervening medium. Moreover, a first feature being "above," "over" and "on" a second feature may be a first feature being directly above or obliquely above the second feature, or simply indicating that the first feature is level higher than the second feature. The first feature being "under", "below" and "beneath" the second feature may be the first feature being directly under or obliquely below the second feature, or simply indicating that the first feature is less level than the second feature.
In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the embodiments of the present invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.
An embodiment provided by the present invention is described below with reference to fig. 1.
The invention provides a high-throughput genome sequencing quality score data parallel compression method, which comprises the following steps:
s100: dividing an original gene sequencing file to obtain quality score data as a file to be compressed;
the original gene sequencing file comprises sequencing description information data, sequencing DNA sequence data, sequencing additional information data and sequencing quality score data.
Further, the format of the original gene sequencing file is fastQ format, and the step S100 is to obtain the sequencing quality score data file to be compressed for the subsequent compressed storage operation through data segmentation.
S200: randomly sampling the file to be compressed, carrying out k-mer analysis on sampling data to obtain statistical characteristic information of the file to be compressed, and establishing a parallel sequence partition model according to the statistical characteristic information;
the expression of the parallel sequence partition model in step S200 is:
wherein,the partition result mark obtained by calculating the ith sequence for the p-th processor computing core is represented by I, alpha is a partition threshold value, and when I<The value of I is 0 when alpha, n is the average sequence length, k is the size of a sliding window used for feature extraction in k-mer analysis, M is the number of samples for randomly sampling a file to be compressed, I is a first index value, j is a second index value, I Is the third index value, j For the fourth index value, NFactor is a k-mer frequency normalization factor for randomly sampling the file to be compressed, < >>Calculating for the p-th processor a k-mer sequence of length k acquired by the core at the j-th position of the i-th sequence, q i′,j′:j′+k+1 Is the ith Stripe random sampling sequence q i′ At the j th And calculating the k-mer sequences at the positions.
S300: performing two-classification on the file to be compressed through the parallel sequence partition model to obtain a first partition sub-file and a second partition sub-file, and splicing the first partition sub-file and the second partition sub-file according to splicing parameters to obtain a first splicing sub-file and a second splicing sub-file;
the range of values of the splicing parameters in step S300 is as follows:
10000≤τ≤r;
wherein τ is a splicing parameter, and r is the total number of character string sequences in the file to be compressed.
S400: predicting the file to be compressed by a multiple linear regression analysis prediction method to obtain a compression rate gain, establishing a parallel four-level run prediction mapping model according to the compression rate gain, and respectively carrying out data redundancy elimination on the first spliced sub-file and the second spliced sub-file by the parallel four-level run prediction mapping model to correspondingly obtain a first redundancy elimination sub-file and a second redundancy elimination sub-file;
the expression of the compression ratio gain predicted by the multiple linear regression analysis prediction method in step S400 is as follows:
wherein,for compression ratio gain, w 0 First weight parameters, w, obtained for multiple linear regression analysis prediction 1 Calculating a second weight parameter, w, for the multiple linear regression analysis prediction method 2 Third weight parameter, w, calculated for multiple linear regression analysis prediction 3 Fourth weight parameter, w, obtained by calculation for multiple linear regression analysis prediction method 4 Fifth weight parameter calculated for multiple linear regression analysis prediction method, +.>Calculating a mode character obtained by calculating an ith sequence by a core for a p-th processor, wherein N is an average sequence length, N is the number of spliced sequences, I is a first index value, j is a second index value, v is a fifth index value, and I is an indication function>For representing the characters corresponding to the jth position in the ith QSD sequence in the jth splicing partition file of the p-th processor computing core,/->Representing the characters corresponding to the j-1 th position in the ith QSD sequence in the ith splicing partition file of the computing core pair of the p-th processor,/->Calculating, for the p-th processor, the run length of the core calculated at the j-th position of the i-th sequence, |Θ|, as a quality scoreAccording to the size of the character value space,the size of the character value space of the k-mer sequence acquired by the core at the j position of the ith sequence is calculated for the p-th processor.
Further, the objective of the multiple linear regression prediction is to solve the optimal weight vector of the parallel four-stage run prediction mapping model, wherein the parameter multiplied by the first weight in the above formula is the proportion of mode characters to total sequence characters, the parameter multiplied by the second weight parameter is the proportion of characters with adjacent quality values larger than 7 to total string sequence characters, the character multiplied by the third weight parameter is the proportion of characters with run length larger than 3 to total sequence characters, and the space multiplied by the fourth weight parameter is the proportion of the space taken by the current sequence characters to source character value space.
The expression of the parallel four-level run prediction mapping model in step S400 is:
wherein,mapping result label calculated by the kernel for the p-th processor for the j position of the ith sequence of the v-th file, β being the dynamic run prediction switch,>representing the characters corresponding to the j+1th position in the ith QSD sequence in the ith splicing partition file of the computing core pair of the p-th processor,/->Representing the characters corresponding to the j+2 positions in the ith QSD sequence in the ith splicing partition file by the p-th processor computing core.
Further, four parameters multiplied by the weight parameters and output in the expression of the compression rate gain obtained by the prediction of the multiple linear regression analysis prediction method are recorded as sequence feature vectors, and a parallel four-level run prediction mapping model can be established through the sequence feature vectors by inputting the sequence feature vectors.
S500: and carrying out context modeling on the first redundancy elimination sub-file and the second redundancy elimination sub-file through a multi-core processor cluster, and carrying out cascade compression by combining arithmetic coding to obtain a final compressed file.
Wherein, step S500 includes:
s510: dividing the first redundancy elimination sub-file and the second redundancy elimination sub-file respectively through a data blocking strategy to obtain a plurality of blocks to be compressed;
s520: modeling a plurality of blocks to be compressed through parallel contexts of a multi-core processor cluster and performing cascade compression to obtain a plurality of files to be combined;
s530: and merging and packing a plurality of files to be merged to obtain a final compressed file.
Wherein, still include:
s600: and decompressing the final compressed file in parallel through the multi-core processor cluster to obtain a final output file subjected to lossless recovery.
Furthermore, when cascade compression and cascade decompression are carried out, the ZPAQ algorithm is adopted, the final compressed file after compression is Q.save, and the output decompressed lossless recovery file is Q.recovery.
In some embodiments, the invention uses 11 sets of real open source data from NCBI (National Center for Biotechnology Information) database to evaluate and verify that the total file size is 72618889KiloByte, which contains 219959565 quality score sequence data and 20363018799 quality score characters in total. The QSD parallel compression algorithm and 4 groups of benchmark methods ZPAQ, qscomp, LCQS and CMIC of the invention are tested and evaluated during experimental tests, and the compression performance is obtained by running 11 groups of data sets through single-node 28 CPU calculation cores.
The experiment result respectively collects normalized peak memory overhead and normalized time overhead, the algorithm time and peak memory overhead are lower, the algorithm performance is better, the experiment result shows that the compression method provided by the invention obtains 4 overall optimal effects on 5 indexes of compression rate, compression time, compression peak memory overhead, decompression time and decompression peak memory overhead, and the result shows that the large-scale high-throughput genome sequencing quality score data parallel compression method provided by the invention is effective in optimizing and improving the large-scale QSD compression rate, reducing the compression time and the decompression time and the peak memory overhead.
In some embodiments, the parallel acceleration ratio of the invention under the condition of adopting different cores of a single node and 4 cores of multiple nodes is verified by adopting multi-core CPU acceleration sequence partitioning and four-level run prediction mapping calculation and adopting multi-core CPU cluster acceleration ZPAQ to perform context modeling and arithmetic coding cascade compression.
The experimental results respectively collect parallel acceleration ratios, which are the ratio of the parallel calculation time cost and the serial calculation time cost; and the other parallel acceleration ratio is the ratio of the time overhead calculated for multiple nodes and the time overhead calculated for a single node. Experimental results show that the large-scale high-throughput genome sequencing quality score data parallel compression method provided by the invention has the advantages that as the data scale is continuously increased, the parallel acceleration ratio of compression and decompression is in an ascending trend as a whole when the number of CPU cores is increased, which shows that the multi-core CPU parallel acceleration effect on a single node of the compression method is obvious; the parallel compression method is obvious in parallel acceleration effect by the multi-core CPU cluster, and particularly obvious in decompression parallel acceleration effect of the multi-core CPU cluster.
The high-throughput genome sequencing quality score data parallel compression method provided by the invention has the advantages that the QSD compression rate is optimized and improved, the time and peak memory consumption are reduced as the whole body target, the QSD compression rate is improved on the premise of obviously reducing the QSD compression time and peak memory expenditure, the size of a storage file to be compressed is reduced, and the construction cost of an infrastructure storage facility is saved.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (8)

1. A method for parallel compression of high throughput genome sequencing quality score data, comprising:
s100: dividing an original gene sequencing file to obtain quality score data as a file to be compressed;
s200: randomly sampling the file to be compressed, carrying out k-mer analysis on sampling data to obtain statistical characteristic information of the file to be compressed, and establishing a parallel sequence partition model according to the statistical characteristic information;
s300: performing two-classification on the file to be compressed through the parallel sequence partition model to obtain a first partition sub-file and a second partition sub-file, and splicing the first partition sub-file and the second partition sub-file according to splicing parameters to obtain a first splicing sub-file and a second splicing sub-file;
s400: predicting the file to be compressed by a multiple linear regression analysis prediction method to obtain a compression rate gain, establishing a parallel four-level run prediction mapping model according to the compression rate gain, and respectively carrying out data redundancy elimination on the first spliced sub-file and the second spliced sub-file by the parallel four-level run prediction mapping model to correspondingly obtain a first redundancy elimination sub-file and a second redundancy elimination sub-file;
s500: and carrying out context modeling on the first redundancy elimination sub-file and the second redundancy elimination sub-file through a multi-core processor cluster, and carrying out cascade compression by combining arithmetic coding to obtain a final compressed file.
2. The high throughput genome sequencing quality score data parallel compression method of claim 1, further comprising:
s600: and decompressing the final compressed file in parallel through the multi-core processor cluster to obtain a final output file subjected to lossless recovery.
3. The high throughput genome sequencing quality score data parallel compression method of claim 1, wherein the raw gene sequencing file comprises sequencing description information data, sequencing DNA sequence data, sequencing additional information data, and sequencing quality score data.
4. The method according to claim 1, wherein the expression of the parallel sequence partitioning model in step S200 is:
wherein i is a first index value, j is a second index value, i Is the third index value, j For the fourth index value of the index value,the partition result mark obtained by calculating the ith sequence for the p-th processor computing core is represented by I, alpha is a partition threshold value, and when I<The value of I at alpha is 0, n is the average sequence length, and k is the sliding used for feature extraction in k-mer analysisThe window size, M is the number of samples for randomly sampling the file to be compressed, NFactor is the k-mer frequency normalization factor for randomly sampling the file to be compressed, < + >>Calculating for the p-th processor a k-mer sequence of length k acquired by the core at the j-th position of the i-th sequence, q i′,j′:j′+k+1 Is the ith Stripe random sampling sequence q i′ At the j th And calculating the k-mer sequences at the positions.
5. The method according to claim 1, wherein the expression of the compression rate gain predicted by the multiple linear regression analysis prediction method in step S400 is:
wherein,for compression ratio gain, w 0 First weight parameters, w, obtained for multiple linear regression analysis prediction 1 Calculating a second weight parameter, w, for the multiple linear regression analysis prediction method 2 Third weight parameter, w, calculated for multiple linear regression analysis prediction 3 Fourth weight parameter, w, obtained by calculation for multiple linear regression analysis prediction method 4 Fifth weight parameter calculated for multiple linear regression analysis prediction method, +.>Calculating a mode character obtained by calculating an ith sequence by a core for a p-th processor, wherein N is an average sequence length, N is the number of spliced sequences, I is a first index value, j is a second index value, v is a fifth index value, and I is an indicationFunction (F)>For representing the characters corresponding to the jth position in the ith QSD sequence in the jth splicing partition file of the p-th processor computing core,/->Representing the characters corresponding to the j-1 th position in the ith QSD sequence in the ith splicing partition file of the computing core pair of the p-th processor,/->Calculating the run length of the core calculated at the j-th position of the ith sequence for the p-th processor, wherein Θ is the character value space size of the quality score data,the size of the character value space of the k-mer sequence acquired by the core at the j position of the ith sequence is calculated for the p-th processor.
6. The method according to claim 5, wherein the expression of the parallel four-level run prediction mapping model in step S400 is:
wherein,mapping result label calculated by the kernel for the p-th processor for the j position of the ith sequence of the v-th file, β being the dynamic run prediction switch,>representing the characters corresponding to the j+1th position in the ith QSD sequence in the ith splicing partition file of the computing core pair of the p-th processor,/->Representing the characters corresponding to the j+2 positions in the ith QSD sequence in the ith splicing partition file by the p-th processor computing core.
7. The method of claim 1, wherein step S500 comprises:
s510: dividing the first redundancy elimination sub-file and the second redundancy elimination sub-file respectively through a data blocking strategy to obtain a plurality of blocks to be compressed;
s520: modeling a plurality of blocks to be compressed through parallel contexts of a multi-core processor cluster and performing cascade compression to obtain a plurality of files to be combined;
s530: and merging and packing a plurality of files to be merged to obtain a final compressed file.
8. The method for parallel compression of high-throughput genome sequencing quality score data according to claim 1, wherein the range of values of the splicing parameters in step S300 is:
10000≤τ≤r;
wherein τ is a splicing parameter, and r is the total number of character string sequences in the file to be compressed.
CN202311018059.9A 2023-08-14 2023-08-14 High-throughput genome sequencing quality score data parallel compression method Pending CN117133365A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311018059.9A CN117133365A (en) 2023-08-14 2023-08-14 High-throughput genome sequencing quality score data parallel compression method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311018059.9A CN117133365A (en) 2023-08-14 2023-08-14 High-throughput genome sequencing quality score data parallel compression method

Publications (1)

Publication Number Publication Date
CN117133365A true CN117133365A (en) 2023-11-28

Family

ID=88857598

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311018059.9A Pending CN117133365A (en) 2023-08-14 2023-08-14 High-throughput genome sequencing quality score data parallel compression method

Country Status (1)

Country Link
CN (1) CN117133365A (en)

Similar Documents

Publication Publication Date Title
CN106687966B (en) Method and system for data analysis and compression
US11620567B2 (en) Method, apparatus, device and storage medium for predicting protein binding site
Zhang et al. Real-time mapping of nanopore raw signals
US8798936B2 (en) Methods and systems for data analysis using the Burrows Wheeler transform
JP2019535057A5 (en)
CN113393911B (en) Ligand compound rapid pre-screening method based on deep learning
CN104699998A (en) Method and device for compressing and decompressing genome
CN113488104B (en) Cancer driving gene prediction method and system based on local and global network centrality analysis
CN113539364B (en) Method for predicting protein phosphorylation by deep neural network framework
CN113743453A (en) Population quantity prediction method based on random forest
CN117133365A (en) High-throughput genome sequencing quality score data parallel compression method
CN111816246A (en) Method for identifying driving gene from difference network
CN112259157A (en) Protein interaction prediction method
CN116386724A (en) Method and device for predicting protein interaction, electronic device and storage medium
CN114758721A (en) Deep learning-based transcription factor binding site positioning method
CN113053461A (en) Target-based gene cluster directional mining method
CN113838525B (en) Prediction method and system for pathogenic gene pair
CN117059181A (en) High-flux genome sequence data compression parallel optimization method
Cai et al. Application and research progress of machine learning in Bioinformatics
Liu et al. A novel Group Template Pattern Classifiers (GTPCs) method in protein secondary structure prediction
CN111640467B (en) DNA sequencing quality fraction lossless compression method based on self-adaptive coding sequence
Yi et al. ACO: lossless quality score compression based on adaptive coding order
KR20230134429A (en) Cancer status diagnostic determining apparatus using cell-free dna amd method thereof
Wu et al. A Convolutional Neural Network-Based Approach to Identify the Origins of Replication in Saccharomyces Cerevisiae
CN117637035A (en) Classification model and method for multiple groups of credible integration of students based on graph neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination