CN117133365A

CN117133365A - High-throughput genome sequencing quality score data parallel compression method

Info

Publication number: CN117133365A
Application number: CN202311018059.9A
Authority: CN
Inventors: 王刚; 孙辉; 刘晓光; 郑营锋; 马汇东; 王潇霏
Original assignee: Nankai University
Current assignee: Nankai University
Priority date: 2023-08-14
Filing date: 2023-08-14
Publication date: 2023-11-28

Abstract

The invention relates to the technical field of data compression and storage, and provides a high-throughput genome sequencing quality score data parallel compression method. The method comprises the following steps: dividing an original gene sequencing file; randomly sampling and carrying out k-mer analysis on the sampled data to obtain statistical characteristic information, establishing a parallel sequence partition model for carrying out secondary classification, and splicing two partition files obtained by the secondary classification according to splicing parameters; predicting the file to be compressed by a multiple linear regression analysis prediction method to obtain compression rate gain, and establishing a parallel four-level run prediction mapping model to perform data redundancy elimination; and carrying out context modeling on the two redundancy elimination sub-files through the multi-core processor cluster, and carrying out cascade compression by combining arithmetic coding to obtain a final compressed file. The invention also improves the quality score data compression rate, reduces the size of the storage file to be compressed and saves the construction cost of the infrastructure storage facility on the premise of obviously reducing the quality score data compression time and the peak memory overhead.

Description

High-throughput genome sequencing quality score data parallel compression method

Technical Field

The invention relates to the technical field of data compression and storage, in particular to a high-throughput genome sequencing quality score data parallel compression method.

Background

High Throughput Genome Sequencing Data (HTGSD) is an important biological big data type and is widely used in the fields of drug development, whole genome association analysis, virus tracing, accurate diagnosis and treatment and the like. In recent years, with the rapid development of high throughput sequencing technology, HTGSD sequencing costs have been significantly reduced, such cost reduction causing HTGSD data sizes to exhibit a significant increase in supermoore's law speed. For example, only the national gene library sequence archiving system maintains high throughput genome sequencing data compressed by the GZIP algorithm for 11372TB. This scale increase reflects the importance and widespread use of biotechnology in the current era, while also providing greater development space and opportunity for the depth and breadth of genomics and biomedical research.

HTGSDs are typically stored in a FastQ file format, wherein the quality score data (Quality Scores Data, QSD) occupies up to 70% of the space in the FastQ lossless compression file, i.e., QSD plays a key role in increasing the compression rate of the FastQ file, i.e., the ratio of the file size before compression to the file size after compression. How to compress and store large-scale high-throughput genome sequencing quality score data so as to balance the generation speed of genome big data, reduce the construction cost of infrastructure and the cost of data sharing transmission is an important problem which needs to be solved currently.

With the continuous development of high-throughput sequencing technology, the traditional high-throughput genome sequencing quality score data serial compression algorithm and the simple CPU parallel compression algorithm have difficulty in meeting the requirement of rapid real-time processing of mass data, and meanwhile, the existing lossless compression algorithm still has a large compression rate improvement space.

Disclosure of Invention

The present invention is directed to solving at least one of the technical problems existing in the related art. Therefore, the invention provides a high-throughput genome sequencing quality score data parallel compression method.

The invention provides a high-throughput genome sequencing quality score data parallel compression method, which comprises the following steps:

s100: dividing an original gene sequencing file to obtain quality score data as a file to be compressed;

s200: randomly sampling the file to be compressed, carrying out k-mer analysis on sampling data to obtain statistical characteristic information of the file to be compressed, and establishing a parallel sequence partition model according to the statistical characteristic information;

s300: performing two-classification on the file to be compressed through the parallel sequence partition model to obtain a first partition sub-file and a second partition sub-file, and splicing the first partition sub-file and the second partition sub-file according to splicing parameters to obtain a first splicing sub-file and a second splicing sub-file;

s400: predicting the file to be compressed by a multiple linear regression analysis prediction method to obtain a compression rate gain, establishing a parallel four-level run prediction mapping model according to the compression rate gain, and respectively carrying out data redundancy elimination on the first spliced sub-file and the second spliced sub-file by the parallel four-level run prediction mapping model to correspondingly obtain a first redundancy elimination sub-file and a second redundancy elimination sub-file;

s500: and carrying out context modeling on the first redundancy elimination sub-file and the second redundancy elimination sub-file through a multi-core processor cluster, and carrying out cascade compression by combining arithmetic coding to obtain a final compressed file.

The invention provides a high-throughput genome sequencing quality score data parallel compression method, which further comprises the following steps:

s600: and decompressing the final compressed file in parallel through the multi-core processor cluster to obtain a final output file subjected to lossless recovery.

According to the high-throughput genome sequencing quality score data parallel compression method provided by the invention, the original gene sequencing file comprises sequencing description information data, sequencing DNA sequence data, sequencing additional information data and sequencing quality score data.

According to the high-throughput genome sequencing quality score data parallel compression method provided by the invention, the expression of the parallel sequence partition model in the step S200 is as follows:

wherein,the partition result mark obtained by calculating the ith sequence for the p-th processor computing core is represented by I, alpha is a partition threshold value, and when I<The value of I is 0 when alpha, n is the average sequence length, k is the size of a sliding window used for feature extraction in k-mer analysis, M is the number of samples for randomly sampling a file to be compressed, I is a first index value, j is a second index value, I ^′ Is the third index value, j ^′ For the fourth index value, NFactor is to perform compression on the file to be compressed

A k-mer sequence of length k acquired at the j-th position of the i-th sequence, q _{i′,j′:j′+k+1} Is the ith ^′ Stripe random sampling sequence q _i′ At the j th ^′ And calculating the k-mer sequences at the positions.

According to the high-throughput genome sequencing quality score data parallel compression method provided by the invention, the expression of the compression rate gain predicted by the multiple linear regression analysis prediction method in the step S400 is as follows:

wherein,for compression ratio gain, w ₀ First weight parameters, w, obtained for multiple linear regression analysis prediction ₁ Calculating a second weight parameter, w, for the multiple linear regression analysis prediction method ₂ Third weight parameter, w, calculated for multiple linear regression analysis prediction ₃ Fourth weight parameter, w, obtained by calculation for multiple linear regression analysis prediction method ₄ Fifth weight parameter calculated for multiple linear regression analysis prediction method, +.>Calculating a mode character obtained by calculating an ith sequence by a core for a p-th processor, wherein N is an average sequence length, N is the number of spliced sequences, I is a first index value, j is a second index value, v is a fifth index value, and I is an indication function>For representing the characters corresponding to the jth position in the ith QSD sequence in the jth splicing partition file of the p-th processor computing core,/->Representing the characters corresponding to the j-1 th position in the ith QSD sequence in the ith splicing partition file of the computing core pair of the p-th processor,/->Calculating the run length of the core calculated at the j-th position of the ith sequence for the p-th processor, wherein Θ is the character value space size of the quality score data,the size of the character value space of the k-mer sequence acquired by the core at the j position of the ith sequence is calculated for the p-th processor.

According to the high-throughput genome sequencing quality score data parallel compression method provided by the invention, the expression of the parallel four-level run prediction mapping model in the step S400 is as follows:

wherein,mapping result label calculated by the kernel for the p-th processor for the j position of the ith sequence of the v-th file, β being the dynamic run prediction switch,>representing the characters corresponding to the j+1th position in the ith QSD sequence in the ith splicing partition file of the computing core pair of the p-th processor,/->Representing the characters corresponding to the j+2 positions in the ith QSD sequence in the ith splicing partition file by the p-th processor computing core.

According to the high throughput genome sequencing quality score data parallel compression method provided by the invention, the step S500 comprises the following steps:

s510: dividing the first redundancy elimination sub-file and the second redundancy elimination sub-file respectively through a data blocking strategy to obtain a plurality of blocks to be compressed;

s520: modeling a plurality of blocks to be compressed through parallel contexts of a multi-core processor cluster and performing cascade compression to obtain a plurality of files to be combined;

s530: and merging and packing a plurality of files to be merged to obtain a final compressed file.

According to the high-throughput genome sequencing quality score data parallel compression method provided by the invention, the value range of the splicing parameters in the step S300 is as follows:

10000≤τ≤r；

wherein τ is a splicing parameter, and r is the total number of character string sequences in the file to be compressed.

The high-throughput genome sequencing quality score data parallel compression method provided by the invention aims at optimizing and improving the compression rate of large-scale high-throughput genome sequencing quality score data, reducing the compression and decompression time and the peak memory overhead, realizes the large-scale high-throughput genome sequencing quality score data parallel compression method based on the parallel sequence partition model, the parallel four-level run prediction mapping model and the multi-core CPU cluster, improves the compression and storage efficiency of genome sequencing data, and has important value in industrial and academic applications.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method for parallel compression of high throughput genome sequencing quality score data according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention. The following examples are illustrative of the invention but are not intended to limit the scope of the invention.

In the description of the embodiments of the present invention, it should be noted that the terms "center", "longitudinal", "lateral", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", etc. indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, are merely for convenience in describing the embodiments of the present invention and simplifying the description, and do not indicate or imply that the apparatus or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the embodiments of the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

In describing embodiments of the present invention, it should be noted that, unless explicitly stated and limited otherwise, the terms "coupled," "coupled," and "connected" should be construed broadly, and may be either a fixed connection, a removable connection, or an integral connection, for example; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium. The specific meaning of the above terms in embodiments of the present invention will be understood in detail by those of ordinary skill in the art.

In embodiments of the invention, unless expressly specified and limited otherwise, a first feature "up" or "down" on a second feature may be that the first and second features are in direct contact, or that the first and second features are in indirect contact via an intervening medium. Moreover, a first feature being "above," "over" and "on" a second feature may be a first feature being directly above or obliquely above the second feature, or simply indicating that the first feature is level higher than the second feature. The first feature being "under", "below" and "beneath" the second feature may be the first feature being directly under or obliquely below the second feature, or simply indicating that the first feature is less level than the second feature.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the embodiments of the present invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.

An embodiment provided by the present invention is described below with reference to fig. 1.

the original gene sequencing file comprises sequencing description information data, sequencing DNA sequence data, sequencing additional information data and sequencing quality score data.

Further, the format of the original gene sequencing file is fastQ format, and the step S100 is to obtain the sequencing quality score data file to be compressed for the subsequent compressed storage operation through data segmentation.

the expression of the parallel sequence partition model in step S200 is:

wherein,the partition result mark obtained by calculating the ith sequence for the p-th processor computing core is represented by I, alpha is a partition threshold value, and when I<The value of I is 0 when alpha, n is the average sequence length, k is the size of a sliding window used for feature extraction in k-mer analysis, M is the number of samples for randomly sampling a file to be compressed, I is a first index value, j is a second index value, I ^′ Is the third index value, j ^′ For the fourth index value, NFactor is a k-mer frequency normalization factor for randomly sampling the file to be compressed, < >>Calculating for the p-th processor a k-mer sequence of length k acquired by the core at the j-th position of the i-th sequence, q _{i′,j′:j′+k+1} Is the ith ^′ Stripe random sampling sequence q _i′ At the j th ^′ And calculating the k-mer sequences at the positions.

the range of values of the splicing parameters in step S300 is as follows:

10000≤τ≤r；

the expression of the compression ratio gain predicted by the multiple linear regression analysis prediction method in step S400 is as follows:

wherein,for compression ratio gain, w ₀ First weight parameters, w, obtained for multiple linear regression analysis prediction ₁ Calculating a second weight parameter, w, for the multiple linear regression analysis prediction method ₂ Third weight parameter, w, calculated for multiple linear regression analysis prediction ₃ Fourth weight parameter, w, obtained by calculation for multiple linear regression analysis prediction method ₄ Fifth weight parameter calculated for multiple linear regression analysis prediction method, +.>Calculating a mode character obtained by calculating an ith sequence by a core for a p-th processor, wherein N is an average sequence length, N is the number of spliced sequences, I is a first index value, j is a second index value, v is a fifth index value, and I is an indication function>For representing the characters corresponding to the jth position in the ith QSD sequence in the jth splicing partition file of the p-th processor computing core,/->Representing the characters corresponding to the j-1 th position in the ith QSD sequence in the ith splicing partition file of the computing core pair of the p-th processor,/->Calculating, for the p-th processor, the run length of the core calculated at the j-th position of the i-th sequence, |Θ|, as a quality scoreAccording to the size of the character value space,the size of the character value space of the k-mer sequence acquired by the core at the j position of the ith sequence is calculated for the p-th processor.

Further, the objective of the multiple linear regression prediction is to solve the optimal weight vector of the parallel four-stage run prediction mapping model, wherein the parameter multiplied by the first weight in the above formula is the proportion of mode characters to total sequence characters, the parameter multiplied by the second weight parameter is the proportion of characters with adjacent quality values larger than 7 to total string sequence characters, the character multiplied by the third weight parameter is the proportion of characters with run length larger than 3 to total sequence characters, and the space multiplied by the fourth weight parameter is the proportion of the space taken by the current sequence characters to source character value space.

The expression of the parallel four-level run prediction mapping model in step S400 is:

Further, four parameters multiplied by the weight parameters and output in the expression of the compression rate gain obtained by the prediction of the multiple linear regression analysis prediction method are recorded as sequence feature vectors, and a parallel four-level run prediction mapping model can be established through the sequence feature vectors by inputting the sequence feature vectors.

Wherein, step S500 includes:

Wherein, still include:

Furthermore, when cascade compression and cascade decompression are carried out, the ZPAQ algorithm is adopted, the final compressed file after compression is Q.save, and the output decompressed lossless recovery file is Q.recovery.

In some embodiments, the invention uses 11 sets of real open source data from NCBI (National Center for Biotechnology Information) database to evaluate and verify that the total file size is 72618889KiloByte, which contains 219959565 quality score sequence data and 20363018799 quality score characters in total. The QSD parallel compression algorithm and 4 groups of benchmark methods ZPAQ, qscomp, LCQS and CMIC of the invention are tested and evaluated during experimental tests, and the compression performance is obtained by running 11 groups of data sets through single-node 28 CPU calculation cores.

The experiment result respectively collects normalized peak memory overhead and normalized time overhead, the algorithm time and peak memory overhead are lower, the algorithm performance is better, the experiment result shows that the compression method provided by the invention obtains 4 overall optimal effects on 5 indexes of compression rate, compression time, compression peak memory overhead, decompression time and decompression peak memory overhead, and the result shows that the large-scale high-throughput genome sequencing quality score data parallel compression method provided by the invention is effective in optimizing and improving the large-scale QSD compression rate, reducing the compression time and the decompression time and the peak memory overhead.

In some embodiments, the parallel acceleration ratio of the invention under the condition of adopting different cores of a single node and 4 cores of multiple nodes is verified by adopting multi-core CPU acceleration sequence partitioning and four-level run prediction mapping calculation and adopting multi-core CPU cluster acceleration ZPAQ to perform context modeling and arithmetic coding cascade compression.

The experimental results respectively collect parallel acceleration ratios, which are the ratio of the parallel calculation time cost and the serial calculation time cost; and the other parallel acceleration ratio is the ratio of the time overhead calculated for multiple nodes and the time overhead calculated for a single node. Experimental results show that the large-scale high-throughput genome sequencing quality score data parallel compression method provided by the invention has the advantages that as the data scale is continuously increased, the parallel acceleration ratio of compression and decompression is in an ascending trend as a whole when the number of CPU cores is increased, which shows that the multi-core CPU parallel acceleration effect on a single node of the compression method is obvious; the parallel compression method is obvious in parallel acceleration effect by the multi-core CPU cluster, and particularly obvious in decompression parallel acceleration effect of the multi-core CPU cluster.

The high-throughput genome sequencing quality score data parallel compression method provided by the invention has the advantages that the QSD compression rate is optimized and improved, the time and peak memory consumption are reduced as the whole body target, the QSD compression rate is improved on the premise of obviously reducing the QSD compression time and peak memory expenditure, the size of a storage file to be compressed is reduced, and the construction cost of an infrastructure storage facility is saved.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for parallel compression of high throughput genome sequencing quality score data, comprising:

2. The high throughput genome sequencing quality score data parallel compression method of claim 1, further comprising:

3. The high throughput genome sequencing quality score data parallel compression method of claim 1, wherein the raw gene sequencing file comprises sequencing description information data, sequencing DNA sequence data, sequencing additional information data, and sequencing quality score data.

4. The method according to claim 1, wherein the expression of the parallel sequence partitioning model in step S200 is:

wherein i is a first index value, j is a second index value, i ^′ Is the third index value, j ^′ For the fourth index value of the index value,the partition result mark obtained by calculating the ith sequence for the p-th processor computing core is represented by I, alpha is a partition threshold value, and when I<The value of I at alpha is 0, n is the average sequence length, and k is the sliding used for feature extraction in k-mer analysisThe window size, M is the number of samples for randomly sampling the file to be compressed, NFactor is the k-mer frequency normalization factor for randomly sampling the file to be compressed, < + >>Calculating for the p-th processor a k-mer sequence of length k acquired by the core at the j-th position of the i-th sequence, q _{i′,j′:j′+k+1} Is the ith ^′ Stripe random sampling sequence q _i′ At the j th ^′ And calculating the k-mer sequences at the positions.

5. The method according to claim 1, wherein the expression of the compression rate gain predicted by the multiple linear regression analysis prediction method in step S400 is:

wherein,for compression ratio gain, w ₀ First weight parameters, w, obtained for multiple linear regression analysis prediction ₁ Calculating a second weight parameter, w, for the multiple linear regression analysis prediction method ₂ Third weight parameter, w, calculated for multiple linear regression analysis prediction ₃ Fourth weight parameter, w, obtained by calculation for multiple linear regression analysis prediction method ₄ Fifth weight parameter calculated for multiple linear regression analysis prediction method, +.>Calculating a mode character obtained by calculating an ith sequence by a core for a p-th processor, wherein N is an average sequence length, N is the number of spliced sequences, I is a first index value, j is a second index value, v is a fifth index value, and I is an indicationFunction (F)>For representing the characters corresponding to the jth position in the ith QSD sequence in the jth splicing partition file of the p-th processor computing core,/->Representing the characters corresponding to the j-1 th position in the ith QSD sequence in the ith splicing partition file of the computing core pair of the p-th processor,/->Calculating the run length of the core calculated at the j-th position of the ith sequence for the p-th processor, wherein Θ is the character value space size of the quality score data,the size of the character value space of the k-mer sequence acquired by the core at the j position of the ith sequence is calculated for the p-th processor.

6. The method according to claim 5, wherein the expression of the parallel four-level run prediction mapping model in step S400 is:

7. The method of claim 1, wherein step S500 comprises:

8. The method for parallel compression of high-throughput genome sequencing quality score data according to claim 1, wherein the range of values of the splicing parameters in step S300 is:

10000≤τ≤r；