CN111640467A

CN111640467A - DNA sequencing quality fraction lossless compression method based on self-adaptive coding sequence

Info

Publication number: CN111640467A
Application number: CN202010446416.1A
Authority: CN
Inventors: 牛毅; 马明明; 李甫; 田英轩; 石光明
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2020-05-25
Filing date: 2020-05-25
Publication date: 2020-09-08
Anticipated expiration: 2040-05-25
Also published as: CN111640467B

Abstract

The invention provides a DNA sequencing quality score lossless compression method based on a self-adaptive coding sequence, which mainly solves the problem that the compression ratio is low due to the fact that a prediction model of the existing quality score compression method is not accurate enough. The implementation scheme is as follows: 1) compressing a block P by two encodings₁And P₂Extracting mass fraction data and base number data in the FASTQ file; 2) calculating a first encoded compressed block P₁Extracting the mean value of the mass fraction of each line in the file and quantizing to obtain a line mean value matrix F of M × 1, 3) counting the context information, the base information and the line mean value information of the coded characters, 4) setting two identifiers C and D and uniformly quantizing the information counted in the step 3) to construct a coding model, 5) driving the self-adaptive arithmetic coder by the coding model and adopting a snake-shaped coding sequence along the side with the strongest correlationTo the first coded compressed block P₁And performing traversal compression. The invention improves the compression efficiency and can be used for storing and transmitting gene data.

Description

DNA sequencing quality fraction lossless compression method based on self-adaptive coding sequence

Technical Field

The invention belongs to the technical field of data compression, and particularly relates to a DNA sequencing mass fraction lossless compression method which can be used for compressing biological gene sequencing data.

Background

Sequencing has gradually become a widely used technique in biological research, acquiring genetic information of different organisms, and can help us to improve understanding of the organic world. With the rapid development of the next generation of high-throughput gene sequencing technology NGS, sequencing companies represented by Illumina continuously develop new sequencing technologies, so that the sequencing cost is rapidly reduced, the price of the human whole genome sequencing WGS is reduced to 1000 dollars or even lower, and the price is still reduced at a speed higher than moore's law. In this case, the amount of new generation sequencing data generated will exceed astronomical data, and in contrast, the overhead associated with storing and transmitting such data is increasing. Therefore, it is of great significance to reduce the size of gene sequencing data through data compression, thereby reducing storage and transmission costs. At present, gene compression tool research achieves a plurality of results, but no scheme reduces code stream from the aspect of coding sequence, so that compression efficiency has a space for improvement.

The next generation of sequencing products generates thousands of short reads, which are typically stored in the widely accepted text-based FASTQ format, containing all the information that the sequencing produces. Wherein each short read contains three parts of content: the first is metadata used for describing information such as a sequencing platform and the like; the DNA base sequence is used for recording the DNA fragments obtained in the current short reading; third, the mass fraction is used to indicate the reliability of measurement of each symbol in the corresponding DNA base sequence. The quality score data in the FASTQ format has high randomness and noise, is related to factors such as a sequencing instrument and a sequencing method, generally comprises dozens of different characters, is high in compression difficulty, and generally accounts for about 70% of a compressed file, so that the compression result of the quality score data plays a key role in the compression effect of the whole FASTQ format data.

At present, typical methods for lossless compression of mass fraction in gene sequencing data mainly include the following methods:

the first is to use the existing text compression tools as the most common compression methods for FASTQ documents, such as Gzip and 7z, which are designed to process common character sequences and do not consider the unique characteristics of mass fraction data, so that the compression effect is not good when compressing gene sequencing data.

The second method is an improved run-length method and a dictionary method aiming at the generation of gene data compression, and the methods have poorer compression effect than an entropy coding method under most conditions and cannot achieve the aim of reducing the compression rate to the maximum extent.

The third is some compression algorithms for quality fractions, such as Quip and the like, which use a high-order markov model to perform predictive coding on the quality fractions, and although a good compression effect is obtained, the occupied storage volume is large, the calculation of the prediction model is too complex, and the influence of the coding sequence on compression is not considered, so that the compression time is long and the robustness of the algorithm is poor.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a DNA sequencing quality score lossless compression method based on an adaptive coding sequence so as to improve the compression effect to the maximum extent under the condition of not increasing the compression time.

The technical scheme of the invention is as follows: firstly, extracting a base sequence and a mass fraction sequence in a FASTQ file; then, calculating the mean value of each line of mass fraction data, quantifying, and constructing a prediction model according to context information, mean value information and base information; and finally, a snake-shaped coding sequence is adopted to drive an arithmetic coder to code the sequence, so that the purpose of compressing the quality fraction is achieved, and the following concrete implementation is realized:

(1) extracting mass fraction data and base number data in FASTQ files:

(1a) statistical analysis of DNA sequencing data features, creationTwo coded compressed blocks P of size M × N₁And P₂Wherein M is the number of rows of the compressed block, namely the number of rows of the once-processed quality fraction data, N is the number of columns of the compressed block, namely the length of the quality fraction, and N is less than or equal to 150;

(1b) compressing the blocks P by first encoding respectively₁And a second coded compressed block P₂Extracting mass fraction data and base number data stored in a FASTQ file;

(2) calculating a first encoded compressed block P₁The average value of the mass fraction of each row in the FASTQ file is extracted and quantized to obtain a row average value matrix F of M × 1;

(3) counting context information, base information and line mean information of the coded characters, carrying out unified quantization, and calculating a final coding model:

(3a) establishing a model for the current code character q: counting the first four characters q₁，q₂，q₃And q is₄Taking the second coded compressed block P₂The base information corresponding to the current character and the previous character is marked as j₁And j₂Taking the average value of the row of the character q in the row average matrix F as F, wherein F is a quantized result; for edge characters lacking the above information, q is₁，q₂，q₃And q is₄Taking the same sign or making it equal to zero;

(3b) model cost reduction by quantizing the whole model, i.e. taking the first two characters q₁And q is₂The larger of which is denoted A, the last two characters q₃And q is₄The larger of which is denoted B, two different identifiers C and D are created, and the final coding model of the current coding symbol is calculated:

P_now＝A·B·C·D·j₁·j₂·f

wherein when q is₁＝q₂If the identifier C is 1, otherwise C is 0; when q is₃＝q₄When D is 1, otherwise D is 0; p_nowProbability estimation for the current code symbol;

(4) the designed final coding model is used for driving the self-adaptive arithmetic coder, and the snake-shaped coding sequence is adopted to have the strongest correlation along the edgeFor the first coded compressed block P₁And performing traversal compression.

Compared with the prior art, the invention has the following advantages:

1. the invention fully utilizes the probability updating mechanism of the arithmetic coder, so that the compression ratio of the quality fraction data in the FASTQ file with the same length is superior to that of all the current algorithms.

2. The invention compresses the average value of the mass fraction of each line while compressing the mass fraction data, thereby facilitating the statistics and access of the average value in the downstream processing process.

3. The invention has strong portability due to simple structure of the designed coding model, is convenient to optimize and integrate into the compression of the whole FASTQ file again, can be widely applied to various compression schemes using the module, and has good expandability.

Drawings

FIG. 1 is a flow chart of an implementation of the present invention;

FIG. 2 is a schematic diagram of the quantification of mass fraction line means in the present invention;

FIG. 3 is a schematic diagram of a serpentine scan sequence used in the present invention.

Detailed Description

The invention is described in further detail below with reference to the figures and specific examples.

Referring to fig. 1, the implementation steps of the invention are as follows:

step 1, extracting mass fraction data and base number data in a FASTQ file.

Gene sequencing can produce thousands of short reads, which are typically stored in the widely accepted text-based FASTQ format, containing all the information produced by sequencing. In the FASTQ file format, each short read contains four rows, each separated by a line break, where:

the first line starts with the '@' character followed by a unique sequence ID identifier and optionally a sequence description, the identifier being separated from the description character by a space;

the second row is a nucleotide sequence, representing base data, consisting of a sequence containing only five characters { 'A', 'T', 'C', 'G', 'N' }, wherein the character 'N' represents an ambiguous base;

the third line starts with the character ' + ' followed by the sequence's flags and descriptive information again, or no information, acting as a separator;

the last row is the mass fraction row, each character corresponding to the mass of the base at the corresponding position in the second row, the mass fraction corresponding to the number Q-10 log 10P, where P indicates that the probability of the corresponding nucleotide in the read is erroneous. The quality score is typically expressed using ASCII letters [ 33: 73] or [ 64: 104] for both quality control of the raw data and for downstream processing.

The specific implementation of this step is as follows:

1.1) statistical analysis of DNA sequencing data features, creating two encoded compressed blocks P of size M × N₁And P₂Wherein M is the number of rows of the compressed block, namely the number of rows of the once-processed quality fraction data, N is the number of columns of the compressed block, namely the length of the quality fraction, and N is less than or equal to 150;

1.2) compressing the blocks P by a first encoding, respectively₁And a second coded compressed block P₂Extracting mass fraction data and base number data stored in a FASTQ file;

since the number of quality score characters in most FASTQ files is less than 40 and the jumping performance is not large, a prediction model with good data correlation can be designed to improve the compression effect. Meanwhile, considering that too many related characters not only increase the complexity of time and calculation, but also bring about the problem of model cost in some cases, it is necessary to use appropriate compression blocks to count the correlation between the quality scores, and within the range allowed by the calculation resources, the larger the compression block design, the better the compression effect, but in order not to exceed the maximum memory, in this embodiment, the compression block of 2000000 × 160 is taken. The total number of models is set to 40 × 40 × 40 × 16. In the actual compression process, one compressed block size of data is processed at a time until the end of the file.

Step 2, calculating a first coding compression block P₁Average of the mass fraction of each line in the extracted FASTQ file and performingQuantization yields the row mean matrix F of M × 1.

2.1) for a first coded compressed block P of size M × N₁Each row of the N-channel filter is subjected to averaging operation, and N mass fraction values of each row are added and divided by the total number N to obtain an average value of the mass fractions of each row;

2.2) carrying out quantization operation on the obtained quality score values of each row and storing:

referring to fig. 2, after the mean value of the mass fraction of each row is counted, clustering is performed according to the mean value distribution condition, the mean values with a large number are subdivided, and the mass values with a small number and a low value are combined, so that the coding efficiency is improved. For a specific compressed file, a specific quantization mode can be designed according to the mean distribution situation to achieve the optimal effect, but the calculation amount is increased, and a lot of calculation time is additionally increased. Therefore, the quantization mode which is high in expansibility and easy to realize is selected in the embodiment, namely two adjacent mean values are regarded as the same condition, and the whole part with small quality value and low quantity is regarded as one part. Summarizing the quantification experience, the quantification results are obtained as follows:

if f is_i<(num-15), then f_i＝(num-15)；

If (num-15) is less than or equal to f_i<(num-13), then f_i＝(num-13)；

If (num-13) is less than or equal to f_i<(num-11), then f_i＝(num-11)；

If (num-11) is less than or equal to f_i<(num-9), then f_i＝(num-9)；

If (num-9) is less than or equal to f_i<(num-7), then f_i＝(num-7)；

If (num-7) is less than or equal to f_iThen f is_i＝(num-6)；

Where num is the total number of coded symbols 40, f_iThe average value of the current row is taken as i is [1, M ]]；

The mean value f of each line after quantization_iThe combination is carried out in a column arrangement mode, and a row mean value matrix F of M × 1 is obtained.

And 3, counting the context information, the base information and the line mean value information of the coding characters, carrying out unified quantization, and calculating a final coding model.

3.1) modeling the current code character q: counting the first four characters q₁，q₂，q₃And q is₄Taking P₂The base information corresponding to the current character and the previous character is marked as j₁And j₂The mean value of the line where the character q is located is denoted as f, wherein f is a quantized result; for edge characters lacking the above information, q is₁，q₂，q₃And q is₄May take the same sign or make it equal to zero.

For example: given a first coded compressed block P₁The specific contents are as follows: e, F, G, H, I; second coded compressed block P₂The specific contents of A, T, C, G and G;

when compressing the block P for the first encoding₁When the third character G in the system is used for establishing a coding model, the values of the first four characters are respectively as follows: q. q.s₁＝F,q₂＝E，q₃＝0，q₄0; the base information values corresponding to the current character and the previous character are as follows: j is a function of₁＝C，j₂T; the mean value of the rows is: mean (ascii) (e) + ascii (f)) + ascii (g)) + ascii (h) + ascii (i));

when to P₁When the fifth character "I" in the code model is established, the values of the first four characters are respectively: q. q.s₁＝H,q₂＝G，q₃＝F，q₄E; the base information values corresponding to the current character and the previous character are as follows: j is a function of₁＝G，j₂G; the mean value of the rows is: mean (ascii) (e) + ascii (f)) + ascii (g)) + ascii (h) + ascii (i));

it can thus be seen that the model can be built in the same way for edge characters lacking the above information.

3.2) considering the actual condition that the total number of the models is limited, reducing the model cost by quantifying the whole model, namely taking q₁And q is₂The larger of these is designated A, q₃And q is₄The larger of which is denoted B, two different identifiers C and D are created, C being used to determine q₁And q is₂Whether they are equal, D is used to judge q₃And q is₄Whether or not equal. Therefore, the model finally selected by the current coding symbol is: p_now＝A·B·C·D·j₁·j₂·f。

Wherein, P_nowIs the probability estimate for the current code symbol.

Step 4, driving the self-adaptive arithmetic coder by utilizing the designed final coding model, and compressing the first coding compression block P along the direction with the strongest correlation by adopting a snake-shaped coding sequence₁And performing traversal compression.

4.1) obtaining more accurate probability estimated value P of the current coding character through the final coding model_nowAnd sending the prediction result as an optimal prediction result to an adaptive arithmetic coder;

4.2) the encoder performs traversal coding compression:

during coding, the characters need to be coded and scanned one by one, the traditional scanning mode defaults to progressive traversal, and when the traversal is completed for a whole line, the scanning is continued from the initial position of the second line. This example uses a scan by column and after a column is encoded, the end of the next column is used as the start, traversing backwards and upwards, with this cycle, the whole appearing as a serpentine scan, as shown in fig. 3. And traversing and coding all characters to realize final lossless compression.

The above description is only one specific example of the present invention and does not constitute any limitation of the present invention. It will be apparent to persons skilled in the relevant art that various modifications and changes in form and detail can be made therein without departing from the principles and arrangements of the invention, but these modifications and changes are still within the scope of the invention as defined in the appended claims.

Claims

1. A DNA sequencing quality fraction lossless compression method based on an adaptive coding sequence is characterized by comprising the following steps:

(1) extracting mass fraction data and base number data in FASTQ files:

(1a) systemAnalysis of DNA sequencing data characteristics, creating two M × N-sized encoded compressed blocks P₁And P₂Wherein M is the number of rows of the compressed block, namely the number of rows of the once-processed quality fraction data, N is the number of columns of the compressed block, namely the length of the quality fraction, and N is less than or equal to 150;

P_now＝A·B·C·D·j₁·j₂·f

(4) driving an adaptive arithmetic coder with a designed final coding model, using serpentine codingSequentially compressing the first encoded block P in the direction of strongest correlation₁And performing traversal compression.

2. The method according to claim 1, wherein the DNA sequencing data in (1a) is characterized in that the DNA sequencing data contains thousands of reads, each read has four rows, the second row is quality score data, the fourth row is base data, the whole DNA sequencing quality score data is encoded in ASCII code, and the number of encoded symbol types is the quality score data of maximum-minimum + 1.

3. The method of claim 1, wherein the first encoded compressed block P in (2) is encoded₁And quantifying the extracted mean value of the mass fraction of each row, wherein the method is realized as follows:

if f is_i<(num-15), then f_i＝(num-15)；

If (num-15) is less than or equal to f_i<(num-13), then f_i＝(num-13)；

If (num-13) is less than or equal to f_i<(num-11), then f_i＝(num-11)；

If (num-11) is less than or equal to f_i<(num-9), then f_i＝(num-9)；

If (num-9) is less than or equal to f_i<(num-7), then f_i＝(num-7)；

If (num-7) is less than or equal to f_iThen f is_i＝(num-6)；

Wherein f is_iIs the mean of the mass fraction of each row, num is the total number of coded symbols with a size of 40, and i is taken to be [1, M ]]Quantizing each line f_iA row mean matrix F of M × 1 is obtained.

4. The method of claim 1, wherein the final coding model driven adaptive arithmetic coder using design in (4) is P which is a probability estimation of the current symbol_nowAnd feeding the prediction result as an optimal prediction result to an adaptive arithmetic coder.

5. According to claim 1The method as described above, characterized in that the first encoded compressed block P is compressed in the direction of the strongest correlation in (4) using a serpentine coding order₁Performing traversal compression, namely traversing the first coding compression block P₁And scanning from top to bottom row by row, reversely traversing from bottom to top after scanning a row, and sequentially reciprocating until traversing the whole compression block.