CN112863600B

CN112863600B - Data compression method based on exon region insertion

Info

Publication number: CN112863600B
Application number: CN202110388432.4A
Authority: CN
Inventors: 张云翔; 李杨; 刘博�; 王亚东
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2021-04-12
Filing date: 2021-04-12
Publication date: 2022-05-24
Anticipated expiration: 2041-04-12
Also published as: CN112863600A

Abstract

A data compression method based on exon region insertion relates to the field of data compression. The invention aims to solve the problems of low running speed, narrow compression application range and high compression storage consumption in the conventional data compression method. The invention comprises the following steps: preprocessing the sequencing short-read DNA data to obtain an exon data set; performing quality control on the explicit data set to obtain an abnormal value and storing the abnormal value in a hash table; storing abnormal values in the hash table in order; compressing and storing the base in the abnormal value stored in the hash table by using Huffman coding; judging whether the depth of the accumulated insertion sequence reaches 30X by using a LYZip local decompression method, and if the depth of the accumulated insertion sequence is larger than 30X, indicating that the insertion compression cannot be carried out; if less than 30X and the accumulated newly added insertion sequence is still less than 30X, the compression step is repeated. The invention is used for compressing data.

Description

Data compression method based on exon region insertion

Technical Field

The invention belongs to the field of data compression, and particularly relates to a data compression method based on exon region insertion.

Background

With the development of bioinformatics, sequencing technologies have been developed into the third generation sequencing era, and the third generation sequencing technologies have become the main research direction in the field of current bioinformatics. However, with the rapid development of the third generation sequencing technology, the problems that the sequencing generated sequence is not matched with the storage space of the database, the data growth speed is not matched with the computer capacity growth, and the like are brought, so that the compression method of the sequencing data becomes a difficult problem which is urgently needed to be solved at present in order to correspond to the high-speed growth of the sequencing data.

At present, aiming at the compression problem that certain compression data exists, new sequencing data is input, the new sequencing data is compressed into the previous compression data, and two methods are mainly adopted for compression; one method is to decompress compressed input data, then combine the decompressed data with new sequencing data for sequencing and recompression, but the sequencing between the sequences in this method uses the sort function of samtools, which consumes a lot of time with the increase of the sequences, thereby slowing down the running speed, and the normally aligned base sequences are also compressed during compression, thereby increasing the storage consumption of the compression. The second method is the LYZip incremental compression method, but the compression depth of the LYZip incremental compression method is only 10X, the compression adaptation range is not wide enough, and the compressed data is less. Therefore, the existing data compression method has the problems of low running speed, narrow compression adaptation range and high compression storage consumption.

Disclosure of Invention

The invention aims to solve the problems of low running speed, narrow compression application range and high compression storage consumption of the conventional data compression method, and provides a data compression method based on exon region insertion.

A data compression method based on exon region insertion comprises the following specific processes:

step one, preprocessing sequencing short-read DNA data to obtain an exon data set;

step two, performing quality control on the external display data set to obtain an abnormal value and storing the abnormal value in a hash table;

step three, orderly storing the abnormal values in the hash table;

fourthly, compressing and storing the basic groups in the abnormal values stored in the hash table by using Huffman coding;

step five, judging whether the depth of the accumulated insertion sequence reaches 30X by using an LYZip local decompression method, and if the depth of the accumulated insertion sequence is greater than 30X, indicating that the insertion compression cannot be carried out; if the number is less than 30X and the accumulated newly added insertion sequence is still less than 30X, the steps one to three are repeated to perform the insertion compression again.

The invention has the beneficial effects that:

the invention is an improvement of increment compression algorithm in LYZip, on the basis of sequencing short read data compressed by TPBWT algorithm in certain depth, extract the exon interval data of the gene region, on the basis of not decompressing original sequencing data compression file completely, insert the provided exon data into the set position, and complete the compression operation at the same time. The invention also utilizes the characteristics that the TPBWT structure can be converted into the next column through the previous column of the storage index structure without additionally storing the index structure and all the sites in the TPBWT structure contain the identifier '2', ensures that the insertion operation does not need to additionally change the index information, reduces the processing steps, improves the running speed, and can occupy all the sites in the compression interval by setting the identifier '2', ensures that the subsequent insertion compression algorithm can be quickly positioned to the position of the site to be inserted in linear time, improves the maximum compression depth to 31X while improving the compression speed of the insertion compression, improves the adaptive range of data compression and further improves the compression storage capacity.

Drawings

FIG. 1 is a block diagram of the present invention.

Detailed Description

The first embodiment is as follows: the data compression method of exon region insertion must be used under the premise of sequencing compressed data. The sequencing compressed data must also be three generations of data and a TPBWT self-indexing structure is established. The TPBWT self-indexing structure ensures fast location of compressed data, finding the start and end positions where data insertion is needed. Exon region insertion compression flooding mainly takes advantage of two-point properties of the TPBWT data structure.

The first is that the TPBWT structure can transform the next column by storing the index structure in the previous column without additionally storing the index structure. This ensures that the insertion operation does not require additional modification of the index information, reduces processing steps, and increases the operating speed.

Second, all sites contain an identifier of '2'. Sequencing data are corresponding to a reference genome according to sites one by one, but compression is performed by taking the column corresponding to each site as a compression object. For sites covered by sequencing short reads, the self-indexing algorithm will add a marker symbol '2' at the end of each column, indicating the end of a column. For sites covered by sequencing-free short reads, the self-indexing algorithm will also add identifier '2' to this column, but the characters at this time will act as placeholders. All the positions of the compression interval can be occupied by setting the identifier '2', and the subsequent method for inserting the compressed data can be quickly positioned to the positions of the positions to be inserted in the linear time. This allows the method of inserting compressed data to be improved in compression speed.

In this embodiment, a data compression method based on exon region insertion specifically includes the following steps:

the method comprises the following steps of screening data to be compressed, and removing sequencing short reads which do not contain gene segments in sequencing sequences, wherein the sequencing short reads comprise the following steps:

inputting TPBWT (transport protocol data binding) transformation compressed data, a sequencing short read data set and a reference genome, screening data to be compressed through the reference genome, and removing sequencing short reads which do not contain gene fragments in a sequencing sequence; the input sequencing short read sequence is screened for the range of exon intervals recognized in public databases, and if the sequence is not within any exon interval, the sequence is considered as a useless sequence and is deleted. If one or more exons (intervals) are contained in the sequence, then retaining, and proceeding to the next treatment;

and step two, shearing the sequencing short read containing the gene fragment:

further screening the exon regions of the short read sequences retained in the step one by one, and if the sequence contains one or more exon regions, completely retaining the regions;

step three, extracting a base sequence of the sheared sequencing short-reading exon interval to obtain an exon data set:

each short read sequence may contain one or more exons after screening and splicing, and exons from the same sequence are stored together.

The compressed input sequencing data is three generations of sequencing data, and the length is mostly higher than 1000 bp. Longer base lengths can cover the gene region as much as possible. However, there is a gene separation between genes, and this part is also far away. The exon and intron regions are only contained within the gene and play a decisive role in the performance of the trait. The information is the most important information in the whole sequencing sequence and is data needing lossless special compression. There is no need to number exon fragments of the base sequence of the extracted exon intervals, since the object of compression is the exon base sequence itself. The three-generation sequencing short-reading fragment that was originally imported was only to be able to contain exon fragments to a greater extent.

Step two, carrying out quality control on the external display data set to obtain an abnormal value and storing the abnormal value in a hash table, wherein the method comprises the following steps:

step two, overall quality control of exons:

calculating the average value of the sequencing mass fractions of all the bases of each exon, if the average value is lower than 30, determining that the sequencing of the whole exon has a problem, and the whole data quality is poor, and discarding the whole exon; if the value is higher than 30, the whole sequence is considered to be reliable, and the sequence is reserved and the quality control of the variation inside the exon is carried out.

Step two, controlling the internal variation quality of the exon:

secondly, introducing a reference sequence genome, and comparing the reference sequence genome with the base sequence subjected to overall quality control of the exon to obtain the mass fraction of the base sequence which cannot be compared with the reference genome and recording the mass fraction as an abnormal mass fraction;

the mass fraction of the sequence of each base can be known by inputting the comparison result file sam format file of the file;

and reducing by referring to the base on the corresponding site of the genome, and only storing a compression abnormal value in the compression process. By doing so, the storage space required by compression is greatly reduced, and the overall compression speed is also improved.

Secondly, storing the sites with the abnormal quality scores larger than the preset threshold (30) into a hash table under the corresponding sites, and discarding the sites with the abnormal quality scores smaller than the preset threshold (30) by comparison;

this ensures the continuity of the exon regions of the sequence and at the same time eliminates the base which may be misdetected. And if the abnormal mass fraction is higher than a preset threshold (30), the accuracy of the base at the position of the reference genome is considered to be higher, and the variation (single nucleotide variation, insertion deletion variation and structural variation) is put into an abnormal hash table corresponding to the position of the reference genome for subsequent compression treatment.

Filling up the abandoned sites in the base sequence by bases on corresponding sites of the reference sequence genome, and storing the filled sites in an array;

through the two quality control steps, fragments with lower mass fractions and variation in the fragments can be removed. Finally, variation in exons required by the symbols is extracted and stored in a hash table for subsequent compression operation.

And step three, orderly storing the abnormal values in the hash table, comprising the following steps:

step three, decompressing the required decompression range by using LYZip local decompression strategy:

the interval required for decompression is the exon interval range. The partial decompression process is to decompress from the starting position of the compressed file to the maximum ending position of the exon. The decompression process decompresses only the base sequence and the list of outliers.

Step two, storing the newly added abnormal values in the hash table of each column according to the sequence:

and changing the original relative position information. For example, if the abnormal value a of the A sequence is at a certain position, and the abnormal value B of the B sequence, a is located before B, the sequence is also kept unchanged in the abnormal value table.

Step three, performing EXPBWT transformation on original data to be compressed, wherein the new added abnormal value in the hash table is not subjected to EXPBWT transformation;

EXPBWT is a self-indexing sequence transformation algorithm that uses the relative position of 0,1 characters in the previous sequence to change the position of the subsequent sequence. Putting sequences corresponding to characters with prefixes of 0 under the position together; and putting sequences corresponding to characters with prefixes of 1 under the position together. The 0 prefix sequence is always placed in front of the 1 prefix sequence in the whole process. The sequences are adjacently distributed by EXPBWT transformation to maximally similar prefix segments.

EXPBWT the compressed object is changed relative to the most primitive TPBWT. EXPBWT compressed objects are exons. This feature results in the sequences being all of equal length, aligned. The characteristics are completely different from sequencing short read-unequal length and misalignment characteristics aimed by an original TPBWT compression algorithm.

The outliers are not EXPBWT transformed and the location of the outliers at the beginning is not changed. This is done so that the relative position between "1" in the {0,1} sequence to which the base sequence is converted is not the relative position of an abnormal value in the abnormal value list. Since {0,1} is EXPBWT transformed, while the outliers in the outlier list are not EXPBWT transformed.

Each column of the original compressed data is subjected to EXPBWT transformation, and the newly added sequence is not subjected to EXPBWT transformation. This process can reduce the time required for the reduction and transformation. The invention adopts the strategy of storing the abnormal values, all the newly added abnormal values are stored in the hash table of each column according to the sequence, and the differential treatment in the decompression process is ensured.

Step three, storing all abnormal values at the end of each column of the corresponding abnormal value table, and setting the quantity information of the newly added sequence in each column:

all sites of EXPBWT contain identifier '2', and for sites covered by sequencing read-short, the self-indexing algorithm will add a marker '2' at the end of each column, indicating the end of a column. Therefore, the end of each column of the hash table is '2', the number of 2 is {1,2}, all run length codes occupy 1 byte, 2 bits are needed for coding '2', 1bit is needed for coding run length, and therefore 5 bits of free space is left. The 5bit space can be used for representing numbers in the interval of 0-31, namely, the data with the depth of 30X can be stored. 30X is also the maximum compression depth of exon insertion compression, and no matter how many repeated insertions are compressed, the additional insertion compression data for each site is put to the end and can reach 30X depth.

In the process of continuously storing the abnormal values at the end, the number information of the abnormal values at the time needs to be recorded. Say 10 compressed sequences are inserted at this position, where 5 of the sequences have a variation at this position, then the value recorded at this time is 5; the compression is inserted to the variation of the exon region, and the compression is not performed for the base that can be matched to the reference genome.

Step four, compressing four bases { A, T, C, G, N } in the abnormal values (snp, indel, sv) stored in the hash table by using Huffman coding:

the decompressed end character '2' of each column is compressed by run-length coding, and the information of the previous statistical quantity is incorporated into the last 5 bits of the run-length coding of 2.

And step five, judging whether the depth of the accumulated insertion sequence reaches 30X by using a LYZip local decompression method. If greater than 30X, it indicates that no insert compression is possible. If the number is less than 30X and the accumulated newly added insertion sequence is still less than 30X, the steps one to three are repeated to perform the insertion compression again.

The number of locus variations at the end of each column is accumulated. For example, 5 mutations have been inserted before this operation at site K, and the number recovered from the last 5 bits of the last character '2' is 5. The number of newly added variation values is 10, and the value is modified to 15 at this time.

Claims

1. A data compression method based on exon region insertion is characterized in that the method comprises the following specific processes:

thirdly, decompressing the base sequence and the abnormal value list by using an LYZip local decompression strategy;

the range of local decompression is the range of exon intervals;

the local decompression process is from the starting position of the compressed file to the maximum ending position of the exon;

if the abnormal value a of the A sequence is positioned at a certain position and the abnormal value B of the B sequence is positioned before a, the sequence is also kept unchanged in an abnormal value table;

step three, performing EXTPWBT transformation on original data to be compressed, and not performing EXTPWBT transformation on the newly added abnormal values in the hash table;

step four, storing all abnormal values at the end of each column of the corresponding hash table, and setting the quantity information of the newly added sequence in each column;

2. The method of claim 1, wherein the method comprises the following steps: in the first step, the sequencing short-read DNA data is preprocessed to obtain an exon data set, and the method comprises the following steps:

screening data to be compressed, and removing sequencing short reads which do not contain gene segments in a sequencing sequence:

inputting TPBWT (transport protocol data binding) transformation compressed data, a sequencing short read data set and a reference genome, screening data to be compressed through the reference genome, and removing sequencing short reads which do not contain gene fragments in a sequencing sequence; screening the input sequencing short-read sequence through the range of the exon intervals acknowledged in the public database, and if the sequence is not in any exon interval, deleting the sequence which is useless; if one or more exons are contained in the sequence, then retaining;

and step two, shearing the sequencing short read containing the gene fragment:

each short read sequence is screened and spliced to contain one or more exons, and the exons from the same sequence are stored together.

3. The method of claim 2, wherein the method comprises the following steps: in the second step, the quality control is performed on the external display data set, and the abnormal value is stored in the hash table, and the method comprises the following steps:

step two, integrally controlling the quality of the exons;

and step two, performing internal variant quality control on the exons and storing the obtained abnormal values in a hash table.

4. A method for data compression based on exon region insertion as claimed in claim 3, wherein: in the second step, the overall quality control of the exons comprises the following steps:

calculating the average value of the sequencing mass fractions of all the bases of each exon, if the average value is lower than 30, determining that the sequencing of the whole exon has a problem, and the whole data quality is poor, and discarding the whole exon; if the value is higher than 30, the whole sequence is reliable, and the reliable sequence is reserved and the quality control of the variation inside the exon is carried out.

5. The method of claim 4, wherein the method comprises the following steps: and in the second step, quality control of variation inside the exon comprises the following steps:

secondly, storing the sites with the abnormal quality scores larger than the preset threshold value into a hash table under the corresponding sites, and discarding the sites with the abnormal quality scores smaller than the preset threshold value by comparison;

and step two and three, filling up the abandoned sites in the base sequence through bases on corresponding sites of the reference sequence genome, and storing the filled sites in an array.

6. The method of claim 5, wherein the method comprises the following steps: in the third step, all abnormal values are stored at the end of the hash table corresponding to each column, and each column is provided with the quantity information of the newly added sequence, which comprises the following steps:

firstly, obtaining a free space in a hash table:

the TPBWT transformation adds a symbol of '2' at the end position of each column to indicate that the column is ended, so that the end of each column of the hash table is ended by '2', all run codes occupy 1 byte, the code '2' occupies 2 bytes, and the code run occupies 1 byte, so that the free space in the hash table is 5 bytes;

then, the free space is used to obtain the storage depth:

by using the 5-byte space to represent the number between 0 and 31, the data with the depth of 30X can be stored, and 30X is also the maximum compression depth of exon insertion compression, and no matter how many times of repeated insertion compression is carried out, the data of each site with additional insertion compression is put to the end and reaches the depth of 30X.

7. The method of claim 6, wherein the method comprises the following steps: the abnormal values in the hash table in the fourth step include: snp, indel, sv.

8. The method of claim 7, wherein the method comprises the following steps: the bases in the outliers in step four include: { A, T, C, G, N }.

9. The method of claim 8, wherein the method comprises the following steps: and in the second step, the mass fraction of the base sequence is obtained by inputting a comparison result file sam format file of the file.