CN111640467B - DNA sequencing quality fraction lossless compression method based on self-adaptive coding sequence - Google Patents

DNA sequencing quality fraction lossless compression method based on self-adaptive coding sequence Download PDF

Info

Publication number
CN111640467B
CN111640467B CN202010446416.1A CN202010446416A CN111640467B CN 111640467 B CN111640467 B CN 111640467B CN 202010446416 A CN202010446416 A CN 202010446416A CN 111640467 B CN111640467 B CN 111640467B
Authority
CN
China
Prior art keywords
num
row
data
compression
mass fraction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010446416.1A
Other languages
Chinese (zh)
Other versions
CN111640467A (en
Inventor
牛毅
马明明
李甫
田英轩
石光明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN202010446416.1A priority Critical patent/CN111640467B/en
Publication of CN111640467A publication Critical patent/CN111640467A/en
Application granted granted Critical
Publication of CN111640467B publication Critical patent/CN111640467B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/50Compression of genetic data

Landscapes

  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Databases & Information Systems (AREA)
  • Bioethics (AREA)
  • Genetics & Genomics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention provides a DNA sequencing quality score lossless compression method based on a self-adaptive coding sequence, which mainly solves the problem that the compression ratio is low due to the fact that a prediction model of the existing quality score compression method is not accurate enough. The implementation scheme is as follows: 1) Compressing a block P by two encodings 1 And P 2 Extracting mass fraction data and base number data in the FASTQ file; 2) Calculating a first encoded compressed block P 1 The mean value of the mass fraction of each row in the extracted file is quantized to obtain a row mean value matrix F of Mx 1; 3) Counting context information, base information and line mean information of the coded characters, 4) setting two identifiers C and D, and uniformly quantizing the information counted in step 3) to construct a coding model; 5) Driving an adaptive arithmetic coder by a coding model, and compressing a block P by a first code along the direction with the strongest correlation by a snake-shaped coding sequence 1 And performing traversal compression. The invention improves the compression efficiency and can be used for storing and transmitting gene data.

Description

DNA sequencing quality fraction lossless compression method based on self-adaptive coding sequence
Technical Field
The invention belongs to the technical field of data compression, and particularly relates to a DNA sequencing mass fraction lossless compression method which can be used for compressing biological gene sequencing data.
Background
Sequencing has gradually become a widely used technique in biological research, acquiring genetic information of different organisms, and can help us to improve understanding of the organic world. With the rapid development of the next generation of high-throughput gene sequencing technology NGS, sequencing companies represented by Illumina continuously develop new sequencing technologies, so that the sequencing cost is rapidly reduced, the price of the human whole genome sequencing WGS is reduced to 1000 dollars or even lower, and the price is still reduced at a speed higher than moore's law. In this case, the amount of new generation sequencing data generated will exceed astronomical data, and in contrast, the overhead associated with storing and transmitting such data is increasing. Therefore, it is of great significance to reduce the size of gene sequencing data through data compression, thereby reducing storage and transmission costs. At present, gene compression tool research achieves a plurality of results, but no scheme reduces code stream from the aspect of coding sequence, so that compression efficiency has a space for improvement.
The next generation of sequencing products generates thousands of short reads, which are typically stored in the widely accepted text-based FASTQ format, containing all the information that the sequencing produces. Wherein each short read contains three parts of content: the first is metadata used for describing information such as a sequencing platform and the like; the DNA base sequence is used for recording the DNA fragments obtained in the current short reading; third, the mass fraction is used to indicate the reliability of measurement of each symbol in the corresponding DNA base sequence. The quality score data in the FASTQ format has high randomness and noise, is related to factors such as a sequencing instrument and a sequencing method, generally comprises dozens of different characters, is high in compression difficulty, and generally accounts for about 70% of a compressed file, so that the compression result of the quality score data plays a key role in the compression effect of the whole FASTQ format data.
At present, typical methods for lossless compression of mass fraction in gene sequencing data mainly include the following methods:
the first is to use the existing text compression tools as the most common compression methods for FASTQ documents, such as Gzip and 7z, which are designed to process common character sequences and do not consider the unique characteristics of mass fraction data, so that the compression effect is not good when compressing gene sequencing data.
The second method is an improved run-length method and a dictionary method aiming at the generation of gene data compression, and the methods have poorer compression effect than an entropy coding method in most cases and cannot achieve the aim of reducing the compression rate to the maximum extent.
The third is some compression algorithms for quality fractions, such as Quip and the like, which use a high-order markov model to perform predictive coding on the quality fractions, and although a good compression effect is obtained, the occupied storage volume is large, the calculation of the prediction model is too complex, and the influence of the coding sequence on compression is not considered, so that the compression time is long and the robustness of the algorithm is poor.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a DNA sequencing quality score lossless compression method based on an adaptive coding sequence so as to improve the compression effect to the maximum extent under the condition of not increasing the compression time.
The technical scheme of the invention is as follows: firstly, extracting a base sequence and a mass fraction sequence in a FASTQ file; then, calculating the mean value of each line of mass fraction data, quantifying, and constructing a prediction model according to context information, mean value information and base information; and finally, a snake-shaped coding sequence is adopted to drive an arithmetic coder to code the sequence, so that the purpose of compressing the quality fraction is achieved, and the following concrete implementation is realized:
(1) Extracting mass fraction data and base data in FASTQ files:
(1a) Statistical analysis of DNA sequencing data characteristics to create two MxN-sized encoded compressed blocks P 1 And P 2 Wherein M is the number of rows of the compressed block, namely the number of rows of the once-processed quality fraction data, N is the number of columns of the compressed block, namely the length of the quality fraction, and N is less than or equal to 150;
(1b) Compressing the blocks P by first encoding respectively 1 And a second coded compressed block P 2 Extracting mass fraction data and base data stored in a FASTQ file;
(2) Calculating a first encoded compressed block P 1 Extracting the average value of the mass fraction of each row in the FASTQ file, and quantizing to obtain an MX 1 row average value matrix F;
(3) Counting context information, base information and line mean information of the coded characters, carrying out unified quantization, and calculating a final coding model:
(3a) Modeling the current encoding character q: counting the first four characters q 1 ,q 2 ,q 3 And q is 4 Taking the second coded compressed block P 2 The base information corresponding to the current character and the previous character is marked as j 1 And j 2 Taking the average value of the row of the character q in the row average matrix F as F, wherein F is a quantized result; for edge characters lacking the above information, q is 1 ,q 2 ,q 3 And q is 4 Taking the same sign or making it equal to zero;
(3b) Model cost reduction by quantizing the whole model, i.e. taking the first two characters q 1 And q is 2 The larger of which is denoted A, the last two characters q 3 And q is 4 The larger of which is denoted B, two different identifiers C and D are created, and the final coding model of the current coding symbol is calculated:
P now =A·B·C·D·j 1 ·j 2 ·f
wherein when q is 1 =q 2 Identifier C =1 when no, otherwise C =0; when q is 3 =q 4 When D =1, otherwise D =0; p now Probability estimation for the current code symbol;
(4) Driving an adaptive arithmetic coder by using a designed final coding model, and compressing a first coded compressed block P along the direction with the strongest correlation by adopting a snake-shaped coding sequence 1 And performing traversal compression.
Compared with the prior art, the invention has the following advantages:
1. the invention fully utilizes the probability updating mechanism of the arithmetic coder, so that the compression ratio of the quality fraction data in the FASTQ file with the same length is superior to that of all the current algorithms.
2. The invention compresses the average value of the mass fraction of each line while compressing the mass fraction data, thereby facilitating the statistics and access of the average value in the downstream processing process.
3. The invention has strong portability due to simple structure of the designed coding model, is convenient to optimize and integrate into the compression of the whole FASTQ file again, can be widely applied to various compression schemes using the module, and has good expandability.
Drawings
FIG. 1 is a flow chart of an implementation of the present invention;
FIG. 2 is a schematic diagram of the quantification of mass fraction line means in the present invention;
FIG. 3 is a schematic diagram of a serpentine scan sequence used in the present invention.
Detailed Description
The invention is described in further detail below with reference to the figures and specific examples.
Referring to fig. 1, the implementation steps of the invention are as follows:
step 1, extracting mass fraction data and base number data in a FASTQ file.
Gene sequencing can produce thousands of short reads, which are typically stored in the widely accepted text-based FASTQ format, containing all the information produced by sequencing. In the FASTQ file format, each short read contains four rows, each separated by a line break, where:
the first line starts with the '@' character followed by a unique sequence ID identifier and optionally a sequence description, the identifier being separated from the description character by a space;
the second row is a nucleotide sequence, representing base data, consisting of a sequence containing only five characters { 'A', 'T', 'C', 'G', 'N' }, wherein the character 'N' represents an ambiguous base;
the third line starts with the character ' + ' and is followed again by the sequence's flags and description information, or no information, acting as a separator;
the last row is the mass fraction row, each character corresponding to the mass of the base at the corresponding position in the second row, the mass fraction corresponding to the number Q = -10log 10P, where P represents the probability that the corresponding nucleotide in the read is erroneous. The quality score is typically expressed using ASCII letters [33:73] or [64:104] for both quality control of the raw data and for downstream processing.
The specific implementation of this step is as follows:
1.1 Statistically analyzing DNA sequencing data characteristics to create two encoded compressed blocks P of size M N 1 And P 2 Wherein M is the number of rows of the compressed block, namely the number of rows of the once-processed quality fraction data, N is the number of columns of the compressed block, namely the length of the quality fraction, and N is less than or equal to 150;
1.2 Respectively compress the blocks P by the first encoding 1 And a second coded compressed block P 2 Extracting mass fraction data and base data stored in a FASTQ file;
since the number of quality score characters in most FASTQ files is less than 40 and the jumping performance is not large, a prediction model with good data correlation can be designed to improve the compression effect. Meanwhile, considering that too many related characters not only increase the complexity of time and calculation, but also bring about the problem of model cost in some cases, it is necessary to use appropriate compression blocks to count the correlation between the quality scores, and within the range allowed by the calculation resources, the larger the compression block design, the better the compression effect, but in order not to exceed the maximum memory, in this embodiment, the compression block of 2000000 × 160 is taken. The total number of models is set to 40 × 40 × 40 × 16. In the actual compression process, one compressed block size of data is processed at a time until the end of the file.
Step 2, calculating a first coding compression block P 1 And (3) extracting the average value of the mass fraction of each row in the FASTQ file and quantizing the average value to obtain an M multiplied by 1 row average value matrix F.
2.1 For a first coded compressed block P of size M × N 1 Each row of the N-channel filter is subjected to averaging operation, and N mass fraction values of each row are added and divided by the total number N to obtain an average value of the mass fractions of each row;
2.2 Performing quantization operation on the obtained quality score values of the rows and storing:
referring to fig. 2, after the mean value of the mass fraction of each row is counted, clustering is performed according to the mean value distribution condition, the mean values with a large number are subdivided, and the mass values with a small number and a low value are combined, so that the coding efficiency is improved. For a specific compressed file, a specific quantization mode can be designed according to the mean distribution situation to achieve the optimal effect, but the calculation amount is increased, and a lot of calculation time is additionally increased. Therefore, the quantization mode with strong expansibility and easy realization is selected in the embodiment, namely two adjacent mean values are regarded as the same condition, and the whole part with small quality value and low quantity is regarded as one part. Summarizing the quantification experience, the quantification results are obtained as follows:
if f is i <(num-15), then f i =(num-15);
If (num-15) is less than or equal to f i <(num-13), then f i =(num-13);
If (num-13) is less than or equal to f i <(num-11), then f i =(num-11);
If (num-11) is less than or equal to f i <(num-9), then f i =(num-9);
If (num-9) is less than or equal to f i <(num-7), then f i =(num-7);
If (num-7) is less than or equal to f i Then f is i =(num-6);
Where num is the total number of coded symbols 40,f i The average value of the current row is taken as i is [1, M ]];
The mean value f of each line after quantization i And combining according to a column arrangement mode to obtain an M multiplied by 1 row mean value matrix F.
And 3, counting the context information, the base information and the line mean value information of the coding characters, carrying out unified quantization, and calculating a final coding model.
3.1 To model the current code character q): counting the first four characters q 1 ,q 2 ,q 3 And q is 4 Taking P 2 The base information corresponding to the current character and the previous character is marked as j 1 And j 2 The mean value of the line where the character q is located is denoted as f, wherein f is a quantized result; for edge characters lacking the above information, q is 1 ,q 2 ,q 3 And q is 4 May take the same sign or make it equal to zero.
For example: given aFirst coded compressed block P 1 The specific contents are as follows: e, F, G, H, I; second coded compressed block P 2 The specific contents of A, T, C, G and G;
when compressing the block P for the first encoding 1 When the third character G in the system is used for establishing a coding model, the values of the first four characters are respectively as follows: q. q.s 1 =F,q 2 =E,q 3 =0,q 4 =0; the base information values corresponding to the current character and the previous character are as follows: j is a function of 1 =C,j 2 = T; the mean value of the rows is: f = mean (ASCII (E) + ASCII (F) + ASCII (G) + ASCII (H) + ASCII (I));
when to P 1 When the fifth character "I" in the code model is established, the values of the first four characters are respectively: q. q of 1 =H,q 2 =G,q 3 =F,q 4 = E; the base information values corresponding to the current character and the previous character are as follows: j is a function of 1 =G,j 2 = G; the mean value of the rows is: f = mean (ASCII (E) + ASCII (F) + ASCII (G) + ASCII (H) + ASCII (I));
it can thus be seen that the model can be built in the same way for edge characters lacking the above information.
3.2 Considering the practical situation that the total number of models is limited, model cost is reduced by means of quantizing the whole model, namely, q is taken 1 And q is 2 The larger of these is designated A, q 3 And q is 4 The larger of which is denoted B, two different identifiers C and D are created, C being used to determine q 1 And q is 2 Whether they are equal, D is used to judge q 3 And q is 4 Whether or not equal. Therefore, the model finally selected by the current coding symbol is: p now =A·B·C·D·j 1 ·j 2 ·f。
Wherein, P now Is the probability estimate for the current code symbol.
Step 4, driving the self-adaptive arithmetic coder by utilizing the designed final coding model, and compressing the first coding compression block P along the direction with the strongest correlation by adopting a snake-shaped coding sequence 1 And performing traversal compression.
4.1 Get the current encoding through the final encoding modelMore accurate probability estimated value P of code character now And sending the prediction result as an optimal prediction result to an adaptive arithmetic coder;
4.2 The encoder performs traversal encoding compression:
during coding, the characters need to be coded and scanned one by one, the traditional scanning mode defaults to progressive traversal, and when the traversal is completed for a whole line, the scanning is continued from the initial position of the second line. This example uses a scan by column and after a column is encoded, the end of the next column is used as the start, traversing backwards and upwards, with this cycle, the whole appearing as a serpentine scan, as shown in fig. 3. And traversing and coding all characters to realize final lossless compression.
The above description is only one specific example of the present invention and does not constitute any limitation of the present invention. It will be apparent to persons skilled in the relevant art that various modifications and changes in form and detail can be made therein without departing from the principles and arrangements of the invention, but these modifications and changes are still within the scope of the invention as defined in the appended claims.

Claims (5)

1. A DNA sequencing quality fraction lossless compression method based on an adaptive coding sequence is characterized by comprising the following steps:
(1) Extracting mass fraction data and base number data in FASTQ files:
(1a) Statistical analysis of DNA sequencing data features, creating two MxN-sized encoded compressed blocks P 1 And P 2 Wherein M is the number of rows of the compressed block, namely the number of rows of the once-processed quality fraction data, N is the number of columns of the compressed block, namely the length of the quality fraction, and N is less than or equal to 150;
(1b) Compressing the blocks P by first encoding respectively 1 And a second coded compressed block P 2 Extracting mass fraction data and base number data stored in a FASTQ file;
(2) Calculating a first encoded compressed block P 1 Extracting the average value of the mass fraction of each line in the FASTQ file and quantizing to obtain an MX 1 line average value matrix F;
(3) Counting context information, base information and line mean information of the coded characters, carrying out unified quantization, and calculating a final coding model:
(3a) Establishing a model for the current code character q: counting the first four characters q 1 ,q 2 ,q 3 And q is 4 Taking the second coded compressed block P 2 The base information corresponding to the current character and the previous character is marked as j 1 And j 2 Taking the average value of the row of the character q in the row average matrix F as F, wherein F is a quantized result; for edge characters lacking the above information, q thereof 1 ,q 2 ,q 3 And q is 4 Taking the same sign or making it equal to zero;
(3b) Model cost reduction by quantizing the whole model, i.e. taking the first two characters q 1 And q is 2 The larger of which is denoted A, the last two characters q 3 And q is 4 The larger of which is denoted B, two different identifiers C and D are created, and the final coding model of the current coding symbol is calculated:
P now =A·B·C·D·j 1 ·j 2 ·f
wherein when q is 1 =q 2 When identifier C =1, otherwise C =0; when q is 3 =q 4 When D =1, otherwise D =0; p now Probability estimation for the current code symbol;
(4) Driving an adaptive arithmetic coder by using a designed final coding model, and compressing a first coded compressed block P along the direction with the strongest correlation by adopting a snake-shaped coding sequence 1 And performing traversal compression.
2. The method of claim 1, wherein the DNA sequencing data characteristic in (1 a) means that the DNA sequencing data contains thousands of reads, each read has four rows, the second row is quality score data, the fourth row is base data, the whole DNA sequencing quality score data is encoded in ASCII code, and the number of encoded symbol types = quality score data of maximum-minimum + 1.
3. Root of herbaceous plantThe method of claim 1, wherein the first encoded compressed block P in (2) is encoded 1 And quantifying the extracted mean value of the mass fraction of each row, wherein the method is realized as follows:
if f is i <(num-15), then f i =(num-15);
If (num-15) is less than or equal to f i <(num-13), then f i =(num-13);
If (num-13) is less than or equal to f i <(num-11), then f i =(num-11);
If (num-11) is less than or equal to f i <(num-9), then f i =(num-9);
If (num-9) is less than or equal to f i <(num-7), then f i =(num-7);
If (num-7) is less than or equal to f i Then f is i =(num-6);
Wherein f is i Is the mean of the mass fraction of each row, num is the total number of coded symbols of size 40, i is taken to be [1, M ]]Quantizing each line f i A row mean matrix F of mx 1 is obtained.
4. The method of claim 1, wherein the final coding model driven adaptive arithmetic coder using design in (4) refers to P which performs probability estimation on the current symbol now And feeding the prediction result as an optimal prediction result to an adaptive arithmetic coder.
5. The method of claim 1, wherein (4) the first encoded compressed block P is compressed along the direction of strongest correlation using a serpentine coding order 1 Performing traversal compression, namely traversing the first coding compression block P 1 And scanning from top to bottom row by row, reversely traversing from bottom to top after scanning a row, and sequentially reciprocating until traversing the whole compression block.
CN202010446416.1A 2020-05-25 2020-05-25 DNA sequencing quality fraction lossless compression method based on self-adaptive coding sequence Active CN111640467B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010446416.1A CN111640467B (en) 2020-05-25 2020-05-25 DNA sequencing quality fraction lossless compression method based on self-adaptive coding sequence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010446416.1A CN111640467B (en) 2020-05-25 2020-05-25 DNA sequencing quality fraction lossless compression method based on self-adaptive coding sequence

Publications (2)

Publication Number Publication Date
CN111640467A CN111640467A (en) 2020-09-08
CN111640467B true CN111640467B (en) 2023-03-24

Family

ID=72332834

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010446416.1A Active CN111640467B (en) 2020-05-25 2020-05-25 DNA sequencing quality fraction lossless compression method based on self-adaptive coding sequence

Country Status (1)

Country Link
CN (1) CN111640467B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103995988A (en) * 2014-05-30 2014-08-20 周家锐 High-throughput DNA sequencing mass fraction lossless compression system and method
CN105391454A (en) * 2015-12-14 2016-03-09 季检 DNA sequencing quality score lossless compression method
CN106100641A (en) * 2016-06-12 2016-11-09 深圳大学 Multithreading quick storage lossless compression method and system thereof for FASTQ data
WO2017214765A1 (en) * 2016-06-12 2017-12-21 深圳大学 Multi-thread fast storage lossless compression method and system for fastq data
CN108306650A (en) * 2018-01-16 2018-07-20 厦门极元科技有限公司 The compression method of gene sequencing data
WO2019144312A1 (en) * 2018-01-24 2019-08-01 深圳大学 Gpu-accelerated dna sequence compression method and system
CN110111852A (en) * 2018-01-11 2019-08-09 广州明领基因科技有限公司 A kind of magnanimity DNA sequencing data lossless Fast Compression platform

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10090857B2 (en) * 2010-04-26 2018-10-02 Samsung Electronics Co., Ltd. Method and apparatus for compressing genetic data
US12080384B2 (en) * 2015-06-16 2024-09-03 Gottfried Wilhelm Leibniz Universitaet Hannover Method for compressing genomic data

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103995988A (en) * 2014-05-30 2014-08-20 周家锐 High-throughput DNA sequencing mass fraction lossless compression system and method
WO2015180203A1 (en) * 2014-05-30 2015-12-03 周家锐 High-throughput dna sequencing quality score lossless compression system and compression method
CN105391454A (en) * 2015-12-14 2016-03-09 季检 DNA sequencing quality score lossless compression method
CN106100641A (en) * 2016-06-12 2016-11-09 深圳大学 Multithreading quick storage lossless compression method and system thereof for FASTQ data
WO2017214765A1 (en) * 2016-06-12 2017-12-21 深圳大学 Multi-thread fast storage lossless compression method and system for fastq data
CN110111852A (en) * 2018-01-11 2019-08-09 广州明领基因科技有限公司 A kind of magnanimity DNA sequencing data lossless Fast Compression platform
CN108306650A (en) * 2018-01-16 2018-07-20 厦门极元科技有限公司 The compression method of gene sequencing data
WO2019144312A1 (en) * 2018-01-24 2019-08-01 深圳大学 Gpu-accelerated dna sequence compression method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于码书索引变换的高通量DNA序列数据压缩算法;谭丽等;《电子学报》;20150515(第05期);全文 *
基于高通量测序的短序列生物数据压缩研究;孟倩;《计算机应用与软件》;20170415(第04期);全文 *

Also Published As

Publication number Publication date
CN111640467A (en) 2020-09-08

Similar Documents

Publication Publication Date Title
CN103814396B (en) The method and apparatus of coding/decoding bit stream
Korodi et al. An efficient normalized maximum likelihood algorithm for DNA sequence compression
US9929746B2 (en) Methods and systems for data analysis and compression
CN103995988B (en) High-throughput DNA sequencing mass fraction lossless compression system and method
CN110021369B (en) Gene sequencing data compression and decompression method, system and computer readable medium
US20130031092A1 (en) Method and apparatus for compressing genetic data
EP3311318B1 (en) Method for compressing genomic data
CN103546160A (en) Multi-reference-sequence based gene sequence stage compression method
CN103067022A (en) Nondestructive compressing method, uncompressing method, compressing device and uncompressing device for integer data
CN110021368B (en) Comparison type gene sequencing data compression method, system and computer readable medium
CN107066837A (en) One kind has with reference to DNA sequence dna compression method and system
CN109450452A (en) A kind of compression method and system of the sampling dictionary tree index for gene data
CN115064216A (en) Protein coding method based on position sequence matrix
CN111640467B (en) DNA sequencing quality fraction lossless compression method based on self-adaptive coding sequence
CN110111851B (en) Gene sequencing data compression method, system and computer readable medium
CN109698702B (en) Gene sequencing data compression preprocessing method, system and computer readable medium
CN109698703B (en) Gene sequencing data decompression method, system and computer readable medium
JP2023513203A (en) An Improved Quality Value Compression Framework for New Context-Based Aligned Sequencing Data
Kozanitis et al. Compressing genomic sequence fragments using SlimGene
CN107633158A (en) The method and apparatus for being compressed and decompressing to gene order
CN110915140B (en) Method for encoding and decoding quality values of a data structure
CN110111852A (en) A kind of magnanimity DNA sequencing data lossless Fast Compression platform
Voges Compression of DNA sequencing data
CN117577186A (en) META compression method based on MSB embedding and Tile context
Chen Algorithm of Genome Sequence Compression Based on Entropy Coding

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant