CN111640467A - DNA sequencing quality fraction lossless compression method based on self-adaptive coding sequence - Google Patents

DNA sequencing quality fraction lossless compression method based on self-adaptive coding sequence Download PDF

Info

Publication number
CN111640467A
CN111640467A CN202010446416.1A CN202010446416A CN111640467A CN 111640467 A CN111640467 A CN 111640467A CN 202010446416 A CN202010446416 A CN 202010446416A CN 111640467 A CN111640467 A CN 111640467A
Authority
CN
China
Prior art keywords
num
row
data
compression
mass fraction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010446416.1A
Other languages
Chinese (zh)
Other versions
CN111640467B (en
Inventor
牛毅
马明明
李甫
田英轩
石光明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN202010446416.1A priority Critical patent/CN111640467B/en
Publication of CN111640467A publication Critical patent/CN111640467A/en
Application granted granted Critical
Publication of CN111640467B publication Critical patent/CN111640467B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/50Compression of genetic data

Abstract

The invention provides a DNA sequencing quality score lossless compression method based on a self-adaptive coding sequence, which mainly solves the problem that the compression ratio is low due to the fact that a prediction model of the existing quality score compression method is not accurate enough. The implementation scheme is as follows: 1) compressing a block P by two encodings1And P2Extracting mass fraction data and base number data in the FASTQ file; 2) calculating a first encoded compressed block P1Extracting the mean value of the mass fraction of each line in the file and quantizing to obtain a line mean value matrix F of M × 1, 3) counting the context information, the base information and the line mean value information of the coded characters, 4) setting two identifiers C and D and uniformly quantizing the information counted in the step 3) to construct a coding model, 5) driving the self-adaptive arithmetic coder by the coding model and adopting a snake-shaped coding sequence along the side with the strongest correlationTo the first coded compressed block P1And performing traversal compression. The invention improves the compression efficiency and can be used for storing and transmitting gene data.

Description

DNA sequencing quality fraction lossless compression method based on self-adaptive coding sequence
Technical Field
The invention belongs to the technical field of data compression, and particularly relates to a DNA sequencing mass fraction lossless compression method which can be used for compressing biological gene sequencing data.
Background
Sequencing has gradually become a widely used technique in biological research, acquiring genetic information of different organisms, and can help us to improve understanding of the organic world. With the rapid development of the next generation of high-throughput gene sequencing technology NGS, sequencing companies represented by Illumina continuously develop new sequencing technologies, so that the sequencing cost is rapidly reduced, the price of the human whole genome sequencing WGS is reduced to 1000 dollars or even lower, and the price is still reduced at a speed higher than moore's law. In this case, the amount of new generation sequencing data generated will exceed astronomical data, and in contrast, the overhead associated with storing and transmitting such data is increasing. Therefore, it is of great significance to reduce the size of gene sequencing data through data compression, thereby reducing storage and transmission costs. At present, gene compression tool research achieves a plurality of results, but no scheme reduces code stream from the aspect of coding sequence, so that compression efficiency has a space for improvement.
The next generation of sequencing products generates thousands of short reads, which are typically stored in the widely accepted text-based FASTQ format, containing all the information that the sequencing produces. Wherein each short read contains three parts of content: the first is metadata used for describing information such as a sequencing platform and the like; the DNA base sequence is used for recording the DNA fragments obtained in the current short reading; third, the mass fraction is used to indicate the reliability of measurement of each symbol in the corresponding DNA base sequence. The quality score data in the FASTQ format has high randomness and noise, is related to factors such as a sequencing instrument and a sequencing method, generally comprises dozens of different characters, is high in compression difficulty, and generally accounts for about 70% of a compressed file, so that the compression result of the quality score data plays a key role in the compression effect of the whole FASTQ format data.
At present, typical methods for lossless compression of mass fraction in gene sequencing data mainly include the following methods:
the first is to use the existing text compression tools as the most common compression methods for FASTQ documents, such as Gzip and 7z, which are designed to process common character sequences and do not consider the unique characteristics of mass fraction data, so that the compression effect is not good when compressing gene sequencing data.
The second method is an improved run-length method and a dictionary method aiming at the generation of gene data compression, and the methods have poorer compression effect than an entropy coding method under most conditions and cannot achieve the aim of reducing the compression rate to the maximum extent.
The third is some compression algorithms for quality fractions, such as Quip and the like, which use a high-order markov model to perform predictive coding on the quality fractions, and although a good compression effect is obtained, the occupied storage volume is large, the calculation of the prediction model is too complex, and the influence of the coding sequence on compression is not considered, so that the compression time is long and the robustness of the algorithm is poor.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a DNA sequencing quality score lossless compression method based on an adaptive coding sequence so as to improve the compression effect to the maximum extent under the condition of not increasing the compression time.
The technical scheme of the invention is as follows: firstly, extracting a base sequence and a mass fraction sequence in a FASTQ file; then, calculating the mean value of each line of mass fraction data, quantifying, and constructing a prediction model according to context information, mean value information and base information; and finally, a snake-shaped coding sequence is adopted to drive an arithmetic coder to code the sequence, so that the purpose of compressing the quality fraction is achieved, and the following concrete implementation is realized:
(1) extracting mass fraction data and base number data in FASTQ files:
(1a) statistical analysis of DNA sequencing data features, creationTwo coded compressed blocks P of size M × N1And P2Wherein M is the number of rows of the compressed block, namely the number of rows of the once-processed quality fraction data, N is the number of columns of the compressed block, namely the length of the quality fraction, and N is less than or equal to 150;
(1b) compressing the blocks P by first encoding respectively1And a second coded compressed block P2Extracting mass fraction data and base number data stored in a FASTQ file;
(2) calculating a first encoded compressed block P1The average value of the mass fraction of each row in the FASTQ file is extracted and quantized to obtain a row average value matrix F of M × 1;
(3) counting context information, base information and line mean information of the coded characters, carrying out unified quantization, and calculating a final coding model:
(3a) establishing a model for the current code character q: counting the first four characters q1,q2,q3And q is4Taking the second coded compressed block P2The base information corresponding to the current character and the previous character is marked as j1And j2Taking the average value of the row of the character q in the row average matrix F as F, wherein F is a quantized result; for edge characters lacking the above information, q is1,q2,q3And q is4Taking the same sign or making it equal to zero;
(3b) model cost reduction by quantizing the whole model, i.e. taking the first two characters q1And q is2The larger of which is denoted A, the last two characters q3And q is4The larger of which is denoted B, two different identifiers C and D are created, and the final coding model of the current coding symbol is calculated:
Pnow=A·B·C·D·j1·j2·f
wherein when q is1=q2If the identifier C is 1, otherwise C is 0; when q is3=q4When D is 1, otherwise D is 0; pnowProbability estimation for the current code symbol;
(4) the designed final coding model is used for driving the self-adaptive arithmetic coder, and the snake-shaped coding sequence is adopted to have the strongest correlation along the edgeFor the first coded compressed block P1And performing traversal compression.
Compared with the prior art, the invention has the following advantages:
1. the invention fully utilizes the probability updating mechanism of the arithmetic coder, so that the compression ratio of the quality fraction data in the FASTQ file with the same length is superior to that of all the current algorithms.
2. The invention compresses the average value of the mass fraction of each line while compressing the mass fraction data, thereby facilitating the statistics and access of the average value in the downstream processing process.
3. The invention has strong portability due to simple structure of the designed coding model, is convenient to optimize and integrate into the compression of the whole FASTQ file again, can be widely applied to various compression schemes using the module, and has good expandability.
Drawings
FIG. 1 is a flow chart of an implementation of the present invention;
FIG. 2 is a schematic diagram of the quantification of mass fraction line means in the present invention;
FIG. 3 is a schematic diagram of a serpentine scan sequence used in the present invention.
Detailed Description
The invention is described in further detail below with reference to the figures and specific examples.
Referring to fig. 1, the implementation steps of the invention are as follows:
step 1, extracting mass fraction data and base number data in a FASTQ file.
Gene sequencing can produce thousands of short reads, which are typically stored in the widely accepted text-based FASTQ format, containing all the information produced by sequencing. In the FASTQ file format, each short read contains four rows, each separated by a line break, where:
the first line starts with the '@' character followed by a unique sequence ID identifier and optionally a sequence description, the identifier being separated from the description character by a space;
the second row is a nucleotide sequence, representing base data, consisting of a sequence containing only five characters { 'A', 'T', 'C', 'G', 'N' }, wherein the character 'N' represents an ambiguous base;
the third line starts with the character ' + ' followed by the sequence's flags and descriptive information again, or no information, acting as a separator;
the last row is the mass fraction row, each character corresponding to the mass of the base at the corresponding position in the second row, the mass fraction corresponding to the number Q-10 log 10P, where P indicates that the probability of the corresponding nucleotide in the read is erroneous. The quality score is typically expressed using ASCII letters [ 33: 73] or [ 64: 104] for both quality control of the raw data and for downstream processing.
The specific implementation of this step is as follows:
1.1) statistical analysis of DNA sequencing data features, creating two encoded compressed blocks P of size M × N1And P2Wherein M is the number of rows of the compressed block, namely the number of rows of the once-processed quality fraction data, N is the number of columns of the compressed block, namely the length of the quality fraction, and N is less than or equal to 150;
1.2) compressing the blocks P by a first encoding, respectively1And a second coded compressed block P2Extracting mass fraction data and base number data stored in a FASTQ file;
since the number of quality score characters in most FASTQ files is less than 40 and the jumping performance is not large, a prediction model with good data correlation can be designed to improve the compression effect. Meanwhile, considering that too many related characters not only increase the complexity of time and calculation, but also bring about the problem of model cost in some cases, it is necessary to use appropriate compression blocks to count the correlation between the quality scores, and within the range allowed by the calculation resources, the larger the compression block design, the better the compression effect, but in order not to exceed the maximum memory, in this embodiment, the compression block of 2000000 × 160 is taken. The total number of models is set to 40 × 40 × 40 × 16. In the actual compression process, one compressed block size of data is processed at a time until the end of the file.
Step 2, calculating a first coding compression block P1Average of the mass fraction of each line in the extracted FASTQ file and performingQuantization yields the row mean matrix F of M × 1.
2.1) for a first coded compressed block P of size M × N1Each row of the N-channel filter is subjected to averaging operation, and N mass fraction values of each row are added and divided by the total number N to obtain an average value of the mass fractions of each row;
2.2) carrying out quantization operation on the obtained quality score values of each row and storing:
referring to fig. 2, after the mean value of the mass fraction of each row is counted, clustering is performed according to the mean value distribution condition, the mean values with a large number are subdivided, and the mass values with a small number and a low value are combined, so that the coding efficiency is improved. For a specific compressed file, a specific quantization mode can be designed according to the mean distribution situation to achieve the optimal effect, but the calculation amount is increased, and a lot of calculation time is additionally increased. Therefore, the quantization mode which is high in expansibility and easy to realize is selected in the embodiment, namely two adjacent mean values are regarded as the same condition, and the whole part with small quality value and low quantity is regarded as one part. Summarizing the quantification experience, the quantification results are obtained as follows:
if f isi<(num-15), then fi=(num-15);
If (num-15) is less than or equal to fi<(num-13), then fi=(num-13);
If (num-13) is less than or equal to fi<(num-11), then fi=(num-11);
If (num-11) is less than or equal to fi<(num-9), then fi=(num-9);
If (num-9) is less than or equal to fi<(num-7), then fi=(num-7);
If (num-7) is less than or equal to fiThen f isi=(num-6);
Where num is the total number of coded symbols 40, fiThe average value of the current row is taken as i is [1, M ]];
The mean value f of each line after quantizationiThe combination is carried out in a column arrangement mode, and a row mean value matrix F of M × 1 is obtained.
And 3, counting the context information, the base information and the line mean value information of the coding characters, carrying out unified quantization, and calculating a final coding model.
3.1) modeling the current code character q: counting the first four characters q1,q2,q3And q is4Taking P2The base information corresponding to the current character and the previous character is marked as j1And j2The mean value of the line where the character q is located is denoted as f, wherein f is a quantized result; for edge characters lacking the above information, q is1,q2,q3And q is4May take the same sign or make it equal to zero.
For example: given a first coded compressed block P1The specific contents are as follows: e, F, G, H, I; second coded compressed block P2The specific contents of A, T, C, G and G;
when compressing the block P for the first encoding1When the third character G in the system is used for establishing a coding model, the values of the first four characters are respectively as follows: q. q.s1=F,q2=E,q3=0,q40; the base information values corresponding to the current character and the previous character are as follows: j is a function of1=C,j2T; the mean value of the rows is: mean (ascii) (e) + ascii (f)) + ascii (g)) + ascii (h) + ascii (i));
when to P1When the fifth character "I" in the code model is established, the values of the first four characters are respectively: q. q.s1=H,q2=G,q3=F,q4E; the base information values corresponding to the current character and the previous character are as follows: j is a function of1=G,j2G; the mean value of the rows is: mean (ascii) (e) + ascii (f)) + ascii (g)) + ascii (h) + ascii (i));
it can thus be seen that the model can be built in the same way for edge characters lacking the above information.
3.2) considering the actual condition that the total number of the models is limited, reducing the model cost by quantifying the whole model, namely taking q1And q is2The larger of these is designated A, q3And q is4The larger of which is denoted B, two different identifiers C and D are created, C being used to determine q1And q is2Whether they are equal, D is used to judge q3And q is4Whether or not equal. Therefore, the model finally selected by the current coding symbol is: pnow=A·B·C·D·j1·j2·f。
Wherein, PnowIs the probability estimate for the current code symbol.
Step 4, driving the self-adaptive arithmetic coder by utilizing the designed final coding model, and compressing the first coding compression block P along the direction with the strongest correlation by adopting a snake-shaped coding sequence1And performing traversal compression.
4.1) obtaining more accurate probability estimated value P of the current coding character through the final coding modelnowAnd sending the prediction result as an optimal prediction result to an adaptive arithmetic coder;
4.2) the encoder performs traversal coding compression:
during coding, the characters need to be coded and scanned one by one, the traditional scanning mode defaults to progressive traversal, and when the traversal is completed for a whole line, the scanning is continued from the initial position of the second line. This example uses a scan by column and after a column is encoded, the end of the next column is used as the start, traversing backwards and upwards, with this cycle, the whole appearing as a serpentine scan, as shown in fig. 3. And traversing and coding all characters to realize final lossless compression.
The above description is only one specific example of the present invention and does not constitute any limitation of the present invention. It will be apparent to persons skilled in the relevant art that various modifications and changes in form and detail can be made therein without departing from the principles and arrangements of the invention, but these modifications and changes are still within the scope of the invention as defined in the appended claims.

Claims (5)

1. A DNA sequencing quality fraction lossless compression method based on an adaptive coding sequence is characterized by comprising the following steps:
(1) extracting mass fraction data and base number data in FASTQ files:
(1a) systemAnalysis of DNA sequencing data characteristics, creating two M × N-sized encoded compressed blocks P1And P2Wherein M is the number of rows of the compressed block, namely the number of rows of the once-processed quality fraction data, N is the number of columns of the compressed block, namely the length of the quality fraction, and N is less than or equal to 150;
(1b) compressing the blocks P by first encoding respectively1And a second coded compressed block P2Extracting mass fraction data and base number data stored in a FASTQ file;
(2) calculating a first encoded compressed block P1The average value of the mass fraction of each row in the FASTQ file is extracted and quantized to obtain a row average value matrix F of M × 1;
(3) counting context information, base information and line mean information of the coded characters, carrying out unified quantization, and calculating a final coding model:
(3a) establishing a model for the current code character q: counting the first four characters q1,q2,q3And q is4Taking the second coded compressed block P2The base information corresponding to the current character and the previous character is marked as j1And j2Taking the average value of the row of the character q in the row average matrix F as F, wherein F is a quantized result; for edge characters lacking the above information, q is1,q2,q3And q is4Taking the same sign or making it equal to zero;
(3b) model cost reduction by quantizing the whole model, i.e. taking the first two characters q1And q is2The larger of which is denoted A, the last two characters q3And q is4The larger of which is denoted B, two different identifiers C and D are created, and the final coding model of the current coding symbol is calculated:
Pnow=A·B·C·D·j1·j2·f
wherein when q is1=q2If the identifier C is 1, otherwise C is 0; when q is3=q4When D is 1, otherwise D is 0; pnowProbability estimation for the current code symbol;
(4) driving an adaptive arithmetic coder with a designed final coding model, using serpentine codingSequentially compressing the first encoded block P in the direction of strongest correlation1And performing traversal compression.
2. The method according to claim 1, wherein the DNA sequencing data in (1a) is characterized in that the DNA sequencing data contains thousands of reads, each read has four rows, the second row is quality score data, the fourth row is base data, the whole DNA sequencing quality score data is encoded in ASCII code, and the number of encoded symbol types is the quality score data of maximum-minimum + 1.
3. The method of claim 1, wherein the first encoded compressed block P in (2) is encoded1And quantifying the extracted mean value of the mass fraction of each row, wherein the method is realized as follows:
if f isi<(num-15), then fi=(num-15);
If (num-15) is less than or equal to fi<(num-13), then fi=(num-13);
If (num-13) is less than or equal to fi<(num-11), then fi=(num-11);
If (num-11) is less than or equal to fi<(num-9), then fi=(num-9);
If (num-9) is less than or equal to fi<(num-7), then fi=(num-7);
If (num-7) is less than or equal to fiThen f isi=(num-6);
Wherein f isiIs the mean of the mass fraction of each row, num is the total number of coded symbols with a size of 40, and i is taken to be [1, M ]]Quantizing each line fiA row mean matrix F of M × 1 is obtained.
4. The method of claim 1, wherein the final coding model driven adaptive arithmetic coder using design in (4) is P which is a probability estimation of the current symbolnowAnd feeding the prediction result as an optimal prediction result to an adaptive arithmetic coder.
5. According to claim 1The method as described above, characterized in that the first encoded compressed block P is compressed in the direction of the strongest correlation in (4) using a serpentine coding order1Performing traversal compression, namely traversing the first coding compression block P1And scanning from top to bottom row by row, reversely traversing from bottom to top after scanning a row, and sequentially reciprocating until traversing the whole compression block.
CN202010446416.1A 2020-05-25 2020-05-25 DNA sequencing quality fraction lossless compression method based on self-adaptive coding sequence Active CN111640467B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010446416.1A CN111640467B (en) 2020-05-25 2020-05-25 DNA sequencing quality fraction lossless compression method based on self-adaptive coding sequence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010446416.1A CN111640467B (en) 2020-05-25 2020-05-25 DNA sequencing quality fraction lossless compression method based on self-adaptive coding sequence

Publications (2)

Publication Number Publication Date
CN111640467A true CN111640467A (en) 2020-09-08
CN111640467B CN111640467B (en) 2023-03-24

Family

ID=72332834

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010446416.1A Active CN111640467B (en) 2020-05-25 2020-05-25 DNA sequencing quality fraction lossless compression method based on self-adaptive coding sequence

Country Status (1)

Country Link
CN (1) CN111640467B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130031092A1 (en) * 2010-04-26 2013-01-31 Samsung Electronics Co., Ltd. Method and apparatus for compressing genetic data
CN103995988A (en) * 2014-05-30 2014-08-20 周家锐 High-throughput DNA sequencing mass fraction lossless compression system and method
CN105391454A (en) * 2015-12-14 2016-03-09 季检 DNA sequencing quality score lossless compression method
CN106100641A (en) * 2016-06-12 2016-11-09 深圳大学 Multithreading quick storage lossless compression method and system thereof for FASTQ data
WO2017214765A1 (en) * 2016-06-12 2017-12-21 深圳大学 Multi-thread fast storage lossless compression method and system for fastq data
US20180181706A1 (en) * 2015-06-16 2018-06-28 Gottfried Wilhelm Leibniz Universitaet Hannover Method for Compressing Genomic Data
CN108306650A (en) * 2018-01-16 2018-07-20 厦门极元科技有限公司 The compression method of gene sequencing data
WO2019144312A1 (en) * 2018-01-24 2019-08-01 深圳大学 Gpu-accelerated dna sequence compression method and system
CN110111852A (en) * 2018-01-11 2019-08-09 广州明领基因科技有限公司 A kind of magnanimity DNA sequencing data lossless Fast Compression platform

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130031092A1 (en) * 2010-04-26 2013-01-31 Samsung Electronics Co., Ltd. Method and apparatus for compressing genetic data
CN103995988A (en) * 2014-05-30 2014-08-20 周家锐 High-throughput DNA sequencing mass fraction lossless compression system and method
WO2015180203A1 (en) * 2014-05-30 2015-12-03 周家锐 High-throughput dna sequencing quality score lossless compression system and compression method
US20180181706A1 (en) * 2015-06-16 2018-06-28 Gottfried Wilhelm Leibniz Universitaet Hannover Method for Compressing Genomic Data
CN105391454A (en) * 2015-12-14 2016-03-09 季检 DNA sequencing quality score lossless compression method
CN106100641A (en) * 2016-06-12 2016-11-09 深圳大学 Multithreading quick storage lossless compression method and system thereof for FASTQ data
WO2017214765A1 (en) * 2016-06-12 2017-12-21 深圳大学 Multi-thread fast storage lossless compression method and system for fastq data
CN110111852A (en) * 2018-01-11 2019-08-09 广州明领基因科技有限公司 A kind of magnanimity DNA sequencing data lossless Fast Compression platform
CN108306650A (en) * 2018-01-16 2018-07-20 厦门极元科技有限公司 The compression method of gene sequencing data
WO2019144312A1 (en) * 2018-01-24 2019-08-01 深圳大学 Gpu-accelerated dna sequence compression method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
孟倩: "基于高通量测序的短序列生物数据压缩研究", 《计算机应用与软件》 *
谭丽等: "基于码书索引变换的高通量DNA序列数据压缩算法", 《电子学报》 *

Also Published As

Publication number Publication date
CN111640467B (en) 2023-03-24

Similar Documents

Publication Publication Date Title
CN103814396B (en) The method and apparatus of coding/decoding bit stream
US9929746B2 (en) Methods and systems for data analysis and compression
CN103995988B (en) High-throughput DNA sequencing mass fraction lossless compression system and method
CN113593631A (en) Method and system for predicting protein-polypeptide binding site
US20130031092A1 (en) Method and apparatus for compressing genetic data
CN110021369B (en) Gene sequencing data compression and decompression method, system and computer readable medium
EP3311318B1 (en) Method for compressing genomic data
CN103546160A (en) Multi-reference-sequence based gene sequence stage compression method
CN107066837A (en) One kind has with reference to DNA sequence dna compression method and system
CN105760706A (en) Compression method for next generation sequencing data
CN109450452A (en) A kind of compression method and system of the sampling dictionary tree index for gene data
CN110021368B (en) Comparison type gene sequencing data compression method, system and computer readable medium
CN108287985A (en) A kind of the DNA sequence dna compression method and system of GPU acceleration
CN111640467B (en) DNA sequencing quality fraction lossless compression method based on self-adaptive coding sequence
CN112669905B (en) RNA sequence coding potential prediction method and system based on data enhancement
Kozanitis et al. Compressing genomic sequence fragments using SlimGene
CN110111851B (en) Gene sequencing data compression method, system and computer readable medium
CN107633158A (en) The method and apparatus for being compressed and decompressing to gene order
CN114758721B (en) Deep learning-based transcription factor binding site positioning method
CN109698702B (en) Gene sequencing data compression preprocessing method, system and computer readable medium
CN110915140B (en) Method for encoding and decoding quality values of a data structure
CN115064216A (en) Protein coding method based on position sequence matrix
JP2023513203A (en) An Improved Quality Value Compression Framework for New Context-Based Aligned Sequencing Data
Pathak et al. RETRACTED: LFQC: a lossless compression algorithm for FASTQ files
Voges Compression of DNA sequencing data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant