CN109979537B - Multi-sequence-oriented gene sequence data compression method - Google Patents

Multi-sequence-oriented gene sequence data compression method Download PDF

Info

Publication number
CN109979537B
CN109979537B CN201910197033.2A CN201910197033A CN109979537B CN 109979537 B CN109979537 B CN 109979537B CN 201910197033 A CN201910197033 A CN 201910197033A CN 109979537 B CN109979537 B CN 109979537B
Authority
CN
China
Prior art keywords
sequence
compressed
gene
gene sequence
reference sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910197033.2A
Other languages
Chinese (zh)
Other versions
CN109979537A (en
Inventor
季一木
李可
尧海昌
刘尚东
王汝传
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Lemote Information Technology Co ltd
Nanjing University of Posts and Telecommunications
Original Assignee
Jiangsu Lemote Information Technology Co ltd
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Lemote Information Technology Co ltd, Nanjing University of Posts and Telecommunications filed Critical Jiangsu Lemote Information Technology Co ltd
Priority to CN201910197033.2A priority Critical patent/CN109979537B/en
Publication of CN109979537A publication Critical patent/CN109979537A/en
Application granted granted Critical
Publication of CN109979537B publication Critical patent/CN109979537B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention provides a gene sequence data compression method for multiple sequences, which is mainly used for solving the problems of overlarge gene data amount and reduction of the storage and transmission cost of the gene data. Firstly, selecting a reference sequence from a gene sequence to be compressed, and secondly, compressing a non-reference sequence and the reference sequence by adopting different compression modes. For a non-reference sequence, carrying out exclusive or with a reference sequence, then carrying out matrix division and matrix coding, and finally coding a gene sequence into a binary form for storage; for the reference sequence, a k-mer algorithm is used for individual compression. The compression method has high compression ratio and high compression speed, and the binary group codes are unrelated to the gene sequence, thereby being beneficial to distributed storage and analysis of the gene sequence.

Description

Multi-sequence-oriented gene sequence data compression method
Technical Field
The invention relates to the technical field of gene sequence compression in the field of big data, in particular to a method for compressing gene sequence data for multiple sequences.
Background
Genes are fragments of DNA that have a genetic effect and are closely related to life. The research on gene data can obtain deep research on life running mechanism, disease mechanism and the like, plays an increasingly important role in biological medicine and related biotechnology industries, and has an important role in promoting accurate medical treatment and helping to solve one of three civilian problems by researching human genes. Therefore, the genetic data is widely regarded by the international society because of its important social value and scientific research value. Since the international human genome project formally started in 1990, with the continuous progress of gene sequencing technology, the cost of gene sequencing is continuously reduced, the sequencing speed is continuously increased, and numerous countries and organizations start genetic engineering projects. 12 and 28 days in 2017, China starts a genome plan of hundred thousand people in China, which is a first important national plan implemented in the field of human genome research in China and is also the largest-scale human genome plan in the world at present. As various sequencing projects are expanded, the amount of sequence data generated increases exponentially and at a faster rate in the future. The growth rate of gene data greatly exceeds the growth rate of storage and transmission bandwidth, and great pressure is brought to storage and transmission. How to store gene data with higher efficiency and reduce the storage and transmission pressure plays an important role in gene research and application.
DNA sequence data has a property that is distinct from other data, and is a very long sequence consisting of only A, G, C, T four symbols, and is simple in kind of construction but large in sequence length. A large part of DNA sequence can not be used for determining the purpose, and if loss occurs in the data compression process, the loss can be immeasurable, so that the DNA sequence must ensure lossless compression. In addition, the arrangement of base pairs in a DNA sequence is not random and has a specific probability distribution and regularity. Furthermore, the DNA sequences have a high degree of similarity. First, the DNA sequence similarity between different species is high, and the DNA sequence similarity between the same species is more obvious. Secondly, there are many exact repeats of DNA sequences of different fragments within the same body. By utilizing the information characteristics of DNA, the industry and academia propose a plurality of DNA sequence compression methods by utilizing the DNA sequence characteristics. Through literature search of the prior art, the CTW + LZ method was proposed in 2000 by T Matsumoto and K Sadakane on Genome information, "Biological sequence compression algorithms", and different fragments of a DNA sequence were compressed using a plurality of coding models by combining a Context Tree Weighting (CTW) method and an LZ compression method. In 2002, "DNACompress: fast and effective DNA sequence compression "proposes a DNAcompression compression method, and uses a Pattern Hunter tool to search repeated and approximately repeated segments of a DNA sequence, thereby improving the overall speed of the method. In 2005, "An Effective Normalized Maximum Likelihood Algorithm for DNA Sequence Compression" by G Korodi and I Tabus on ACM Transactions on Information Systems proposed GeNML method, which used different coding strategies and probability models for DNA fragments with different data characteristics to compress. In 2013, Sebastian wanderer and Uif Leser, "FRESCO: reference Compression of high hierarchy Similar Sequences "proposes a rapid gene Compression method called FRESCO, which employs a method of expressing a compressed gene with a reference gene. In 2015, Xiaoojin Xie, Shuigeng Zhou and Jihong Guan in IEEE/ACM Transactions on computerized Biology and Bioinformatics "CoGI: towards Compressing genome as an Image "proposes a method of representing genetic data by a graph model, so that the genetic model can be compressed by utilizing a graph compression technique. Summarizing these DNA sequence compression methods, these compression methods can be divided into two broad categories: the method for compressing the DNA sequence based on the non-reference sequence and the method for compressing the DNA sequence based on the reference sequence effectively improve the compression ratio and the compression efficiency. However, in these methods, the bioinformatic characteristics of the constituent gene fragments and the detailed repetitive characteristics within the fragments are not fully utilized, and the characteristics between gene sequences are not sufficiently exploited, resulting in low compression ratio and compression efficiency of the DNA sequences.
Disclosure of Invention
The purpose of the invention is as follows: in order to make up the defects of the prior art, the invention provides a gene sequence data compression method for multiple sequences, which can obviously improve the compression efficiency and realize high-efficiency storage.
The technical scheme is as follows: in order to achieve the technical effects, the invention provides the following technical scheme:
a method for compressing gene sequence data for multiple sequences comprises the following steps:
(1) selecting a reference sequence: recording n gene sequences to be compressed, mapping the gene sequences to be compressed into high-dimensional digital vectors of Euclidean space through digitalization, then calculating the sum of Euclidean distances between each gene sequence to be compressed and other n-1 gene sequences to be compressed, and taking the gene sequence to be compressed with the minimum sum of the Euclidean distances as a reference sequence;
(2) aligning the gene sequence to be compressed with a reference sequence: traversing the gene sequence to be compressed according to the sequence from beginning to end, and comparing each bit in the gene sequence to be compressed with the corresponding bit in the reference sequence; when the excessive part of the gene sequence to be compressed is met relative to the reference sequence, deleting the excessive part of the content from the gene sequence to be compressed, and independently hanging the deleted content at the tail of the gene sequence to be compressed in a quadruple form, wherein the quadruple form is as follows: (T, P, L, C), wherein T represents an attachment type, the attachment type is divided into deletion or addition, P represents the position of the deleted content in the gene sequence to be compressed, L represents the length of the deleted content, and C represents the deleted content; when a part of the gene sequence to be compressed is lacked relative to the reference sequence, the lacked part is complemented at a corresponding position in the gene sequence to be compressed, and then the complemented content is hung at the tail of the gene sequence to be compressed in a form of a triplet: (T, P, L);
(3) independently compressing the reference sequence by adopting a segmented coding method based on a k-mer algorithm;
(4) carrying out binary coding on all gene sequences including the reference sequence, and converting all the gene sequences into binary bit sequences;
(5) carrying out XOR on the gene sequence to be compressed and a reference sequence, and after the XOR processing, wherein the parts of the gene sequence to be compressed, which are the same as the reference sequence, are all 0, and the different parts are 1;
(6) and (3) matrixing the gene sequence to be compressed after the XOR with the reference sequence: recording the binary length of the reference sequence as 1, selecting proper w, dividing each gene sequence to be compressed into equal parts of w width, and placing the equal parts into a two-dimensional matrix; wherein, the first section of the first gene sequence to be compressed is taken as the first row of the two-dimensional matrix, the first section of the second gene sequence to be compressed is taken as the second row of the two-dimensional matrix, and so on until the first section of the (n-1) th gene sequence to be compressed is taken as the (n-1) th row of the two-dimensional matrix; then, taking the second section of the first gene sequence to be compressed as the nth row of the two-dimensional matrix, taking the second section of the second gene sequence to be compressed as the n +1 th row of the two-dimensional matrix, and so on until n-1 sequences are all input into the two-dimensional matrix; so far, all gene sequences to be compressed are converted into a two-dimensional matrix with the width of w and the length of (n-1) × l/w;
(7) and (4) coding the two-dimensional matrix obtained in the step (6): dividing elements of 1 in the two-dimensional matrix into a plurality of sub-matrices of which the elements are all 1; each sub-matrix is encoded as follows: for a sub-matrix with only one element, set the element code to-1; for a sub-matrix with 2 or more elements, the upper left corner element is set to 1, the lower right corner element is set to 2, and the rest elements are set to 0;
(8) and (5) storing the gene sequence to be detected after the treatment in the step (7): recording row numbers and column numbers of elements with median values of 1, 2-and-1 in a two-dimensional matrix in a binary form in sequence, and converting information of a gene sequence to be compressed into binary information; entropy encoding is performed on elements in the doublet, i.e. using variable length coding, for elements smaller than 255, two bytes for elements larger than 255 and smaller than 65535, and three bytes for elements larger than 65535.
Further, the step (3) of individually compressing the reference sequence by using a segmented coding method based on a k-mer algorithm specifically comprises the steps of: firstly, dividing a reference sequence into equal-length segments with the length of m, then selecting a proper k value, searching a k-mer sequence with the highest repetition rate in each segment, recording the total repetition times and the position of each repetition in the segment, and then carrying out sectional coding to replace a k-mer subsequence which repeatedly appears in the sequence.
Has the advantages that: compared with the prior art, the invention has the following advantages:
the gene sequence data compression method for multiple sequences provided by the invention converts the gene sequence into a binary form, and then changes the gene sequence with strict requirements on the sequence into a sequence irrelevant to the sequence, thereby being beneficial to improving the efficiency of gene compression and analysis by utilizing distributed storage and calculation.
Drawings
FIG. 1 is a flow chart of a method for compressing gene sequence data for multiple sequences according to the present invention;
FIG. 2 is a flow chart of reference sequence selection
FIG. 3 is a schematic diagram of a non-reference sequence to reference sequence alignment scheme;
FIG. 4 is a schematic diagram of a gene sequence matrixing scheme
FIG. 5 is a flow chart of the k-mer algorithm.
Detailed Description
The present invention will be further described with reference to the accompanying drawings.
The invention provides a method for compressing gene sequence data for multiple sequences, which is mainly used for solving the problems of overlarge gene sequence data, high storage and transmission cost and the like. Finally, the reference sequence and the like are compressed separately. FIG. 1 is a flow chart of the method for compressing gene sequence data for multiple sequences according to the present invention, wherein the flow chart comprises the following steps:
step one, selecting a reference sequence: the reference sequence selection process is shown in fig. 2, and for a plurality of gene sequences to be compressed, a reference sequence needs to be selected from the plurality of gene sequences to be compressed, and the quality of the selected reference sequence is an important condition for improving the gene compression ratio. Recording n gene sequences to be compressed, mapping the gene sequences to be compressed into high-dimensional digital vectors of Euclidean space through digitalization, then calculating the sum of Euclidean distances between each gene sequence to be compressed and other n-1 gene sequences to be compressed, and taking the gene sequence to be compressed with the minimum sum of the Euclidean distances as a reference sequence.
Step two, aligning the sequence to be compressed with the reference sequence: alignment procedure as shown in fig. 3, temporary pruning is required for the portion of the sequence to be compressed that is added with respect to the reference sequence. And independently hanging the deleted information at the tail of the sequence in a four-tuple form, wherein the hanging four-tuple form is' hanging Type (adding or deleting), Position, Length and Content > in the sequence. And for the part of the sequence to be compressed, which is lacked relative to the reference sequence, temporary supplement is needed, the supplemented content is the content in the reference sequence, and the supplemented content is hung at the tail of the sequence in a form of a triple, wherein the form of the hanging triple is a hanging type (adding or deleting), a Position in the sequence and a Length.
Step three, independently compressing and storing the reference sequence: and (3) independently compressing by adopting a segmented coding method based on a k-mer algorithm. A k-mer is a short sequence fragment consisting of k contiguous bases in a DNA sequence, and when k is taken to be a suitable value, the k-mer frequency distribution in the DNA sequence contains all the information of the genome, thereby constituting an equivalent representation of the sequence. Firstly, dividing a DNA sequence into equal-length fragments with the length of m, then selecting a proper k value, searching a k-mer sequence with the highest repetition rate in each fragment, recording the total number of repetitions of the k-mer sequence, the position of each repetition in the fragment and the like, and then carrying out sectional coding to replace a k-mer subsequence which repeatedly appears in the sequence. In this embodiment, the DNA sequence is divided into equal-length fragments with a length of 64, and then k is set to 3, that is, the 3-mer sequence with the highest repetition rate is searched in each 64-base subfragment, and the total number of repetitions and the position of each repetition in the fragment are recorded, and then coding is performed in the form of a triplet, where the triplet is < subsequence number No., 3-mer Type, and the distance increment vector d of all 3-mers in the subsequence is (d1, d2, d3,.. gtn) >, so as to replace the k-mer subsequence repeatedly appearing in the sequence, as shown in fig. 5.
And step four, coding all gene sequences including the reference sequence. The gene sequence consists of A, G, C, T four bases, so each base can be represented in binary format. Thus, the gene sequence is converted into a binary bit sequence.
And step five, carrying out XOR on the gene sequence to be compressed and the reference sequence. After XOR, the part of the gene sequence to be compressed, which is the same as the reference sequence, is changed into 0, and the part of the gene sequence, which is not the same as the reference sequence, is changed into 1.
Step six, as shown in fig. 4, matrixing the gene sequence to be compressed after being subjected to exclusive or with the reference sequence: recording the binary length of the reference sequence as l, selecting proper w, dividing each gene sequence to be compressed into equal parts of w width, and placing the equal parts into a two-dimensional matrix; wherein, the first section of the first gene sequence to be compressed is taken as the first row of the two-dimensional matrix, the first section of the second gene sequence to be compressed is taken as the second row of the two-dimensional matrix, and so on until the first section of the (n-1) th gene sequence to be compressed is taken as the (n-1) th row of the two-dimensional matrix; then, taking the second section of the first gene sequence to be compressed as the nth row of the two-dimensional matrix, taking the second section of the second gene sequence to be compressed as the n +1 th row of the two-dimensional matrix, and so on until n-1 sequences are all input into the two-dimensional matrix; to this end, all gene sequences to be compressed are converted into a two-dimensional matrix with width w and length (n-1) × l/w.
Step seven, coding the two-dimensional matrix obtained in the step six: dividing elements of 1 in the two-dimensional matrix into a plurality of sub-matrices of which the elements are all 1; each sub-matrix is encoded as follows: for a sub-matrix with only one element, set the element code to-1; for a sub-matrix with 2 or more elements, the upper left corner element is set to 1, the lower right corner element is set to 2, and the rest elements are set to 0;
step eight, storing the gene sequence to be detected after the treatment of the step seven: recording row numbers and column numbers of elements with median values of 1, 2-and-1 in a two-dimensional matrix in a binary form in sequence, and converting information of a gene sequence to be compressed into binary information; entropy encoding is performed on elements in the doublet, i.e. using variable length coding, for elements smaller than 255, two bytes for elements larger than 255 and smaller than 65535, and three bytes for elements larger than 65535.
The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims (2)

1. A method for compressing gene sequence data for a plurality of sequences, comprising the steps of:
(1) selecting a reference sequence: recording n gene sequences to be compressed, mapping the gene sequences to be compressed into high-dimensional digital vectors of Euclidean space through digitalization, then calculating the sum of Euclidean distances between each gene sequence to be compressed and other n-1 gene sequences to be compressed, and taking the gene sequence to be compressed with the minimum sum of the Euclidean distances as a reference sequence;
(2) aligning the gene sequence to be compressed with a reference sequence: traversing the gene sequence to be compressed according to the sequence from beginning to end, and comparing each bit in the gene sequence to be compressed with the corresponding bit in the reference sequence; when the excessive part of the gene sequence to be compressed is met relative to the reference sequence, deleting the excessive part of the content from the gene sequence to be compressed, and independently hanging the deleted content at the tail of the gene sequence to be compressed in a quadruple form, wherein the quadruple form is as follows: (T, P, L, C), wherein T represents an attachment type, the attachment type is divided into deletion or addition, P represents the position of the deleted content in the gene sequence to be compressed, L represents the length of the deleted content, and C represents the deleted content; when a part of the gene sequence to be compressed is lacked relative to the reference sequence, the lacked part is complemented at a corresponding position in the gene sequence to be compressed, and then the complemented content is hung at the tail of the gene sequence to be compressed in a form of a triplet: (T, P, L);
(3) independently compressing the reference sequence by adopting a segmented coding method based on a k-mer algorithm;
(4) carrying out binary coding on all gene sequences including the reference sequence, and converting all the gene sequences into binary bit sequences;
(5) carrying out XOR on the gene sequence to be compressed and a reference sequence, and after the XOR processing, wherein the parts of the gene sequence to be compressed, which are the same as the reference sequence, are all 0, and the different parts are 1;
(6) and (3) matrixing the gene sequence to be compressed after the XOR with the reference sequence: recording the binary length of the reference sequence as l, selecting proper w, dividing each gene sequence to be compressed into equal parts of w width, and placing the equal parts into a two-dimensional matrix; wherein, the first section of the first gene sequence to be compressed is taken as the first row of the two-dimensional matrix, the first section of the second gene sequence to be compressed is taken as the second row of the two-dimensional matrix, and so on until the first section of the (n-1) th gene sequence to be compressed is taken as the (n-1) th row of the two-dimensional matrix; then, taking the second section of the first gene sequence to be compressed as the nth row of the two-dimensional matrix, taking the second section of the second gene sequence to be compressed as the n +1 th row of the two-dimensional matrix, and so on until n-1 sequences are all input into the two-dimensional matrix; so far, all gene sequences to be compressed are converted into a two-dimensional matrix with the width of w and the length of (n-1) × l/w;
(7) and (4) coding the two-dimensional matrix obtained in the step (6): dividing the two-dimensional matrix into a plurality of sub-matrices, wherein elements in the sub-matrices are all 1; each sub-matrix is encoded as follows: for a sub-matrix with only one element, set the element code to-1; for a sub-matrix with 2 or more elements, the upper left corner element is set to 1, the lower right corner element is set to 2, and the rest elements are set to 0;
(8) and (5) storing the gene sequence to be detected after the treatment in the step (7): recording the row number and the column number of elements with the median values of 1, 2 and-1 in a two-dimensional matrix in sequence in a binary form, and converting the information of the gene sequence to be compressed into binary information; entropy encoding is performed on elements in the doublet, i.e. using variable length coding, for elements smaller than 255, two bytes for elements larger than 255 and smaller than 65535, and three bytes for elements larger than 65535.
2. The method for compressing gene sequence data for multiple sequences according to claim 1, wherein the step (3) of compressing the reference sequence separately by using a segmented coding method based on k-mer algorithm comprises the following specific steps: firstly, dividing a reference sequence into equal-length segments with the length of m, then selecting a proper k value, searching a k-mer sequence with the highest repetition rate in each segment, recording the total repetition times and the position of each repetition in the segment, and then carrying out sectional coding to replace a k-mer subsequence which repeatedly appears in the sequence.
CN201910197033.2A 2019-03-15 2019-03-15 Multi-sequence-oriented gene sequence data compression method Active CN109979537B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910197033.2A CN109979537B (en) 2019-03-15 2019-03-15 Multi-sequence-oriented gene sequence data compression method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910197033.2A CN109979537B (en) 2019-03-15 2019-03-15 Multi-sequence-oriented gene sequence data compression method

Publications (2)

Publication Number Publication Date
CN109979537A CN109979537A (en) 2019-07-05
CN109979537B true CN109979537B (en) 2020-12-18

Family

ID=67079015

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910197033.2A Active CN109979537B (en) 2019-03-15 2019-03-15 Multi-sequence-oriented gene sequence data compression method

Country Status (1)

Country Link
CN (1) CN109979537B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112700819B (en) * 2020-12-31 2021-11-30 云舟生物科技(广州)有限公司 Gene sequence processing method, computer storage medium and electronic device
CN113496762B (en) * 2021-05-20 2022-09-27 山东大学 Biological gene sequence summary data generation method and system
CN115798591B (en) * 2022-12-23 2023-05-23 哈尔滨星云医学检验所有限公司 Genome sequence compression method based on Hilbert fractal
CN117153270B (en) * 2023-10-30 2024-02-02 吉林华瑞基因科技有限公司 Gene second-generation sequencing data processing method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012098515A1 (en) * 2011-01-19 2012-07-26 Koninklijke Philips Electronics N.V. Method for processing genomic data
CN103546160A (en) * 2013-09-22 2014-01-29 上海交通大学 Multi-reference-sequence based gene sequence stage compression method
CN108287985A (en) * 2018-01-24 2018-07-17 深圳大学 A kind of the DNA sequence dna compression method and system of GPU acceleration
CN108350494A (en) * 2015-08-06 2018-07-31 阿柯生物有限公司 System and method for genome analysis

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160103953A1 (en) * 2014-10-10 2016-04-14 International Business Machines Corporation Biological sequence tandem repeat characterization
CN106021985B (en) * 2016-05-17 2019-03-29 杭州和壹基因科技有限公司 A kind of genomic data compression method
CN106295250B (en) * 2016-07-28 2019-03-29 北京百迈客医学检验所有限公司 Short sequence quick comparison analysis method and device was sequenced in two generations
US20180157787A1 (en) * 2016-10-19 2018-06-07 Pacific Biosciences Of California, Inc. Coding genome reconstruction from transcript sequences

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012098515A1 (en) * 2011-01-19 2012-07-26 Koninklijke Philips Electronics N.V. Method for processing genomic data
CN103546160A (en) * 2013-09-22 2014-01-29 上海交通大学 Multi-reference-sequence based gene sequence stage compression method
CN108350494A (en) * 2015-08-06 2018-07-31 阿柯生物有限公司 System and method for genome analysis
CN108287985A (en) * 2018-01-24 2018-07-17 深圳大学 A kind of the DNA sequence dna compression method and system of GPU acceleration

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Compression of Multiple DNA Sequences Using Intra-Sequence and Inter-Sequence Similarities;Kin-On Cheng等;《IEEE/ACM Transactions on Computational Biology and Bioinformatics》;20150224;第1322-1332页 *
Lossless Segment Based DNA Compression;T V Mridula等;《2011 3rd International Conference on Electronics Computer Technology》;20110707;第298-302页 *
基于Memetic优化的智能DNA序列数据压缩算法;周家锐 等;《电子学报》;20130331(第3期);第513-518页 *

Also Published As

Publication number Publication date
CN109979537A (en) 2019-07-05

Similar Documents

Publication Publication Date Title
CN109979537B (en) Multi-sequence-oriented gene sequence data compression method
CN110603595B (en) Methods and systems for reconstructing genomic reference sequences from compressed genomic sequence reads
Kuruppu et al. Optimized relative Lempel-Ziv compression of genomes
Wandelt et al. Trends in genome compression
CN107066837B (en) Method and system for compressing reference DNA sequence
CN103546160A (en) Multi-reference-sequence based gene sequence stage compression method
Bakr et al. DNA lossless compression algorithms
WO2011007956A2 (en) Data compression method
CN110021369B (en) Gene sequencing data compression and decompression method, system and computer readable medium
US8972200B2 (en) Compression of genomic data
CN110428868B (en) Method and system for compressing, preprocessing and decompressing and reducing gene sequencing mass data
WO2019076177A1 (en) Gene sequencing data compression preprocessing, compression and decompression method, system, and computer-readable medium
CN110310709B (en) Reference sequence-based gene compression method
CN107565975A (en) The method of FASTQ formatted file Lossless Compressions
Yao et al. HRCM: an efficient hybrid referential compression method for genomic big data
CN108287985A (en) A kind of the DNA sequence dna compression method and system of GPU acceleration
KR20190113969A (en) Efficient Compression Method and System of Genomic Sequence Reads
CN110021368B (en) Comparison type gene sequencing data compression method, system and computer readable medium
CN109698703B (en) Gene sequencing data decompression method, system and computer readable medium
CN110111852A (en) A kind of magnanimity DNA sequencing data lossless Fast Compression platform
CN108259515A (en) A kind of lossless source compression method suitable for transmission link under Bandwidth-Constrained
CN102932001A (en) Method for compressing and decompressing motion capture data
Bakr et al. Improve the compression of bacterial DNA sequence
Chlopkowski et al. High-order statistical compressor for long-term storage of DNA sequencing data
CN109698702B (en) Gene sequencing data compression preprocessing method, system and computer readable medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant