CN109979537B

CN109979537B - Multi-sequence-oriented gene sequence data compression method

Info

Publication number: CN109979537B
Application number: CN201910197033.2A
Authority: CN
Inventors: 季一木; 李可; 尧海昌; 刘尚东; 王汝传
Original assignee: Jiangsu Lemote Information Technology Co ltd; Nanjing University of Posts and Telecommunications
Current assignee: Jiangsu Lemote Information Technology Co ltd; Nanjing University of Posts and Telecommunications
Priority date: 2019-03-15
Filing date: 2019-03-15
Publication date: 2020-12-18
Anticipated expiration: 2039-03-15
Also published as: CN109979537A

Abstract

The invention provides a gene sequence data compression method for multiple sequences, which is mainly used for solving the problems of overlarge gene data amount and reduction of the storage and transmission cost of the gene data. Firstly, selecting a reference sequence from a gene sequence to be compressed, and secondly, compressing a non-reference sequence and the reference sequence by adopting different compression modes. For a non-reference sequence, carrying out exclusive or with a reference sequence, then carrying out matrix division and matrix coding, and finally coding a gene sequence into a binary form for storage; for the reference sequence, a k-mer algorithm is used for individual compression. The compression method has high compression ratio and high compression speed, and the binary group codes are unrelated to the gene sequence, thereby being beneficial to distributed storage and analysis of the gene sequence.

Description

Multi-sequence-oriented gene sequence data compression method

Technical Field

The invention relates to the technical field of gene sequence compression in the field of big data, in particular to a method for compressing gene sequence data for multiple sequences.

Background

Genes are fragments of DNA that have a genetic effect and are closely related to life. The research on gene data can obtain deep research on life running mechanism, disease mechanism and the like, plays an increasingly important role in biological medicine and related biotechnology industries, and has an important role in promoting accurate medical treatment and helping to solve one of three civilian problems by researching human genes. Therefore, the genetic data is widely regarded by the international society because of its important social value and scientific research value. Since the international human genome project formally started in 1990, with the continuous progress of gene sequencing technology, the cost of gene sequencing is continuously reduced, the sequencing speed is continuously increased, and numerous countries and organizations start genetic engineering projects. 12 and 28 days in 2017, China starts a genome plan of hundred thousand people in China, which is a first important national plan implemented in the field of human genome research in China and is also the largest-scale human genome plan in the world at present. As various sequencing projects are expanded, the amount of sequence data generated increases exponentially and at a faster rate in the future. The growth rate of gene data greatly exceeds the growth rate of storage and transmission bandwidth, and great pressure is brought to storage and transmission. How to store gene data with higher efficiency and reduce the storage and transmission pressure plays an important role in gene research and application.

DNA sequence data has a property that is distinct from other data, and is a very long sequence consisting of only A, G, C, T four symbols, and is simple in kind of construction but large in sequence length. A large part of DNA sequence can not be used for determining the purpose, and if loss occurs in the data compression process, the loss can be immeasurable, so that the DNA sequence must ensure lossless compression. In addition, the arrangement of base pairs in a DNA sequence is not random and has a specific probability distribution and regularity. Furthermore, the DNA sequences have a high degree of similarity. First, the DNA sequence similarity between different species is high, and the DNA sequence similarity between the same species is more obvious. Secondly, there are many exact repeats of DNA sequences of different fragments within the same body. By utilizing the information characteristics of DNA, the industry and academia propose a plurality of DNA sequence compression methods by utilizing the DNA sequence characteristics. Through literature search of the prior art, the CTW + LZ method was proposed in 2000 by T Matsumoto and K Sadakane on Genome information, "Biological sequence compression algorithms", and different fragments of a DNA sequence were compressed using a plurality of coding models by combining a Context Tree Weighting (CTW) method and an LZ compression method. In 2002, "DNACompress: fast and effective DNA sequence compression "proposes a DNAcompression compression method, and uses a Pattern Hunter tool to search repeated and approximately repeated segments of a DNA sequence, thereby improving the overall speed of the method. In 2005, "An Effective Normalized Maximum Likelihood Algorithm for DNA Sequence Compression" by G Korodi and I Tabus on ACM Transactions on Information Systems proposed GeNML method, which used different coding strategies and probability models for DNA fragments with different data characteristics to compress. In 2013, Sebastian wanderer and Uif Leser, "FRESCO: reference Compression of high hierarchy Similar Sequences "proposes a rapid gene Compression method called FRESCO, which employs a method of expressing a compressed gene with a reference gene. In 2015, Xiaoojin Xie, Shuigeng Zhou and Jihong Guan in IEEE/ACM Transactions on computerized Biology and Bioinformatics "CoGI: towards Compressing genome as an Image "proposes a method of representing genetic data by a graph model, so that the genetic model can be compressed by utilizing a graph compression technique. Summarizing these DNA sequence compression methods, these compression methods can be divided into two broad categories: the method for compressing the DNA sequence based on the non-reference sequence and the method for compressing the DNA sequence based on the reference sequence effectively improve the compression ratio and the compression efficiency. However, in these methods, the bioinformatic characteristics of the constituent gene fragments and the detailed repetitive characteristics within the fragments are not fully utilized, and the characteristics between gene sequences are not sufficiently exploited, resulting in low compression ratio and compression efficiency of the DNA sequences.

Disclosure of Invention

The purpose of the invention is as follows: in order to make up the defects of the prior art, the invention provides a gene sequence data compression method for multiple sequences, which can obviously improve the compression efficiency and realize high-efficiency storage.

The technical scheme is as follows: in order to achieve the technical effects, the invention provides the following technical scheme:

a method for compressing gene sequence data for multiple sequences comprises the following steps:

(1) selecting a reference sequence: recording n gene sequences to be compressed, mapping the gene sequences to be compressed into high-dimensional digital vectors of Euclidean space through digitalization, then calculating the sum of Euclidean distances between each gene sequence to be compressed and other n-1 gene sequences to be compressed, and taking the gene sequence to be compressed with the minimum sum of the Euclidean distances as a reference sequence;

(2) aligning the gene sequence to be compressed with a reference sequence: traversing the gene sequence to be compressed according to the sequence from beginning to end, and comparing each bit in the gene sequence to be compressed with the corresponding bit in the reference sequence; when the excessive part of the gene sequence to be compressed is met relative to the reference sequence, deleting the excessive part of the content from the gene sequence to be compressed, and independently hanging the deleted content at the tail of the gene sequence to be compressed in a quadruple form, wherein the quadruple form is as follows: (T, P, L, C), wherein T represents an attachment type, the attachment type is divided into deletion or addition, P represents the position of the deleted content in the gene sequence to be compressed, L represents the length of the deleted content, and C represents the deleted content; when a part of the gene sequence to be compressed is lacked relative to the reference sequence, the lacked part is complemented at a corresponding position in the gene sequence to be compressed, and then the complemented content is hung at the tail of the gene sequence to be compressed in a form of a triplet: (T, P, L);

(3) independently compressing the reference sequence by adopting a segmented coding method based on a k-mer algorithm;

(4) carrying out binary coding on all gene sequences including the reference sequence, and converting all the gene sequences into binary bit sequences;

(5) carrying out XOR on the gene sequence to be compressed and a reference sequence, and after the XOR processing, wherein the parts of the gene sequence to be compressed, which are the same as the reference sequence, are all 0, and the different parts are 1;

(6) and (3) matrixing the gene sequence to be compressed after the XOR with the reference sequence: recording the binary length of the reference sequence as 1, selecting proper w, dividing each gene sequence to be compressed into equal parts of w width, and placing the equal parts into a two-dimensional matrix; wherein, the first section of the first gene sequence to be compressed is taken as the first row of the two-dimensional matrix, the first section of the second gene sequence to be compressed is taken as the second row of the two-dimensional matrix, and so on until the first section of the (n-1) th gene sequence to be compressed is taken as the (n-1) th row of the two-dimensional matrix; then, taking the second section of the first gene sequence to be compressed as the nth row of the two-dimensional matrix, taking the second section of the second gene sequence to be compressed as the n +1 th row of the two-dimensional matrix, and so on until n-1 sequences are all input into the two-dimensional matrix; so far, all gene sequences to be compressed are converted into a two-dimensional matrix with the width of w and the length of (n-1) × l/w;

(7) and (4) coding the two-dimensional matrix obtained in the step (6): dividing elements of 1 in the two-dimensional matrix into a plurality of sub-matrices of which the elements are all 1; each sub-matrix is encoded as follows: for a sub-matrix with only one element, set the element code to-1; for a sub-matrix with 2 or more elements, the upper left corner element is set to 1, the lower right corner element is set to 2, and the rest elements are set to 0;

(8) and (5) storing the gene sequence to be detected after the treatment in the step (7): recording row numbers and column numbers of elements with median values of 1, 2-and-1 in a two-dimensional matrix in a binary form in sequence, and converting information of a gene sequence to be compressed into binary information; entropy encoding is performed on elements in the doublet, i.e. using variable length coding, for elements smaller than 255, two bytes for elements larger than 255 and smaller than 65535, and three bytes for elements larger than 65535.

Further, the step (3) of individually compressing the reference sequence by using a segmented coding method based on a k-mer algorithm specifically comprises the steps of: firstly, dividing a reference sequence into equal-length segments with the length of m, then selecting a proper k value, searching a k-mer sequence with the highest repetition rate in each segment, recording the total repetition times and the position of each repetition in the segment, and then carrying out sectional coding to replace a k-mer subsequence which repeatedly appears in the sequence.

Has the advantages that: compared with the prior art, the invention has the following advantages:

the gene sequence data compression method for multiple sequences provided by the invention converts the gene sequence into a binary form, and then changes the gene sequence with strict requirements on the sequence into a sequence irrelevant to the sequence, thereby being beneficial to improving the efficiency of gene compression and analysis by utilizing distributed storage and calculation.

Drawings

FIG. 1 is a flow chart of a method for compressing gene sequence data for multiple sequences according to the present invention;

FIG. 2 is a flow chart of reference sequence selection

FIG. 3 is a schematic diagram of a non-reference sequence to reference sequence alignment scheme;

FIG. 4 is a schematic diagram of a gene sequence matrixing scheme

FIG. 5 is a flow chart of the k-mer algorithm.

Detailed Description

The present invention will be further described with reference to the accompanying drawings.

The invention provides a method for compressing gene sequence data for multiple sequences, which is mainly used for solving the problems of overlarge gene sequence data, high storage and transmission cost and the like. Finally, the reference sequence and the like are compressed separately. FIG. 1 is a flow chart of the method for compressing gene sequence data for multiple sequences according to the present invention, wherein the flow chart comprises the following steps:

step one, selecting a reference sequence: the reference sequence selection process is shown in fig. 2, and for a plurality of gene sequences to be compressed, a reference sequence needs to be selected from the plurality of gene sequences to be compressed, and the quality of the selected reference sequence is an important condition for improving the gene compression ratio. Recording n gene sequences to be compressed, mapping the gene sequences to be compressed into high-dimensional digital vectors of Euclidean space through digitalization, then calculating the sum of Euclidean distances between each gene sequence to be compressed and other n-1 gene sequences to be compressed, and taking the gene sequence to be compressed with the minimum sum of the Euclidean distances as a reference sequence.

Step two, aligning the sequence to be compressed with the reference sequence: alignment procedure as shown in fig. 3, temporary pruning is required for the portion of the sequence to be compressed that is added with respect to the reference sequence. And independently hanging the deleted information at the tail of the sequence in a four-tuple form, wherein the hanging four-tuple form is' hanging Type (adding or deleting), Position, Length and Content > in the sequence. And for the part of the sequence to be compressed, which is lacked relative to the reference sequence, temporary supplement is needed, the supplemented content is the content in the reference sequence, and the supplemented content is hung at the tail of the sequence in a form of a triple, wherein the form of the hanging triple is a hanging type (adding or deleting), a Position in the sequence and a Length.

Step three, independently compressing and storing the reference sequence: and (3) independently compressing by adopting a segmented coding method based on a k-mer algorithm. A k-mer is a short sequence fragment consisting of k contiguous bases in a DNA sequence, and when k is taken to be a suitable value, the k-mer frequency distribution in the DNA sequence contains all the information of the genome, thereby constituting an equivalent representation of the sequence. Firstly, dividing a DNA sequence into equal-length fragments with the length of m, then selecting a proper k value, searching a k-mer sequence with the highest repetition rate in each fragment, recording the total number of repetitions of the k-mer sequence, the position of each repetition in the fragment and the like, and then carrying out sectional coding to replace a k-mer subsequence which repeatedly appears in the sequence. In this embodiment, the DNA sequence is divided into equal-length fragments with a length of 64, and then k is set to 3, that is, the 3-mer sequence with the highest repetition rate is searched in each 64-base subfragment, and the total number of repetitions and the position of each repetition in the fragment are recorded, and then coding is performed in the form of a triplet, where the triplet is < subsequence number No., 3-mer Type, and the distance increment vector d of all 3-mers in the subsequence is (d1, d2, d3,.. gtn) >, so as to replace the k-mer subsequence repeatedly appearing in the sequence, as shown in fig. 5.

And step four, coding all gene sequences including the reference sequence. The gene sequence consists of A, G, C, T four bases, so each base can be represented in binary format. Thus, the gene sequence is converted into a binary bit sequence.

And step five, carrying out XOR on the gene sequence to be compressed and the reference sequence. After XOR, the part of the gene sequence to be compressed, which is the same as the reference sequence, is changed into 0, and the part of the gene sequence, which is not the same as the reference sequence, is changed into 1.

Step six, as shown in fig. 4, matrixing the gene sequence to be compressed after being subjected to exclusive or with the reference sequence: recording the binary length of the reference sequence as l, selecting proper w, dividing each gene sequence to be compressed into equal parts of w width, and placing the equal parts into a two-dimensional matrix; wherein, the first section of the first gene sequence to be compressed is taken as the first row of the two-dimensional matrix, the first section of the second gene sequence to be compressed is taken as the second row of the two-dimensional matrix, and so on until the first section of the (n-1) th gene sequence to be compressed is taken as the (n-1) th row of the two-dimensional matrix; then, taking the second section of the first gene sequence to be compressed as the nth row of the two-dimensional matrix, taking the second section of the second gene sequence to be compressed as the n +1 th row of the two-dimensional matrix, and so on until n-1 sequences are all input into the two-dimensional matrix; to this end, all gene sequences to be compressed are converted into a two-dimensional matrix with width w and length (n-1) × l/w.

Step seven, coding the two-dimensional matrix obtained in the step six: dividing elements of 1 in the two-dimensional matrix into a plurality of sub-matrices of which the elements are all 1; each sub-matrix is encoded as follows: for a sub-matrix with only one element, set the element code to-1; for a sub-matrix with 2 or more elements, the upper left corner element is set to 1, the lower right corner element is set to 2, and the rest elements are set to 0;

step eight, storing the gene sequence to be detected after the treatment of the step seven: recording row numbers and column numbers of elements with median values of 1, 2-and-1 in a two-dimensional matrix in a binary form in sequence, and converting information of a gene sequence to be compressed into binary information; entropy encoding is performed on elements in the doublet, i.e. using variable length coding, for elements smaller than 255, two bytes for elements larger than 255 and smaller than 65535, and three bytes for elements larger than 65535.

The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims

1. A method for compressing gene sequence data for a plurality of sequences, comprising the steps of:

(6) and (3) matrixing the gene sequence to be compressed after the XOR with the reference sequence: recording the binary length of the reference sequence as l, selecting proper w, dividing each gene sequence to be compressed into equal parts of w width, and placing the equal parts into a two-dimensional matrix; wherein, the first section of the first gene sequence to be compressed is taken as the first row of the two-dimensional matrix, the first section of the second gene sequence to be compressed is taken as the second row of the two-dimensional matrix, and so on until the first section of the (n-1) th gene sequence to be compressed is taken as the (n-1) th row of the two-dimensional matrix; then, taking the second section of the first gene sequence to be compressed as the nth row of the two-dimensional matrix, taking the second section of the second gene sequence to be compressed as the n +1 th row of the two-dimensional matrix, and so on until n-1 sequences are all input into the two-dimensional matrix; so far, all gene sequences to be compressed are converted into a two-dimensional matrix with the width of w and the length of (n-1) × l/w;

(7) and (4) coding the two-dimensional matrix obtained in the step (6): dividing the two-dimensional matrix into a plurality of sub-matrices, wherein elements in the sub-matrices are all 1; each sub-matrix is encoded as follows: for a sub-matrix with only one element, set the element code to-1; for a sub-matrix with 2 or more elements, the upper left corner element is set to 1, the lower right corner element is set to 2, and the rest elements are set to 0;

(8) and (5) storing the gene sequence to be detected after the treatment in the step (7): recording the row number and the column number of elements with the median values of 1, 2 and-1 in a two-dimensional matrix in sequence in a binary form, and converting the information of the gene sequence to be compressed into binary information; entropy encoding is performed on elements in the doublet, i.e. using variable length coding, for elements smaller than 255, two bytes for elements larger than 255 and smaller than 65535, and three bytes for elements larger than 65535.

2. The method for compressing gene sequence data for multiple sequences according to claim 1, wherein the step (3) of compressing the reference sequence separately by using a segmented coding method based on k-mer algorithm comprises the following specific steps: firstly, dividing a reference sequence into equal-length segments with the length of m, then selecting a proper k value, searching a k-mer sequence with the highest repetition rate in each segment, recording the total repetition times and the position of each repetition in the segment, and then carrying out sectional coding to replace a k-mer subsequence which repeatedly appears in the sequence.