CN115798591B

CN115798591B - Genome sequence compression method based on Hilbert fractal

Info

Publication number: CN115798591B
Application number: CN202211680607.XA
Authority: CN
Inventors: 刘志岩; 郑青松; 郭方
Original assignee: Harbin Xingyun Medical Laboratory Co ltd
Current assignee: Xingyun Gene Technology Co ltd
Priority date: 2022-12-23
Filing date: 2022-12-23
Publication date: 2023-05-23
Anticipated expiration: 2042-12-23
Also published as: CN115798591A

Abstract

The invention provides a genome sequence compression method based on Hilbert fractal, which is used for digitally mapping a gene sequence to be compressed and determining a gene reference sequence through Euclidean distance so as to more accurately determine the reference sequence; performing redundancy elimination operation on the sequence to be compressed and the gene reference sequence, and storing the reference sequence subjected to redundancy elimination in a form of two groups after matching with the sequence to be compressed; carrying out multi-mode extraction on all binary group data by adopting Hilbert fractal transformation; the average value of each mode is extracted, the dimension is reduced, the linear correlation among the average values is eliminated, and the average value of each mode after dimension reduction is independently compressed, so that the compression efficiency is improved.

Description

Genome sequence compression method based on Hilbert fractal

Technical Field

The invention relates to the technical field of biological information, in particular to a genome sequence compression method based on Hilbert fractal.

Background

In recent years, with the continuous progress of new generation sequencing technology, the speed of gene sequencing is faster and the cost is lower, and the gene sequencing technology is popularized and applied in a plurality of fields such as more extensive biology, medical treatment, health, criminal investigation, agriculture and the like, so that the amount of raw data generated by gene sequencing is explosively increased by 3 to 5 times per year and even faster. Moreover, the sample data of each gene sequencing is large, and the storage, management, retrieval and transmission of massive gene testing data face technical and cost challenges.

Data compression is one of the techniques that alleviates this challenge. Data compression is the process of converting data into a more compact form than the original format in order to reduce storage space. The original input data contains a sequence of symbols that we need to compress or reduce in size. The symbols are encoded by a compressor, and the output is encoded data. Typically at some later time, the encoded data is input to a decompressor where it is decoded, reconstructed, and the original data is output in the form of a symbol sequence. If the output data and the input data are always identical, this compression scheme is called lossless, also called lossless encoder. Otherwise, it is a lossy compression scheme.

According to the comparison research result of the existing gene sequencing data compression method, the problems of a general compression algorithm, a compression algorithm without a reference genome or a compression algorithm with a reference genome are as follows: 1. there is room for further reduction in compression rate; 2. the compression/decompression time of the algorithm is relatively long when a relatively good compression ratio is obtained, and the time cost becomes a new problem. Furthermore, the reference genome compression algorithm generally achieves better compression ratios than the generic compression algorithm and the no reference genome compression algorithm. However, for a compression algorithm with reference genomes, the selection of the reference genomes may lead to stability problems of algorithm performance, i.e. processing the same target sample data, there may be significant differences in compression algorithm performance when different reference genomes are selected; using the same reference genome selection strategy, there may also be significant differences in the performance of the compression algorithm when processing identical, different gene sequencing sample data.

Disclosure of Invention

In order to solve the technical problems, the invention provides a genome sequence compression method based on Hilbert fractal, which comprises the following steps:

s1, digitally mapping a gene sequence to be compressed, and determining a gene reference sequence through Euclidean distance;

s2, performing redundancy elimination operation on the sequence to be compressed and the gene reference sequence;

s3, after the reference sequence subjected to redundancy removal is matched with the sequence to be compressed, the reference sequence is stored in a form of a binary group;

s4, carrying out multi-mode extraction on all binary group data by adopting Hilbert fractal transformation;

s5, reducing the dimension of the extracted mean value of each mode, eliminating the linear correlation among the mean values, and independently compressing the mean value of each mode after dimension reduction.

Further, step S1 includes: and setting n gene sequences in total, digitally mapping the n gene sequences into high-dimension digital vectors in Euclidean space, calculating the Euclidean distance sum between each gene sequence and the high-dimension digital vector of the other n-1 gene sequences, and taking the gene sequence represented by the Euclidean distance sum minimum high-dimension digital vector as a gene reference sequence.

Further, step S2 includes:

s2.1, calculating hash values of the gene reference sequence and the sequence to be compressed, taking a reference hash value generated by the gene reference sequence as an index, respectively matching the reference hash value with each hash value in the hash value sequence generated by the sequence to be compressed, and removing a plurality of gene sequences in the sequence to be compressed in the unmatched hash value sequence.

S2.2, traversing the gene reference sequence according to the step length S, obtaining a continuous sub-reference sequence, taking the continuous sub-reference sequence as an index, and sequencing a plurality of gene sequences in the matched sequence to be compressed according to the index.

S2.3, calculating hash values of the continuous sub-reference sequences and a plurality of gene sequences in the matched sequences to be compressed to form a hash table data block.

S2.4, the offset of the continuous sub-reference sequence and the matched sequence to be compressed in the whole n gene reference sequences is inserted into the hash table data block, the data block with conflict is recorded, redundancy deletion is carried out on each sub-reference sequence and the matched sequence to be compressed of the data block with conflict, and the non-redundant sub-reference sequence and the matched sequence to be compressed are reserved.

Further, step S4 includes:

step 4.1: establishing a data input system, and sampling the binary group data to obtain a binary group data set;

step 4.2: and performing modal decomposition on the obtained binary group data set by using a Hilbert fractal transformation method, and decomposing the binary group data set into a plurality of intrinsic modes.

Further, step S4.2 includes:

step 4.21: adding a filling sequence omega (t) into the obtained binary group data set X (t) to obtain a filled data set X (t):

X(t)＝x(t)+ω(t)；

step 4.22: decomposing the data set X (t) added with the filling sequence into a plurality of modes by using mode decomposition;

in the formula, h _j The j-th modality of the decomposition for X (t), r _n N is the number of decomposed modes for the rest state after decomposing X (t);

step 4.23: each time a different filling sequence ω is added to the dataset X (t) _i (t) (i=1, 2, …, n), repeating steps 4.21 and 4.22 repeatedly, and collecting the data X after the ith decomposition _i (t)；

X _i (t)＝x(t)+ω _i (t)；

The method is divided into:

in the formula, h _ij Is X _i (t) the j-th modality of decomposition, r _in To the X _i (t) a decomposed residual state;

step 4.24: mean value of each mode obtained by decomposition

Compared with the prior art, the invention has the following beneficial technical effects:

the gene sequence to be compressed is digitally mapped, and a gene reference sequence is determined through the Euclidean distance, so that the reference sequence can be determined more accurately; performing redundancy elimination operation on the sequence to be compressed and the gene reference sequence, and storing the reference sequence subjected to redundancy elimination in a form of two groups after matching with the sequence to be compressed; carrying out multi-mode extraction on all binary group data by adopting Hilbert fractal transformation; the average value of each mode is extracted, the dimension is reduced, the linear correlation among the average values is eliminated, and the average value of each mode after dimension reduction is independently compressed, so that the compression efficiency is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort to a person skilled in the art.

FIG. 1 is a schematic flow chart of a genome sequence compression method based on Hilbert fractal.

FIG. 2 is a schematic flow chart of the redundancy elimination operation of the sequence to be compressed and the gene reference sequence.

FIG. 3 is a schematic diagram of a genome sequence compression system based on Hilbert fractal of the present invention.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

In the drawings of the specific embodiments of the present invention, in order to better and more clearly describe the working principle of each element in the system, the connection relationship of each part in the device is represented, but only the relative positional relationship between each element is clearly distinguished, and the limitations on the signal transmission direction, connection sequence and the structure size, dimension and shape of each part in the element or structure cannot be constructed.

Fig. 1 is a schematic flow chart of a genome sequence compression method based on hilbert fractal, which comprises the following steps:

s1, digitally mapping a gene sequence to be compressed, and determining a gene reference sequence through Euclidean distance.

Setting n gene sequences in total, digitally mapping the gene sequences to be compressed into high-dimension digital vectors in Euclidean space, calculating the sum of Euclidean distances between the high-dimension digital vectors of each gene sequence and the high-dimension digital vectors of other n-1 sequences to be compressed, and taking the gene sequence to be compressed represented by the high-dimension digital vector with the smallest value of the sum of Euclidean distances as a gene reference sequence.

S2, performing redundancy elimination operation on the sequence to be compressed and the gene reference sequence, as shown in FIG. 2, specifically comprising the following steps:

s2.1, calculating hash values of a gene reference sequence and a sequence to be compressed, generating a reference hash value according to the gene reference sequence, taking the reference hash value as an index, generating a hash value sequence by the root sequence to be compressed, respectively matching the reference hash value with each hash value in the hash value sequence, determining matching results of the reference hash value relative to each hash value in the hash value sequence, and removing a plurality of gene sequences in the sequence to be compressed in the unmatched hash value sequence.

S2.2, traversing the gene reference sequence according to the step length S aiming at the gene reference sequence, obtaining a continuous sub-reference sequence with a specified length, taking the sub-reference sequence as an index, and sequencing a plurality of gene sequences in the matched sequence to be compressed according to the index.

The gene reference sequence consists of a series of A, C, T, G, and in order to facilitate analysis and processing of data, the invention introduces a continuous sub-reference sequence which is the naming of a small continuous ACTG reference sequence, the step length s is determined, and a plurality of groups of continuous sub-reference sequences with the step length s are obtained in the gene reference sequence.

The ACTG sub-reference sequences with fixed length are taken every other step length s, and the length of the continuous sub-reference sequences is the step length s, which can be defined by the user. Assuming that the total length of the gene reference sequence is N, the number of the common continuous sub-reference sequences is N/s corresponding to the whole gene reference sequence, and the redundancy elimination optimization method of the sequence to be compressed in this embodiment aims to reduce the number of the sequence to be compressed as much as possible through an algorithm, but at the same time, the quality of the continuous sub-reference sequences must be ensured.

S2.3, calculating hash values of each continuous sub-reference sequence to form a hash table data block.

The hash table data block refers to a plurality of data blocks containing hash values, each hash value occupies one data block, the data of each data block can also comprise information about whether the current data block is idle or not, information about whether the hash values collide or not, and an index of the current data block pointing to the next data block in collision, wherein the information is used for completing processing operations according to the information when the gene sequence is inserted into the data block, the gene sequence is deleted and the gene sequence is inquired.

The hash table data block capacity is used for recording the upper limit of the data block of the hash table; the number of used data blocks in the hash table data blocks is used for representing the number of hash values which are currently inserted; the idle data block index is used for indicating the position of the current idle data block and is used for realizing that a database can be rapidly allocated to a newly inserted gene sequence for use when the gene sequence is inserted.

s2.4, inserting the offset of the continuous sub-reference sequence and the matched sequence to be compressed in the whole n gene reference sequences into the hash table data block, recording the data block with conflict, and performing redundancy deletion on each sub-reference sequence and the matched sequence to be compressed of the data block with conflict, so as to keep the non-redundant sub-reference sequence and the matched sequence to be compressed.

And S3, after the reference sequence subjected to redundancy removal is matched with the sequence to be compressed, storing the reference sequence in a form of a binary group with the offset position and the length of >.

Storing the offset and the length of a plurality of gene reference sequences in a non-redundant sequence to be compressed in a form of a binary group of < offset position, length >; the offsets and lengths of the non-redundant sub-reference sequences are also stored in the form of a < offset position, length > tuple.

S4, performing multi-mode extraction on all data in the form of binary groups, and adopting Hilbert fractal transformation as a multi-mode extraction method, wherein the method specifically comprises the following steps:

step 4.1: and establishing a data input system, and sampling the binary group data to obtain a binary group data set.

X(t)＝x(t)+ω(t)。

in the formula, h _j The j-th modality of the decomposition for X (t), r _n N is the number of decomposed modes, which is the remainder of the decomposition of X (t).

X _i (t)＝x(t)+ω _i (t)；

The method is divided into:

step 4.24: mean value of each mode obtained by decomposition

In a preferred embodiment, a main mode analysis can be adopted to perform mode multivariate statistics and processing, a plurality of modes with certain correlation in an original mode space are converted into main modes which are not correlated with each other in a new space, compression dimension reduction is performed on the original mode, and meanwhile, less information loss is ensured.

FIG. 3 is a schematic structural diagram of a genome sequence compression system based on Hilbert fractal of the present invention, the genome sequence compression system comprising: the device comprises a gene reference sequence determining unit, a redundancy removing unit, a data storage unit, a Hilbert fractal transformation unit and a compression unit.

And the gene reference sequence determining unit is used for digitally mapping the gene sequence to be compressed and determining the gene reference sequence through the Euclidean distance.

And the redundancy removing unit is used for performing redundancy removing operation on the sequence to be compressed and the gene reference sequence.

And the data storage unit is used for storing the reference sequence subjected to redundancy removal in a form of a binary group after matching with the sequence to be compressed.

The Hilbert fractal transformation unit is used for carrying out multi-mode extraction on all binary group data by adopting Hilbert fractal transformation.

The compression unit is used for reducing the dimension of the extracted average value of each mode, eliminating the linear correlation among the average values, and independently compressing the average value of each mode after dimension reduction.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted across a computer-readable storage medium. The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.

While the invention has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. The genome sequence compression method based on Hilbert fractal is characterized by comprising the following steps of:

s3, after the reference sequence subjected to redundancy removal is matched with the sequence to be compressed, the reference sequence is stored in a form of a binary group with an offset position and a length;

s4, carrying out multi-mode extraction on all binary group data by adopting Hilbert fractal transformation, wherein the method comprises the following steps:

step 4.2: performing modal decomposition on the obtained binary group data set by using a Hilbert fractal transformation method, and decomposing the binary group data set into a plurality of intrinsic modes, wherein the method comprises the following steps of:

step 4.21: adding a filling sequence omega (t) into the obtained binary group data set X (t) to obtain a filled data set X (t): x (t) =x (t) +ω (t);

X _i (t)＝x(t)+ω _i (t); the method is divided into:

step 4.24: mean value of each mode obtained by decomposition

；

S5, reducing the dimension of the average value of each mode, carrying out mode multivariate statistics and processing by adopting main mode analysis, converting the correlated multi-mode in the original mode space into the uncorrelated main mode in the new space, eliminating the linear correlation among the average values of each mode, and independently compressing the average value of each mode after dimension reduction.

2. The method of genomic sequence compression according to claim 1, wherein step S1 comprises: and setting n gene sequences in total, digitally mapping the n gene sequences into high-dimension digital vectors in Euclidean space, calculating the Euclidean distance sum between each gene sequence and the high-dimension digital vectors of other n-1 gene sequences, and taking the gene sequence represented by the Euclidean distance sum minimum high-dimension digital vector as a gene reference sequence.

3. The method of genomic sequence compression according to claim 2, wherein step S2 comprises:

s2.1, calculating hash values of a gene reference sequence and a sequence to be compressed, taking a reference hash value generated by the gene reference sequence as an index, respectively matching the reference hash value with each hash value in the hash value sequence generated by the sequence to be compressed, and removing a plurality of gene sequences in the sequence to be compressed in the unmatched hash value sequence;

s2.2, traversing the gene reference sequence according to the step length S to obtain a continuous sub-reference sequence, taking the continuous sub-reference sequence as an index, and sequencing a plurality of gene sequences in the matched sequence to be compressed according to the index;

s2.3, calculating hash values of a plurality of gene sequences in the continuous sub-reference sequence and the matched sequence to be compressed to form a hash table data block;