CN115098455A

CN115098455A - Genome data lossless compression method based on deep learning and related equipment

Info

Publication number: CN115098455A
Application number: CN202210743081.9A
Authority: CN
Inventors: 王荣杰; 刘贤明; 朱泽轩
Original assignee: Peng Cheng Laboratory
Current assignee: Peng Cheng Laboratory
Priority date: 2022-06-28
Filing date: 2022-06-28
Publication date: 2022-09-23

Abstract

The invention discloses a genome data lossless compression method based on deep learning and related equipment, wherein the method comprises the following steps: learning based on a deep learning model to obtain context relationship characteristics and non-local characteristics of the genome sequence; based on the context relation characteristic and the non-local characteristic, when the base context is input, the deep learning model predicts the prediction probability corresponding to a plurality of bases which are immediately adjacent to the base context; and connecting arithmetic coding by utilizing the prediction probabilities corresponding to a plurality of bases output by the deep learning model, coding the probability of the base to be compressed by utilizing the arithmetic coding, and outputting a compression result file. The invention obtains the correlation between the genome contexts through deep learning model learning, predicts the probability of the current base to be coded by utilizing the compressed base sequence information, and finally outputs a compression result file by utilizing arithmetic coding, thereby realizing the lossless compression of genome data.

Description

Genome data lossless compression method based on deep learning and related equipment

Technical Field

The invention relates to the technical field of data compression, in particular to a deep learning-based genome data lossless compression method, a deep learning-based genome data lossless compression system, a deep learning-based genome data lossless compression terminal and a computer readable storage medium.

Background

With the development of second generation (NGS, high throughput sequencing technology) genome sequencing technology, a great deal of genome sequencing data is generated, and at the same time, a great deal of genome sequence data spliced by the genome sequencing data is also generated. These massive amounts of genomic sequence data put a tremendous strain on storage and transmission. Because the genomes of the same species have high similarity, convenience is provided for data compression. However, the existing compression methods based on reference genome all require alignment (mapping) or approximate alignment (similar mapping) processes, and these processes are often time-consuming and long, and require reference genome for both compression and decompression, and the reference genome must be consistent. And the existence of genomic variation information (mutation, insertion, deletion) makes the variation locus point not perfectly matched, and it is necessary to store the position and base information of the variation locus point.

In recent years, deep learning methods, particularly Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), have achieved tremendous success in both computer vision and text fields, and are considered to be feature extraction networks applicable to texts and images. In the field of text prediction, they can be used for character prediction by converting input data into Word2Vec and then by learning training, capturing contextual features. However, genomic data has its own specificity, including the inclusion of only four base characters (A, C, G, T), the presence of a large number of short sequence repeats, near repeats, and directionally complementary repeats. Therefore, the text features simply extracted by the Word2Vec method, the convolutional neural network and the recurrent neural network in the traditional sense are not suitable for genome data any more.

Accordingly, the prior art is yet to be improved and developed.

Disclosure of Invention

The invention mainly aims to provide a deep learning-based genome data lossless compression method, a deep learning-based genome data lossless compression system, a deep learning-based genome data lossless compression terminal and a deep learning-based genome data lossless compression computer readable storage medium, and aims to solve the problem that in the prior art, after context features of a genome sequence are extracted, a compression result file cannot be output by using base probability information output by a network.

In order to achieve the above object, the present invention provides a deep learning-based genome data lossless compression method, which comprises the following steps:

learning based on a deep learning model to obtain context relationship characteristics and non-local characteristics of the genome sequence;

based on the context relation characteristic and the non-local characteristic, when the base context is input, the deep learning model predicts the prediction probability corresponding to a plurality of bases which are immediately adjacent to the base context;

and connecting arithmetic coding by utilizing the prediction probabilities corresponding to a plurality of bases output by the deep learning model, coding the probability of the base to be compressed by utilizing the arithmetic coding, and outputting a compression result file.

The deep learning-based genome data lossless compression method comprises the following steps of obtaining context relation characteristics and non-local characteristics of a genome sequence by deep learning-based model learning, and specifically comprises the following steps:

if data set D ═ D ₁ ，D ₂ ，…，D _n H, n genome sequence samples are counted, and each sample is trained by D _i Splitting the sequence into n-gram subsequences with the length of k, wherein the first k-1 bases of the n-gram subsequences are input sequences, and the kth base is regarded as a supervised character of a base to be predicted;

and coding the n-gram subsequence according to a One-hot form, wherein the coding mode is as follows:

A→1000

C→0100

G→0010

T→0001；

and for the coded base sequences, extracting context relation features and non-local features in the data set D through a feature extraction network respectively.

The deep learning-based genome data lossless compression method comprises the steps of extracting local features of context relation features by using a volume set neural network; for non-local features, the LSTM network is used for extraction.

The deep learning-based genome data lossless compression method comprises A, C, G and T in a plurality of bases immediately after the bases.

The deep learning-based genome data lossless compression method comprises the following steps that based on the context relation characteristic and the non-local characteristic, when a base context is input, a deep learning model predicts prediction probabilities corresponding to a plurality of bases which are immediately adjacent to the base context respectively, and specifically comprises the following steps:

inputting the contextual feature and the non-local feature into a feature mapping network;

when base context is input, predictive probability outputs are performed on A, C, G and T respectively by utilizing the softmax function, and the classifier is trained through cross entropy loss.

The deep learning-based genome data lossless compression method comprises the following steps of connecting prediction probabilities corresponding to a plurality of bases output by a deep learning model with arithmetic coding, coding the probability of a base to be compressed by the arithmetic coding, and outputting a compression result file, and specifically comprises the following steps:

directly connecting the basic group probability information output by the deep learning network with the arithmetic coding;

and (3) converting the base prediction probability into a compressed bit stream by utilizing the probability of encoding the base to be compressed by arithmetic coding and outputting the compressed bit stream to a compression result file.

The deep learning-based genome data lossless compression method is characterized in that the non-local features comprise upstream and downstream gene regulation and control related information.

The deep learning-based genome data lossless compression method is characterized in that the size of a convolution kernel in the deep learning model is determined by the length of a modal sequence to be extracted and the length of a tandem repeat.

The deep learning-based genome data lossless compression method comprises the following steps that a deep learning model learns the context relationship characteristics of a genome sequence, and the method further comprises the following steps:

when the convolution neural network carries out local feature extraction, the sparse connection and weight sharing of the convolution neural network are utilized, the sequencing error in the base sequence is avoided, and the local relevant features in the base sequence are extracted.

The deep learning-based genome data lossless compression method comprises the following steps that a deep learning model learns to obtain non-local features of a genome sequence, and the method further comprises the following steps:

the long-distance correlation of the base sequence is analyzed by utilizing a circulating neural network layer, and the non-local correlation characteristics in the base sequence are extracted by utilizing the long-term memory function of a layer node full-connection structure of the circulating neural network.

In addition, to achieve the above object, the present invention further provides a deep learning-based genome data lossless compression system, wherein the deep learning-based genome data lossless compression system includes:

the sequence local feature extraction module is used for learning to obtain the context relationship features of the genome sequence based on a deep learning model;

the non-local feature extraction module is used for extracting non-local features in the genome sequence;

the base probability output module is used for jointly predicting the prediction probabilities corresponding to a plurality of bases which are immediately adjacent to the bases in the background by utilizing the context relation characteristics and the non-local characteristics;

and the probability coding module is used for connecting the arithmetic coding by utilizing the prediction probabilities respectively corresponding to the plurality of basic groups output by the deep learning model, coding the probability of the basic group to be compressed by utilizing the arithmetic coding and outputting a compression result file.

In addition, to achieve the above object, the present invention further provides a terminal, wherein the terminal includes: the deep learning based genome data lossless compression program comprises a memory, a processor and a deep learning based genome data lossless compression program which is stored on the memory and can run on the processor, wherein the deep learning based genome data lossless compression program realizes the steps of the deep learning based genome data lossless compression method when the processor executes the deep learning based genome data lossless compression program.

In addition, to achieve the above object, the present invention further provides a computer readable storage medium, wherein the computer readable storage medium stores a deep learning based genome data lossless compression program, and the deep learning based genome data lossless compression program, when executed by a processor, implements the steps of the deep learning based genome data lossless compression method as described above.

In the invention, context relationship characteristics and non-local characteristics of a genome sequence are obtained based on deep learning model learning; based on the context relation characteristic and the non-local characteristic, when the base context is input, the deep learning model predicts the prediction probability corresponding to a plurality of bases immediately behind the base context; and respectively connecting the prediction probabilities corresponding to a plurality of bases output by the deep learning model with arithmetic coding, coding the probability of the base to be compressed by using the arithmetic coding, and outputting a compression result file. The invention obtains the correlation between genome contexts through deep learning model learning, predicts the probability of the current base to be coded by utilizing the compressed base sequence information, and finally outputs a compression result file by utilizing arithmetic coding, thereby realizing the lossless compression of genome data.

Drawings

FIG. 1 is a flow chart of the genome data lossless compression method based on deep learning according to the preferred embodiment of the present invention;

FIG. 2 is a schematic overall compression flow diagram of the deep learning-based genome data lossless compression method according to the present invention;

FIG. 3 is a schematic diagram of the deep learning-based genome data lossless compression system according to the present invention;

fig. 4 is a schematic operating environment of a terminal according to a preferred embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer and clearer, the present invention is further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

In a preferred embodiment of the present invention, as shown in fig. 1, the deep learning-based genome data lossless compression method includes the following steps:

and step S10, obtaining the context relationship characteristics and the non-local characteristics of the genome sequence based on deep learning model learning.

In particular, deep learning is to learn the intrinsic regularity and expression hierarchy of sample data, and information obtained in these learning processes is very helpful to the interpretation of data such as text, images, and sounds. The final aim of the method is to enable the machine to have the analysis and learning capability like a human, and to recognize data such as characters, images and sounds.

Assume data set D ═ D ₁ ，D ₂ ，…，D _n H, n genome sequence samples are counted, and each sample is trained by D _i Splitting into n-gram subsequences with length of k, wherein the first k-1 bases of the n-gram subsequences are input sequences, and the kth base is regarded as a supervised character of a base to be predicted.

A→1000

C→0100

G→0010

T→0001；

Wherein, for the context relationship characteristics (among the gene sequences, the context bases are strongly related in some regions, such as short tandem repeat, model repeat, etc.), a volume neural network (CNN) is used for local feature extraction, and the size of the convolution kernel can be determined by the length of the modal sequence to be extracted and the length of the tandem repeat; for non-local features (among gene sequences, there is upstream and downstream gene regulation related information in addition to local context features), an LSTM network (Long Short Term Memory networks) is used for extraction.

And step S20, based on the context relation characteristic and the non-local characteristic, when the base context is input, the deep learning model predicts the prediction probability corresponding to each of a plurality of bases immediately after the base context.

Specifically, the plurality of bases immediately after the base include A, C, G and T; inputting the context-relationship features and the non-local features into a feature mapping network; when base context is input, predictive probability outputs are performed on A, C, G and T respectively by utilizing the softmax function, and the classifier is trained through cross entropy loss.

Because the genome sequence only has four bases (A, C, G, T), only the probabilities corresponding to the four bases are needed to be output, the probabilities of the bases to be coded are jointly predicted by using the local feature recognition and the non-local feature of the sequence, and the probability value can output the probability value of each base by using a Softmax function.

And step S30, connecting arithmetic coding by using the prediction probabilities corresponding to the bases output by the deep learning model, coding the probability of the base to be compressed by using the arithmetic coding, and outputting a compression result file.

Specifically, the base probability information output by the deep learning network is directly connected with the arithmetic coding; and (3) converting the base prediction probability into a compressed bit stream by utilizing the probability of encoding the base to be compressed by arithmetic coding and outputting the compressed bit stream to a compression result file. The invention fully utilizes the high-efficiency prediction performance of the deep learning network, because in most cases, the Softmax maximum probability result output by the network is just the coding base, and the variant base only occupies a small part of the genome sequence. Therefore, the invention provides a method for outputting a compressed result file based on base probability information of deep learning, which extracts relevant characteristics among genome contexts through a deep learning network, realizes probability prediction of a base to be coded, outputs the compressed result file through arithmetic coding, and realizes lossless compression of genome data.

According to the invention, context relation characteristic description of a genome sequence is obtained through network model learning, based on the context relation, when a base is input, a network can predict probabilities respectively corresponding to four types of bases (A, C, G, T) immediately behind the base, and the predicted probabilities can be obtained through a Softmax function; and finally, encoding the probability of the base to be compressed by using arithmetic coding and outputting a compression result file.

As shown in fig. 2, the overall compression process of the present invention is divided into four parts, which are: an input layer, a convolutional layer, a recurrent neural network layer, and an output layer. Reading in base data in sequence, realizing a prediction model of base probability by processes of numerical code conversion, convolutional neural network sequence feature extraction, cyclic neural network sequence feature extraction and the like of the base data, and realizing compression by combining arithmetic coding. The method comprises the following specific steps:

(1) an input layer: the layer takes n-gram subsequence of genome sequence as input, and becomes input of neural network convolution layer through One-hot coding.

(2) And (3) rolling layers: the layer identifies context information in the base sequence through a Convolutional Neural Network (CNN), avoids sequencing errors in the base sequence by utilizing the advantages of sparse connection, weight sharing and the like of the CNN, and extracts local correlation characteristics in the base sequence.

(3) A recurrent neural network layer: the layer analyzes the long-distance correlation of the base sequence and extracts the non-local correlation characteristics in the base sequence by utilizing the long-term memory function of the layer node full-connection structure of the Recurrent Neural Network (RNN).

(4) Base probability prediction: the probability prediction is carried out on the base to be coded by combining the local correlation and the non-local correlation of the base sequence, and the base sequence is compressed in a mode of driving arithmetic coding by base probability estimation.

The invention obtains good compression result in the human mitochondrial genome lossless compression experiment. The invention divides 1000 human mitochondrial genomes according to a training set and a testing set, randomly selects 900 human mitochondrial genome sequences as the training set, takes the remaining 100 human mitochondrial data as the testing set, compresses the testing results and compares the testing results with the prior common compression tool Gzip, and the comparison results of the DNA proprietary compression tools MFCompresss, DMCompresss and ERGC are shown in the following table 1:

table 1: comparison of compression results of the deep DNA method with other methods on 100 human mitochondrial datasets

As can be seen from table 1, for the human mitochondrial genome test data set, the compression result of Gzip by the plain text compression method was 1.45 bits/base, while the compression results of DMcompress and MFCompress methods were 0.07 bits/base; and the reference genome sequence ID applied in the compression is NC-012920.1 based on the compression method ERGC of the reference genome, and the compression result is 1.46 bits/base, which is similar to the Gzip compression result. The deep learning-based compression method DeepDNA provided by the invention has a compression result of 0.03 bit/base, which is far superior to the compression results of the ordinary text compression method Gzip and the reference genome-based ERGC, and the reference genome-based compression mode is limited by the same reference genome required in compression and decompression.

The results of comparing the DNA-specific compression tools MFCompress, DMcompress for compression of 5 individual mitochondrial data randomly independent therefrom with the existing common compression tool Gzip are shown in table 2 below:

table 2: comparison of compression results of the deep DNA method with other methods on randomized 5-person mitochondrial data

In order to verify the effectiveness of compressing single genome sequence data, 5 pieces of human mitochondrial genome data are randomly extracted from 100 human mitochondrial genomes for independent compression comparison, and table 2 lists the compression results of the 5 pieces of human mitochondrial genome sequence data, so that the deep dna method provided by the invention achieves good compression effect on all 5 pieces of single genome sequence compression, and is compared with the current compression method Gzip of common texts, the compression methods MFCompress and DMcompress based on a limited context model, and the compression method ERGC based on a reference sequence. The compression result of the deep DNA is less than 0.05 bit/base, and the compression results of other four methods on a single genome are all more than 2 bit/base. The method of the invention is still better than other compression methods in the effect of single genome compression.

Further, as shown in fig. 3, based on the above genome data lossless compression method based on deep learning, the present invention also provides a genome data lossless compression system based on deep learning, wherein the genome data lossless compression system based on deep learning includes:

the sequence local feature extraction module 51 is configured to learn to obtain context relationship features of a genome sequence based on a deep learning model;

a non-local feature extraction module 52, configured to extract non-local features in the genome sequence;

a base probability output module 53, configured to jointly predict prediction probabilities corresponding to a plurality of bases immediately after the base by using the context feature and the non-local feature;

and a probability coding module 54, configured to connect arithmetic coding with the prediction probabilities corresponding to the multiple bases output by the deep learning model, code the probability of the base to be compressed by using the arithmetic coding, and output a compression result file.

The sequence local feature extraction module 51 is used to extract context-related features in genome sequences, because context bases are strongly related in some regions in gene sequences, such as: short tandem repeats, model repeats, etc., using a convolutional neural network to extract contextual features of a gene sequence, the size of the convolutional kernel may be determined by the length of the modal sequence to be extracted and the length of the tandem repeat.

The non-local feature extraction module 52 is configured to extract non-local features in the genome sequence, because there is upstream and downstream gene regulation related information in addition to local context features in the genome sequence, and to extract such non-local information, LSTM is used for non-local feature extraction.

The base probability output module 53 is configured to output probabilities corresponding to four bases, a genome sequence has only four bases (A, C, G, T), the probabilities of bases to be encoded are jointly predicted by using sequence local feature recognition and non-local features, and the probability values can output probability values of the bases by using a Softmax function.

Wherein, the probability coding module 54 converts the base prediction probability into a compressed bit stream by using arithmetic coding and outputs the compressed bit stream to a compressed result file.

Further, as shown in fig. 4, based on the above method and system for lossless compression of genomic data based on deep learning, the present invention also provides a terminal, which includes a processor 10, a memory 20 and a display 30. Fig. 4 shows only some of the components of the terminal, but it should be understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead.

The memory 20 may in some embodiments be an internal storage unit of the terminal, such as a hard disk or a memory of the terminal. The memory 20 may also be an external storage device of the terminal in other embodiments, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the terminal. Further, the memory 20 may also include both an internal storage unit and an external storage device of the terminal. The memory 20 is used for storing application software installed in the terminal and various types of data, such as program codes of the installation terminal. The memory 20 may also be used to temporarily store data that has been output or is to be output. In an embodiment, the memory 20 stores a deep learning based genome data lossless compression program 40, and the deep learning based genome data lossless compression program 40 can be executed by the processor 10, so as to implement the deep learning based genome data lossless compression method in the present application.

The processor 10 may be, in some embodiments, a Central Processing Unit (CPU), a microprocessor or other data Processing chip, which is used to run program codes stored in the memory 20 or process data, such as performing the deep learning-based genome data lossless compression method.

The display 30 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch panel, or the like in some embodiments. The display 30 is used for displaying information at the terminal and for displaying a visual user interface. The components 10-30 of the terminal communicate with each other via a system bus.

In an embodiment, the following steps are implemented when the processor 10 executes the deep learning based genome data lossless compression program 40 in the memory 20:

The method specifically comprises the following steps of obtaining context relationship features and non-local features of a genome sequence based on deep learning model learning, wherein the context relationship features and the non-local features specifically comprise the following steps:

A→1000

C→0100

G→0010

T→0001；

For the context relationship characteristics, local characteristic extraction is carried out by using a volume set neural network; for non-local features, the LSTM network is used for extraction.

Wherein the plurality of bases immediately after the base comprises A, C, G and T.

Wherein, the context-based feature and the non-local feature predict, when a base context is input, prediction probabilities respectively corresponding to a plurality of bases immediately after the base context by using the deep learning model, and specifically include:

when base context is input, predictive probability outputs are performed on A, C, G and T respectively by utilizing a softmax function, and a classifier is trained through cross entropy loss.

The method for compressing the base by utilizing the deep learning model comprises the following steps of connecting prediction probabilities corresponding to a plurality of bases output by utilizing the deep learning model with arithmetic coding, coding the probability of the base to be compressed by utilizing the arithmetic coding, and outputting a compression result file, wherein the method specifically comprises the following steps:

directly connecting the base probability information output by the deep learning network with the arithmetic coding;

Wherein the non-local features comprise upstream and downstream gene regulatory association information.

Wherein the size of the convolution kernel in the deep learning model is determined by the length of the modal sequence to be extracted and the length of the tandem repeat.

Wherein, the deep learning model learns the context relationship characteristics of the genome sequence, and further comprises:

Wherein, the deep learning model learns to obtain the non-local characteristics of the genome sequence, and further comprises:

The invention also provides a computer readable storage medium, wherein the computer readable storage medium stores a deep learning based genome data lossless compression program, and when the deep learning based genome data lossless compression program is executed by a processor, the deep learning based genome data lossless compression program realizes the steps of the deep learning based genome data lossless compression method.

In summary, the present invention provides a genome data lossless compression method based on deep learning and a related device, wherein the method includes: learning based on a deep learning model to obtain context relationship characteristics and non-local characteristics of the genome sequence; based on the context relation characteristic and the non-local characteristic, when the base context is input, the deep learning model predicts the prediction probability corresponding to a plurality of bases which are immediately adjacent to the base context; and connecting arithmetic coding by utilizing the prediction probabilities corresponding to a plurality of bases output by the deep learning model, coding the probability of the base to be compressed by utilizing the arithmetic coding, and outputting a compression result file. The invention obtains the correlation between the genome contexts through deep learning model learning, predicts the probability of the current base to be coded by utilizing the compressed base sequence information, and finally outputs a compression result file by utilizing arithmetic coding, thereby realizing the lossless compression of genome data.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

It is to be understood that the invention is not limited to the examples described above, but that modifications and variations may be effected thereto by those of ordinary skill in the art in light of the foregoing description, and that all such modifications and variations are intended to be within the scope of the invention as defined by the appended claims.

Claims

1. A genome data lossless compression method based on deep learning is characterized in that the genome data lossless compression method based on deep learning comprises the following steps:

based on the context relation characteristic and the non-local characteristic, when the base context is input, the deep learning model predicts the prediction probability corresponding to a plurality of bases immediately behind the base context;

and respectively connecting the prediction probabilities corresponding to a plurality of bases output by the deep learning model with arithmetic coding, coding the probability of the base to be compressed by using the arithmetic coding, and outputting a compression result file.

2. The deep learning-based genome data lossless compression method according to claim 1, wherein the deep learning model learning-based context relationship features and non-local features of the genome sequence specifically include:

A→1000

C→0100

G→0010

T→0001；

3. The genome data lossless compression method based on deep learning of claim 2, characterized in that, for the context relationship features, a volume set neural network is used for local feature extraction; for non-local features, the LSTM network is used for extraction.

4. The deep learning-based genome data lossless compression method according to claim 2, wherein the plurality of bases immediately after the base include A, C, G and T.

5. The method for lossless compression of genome data based on deep learning of claim 4, wherein based on the context-related feature and the non-local feature, when a base context is input, the deep learning model predicts prediction probabilities respectively corresponding to a plurality of bases immediately after the base context, specifically comprising:

6. The method for lossless compression of genome data based on deep learning of claim 5, wherein the predicted probabilities corresponding to the bases output by the deep learning model are respectively connected with arithmetic coding, and the probability of the base to be compressed is coded by the arithmetic coding, and a compression result file is output, specifically comprising:

7. The deep learning-based genome data lossless compression method according to claim 1, wherein the non-local features include upstream and downstream gene regulatory association information.

8. The deep learning-based genome data lossless compression method according to claim 1, wherein the size of the convolution kernel in the deep learning model is determined by the length of the modal sequence to be extracted and the length of the tandem repeat.

9. The deep learning-based genome data lossless compression method according to claim 3, wherein the deep learning model learns the context relationship characteristics of the genome sequence, and further comprises:

10. The deep learning-based genome data lossless compression method according to claim 3, wherein the deep learning model learns the non-local features of the genome sequence, and further comprising:

the long-distance correlation of the base sequence is analyzed by utilizing the layer of the circulating neural network, and the non-local correlation characteristics in the base sequence are extracted by utilizing the long-term memory function of the layer node full-connection structure of the circulating neural network.

11. A deep learning based genome data lossless compression system, characterized in that the deep learning based genome data lossless compression system comprises:

the sequence local feature extraction module is used for obtaining context relationship features of the genome sequence based on deep learning model learning;

and the probability coding module is used for connecting arithmetic coding by utilizing the prediction probabilities corresponding to the plurality of bases output by the deep learning model, coding the probability of the base to be compressed by utilizing the arithmetic coding and outputting a compression result file.

12. A terminal, characterized in that the terminal comprises: a memory, a processor, and a deep learning-based genome data lossless compression program stored on the memory and executable on the processor, the deep learning-based genome data lossless compression program when executed by the processor implementing the steps of the deep learning-based genome data lossless compression method according to any one of claims 1 to 10.

13. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a deep learning-based genome data lossless compression program, which when executed by a processor implements the steps of the deep learning-based genome data lossless compression method according to any one of claims 1 to 10.