CN115098455A - Genome data lossless compression method based on deep learning and related equipment - Google Patents

Genome data lossless compression method based on deep learning and related equipment Download PDF

Info

Publication number
CN115098455A
CN115098455A CN202210743081.9A CN202210743081A CN115098455A CN 115098455 A CN115098455 A CN 115098455A CN 202210743081 A CN202210743081 A CN 202210743081A CN 115098455 A CN115098455 A CN 115098455A
Authority
CN
China
Prior art keywords
deep learning
base
lossless compression
context
genome
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210743081.9A
Other languages
Chinese (zh)
Inventor
王荣杰
刘贤明
朱泽轩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peng Cheng Laboratory
Original Assignee
Peng Cheng Laboratory
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peng Cheng Laboratory filed Critical Peng Cheng Laboratory
Priority to CN202210743081.9A priority Critical patent/CN115098455A/en
Publication of CN115098455A publication Critical patent/CN115098455A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1744Redundancy elimination performed by the file system using compression, e.g. sparse files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Epidemiology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Public Health (AREA)
  • Bioethics (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a genome data lossless compression method based on deep learning and related equipment, wherein the method comprises the following steps: learning based on a deep learning model to obtain context relationship characteristics and non-local characteristics of the genome sequence; based on the context relation characteristic and the non-local characteristic, when the base context is input, the deep learning model predicts the prediction probability corresponding to a plurality of bases which are immediately adjacent to the base context; and connecting arithmetic coding by utilizing the prediction probabilities corresponding to a plurality of bases output by the deep learning model, coding the probability of the base to be compressed by utilizing the arithmetic coding, and outputting a compression result file. The invention obtains the correlation between the genome contexts through deep learning model learning, predicts the probability of the current base to be coded by utilizing the compressed base sequence information, and finally outputs a compression result file by utilizing arithmetic coding, thereby realizing the lossless compression of genome data.

Description

Genome data lossless compression method based on deep learning and related equipment
Technical Field
The invention relates to the technical field of data compression, in particular to a deep learning-based genome data lossless compression method, a deep learning-based genome data lossless compression system, a deep learning-based genome data lossless compression terminal and a computer readable storage medium.
Background
With the development of second generation (NGS, high throughput sequencing technology) genome sequencing technology, a great deal of genome sequencing data is generated, and at the same time, a great deal of genome sequence data spliced by the genome sequencing data is also generated. These massive amounts of genomic sequence data put a tremendous strain on storage and transmission. Because the genomes of the same species have high similarity, convenience is provided for data compression. However, the existing compression methods based on reference genome all require alignment (mapping) or approximate alignment (similar mapping) processes, and these processes are often time-consuming and long, and require reference genome for both compression and decompression, and the reference genome must be consistent. And the existence of genomic variation information (mutation, insertion, deletion) makes the variation locus point not perfectly matched, and it is necessary to store the position and base information of the variation locus point.
In recent years, deep learning methods, particularly Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), have achieved tremendous success in both computer vision and text fields, and are considered to be feature extraction networks applicable to texts and images. In the field of text prediction, they can be used for character prediction by converting input data into Word2Vec and then by learning training, capturing contextual features. However, genomic data has its own specificity, including the inclusion of only four base characters (A, C, G, T), the presence of a large number of short sequence repeats, near repeats, and directionally complementary repeats. Therefore, the text features simply extracted by the Word2Vec method, the convolutional neural network and the recurrent neural network in the traditional sense are not suitable for genome data any more.
Accordingly, the prior art is yet to be improved and developed.
Disclosure of Invention
The invention mainly aims to provide a deep learning-based genome data lossless compression method, a deep learning-based genome data lossless compression system, a deep learning-based genome data lossless compression terminal and a deep learning-based genome data lossless compression computer readable storage medium, and aims to solve the problem that in the prior art, after context features of a genome sequence are extracted, a compression result file cannot be output by using base probability information output by a network.
In order to achieve the above object, the present invention provides a deep learning-based genome data lossless compression method, which comprises the following steps:
learning based on a deep learning model to obtain context relationship characteristics and non-local characteristics of the genome sequence;
based on the context relation characteristic and the non-local characteristic, when the base context is input, the deep learning model predicts the prediction probability corresponding to a plurality of bases which are immediately adjacent to the base context;
and connecting arithmetic coding by utilizing the prediction probabilities corresponding to a plurality of bases output by the deep learning model, coding the probability of the base to be compressed by utilizing the arithmetic coding, and outputting a compression result file.
The deep learning-based genome data lossless compression method comprises the following steps of obtaining context relation characteristics and non-local characteristics of a genome sequence by deep learning-based model learning, and specifically comprises the following steps:
if data set D ═ D 1 ,D 2 ,…,D n H, n genome sequence samples are counted, and each sample is trained by D i Splitting the sequence into n-gram subsequences with the length of k, wherein the first k-1 bases of the n-gram subsequences are input sequences, and the kth base is regarded as a supervised character of a base to be predicted;
and coding the n-gram subsequence according to a One-hot form, wherein the coding mode is as follows:
A→1000
C→0100
G→0010
T→0001;
and for the coded base sequences, extracting context relation features and non-local features in the data set D through a feature extraction network respectively.
The deep learning-based genome data lossless compression method comprises the steps of extracting local features of context relation features by using a volume set neural network; for non-local features, the LSTM network is used for extraction.
The deep learning-based genome data lossless compression method comprises A, C, G and T in a plurality of bases immediately after the bases.
The deep learning-based genome data lossless compression method comprises the following steps that based on the context relation characteristic and the non-local characteristic, when a base context is input, a deep learning model predicts prediction probabilities corresponding to a plurality of bases which are immediately adjacent to the base context respectively, and specifically comprises the following steps:
inputting the contextual feature and the non-local feature into a feature mapping network;
when base context is input, predictive probability outputs are performed on A, C, G and T respectively by utilizing the softmax function, and the classifier is trained through cross entropy loss.
The deep learning-based genome data lossless compression method comprises the following steps of connecting prediction probabilities corresponding to a plurality of bases output by a deep learning model with arithmetic coding, coding the probability of a base to be compressed by the arithmetic coding, and outputting a compression result file, and specifically comprises the following steps:
directly connecting the basic group probability information output by the deep learning network with the arithmetic coding;
and (3) converting the base prediction probability into a compressed bit stream by utilizing the probability of encoding the base to be compressed by arithmetic coding and outputting the compressed bit stream to a compression result file.
The deep learning-based genome data lossless compression method is characterized in that the non-local features comprise upstream and downstream gene regulation and control related information.
The deep learning-based genome data lossless compression method is characterized in that the size of a convolution kernel in the deep learning model is determined by the length of a modal sequence to be extracted and the length of a tandem repeat.
The deep learning-based genome data lossless compression method comprises the following steps that a deep learning model learns the context relationship characteristics of a genome sequence, and the method further comprises the following steps:
when the convolution neural network carries out local feature extraction, the sparse connection and weight sharing of the convolution neural network are utilized, the sequencing error in the base sequence is avoided, and the local relevant features in the base sequence are extracted.
The deep learning-based genome data lossless compression method comprises the following steps that a deep learning model learns to obtain non-local features of a genome sequence, and the method further comprises the following steps:
the long-distance correlation of the base sequence is analyzed by utilizing a circulating neural network layer, and the non-local correlation characteristics in the base sequence are extracted by utilizing the long-term memory function of a layer node full-connection structure of the circulating neural network.
In addition, to achieve the above object, the present invention further provides a deep learning-based genome data lossless compression system, wherein the deep learning-based genome data lossless compression system includes:
the sequence local feature extraction module is used for learning to obtain the context relationship features of the genome sequence based on a deep learning model;
the non-local feature extraction module is used for extracting non-local features in the genome sequence;
the base probability output module is used for jointly predicting the prediction probabilities corresponding to a plurality of bases which are immediately adjacent to the bases in the background by utilizing the context relation characteristics and the non-local characteristics;
and the probability coding module is used for connecting the arithmetic coding by utilizing the prediction probabilities respectively corresponding to the plurality of basic groups output by the deep learning model, coding the probability of the basic group to be compressed by utilizing the arithmetic coding and outputting a compression result file.
In addition, to achieve the above object, the present invention further provides a terminal, wherein the terminal includes: the deep learning based genome data lossless compression program comprises a memory, a processor and a deep learning based genome data lossless compression program which is stored on the memory and can run on the processor, wherein the deep learning based genome data lossless compression program realizes the steps of the deep learning based genome data lossless compression method when the processor executes the deep learning based genome data lossless compression program.
In addition, to achieve the above object, the present invention further provides a computer readable storage medium, wherein the computer readable storage medium stores a deep learning based genome data lossless compression program, and the deep learning based genome data lossless compression program, when executed by a processor, implements the steps of the deep learning based genome data lossless compression method as described above.
In the invention, context relationship characteristics and non-local characteristics of a genome sequence are obtained based on deep learning model learning; based on the context relation characteristic and the non-local characteristic, when the base context is input, the deep learning model predicts the prediction probability corresponding to a plurality of bases immediately behind the base context; and respectively connecting the prediction probabilities corresponding to a plurality of bases output by the deep learning model with arithmetic coding, coding the probability of the base to be compressed by using the arithmetic coding, and outputting a compression result file. The invention obtains the correlation between genome contexts through deep learning model learning, predicts the probability of the current base to be coded by utilizing the compressed base sequence information, and finally outputs a compression result file by utilizing arithmetic coding, thereby realizing the lossless compression of genome data.
Drawings
FIG. 1 is a flow chart of the genome data lossless compression method based on deep learning according to the preferred embodiment of the present invention;
FIG. 2 is a schematic overall compression flow diagram of the deep learning-based genome data lossless compression method according to the present invention;
FIG. 3 is a schematic diagram of the deep learning-based genome data lossless compression system according to the present invention;
fig. 4 is a schematic operating environment of a terminal according to a preferred embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer and clearer, the present invention is further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
In a preferred embodiment of the present invention, as shown in fig. 1, the deep learning-based genome data lossless compression method includes the following steps:
and step S10, obtaining the context relationship characteristics and the non-local characteristics of the genome sequence based on deep learning model learning.
In particular, deep learning is to learn the intrinsic regularity and expression hierarchy of sample data, and information obtained in these learning processes is very helpful to the interpretation of data such as text, images, and sounds. The final aim of the method is to enable the machine to have the analysis and learning capability like a human, and to recognize data such as characters, images and sounds.
Assume data set D ═ D 1 ,D 2 ,…,D n H, n genome sequence samples are counted, and each sample is trained by D i Splitting into n-gram subsequences with length of k, wherein the first k-1 bases of the n-gram subsequences are input sequences, and the kth base is regarded as a supervised character of a base to be predicted.
And coding the n-gram subsequence according to a One-hot form, wherein the coding mode is as follows:
A→1000
C→0100
G→0010
T→0001;
and for the coded base sequences, extracting context relation features and non-local features in the data set D through a feature extraction network respectively.
Wherein, for the context relationship characteristics (among the gene sequences, the context bases are strongly related in some regions, such as short tandem repeat, model repeat, etc.), a volume neural network (CNN) is used for local feature extraction, and the size of the convolution kernel can be determined by the length of the modal sequence to be extracted and the length of the tandem repeat; for non-local features (among gene sequences, there is upstream and downstream gene regulation related information in addition to local context features), an LSTM network (Long Short Term Memory networks) is used for extraction.
And step S20, based on the context relation characteristic and the non-local characteristic, when the base context is input, the deep learning model predicts the prediction probability corresponding to each of a plurality of bases immediately after the base context.
Specifically, the plurality of bases immediately after the base include A, C, G and T; inputting the context-relationship features and the non-local features into a feature mapping network; when base context is input, predictive probability outputs are performed on A, C, G and T respectively by utilizing the softmax function, and the classifier is trained through cross entropy loss.
Because the genome sequence only has four bases (A, C, G, T), only the probabilities corresponding to the four bases are needed to be output, the probabilities of the bases to be coded are jointly predicted by using the local feature recognition and the non-local feature of the sequence, and the probability value can output the probability value of each base by using a Softmax function.
And step S30, connecting arithmetic coding by using the prediction probabilities corresponding to the bases output by the deep learning model, coding the probability of the base to be compressed by using the arithmetic coding, and outputting a compression result file.
Specifically, the base probability information output by the deep learning network is directly connected with the arithmetic coding; and (3) converting the base prediction probability into a compressed bit stream by utilizing the probability of encoding the base to be compressed by arithmetic coding and outputting the compressed bit stream to a compression result file. The invention fully utilizes the high-efficiency prediction performance of the deep learning network, because in most cases, the Softmax maximum probability result output by the network is just the coding base, and the variant base only occupies a small part of the genome sequence. Therefore, the invention provides a method for outputting a compressed result file based on base probability information of deep learning, which extracts relevant characteristics among genome contexts through a deep learning network, realizes probability prediction of a base to be coded, outputs the compressed result file through arithmetic coding, and realizes lossless compression of genome data.
According to the invention, context relation characteristic description of a genome sequence is obtained through network model learning, based on the context relation, when a base is input, a network can predict probabilities respectively corresponding to four types of bases (A, C, G, T) immediately behind the base, and the predicted probabilities can be obtained through a Softmax function; and finally, encoding the probability of the base to be compressed by using arithmetic coding and outputting a compression result file.
As shown in fig. 2, the overall compression process of the present invention is divided into four parts, which are: an input layer, a convolutional layer, a recurrent neural network layer, and an output layer. Reading in base data in sequence, realizing a prediction model of base probability by processes of numerical code conversion, convolutional neural network sequence feature extraction, cyclic neural network sequence feature extraction and the like of the base data, and realizing compression by combining arithmetic coding. The method comprises the following specific steps:
(1) an input layer: the layer takes n-gram subsequence of genome sequence as input, and becomes input of neural network convolution layer through One-hot coding.
(2) And (3) rolling layers: the layer identifies context information in the base sequence through a Convolutional Neural Network (CNN), avoids sequencing errors in the base sequence by utilizing the advantages of sparse connection, weight sharing and the like of the CNN, and extracts local correlation characteristics in the base sequence.
(3) A recurrent neural network layer: the layer analyzes the long-distance correlation of the base sequence and extracts the non-local correlation characteristics in the base sequence by utilizing the long-term memory function of the layer node full-connection structure of the Recurrent Neural Network (RNN).
(4) Base probability prediction: the probability prediction is carried out on the base to be coded by combining the local correlation and the non-local correlation of the base sequence, and the base sequence is compressed in a mode of driving arithmetic coding by base probability estimation.
The invention obtains good compression result in the human mitochondrial genome lossless compression experiment. The invention divides 1000 human mitochondrial genomes according to a training set and a testing set, randomly selects 900 human mitochondrial genome sequences as the training set, takes the remaining 100 human mitochondrial data as the testing set, compresses the testing results and compares the testing results with the prior common compression tool Gzip, and the comparison results of the DNA proprietary compression tools MFCompresss, DMCompresss and ERGC are shown in the following table 1:
Figure BDA0003718750630000101
table 1: comparison of compression results of the deep DNA method with other methods on 100 human mitochondrial datasets
As can be seen from table 1, for the human mitochondrial genome test data set, the compression result of Gzip by the plain text compression method was 1.45 bits/base, while the compression results of DMcompress and MFCompress methods were 0.07 bits/base; and the reference genome sequence ID applied in the compression is NC-012920.1 based on the compression method ERGC of the reference genome, and the compression result is 1.46 bits/base, which is similar to the Gzip compression result. The deep learning-based compression method DeepDNA provided by the invention has a compression result of 0.03 bit/base, which is far superior to the compression results of the ordinary text compression method Gzip and the reference genome-based ERGC, and the reference genome-based compression mode is limited by the same reference genome required in compression and decompression.
The results of comparing the DNA-specific compression tools MFCompress, DMcompress for compression of 5 individual mitochondrial data randomly independent therefrom with the existing common compression tool Gzip are shown in table 2 below:
Figure BDA0003718750630000111
table 2: comparison of compression results of the deep DNA method with other methods on randomized 5-person mitochondrial data
In order to verify the effectiveness of compressing single genome sequence data, 5 pieces of human mitochondrial genome data are randomly extracted from 100 human mitochondrial genomes for independent compression comparison, and table 2 lists the compression results of the 5 pieces of human mitochondrial genome sequence data, so that the deep dna method provided by the invention achieves good compression effect on all 5 pieces of single genome sequence compression, and is compared with the current compression method Gzip of common texts, the compression methods MFCompress and DMcompress based on a limited context model, and the compression method ERGC based on a reference sequence. The compression result of the deep DNA is less than 0.05 bit/base, and the compression results of other four methods on a single genome are all more than 2 bit/base. The method of the invention is still better than other compression methods in the effect of single genome compression.
Further, as shown in fig. 3, based on the above genome data lossless compression method based on deep learning, the present invention also provides a genome data lossless compression system based on deep learning, wherein the genome data lossless compression system based on deep learning includes:
the sequence local feature extraction module 51 is configured to learn to obtain context relationship features of a genome sequence based on a deep learning model;
a non-local feature extraction module 52, configured to extract non-local features in the genome sequence;
a base probability output module 53, configured to jointly predict prediction probabilities corresponding to a plurality of bases immediately after the base by using the context feature and the non-local feature;
and a probability coding module 54, configured to connect arithmetic coding with the prediction probabilities corresponding to the multiple bases output by the deep learning model, code the probability of the base to be compressed by using the arithmetic coding, and output a compression result file.
The sequence local feature extraction module 51 is used to extract context-related features in genome sequences, because context bases are strongly related in some regions in gene sequences, such as: short tandem repeats, model repeats, etc., using a convolutional neural network to extract contextual features of a gene sequence, the size of the convolutional kernel may be determined by the length of the modal sequence to be extracted and the length of the tandem repeat.
The non-local feature extraction module 52 is configured to extract non-local features in the genome sequence, because there is upstream and downstream gene regulation related information in addition to local context features in the genome sequence, and to extract such non-local information, LSTM is used for non-local feature extraction.
The base probability output module 53 is configured to output probabilities corresponding to four bases, a genome sequence has only four bases (A, C, G, T), the probabilities of bases to be encoded are jointly predicted by using sequence local feature recognition and non-local features, and the probability values can output probability values of the bases by using a Softmax function.
Wherein, the probability coding module 54 converts the base prediction probability into a compressed bit stream by using arithmetic coding and outputs the compressed bit stream to a compressed result file.
Further, as shown in fig. 4, based on the above method and system for lossless compression of genomic data based on deep learning, the present invention also provides a terminal, which includes a processor 10, a memory 20 and a display 30. Fig. 4 shows only some of the components of the terminal, but it should be understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead.
The memory 20 may in some embodiments be an internal storage unit of the terminal, such as a hard disk or a memory of the terminal. The memory 20 may also be an external storage device of the terminal in other embodiments, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the terminal. Further, the memory 20 may also include both an internal storage unit and an external storage device of the terminal. The memory 20 is used for storing application software installed in the terminal and various types of data, such as program codes of the installation terminal. The memory 20 may also be used to temporarily store data that has been output or is to be output. In an embodiment, the memory 20 stores a deep learning based genome data lossless compression program 40, and the deep learning based genome data lossless compression program 40 can be executed by the processor 10, so as to implement the deep learning based genome data lossless compression method in the present application.
The processor 10 may be, in some embodiments, a Central Processing Unit (CPU), a microprocessor or other data Processing chip, which is used to run program codes stored in the memory 20 or process data, such as performing the deep learning-based genome data lossless compression method.
The display 30 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch panel, or the like in some embodiments. The display 30 is used for displaying information at the terminal and for displaying a visual user interface. The components 10-30 of the terminal communicate with each other via a system bus.
In an embodiment, the following steps are implemented when the processor 10 executes the deep learning based genome data lossless compression program 40 in the memory 20:
learning based on a deep learning model to obtain context relationship characteristics and non-local characteristics of the genome sequence;
based on the context relation characteristic and the non-local characteristic, when the base context is input, the deep learning model predicts the prediction probability corresponding to a plurality of bases which are immediately adjacent to the base context;
and connecting arithmetic coding by utilizing the prediction probabilities corresponding to a plurality of bases output by the deep learning model, coding the probability of the base to be compressed by utilizing the arithmetic coding, and outputting a compression result file.
The method specifically comprises the following steps of obtaining context relationship features and non-local features of a genome sequence based on deep learning model learning, wherein the context relationship features and the non-local features specifically comprise the following steps:
if data set D ═ D 1 ,D 2 ,…,D n H, n genome sequence samples are counted, and each sample is trained by D i Splitting the sequence into n-gram subsequences with the length of k, wherein the first k-1 bases of the n-gram subsequences are input sequences, and the kth base is regarded as a supervised character of a base to be predicted;
and coding the n-gram subsequence according to a One-hot form, wherein the coding mode is as follows:
A→1000
C→0100
G→0010
T→0001;
and for the coded base sequences, extracting context relation features and non-local features in the data set D through a feature extraction network respectively.
For the context relationship characteristics, local characteristic extraction is carried out by using a volume set neural network; for non-local features, the LSTM network is used for extraction.
Wherein the plurality of bases immediately after the base comprises A, C, G and T.
Wherein, the context-based feature and the non-local feature predict, when a base context is input, prediction probabilities respectively corresponding to a plurality of bases immediately after the base context by using the deep learning model, and specifically include:
inputting the contextual feature and the non-local feature into a feature mapping network;
when base context is input, predictive probability outputs are performed on A, C, G and T respectively by utilizing a softmax function, and a classifier is trained through cross entropy loss.
The method for compressing the base by utilizing the deep learning model comprises the following steps of connecting prediction probabilities corresponding to a plurality of bases output by utilizing the deep learning model with arithmetic coding, coding the probability of the base to be compressed by utilizing the arithmetic coding, and outputting a compression result file, wherein the method specifically comprises the following steps:
directly connecting the base probability information output by the deep learning network with the arithmetic coding;
and (3) converting the base prediction probability into a compressed bit stream by utilizing the probability of encoding the base to be compressed by arithmetic coding and outputting the compressed bit stream to a compression result file.
Wherein the non-local features comprise upstream and downstream gene regulatory association information.
Wherein the size of the convolution kernel in the deep learning model is determined by the length of the modal sequence to be extracted and the length of the tandem repeat.
Wherein, the deep learning model learns the context relationship characteristics of the genome sequence, and further comprises:
when the convolution neural network carries out local feature extraction, the sparse connection and weight sharing of the convolution neural network are utilized, the sequencing error in the base sequence is avoided, and the local relevant features in the base sequence are extracted.
Wherein, the deep learning model learns to obtain the non-local characteristics of the genome sequence, and further comprises:
the long-distance correlation of the base sequence is analyzed by utilizing a circulating neural network layer, and the non-local correlation characteristics in the base sequence are extracted by utilizing the long-term memory function of a layer node full-connection structure of the circulating neural network.
The invention also provides a computer readable storage medium, wherein the computer readable storage medium stores a deep learning based genome data lossless compression program, and when the deep learning based genome data lossless compression program is executed by a processor, the deep learning based genome data lossless compression program realizes the steps of the deep learning based genome data lossless compression method.
In summary, the present invention provides a genome data lossless compression method based on deep learning and a related device, wherein the method includes: learning based on a deep learning model to obtain context relationship characteristics and non-local characteristics of the genome sequence; based on the context relation characteristic and the non-local characteristic, when the base context is input, the deep learning model predicts the prediction probability corresponding to a plurality of bases which are immediately adjacent to the base context; and connecting arithmetic coding by utilizing the prediction probabilities corresponding to a plurality of bases output by the deep learning model, coding the probability of the base to be compressed by utilizing the arithmetic coding, and outputting a compression result file. The invention obtains the correlation between the genome contexts through deep learning model learning, predicts the probability of the current base to be coded by utilizing the compressed base sequence information, and finally outputs a compression result file by utilizing arithmetic coding, thereby realizing the lossless compression of genome data.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.
It is to be understood that the invention is not limited to the examples described above, but that modifications and variations may be effected thereto by those of ordinary skill in the art in light of the foregoing description, and that all such modifications and variations are intended to be within the scope of the invention as defined by the appended claims.

Claims (13)

1. A genome data lossless compression method based on deep learning is characterized in that the genome data lossless compression method based on deep learning comprises the following steps:
learning based on a deep learning model to obtain context relationship characteristics and non-local characteristics of the genome sequence;
based on the context relation characteristic and the non-local characteristic, when the base context is input, the deep learning model predicts the prediction probability corresponding to a plurality of bases immediately behind the base context;
and respectively connecting the prediction probabilities corresponding to a plurality of bases output by the deep learning model with arithmetic coding, coding the probability of the base to be compressed by using the arithmetic coding, and outputting a compression result file.
2. The deep learning-based genome data lossless compression method according to claim 1, wherein the deep learning model learning-based context relationship features and non-local features of the genome sequence specifically include:
if data set D ═ D 1 ,D 2 ,…,D n H, n genome sequence samples are counted, and each sample is trained by D i Splitting the sequence into n-gram subsequences with the length of k, wherein the first k-1 bases of the n-gram subsequences are input sequences, and the kth base is regarded as a supervised character of a base to be predicted;
and coding the n-gram subsequence according to a One-hot form, wherein the coding mode is as follows:
A→1000
C→0100
G→0010
T→0001;
and for the coded base sequences, extracting context relation features and non-local features in the data set D through a feature extraction network respectively.
3. The genome data lossless compression method based on deep learning of claim 2, characterized in that, for the context relationship features, a volume set neural network is used for local feature extraction; for non-local features, the LSTM network is used for extraction.
4. The deep learning-based genome data lossless compression method according to claim 2, wherein the plurality of bases immediately after the base include A, C, G and T.
5. The method for lossless compression of genome data based on deep learning of claim 4, wherein based on the context-related feature and the non-local feature, when a base context is input, the deep learning model predicts prediction probabilities respectively corresponding to a plurality of bases immediately after the base context, specifically comprising:
inputting the contextual feature and the non-local feature into a feature mapping network;
when base context is input, predictive probability outputs are performed on A, C, G and T respectively by utilizing the softmax function, and the classifier is trained through cross entropy loss.
6. The method for lossless compression of genome data based on deep learning of claim 5, wherein the predicted probabilities corresponding to the bases output by the deep learning model are respectively connected with arithmetic coding, and the probability of the base to be compressed is coded by the arithmetic coding, and a compression result file is output, specifically comprising:
directly connecting the base probability information output by the deep learning network with the arithmetic coding;
and (3) converting the base prediction probability into a compressed bit stream by utilizing the probability of encoding the base to be compressed by arithmetic coding and outputting the compressed bit stream to a compression result file.
7. The deep learning-based genome data lossless compression method according to claim 1, wherein the non-local features include upstream and downstream gene regulatory association information.
8. The deep learning-based genome data lossless compression method according to claim 1, wherein the size of the convolution kernel in the deep learning model is determined by the length of the modal sequence to be extracted and the length of the tandem repeat.
9. The deep learning-based genome data lossless compression method according to claim 3, wherein the deep learning model learns the context relationship characteristics of the genome sequence, and further comprises:
when the convolution neural network carries out local feature extraction, the sparse connection and weight sharing of the convolution neural network are utilized, the sequencing error in the base sequence is avoided, and the local relevant features in the base sequence are extracted.
10. The deep learning-based genome data lossless compression method according to claim 3, wherein the deep learning model learns the non-local features of the genome sequence, and further comprising:
the long-distance correlation of the base sequence is analyzed by utilizing the layer of the circulating neural network, and the non-local correlation characteristics in the base sequence are extracted by utilizing the long-term memory function of the layer node full-connection structure of the circulating neural network.
11. A deep learning based genome data lossless compression system, characterized in that the deep learning based genome data lossless compression system comprises:
the sequence local feature extraction module is used for obtaining context relationship features of the genome sequence based on deep learning model learning;
the non-local feature extraction module is used for extracting non-local features in the genome sequence;
the base probability output module is used for jointly predicting the prediction probabilities corresponding to a plurality of bases which are immediately adjacent to the bases in the background by utilizing the context relation characteristics and the non-local characteristics;
and the probability coding module is used for connecting arithmetic coding by utilizing the prediction probabilities corresponding to the plurality of bases output by the deep learning model, coding the probability of the base to be compressed by utilizing the arithmetic coding and outputting a compression result file.
12. A terminal, characterized in that the terminal comprises: a memory, a processor, and a deep learning-based genome data lossless compression program stored on the memory and executable on the processor, the deep learning-based genome data lossless compression program when executed by the processor implementing the steps of the deep learning-based genome data lossless compression method according to any one of claims 1 to 10.
13. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a deep learning-based genome data lossless compression program, which when executed by a processor implements the steps of the deep learning-based genome data lossless compression method according to any one of claims 1 to 10.
CN202210743081.9A 2022-06-28 2022-06-28 Genome data lossless compression method based on deep learning and related equipment Pending CN115098455A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210743081.9A CN115098455A (en) 2022-06-28 2022-06-28 Genome data lossless compression method based on deep learning and related equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210743081.9A CN115098455A (en) 2022-06-28 2022-06-28 Genome data lossless compression method based on deep learning and related equipment

Publications (1)

Publication Number Publication Date
CN115098455A true CN115098455A (en) 2022-09-23

Family

ID=83293941

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210743081.9A Pending CN115098455A (en) 2022-06-28 2022-06-28 Genome data lossless compression method based on deep learning and related equipment

Country Status (1)

Country Link
CN (1) CN115098455A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115579058A (en) * 2022-11-01 2023-01-06 阿里巴巴(中国)有限公司 Lossless compression method for genome data, and method and apparatus for predicting genetic variation
CN116823492A (en) * 2023-05-05 2023-09-29 陕西长瑞安驰信息技术集团有限公司 Data storage method and system
WO2024114597A1 (en) * 2022-12-02 2024-06-06 City University Of Hong Kong Reinforcement-learning-based network transmission of compressed genome sequence

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115579058A (en) * 2022-11-01 2023-01-06 阿里巴巴(中国)有限公司 Lossless compression method for genome data, and method and apparatus for predicting genetic variation
CN115579058B (en) * 2022-11-01 2023-12-01 阿里巴巴(中国)有限公司 Lossless compression method of genome data, prediction method and device of genetic variation
WO2024114597A1 (en) * 2022-12-02 2024-06-06 City University Of Hong Kong Reinforcement-learning-based network transmission of compressed genome sequence
CN116823492A (en) * 2023-05-05 2023-09-29 陕西长瑞安驰信息技术集团有限公司 Data storage method and system
CN116823492B (en) * 2023-05-05 2024-04-02 上海原力枫林信息技术有限公司 Data storage method and system

Similar Documents

Publication Publication Date Title
CN112464641B (en) BERT-based machine reading understanding method, device, equipment and storage medium
CN108959246B (en) Answer selection method and device based on improved attention mechanism and electronic equipment
CN115098455A (en) Genome data lossless compression method based on deep learning and related equipment
WO2022134759A1 (en) Keyword generation method and apparatus, and electronic device and computer storage medium
CN110765458B (en) Malicious software image format detection method and device based on deep learning
CN109033068B (en) Method and device for reading and understanding based on attention mechanism and electronic equipment
CN111767409B (en) Entity relationship extraction method based on multi-head self-attention mechanism
CN111241304B (en) Answer generation method based on deep learning, electronic device and readable storage medium
CN111680494B (en) Similar text generation method and device
CN109509056B (en) Commodity recommendation method based on countermeasure network, electronic device and storage medium
CN111814466A (en) Information extraction method based on machine reading understanding and related equipment thereof
CN107004140B (en) Text recognition method and computer program product
CN110532381B (en) Text vector acquisition method and device, computer equipment and storage medium
CN113254654B (en) Model training method, text recognition method, device, equipment and medium
CN110442711B (en) Text intelligent cleaning method and device and computer readable storage medium
CN111400494B (en) Emotion analysis method based on GCN-Attention
CN101751416A (en) Method for ordering and seeking character strings
CN112084435A (en) Search ranking model training method and device and search ranking method and device
Wang et al. DeepDNA: A hybrid convolutional and recurrent neural network for compressing human mitochondrial genomes
CN113392929B (en) Biological sequence feature extraction method based on word embedding and self-encoder fusion
CN114528944A (en) Medical text encoding method, device and equipment and readable storage medium
CN113870846A (en) Speech recognition method, device and storage medium based on artificial intelligence
CN111581377A (en) Text classification method and device, storage medium and computer equipment
CN117171738A (en) Malicious software analysis method, device, storage medium and equipment
CN116361490A (en) Entity and relation extraction method, system and electronic equipment based on graph neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination